NVIDIA Boosts OpenAI gpt-oss Models to 1.5M TPS Inference

In 2016, NVIDIA and OpenAI embarked on a collaborative journey to push the boundaries of artificial intelligence (AI). Their pioneering efforts began with the launch of the NVIDIA DGX system, which set a new standard for AI performance. Fast forward to today, and this partnership continues to break new ground with the introduction of two remarkable AI models: the OpenAI gpt-oss-20b and gpt-oss-120b. These models are designed to harness the full potential of NVIDIA’s Blackwell architecture, achieving impressive performance benchmarks of up to 1.5 million tokens per second (TPS) on an NVIDIA GB200 NVL72 system.

The gpt-oss models are large language models (LLMs) specifically designed for text-reasoning tasks. They incorporate advanced features such as chain-of-thought processing and tool-calling capabilities, thanks to their use of the popular mixture of experts (MoE) architecture with SwigGLU activations. To further enhance their capabilities, these models use RoPE attention layers with an impressive 128k context, alternating between full context and a sliding 128-token window. Released in FP4 precision, these models are optimized to fit seamlessly on a single 80 GB data center GPU and are natively supported by NVIDIA’s Blackwell architecture.

Training these models was no small feat. The gpt-oss-120b model required over 2.1 million hours of training on NVIDIA H100 Tensor Core GPUs, while the gpt-oss-20b model took approximately one-tenth of that time. NVIDIA’s collaboration with several leading open-source frameworks, including Hugging Face Transformers, Ollama, and vLLM, alongside their own NVIDIA TensorRT-LLM, ensured the development of optimized kernels and model enhancements. This collaboration is a testament to NVIDIA’s commitment to providing developers with the tools they need to succeed.

### Model Specifications

The gpt-oss-20b and gpt-oss-120b models boast impressive specifications. The gpt-oss-20b model comprises 24 transformer blocks with a total of 20 billion parameters and 3.6 billion active parameters per token. Meanwhile, the gpt-oss-120b model is even more robust, featuring 36 transformer blocks, 117 billion total parameters, and 5.1 billion active parameters per token. Both models employ a 128k input context length, allowing for enhanced text reasoning capabilities.

### Performance Optimization

NVIDIA has worked tirelessly to maximize the performance of these models, incorporating various features to achieve this goal. These include:

1. TensorRT-LLM Gen for attention prefill, attention decode, and MoE low-latency on Blackwell.
2. CUTLASS MoE kernels on Blackwell.
3. XQA kernel for specialized attention on Hopper.
4. Optimized attention and MoE routing kernels, accessible through the FlashInfer kernel-serving library for LLMs.
5. OpenAI Triton kernel MoE support, utilized in both TensorRT-LLM and vLLM.

### Deploying with vLLM

In collaboration with vLLM, NVIDIA has verified the accuracy of these models while optimizing performance for both the Hopper and Blackwell architectures. Data center developers can leverage NVIDIA’s optimized kernels through the FlashInfer LLM serving kernel library. To streamline the deployment process, vLLM recommends using ‘uv’ for Python dependency management. By following the provided command, developers can set up an OpenAL-compatible web server and automatically download the model to start the server. For more detailed instructions, refer to the vLLM Cookbook guide.

### Deploying with TensorRT-LLM

NVIDIA offers deployment optimizations through their TensorRT-LLM GitHub repository. Developers can access a comprehensive deployment guide that outlines steps for launching high-performance servers. This guide also provides a Docker container and instructions for configuring performance for both low-latency and maximum throughput scenarios.

### Unleashing the Power of GB200 NVL72

NVIDIA engineers have worked closely with OpenAI to ensure that the new gpt-oss models deliver accelerated performance right from Day 0 on both the NVIDIA Blackwell and NVIDIA Hopper platforms. Early performance measurements have shown that a single GB200 NVL72 rack-scale system can serve the computationally demanding gpt-oss-120b model at an impressive rate of 1.5 million tokens per second, accommodating approximately 50,000 concurrent users. This remarkable feat is achieved through Blackwell’s advanced architectural capabilities, including a second-generation Transformer Engine with FP4 Tensor Cores and fifth-generation NVIDIA NVLink and NVIDIA NVLink Switch, enabling 72 Blackwell GPUs to function as a single, massive GPU.

The combination of performance, versatility, and rapid innovation within the NVIDIA platform empowers the ecosystem to deploy the latest models with high throughput and low cost per token.

### Experiment with NVIDIA Launchable

Deploying with TensorRT-LLM is made even more accessible through the NVIDIA Launchable platform. Developers can utilize the Python API in a JupyterLab notebook to test GPUs from multiple cloud platforms. With just a single click, they can deploy the optimized model in a pre-configured environment, facilitating seamless experimentation and development.

### Introducing NVIDIA Dynamo

NVIDIA Dynamo is an open-source inference serving platform designed for large-scale applications. It integrates with major inference backends and offers features such as LLM-aware routing, elastic autoscaling, and disaggregated serving. For applications with long input sequence lengths (ISL), Dynamo’s disaggregated serving significantly enhances performance. At a 32K ISL, Dynamo delivers a fourfold improvement in interactivity at the same system throughput and GPU budget compared to aggregated serving. To deploy with Dynamo, refer to the provided guide.

### Running Locally on NVIDIA GeForce RTX AI PCs

Developers seeking faster iteration, lower latency, and enhanced data privacy can run AI models locally on NVIDIA RTX PRO GPUs. Both gpt-oss models can be deployed on professional workstations, and the gpt-oss-20b model can also run on any GeForce RTX AI PC with a minimum of 16 GB of VRAM, utilizing MXFP4 precision. This setup allows developers to experience these models through their preferred apps and SDKs, using tools like Ollama, Llama.cpp, or Microsoft AI Foundry Local. For more information, visit the RTX AI Garage.

### Streamlining Enterprise Deployments with NVIDIA NIM

Enterprise developers have the opportunity to try the gpt-oss models for free using the NVIDIA NIM Preview API and the web playground environment available in the NVIDIA API Catalog. These models are packaged as NVIDIA NIM microservices, simplifying deployment on any GPU-accelerated infrastructure while ensuring flexibility, data privacy, and enterprise-grade security.

By integrating gpt-oss models into every layer of the NVIDIA developer ecosystem, developers can choose the solution that best suits their needs. To get started, explore the NVIDIA API Catalog UI or consult the NVIDIA developer guide in the OpenAI Cookbook.

For more information on this exciting development, you can refer to the original article on the NVIDIA blog.
For more Information, Refer to this article.

NVIDIA Boosts OpenAI gpt-oss Models to 1.5M TPS Inference

You may also like these:

Latest From Hawkdive

You May like these Related Articles

LEAVE A REPLY Cancel reply