NVIDIA Blackwell Ultra NVL72 Sets New Benchmark for Agentic AI Performance
Artificial Analysis has launched AgentPerf, the first benchmark designed specifically for agentic AI systems, revealing that NVIDIA’s Blackwell Ultra NVL72 platform outperforms its predecessor, the Hopper architecture. The initial results indicate that the Blackwell system can run 20 times more agents per megawatt than the previous generation, marking a significant advancement in the capabilities of AI infrastructure.
Understanding Agentic AI and Its Workloads
Agentic AI differs fundamentally from traditional conversational AI. While conversational AI typically involves single interactions—like a single large language model (LLM) call followed by one response—agentic AI operates in a more complex manner. An agent breaks down a goal into multiple steps and continues executing until the task is completed. This process involves chaining together numerous LLM calls and tool interactions to gather context, reason, and act.
The complexity of agentic tasks is multiplicative rather than additive, resulting in dozens or even hundreds of LLM calls in a single operation. Each call builds upon the context provided by previous interactions and may involve various tool calls such as code execution or database searches. This intricate web of tasks necessitates a different approach to performance measurement compared to existing benchmarks that focus solely on single LLM call efficiency.
NVIDIA GB300 NVL72: A Performance Powerhouse
In its first round of testing, AgentPerf evaluated agentic performance using DeepSeek V4 Pro, a large mixture-of-experts (MoE) model that represents cutting-edge capabilities in agentic AI. The results showed that NVIDIA’s GB300 NVL72 excels by running up to 20 times more agents per megawatt than the NVIDIA HGX H200 system.
The remarkable performance of the GB300 NVL72 stems from its innovative design, which integrates 72 GPUs into a single rack-scale system. This architecture allows for efficient distribution of model execution across large MoE models like DeepSeek V4 Pro. Additionally, CUDA kernels enhance this efficiency by overlapping communication with computation, minimizing latency during coordination across experts.
NVIDIA’s TensorRT LLM further optimizes performance as concurrent agent sessions scale up. It separates input processing from output generation, allowing each component to be fine-tuned independently. This comprehensive benchmarking methodology reflects real-world operational conditions for agentic AI systems.
AgentPerf: Tailored for Real-World Applications
AgentPerf was developed based on authentic coding agent trajectories where an agent receives tasks, reads files, writes and edits code, executes commands, and iterates based on outcomes—all derived from actual public code repositories across over 12 programming languages. The benchmark simulates tool calls using representative CPU processing times rather than executing them directly, ensuring that performance differences are attributed solely to accelerated computing capabilities.
This approach provides valuable insights for enterprises looking to deploy AI agents at scale. By measuring how many concurrent agentic tasks can be executed per accelerator and per megawatt of power consumed, businesses can make informed decisions about their infrastructure investments and maximize productivity.
NVIDIA Partners Leverage Blackwell’s Superior Performance
Several leading inference providers are already utilizing NVIDIA’s Blackwell architecture to support agentic workloads on advanced models like DeepSeek V4 Pro. Companies such as Baseten, DeepInfra, and Together AI are actively deploying production applications powered by this new technology.
Together AI is enhancing real-time inference for Cursor, an AI-driven coding platform that employs agents to debug issues and generate features while developers maintain their workflow. Similarly, DeepInfra supports Pam.ai, an AI workforce platform designed for car dealerships that deploys agents to manage service appointments and sales campaigns entirely on NVIDIA Blackwell infrastructure.
As NVIDIA continues to refine its inference software alongside the open-source community, improvements in both performance and efficiency for agentic workloads are anticipated. The introduction of the Vera Rubin architecture is expected to further elevate infrastructure capacity to meet the increasing demands of scalable agentic AI applications.
What This Means
The launch of AgentPerf marks a pivotal moment in evaluating agentic AI systems’ performance capabilities. With NVIDIA’s Blackwell Ultra NVL72 leading the way in efficiency and scalability, enterprises can now better assess how their infrastructure investments translate into productive outcomes when deploying complex AI agents at scale. As this technology matures and becomes more widely adopted, it could fundamentally reshape how organizations leverage artificial intelligence across various sectors.
For more information, read the original report here.



































