New GB300 NVL72 Ensures Reliable AI Power, Says NVIDIA

NewsNew GB300 NVL72 Ensures Reliable AI Power, Says NVIDIA

The contemporary electrical grid is primarily structured to handle stable and predictable loads, such as those from lighting, household appliances, and industrial machines that maintain a constant power usage. However, the rapid evolution of data centers—especially those handling artificial intelligence (AI) workloads—has significantly altered this balance.

The New Power Dynamics of Data Centers

Data centers are notorious for being power-hungry, consuming a substantial portion of the available capacities of power plants and transformers. Traditionally, the varied activities within these centers tend to average out power consumption, maintaining a certain equilibrium. However, training large-scale AI models introduces sudden and significant fluctuations in power demand, presenting unique challenges for grid operators:

  1. Delayed Response to Demand Surges: If there is a sudden spike in power demand, it can take anywhere from a minute to 90 minutes for generation resources to respond. This delay is primarily due to the physical limitations inherent in their ramp rates.
  2. Equipment Stress from Power Transients: Repeated power transients can cause resonance, which in turn stresses the equipment.
  3. Handling Excess Energy: If a data center suddenly reduces its power consumption, energy production systems might find themselves with surplus energy that has no immediate outlet.

    These sudden changes can manifest as voltage spikes or sags for other grid customers, potentially disrupting their operations.

    NVIDIA’s Innovative Approach to Power Management

    In light of these challenges, NVIDIA has developed a pioneering solution using a new power supply unit (PSU) with integrated energy storage, specifically the GB300 NVL72. This PSU is designed to mitigate power spikes from AI workloads, reducing peak grid demand by up to 30%. This technology will soon be available in GB200 NVL72 systems as well.

    In the ensuing sections, we will delve into the various solutions for AI training workloads, detailing how they operate at the initial stages, sustain full load, and conclude the training run. Additionally, we will share empirical results demonstrating the effectiveness of this power smoothing solution.

    The Impact of Synchronized Workloads

    AI training involves thousands of Graphics Processing Units (GPUs) working in unison, conducting identical computations on varied datasets. This synchronization results in power fluctuations at the grid level. Unlike conventional data center tasks, where uncorrelated activities help "smooth out" power consumption, AI workloads lead to abrupt transitions between idle and high-power states.

    Visualizing these GPUs as rows on a heatmap provides insight into why AI data centers present unique challenges to power delivery systems. Traditional workloads operate asynchronously across the computational infrastructure. However, AI training highlights how synchronously operating GPUs cause the overall power drawn by a GPU cluster to reflect and even amplify the power pattern of a single node.

    Power Smoothing in the GB300 NVL72

    To tackle these challenges, NVIDIA has introduced a comprehensive power smoothing solution within the GB300 platform. This solution incorporates several mechanisms across different operational phases. The power cap, energy storage, and GPU burn mechanisms collectively smooth out the power demand emanating from the rack.

    Initially, with the power cap feature, the GPU’s power draw at the start of a workload is regulated by the power controller. Maximum power levels are sent to the GPUs and are gradually increased, aligning with the ramp rates the grid can handle. A more intricate strategy is employed for ramp-down; if a workload ends suddenly, the GPU burn system continues dissipating power by operating the GPU in a specialized power burner mode. This ensures a smooth transition rather than an abrupt drop.

    For rapid, short-term power fluctuations during steady-state operations, energy storage components—specifically electrolytic capacitors—have been integrated into the GB300 NVL72 power shelves. These energy storage units charge during periods of low GPU power demand and discharge during periods of high demand.

    The solution also includes power burn hardware and a software algorithm that detects when the GPU power has declined to idle levels. This software driver, which implements the power smoothing algorithm, activates the hardware power burner. The burner maintains constant power usage while waiting for workloads to resume; if they don’t, the burner gradually decreases power consumption. If the GPU workload resumes, the burner disengages immediately.

    Users can fine-tune this system’s behavior using configurable parameters. These parameters can be set using NVIDIA’s System Management Interface (SMI) tool or the Redfish protocol.

    Measured Benefits and Results

    Empirical data from both the previous-generation (GB200) and the new (GB300) power supply units with energy storage reveal significant improvements. The measurements from a GB200 rack highlight how, with the older power supply, the AC power drawn from the grid mirrors the fluctuations in rack power consumption. However, with the new energy-storage-enhanced power shelves, these input power variations are substantially minimized.

    Remarkably, the peak power demand seen by the grid is reduced by 30% when training large AI models, such as the Megatron LLM, and rapid fluctuations are significantly damped.

    Inside the GB300 power supply, approximately half of the unit’s volume is dedicated to capacitors for energy storage. NVIDIA collaborated with power supply vendor LITEON Technology to optimize power electronics for size, utilizing the remaining space to incorporate 65 joules/GPU of energy storage. This setup, combined with a new charge management controller, delivers a fast transient power smoothing solution at the rack level.

    System Design Implications

    Incorporating energy storage not only smooths power transients but also reduces the peak demand requirements for the entire data center. Previously, facilities had to be provisioned for maximum instantaneous power consumption. With effective energy storage, provisioning can align more closely with target average consumption, allowing more racks within the same power budget or enabling reduced total power allocation.

    The system is designed to accommodate fluctuations within the rack; the computing nodes and internal DC buses are built to handle rapid power state changes. The energy storage mechanism optimizes the load profile seen by the grid and does not return energy to the utility.

    Both the GB200 and GB300 NVL72 systems employ multiple power shelves within each rack. Thus, strategies for integrating energy storage and load smoothing must consider aggregation at the rack and data hall levels. Reducing power peaks enables either increased rack density or decreased provisioning requirements for the entire data center.

    Conclusion

    The innovations in energy storage and advanced ramp-rate management algorithms in the GB300 NVL72 power shelves achieve a significant reduction in peak and transient load presented to the grid. The advanced PSU with energy storage—along with the hardware and software to implement the power cap and power burn features—will be available with the GB300 NVL72.

    Data center operators are encouraged to integrate advanced power smoothing and energy storage technologies to optimize peak power consumption, enable increased compute density, and reduce operating costs.

    For further insights, the work of contributors such as Jared Huntington, Gabriele Gorla, Apoorv Gupta, and others has been instrumental in developing these solutions. For additional details, you may refer to the original announcement on NVIDIA’s blog.

For more Information, Refer to this article.

Neil S
Neil S
Neil is a highly qualified Technical Writer with an M.Sc(IT) degree and an impressive range of IT and Support certifications including MCSE, CCNA, ACA(Adobe Certified Associates), and PG Dip (IT). With over 10 years of hands-on experience as an IT support engineer across Windows, Mac, iOS, and Linux Server platforms, Neil possesses the expertise to create comprehensive and user-friendly documentation that simplifies complex technical concepts for a wide audience.
Watch & Subscribe Our YouTube Channel
YouTube Subscribe Button

Latest From Hawkdive

You May like these Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.