Nvidia has long held the title of market leader in AI computing for the cloud, where the majority of AI computations, both in training and inference, take place. However, this year brought increased competition from other chip companies, making it crucial for Nvidia to reassert its dominance at their annual GTC event. The unveiling of the Blackwell family of products was a significant step in this direction.
Reestablishing GPU Leadership with the Blackwell Family
At GTC 2024, Nvidia showcased its new Blackwell family of GPUs, designed to scale from single GPUs to entire datacenters. These products are interconnected using Nvidia’s Mellanox InfiniBand technology, acquired in 2019. This acquisition has been pivotal for Nvidia’s capability to build hyperscale and HPC-scale systems with low latency and high bandwidth.
The Power of the B200 Chip
The Blackwell B200 chip stands out in this new lineup. Featuring dual Blackwell dies and HBM3E memory, the B200 combines two 104B transistor GPU dies into a single chip, offering unprecedented performance. Both the B100 and B200 variants include 8 Gbps HBM3E memory on a 4,096-bit bus with 8 TB/s of memory bandwidth from 192GB of VRAM. The B200 outperforms the B100 with a peak FP4 Dense Tensor performance of 9 PFLOPS, compared to the B100’s 7 PFLOPS. Both GPUs support NVLink 5’s 1800 GB/s interconnect bandwidth and PCIe 6.0.
Manufactured using TSMC’s 4NP process node, the B200 consumes 1 kilowatt of power. Nvidia designed the B100 as a drop-in replacement for the H100, maintaining the same 700-watt TDP but delivering roughly 80% faster performance. The B200 is over 10% faster than the B100 in most scenarios, underscoring the significant advancements of the Blackwell architecture over Nvidia’s previous Hopper architecture.
The GB200 “Superchip”
Taking the performance to another level, the GB200 “superchip” combines two B200 GPUs paired with a Grace Arm server CPU, connected with NVLink interconnects offering 900 GB/s of bandwidth. The GB200 delivers 20 PFLOPS of FP4 tensor performance, more than double that of a single B200, and includes 384GB of HBM3E memory. Integrating over 496 billion transistors, including four Blackwell dies and an 80-billion-transistor Grace server chip, the GB200 has a 2,700-watt TDP. It is available in configurations for rack mounting and more compact DGX/HGX systems.
Efficiency and Scalability
Despite its power demands, the Blackwell architecture offers significant space and power savings at scale. Nvidia highlighted that training a 1.8-trillion-parameter GPT-MoE model with Hopper GPUs would require 8,000 GPUs and 15 megawatts over 90 days. In contrast, a Blackwell GB200 NVL72 system would need only 2,000 GPUs and 4 megawatts for the same task. This efficiency is crucial as power consumption becomes a more pressing issue for AI and cloud computing.
Accelerating AI Adoption with NIM Microservices
To further drive AI adoption, Nvidia announced a new catalog of NIM microservices and cloud endpoints for pretrained AI models, optimized for CUDA-capable Nvidia GPUs. These microservices support tasks such as LLM customization, inference, retrieval-augmented generation, and guardrails. Available at no charge on Nvidia’s website and integrated with its AI Enterprise 5.0 software suite, these microservices aim to improve time-to-market and result quality, benefiting the industry as a whole.
Conclusion
Nvidia’s unveiling of the Blackwell B200 chip and the broader Blackwell family at GTC 2024 marks a significant milestone in AI computing. By reestablishing its leadership in GPU technology and demonstrating substantial improvements in performance and efficiency, Nvidia has set a new standard for the industry. As competition in the AI computing market intensifies, Nvidia’s innovations in the Blackwell architecture will play a crucial role in maintaining its dominant position and driving the next era of AI advancements.