AI models are exploding in complexity as they take on next-level challenges such as conversational AI. Training them requires massive compute power and scalability.
NVIDIA A100 Tensor Cores with Tensor Float (TF32) provide up to 20X higher performance over the NVIDIA Volta with zero code changes and an additional 2X boost with automatic mixed precision and FP16. When combined with NVIDIA® NVLink®, NVIDIA NVSwitch™, PCI Gen4, NVIDIA® InfiniBand®, and the NVIDIA Magnum IO™ SDK, it’s possible to scale to thousands of A100 GPUs.
A training workload like BERT can be solved at scale in under a minute by 2,048 A100 GPUs, a world record for time to solution.
For the largest models with massive data tables like deep learning recommendation models (DLRM), A100 80GB reaches up to 1.3 TB of unified memory per node and delivers up to a 3X throughput increase over A100 40GB.
NVIDIA’s leadership in MLPerf, setting multiple performance records in the industry-wide benchmark for AI training.