NVIDIA Ampere-Based Architecture
- A100 accelerates workloads big and small. Whether using MIG to partition an A100 GPU into smaller instances, or NVLink to connect multiple GPUs to accelerate large-scale workloads, the A100 easily handles different-sized application needs, from the smallest job to the biggest multi-node workload.
Third-Generation Tensor Cores
- First introduced in the NVIDIA Volta architecture, NVIDIA Tensor Core technology has brought dramatic speedups to AI training and inference operations, bringing down training times from weeks to hours and providing massive acceleration to inference. The NVIDIA Ampere architecture builds upon these innovations by providing up to 20x higher FLOPS for AI. It does so by improving the performance of existing precisions and bringing new precisions—TF32, INT8, and FP64—that accelerate and simplify AI adoption and extend the power of NVIDIA Tensor Cores to HPC.
TF32 for AI: 20x Higher Performance, Zero Code Change
- As AI networks and datasets continue to expand exponentially, their computing appetite is similarly growing. Lower precision math has brought huge performance speedups, but they’ve historically required some code changes. A100 brings a new precision, TF32, which works just like FP32 while providing 20x higher FLOPS for AI without requiring any code change. And NVIDIA’s automatic mixed precision feature enables a further 2x boost to performance with just one additional line of code using FP16 precision. A100 Tensor Cores also include support for BFLOAT16, INT8, and INT4 precision, making A100 an incredibly versatile accelerator for both AI training and inference.
Double-Precision Tensor Cores: The Biggest Milestone Since FP64 for HPC
- A100 brings the power of Tensor Cores to HPC, providing the biggest milestone since the introduction of double-precision GPU computing for HPC. The third generation of Tensor Cores in A100 enables matrix operations in full, IEEE-compliant, FP64 precision. Through enhancements in NVIDIA CUDA-X math libraries, a range of HPC applications that need double-precision math can now see a boost of up to 2.5x in performance and efficiency compared to prior generations of GPUs.
Multi-Instance GPU (MIG)
- Every AI and HPC application can benefit from acceleration, but not every application needs the performance of a full A100. With Multi-Instance GPU (MIG), each A100 can be partitioned into as many as seven GPU instances, fully isolated at the hardware level with their own high-bandwidth memory, cache, and compute cores. Now, developers can access breakthrough acceleration for all their applications, big and small, and get guaranteed quality of service. And IT administrators can offer right-sized GPU acceleration for optimal utilization and expand access to every user and application.
MIG is available across both bare metal and virtualized environments and is supported by NVIDIA Container Runtime which supports all major runtimes such as LXC, Docker, CRI-O, Containerd, Podman, and Singularity. Each MIG instance is a new GPU type in Kubernetes and will be available across all Kubernetes distributions such as Red Hat OpenShift, VMware Project Pacific, and others on-premises and on public clouds via NVIDIA Device Plugin for Kubernetes. Administrators can also benefit from hypervisor-based virtualization, including KVM based hypervisors such as Red Hat RHEL/RHV, and VMware ESXi, on MIG instances through NVIDIA vComputeServer.
HBM2e
- With 40 gigabytes (GB) of high-bandwidth memory (HBM2e), A100 delivers improved raw bandwidth of 1.6TB/sec, as well as higher dynamic random access memory (DRAM) utilization efficiency at 95 percent. A100 delivers 1.7x higher memory bandwidth over the previous generation.
Structural Sparsity
- AI networks are big, having millions to billions of parameters. Not all of these parameters are needed for accurate predictions, and some can be converted to zeros to make the models “sparse” without compromising accuracy. Tensor Cores in A100 can provide up to 2x higher performance for sparse models. While the sparsity feature more readily benefits AI inference, it can also improve the performance of model training.
Next Generation NVLink
- NVIDIA NVLink in A100 delivers 2x higher throughput compared to the previous generation, at up to 600 GB/s to unleash the highest application performance possible on a single server. Two NVIDIA A100 PCIe boards can be bridged via NVLink, and multiple pairs of NVLink connected boards can reside in a single server (number varies based on server enclose, thermals, and power supply capacity).
Every Deep Learning Framework, 700+ GPU-Accelerated Applications
- The NVIDIA A100 Tensor Core GPU is the flagship product of the NVIDIA data center platform for deep learning, HPC, and data analytics. It accelerates every major deep learning framework and accelerates over 700 HPC applications. It’s available everywhere, from desktops to servers to cloud services, delivering both dramatic performance gains and cost-saving opportunities.
Virtualization Capabilities
- Virtualized compute workloads such as AI, Deep learning, and high-performance computing (HPC) with NVIDIA Virtual Compute Server (vCS). The NVIDIA A100 PCIe is an ideal upgrade path for existing V100/V100S Tensor Core GPU infrastructure.
Structural Sparsity: 2X Higher Performance for AI
- Modern AI networks are big, having millions and in some cases billions of parameters. Not all of these parameters are needed for accurate predictions, and some can be converted to zeros to make the models “sparse” without compromising accuracy. Tensor Cores in A100 can provide up to 2x higher performance for sparse models. While the sparsity feature more readily benefits AI inference, it can also improve the performance of model training.