Tensor Cores

NVIDIA TENSOR CORES

POWERING THE AI AND DEEP LEARNING REVOLUTION

Select NVIDIA Quadro Volta architecture, and all Quadro RTX series GPUs feature Tensor Cores, a breakthrough technology that provides uprecedented AI performance. Tensor Cores accelerate matrix operations, which are foundational to AI, and perform mixed-precision matrix multiply and accumulate calculations in a single operation. With hundreds of Tensor Cores operating in parallel in one NVIDIA Quadro GPU, this enables massive increases in throughput and efficiency.

BREAKTHROUGH INFERENCE PERFORMANCE

Quadro RTX Powered by Turing Tensor Cores

Quadro RTX series products introduce NVIDIA Turing Tensor Core technology with multi-precision computing for the world’s most efficient AI inference. Turing Tensor Cores provide a full range of precisions for inferencing, from FP32 to FP16 to INT8, as well as INT4, to provide giant leaps in performance over NVIDIA Pascal GPUs.

THE MOST ADVANCED TRAINING AND INFERENCE PLATFORM

Quadro RTX delivers breakthrough performance for deep learning training in FP32, FP16, INT8, INT4, and binary precisions for inference. With up to 500 TeraOPS (TOPS) available (Quadro RTX 8000), Turing delivers the world’s highest inference efficiency at a fraction of the power consumption required by traditional CPUs. Quadro RTX is an ideal solution for scale-out servers at the edge, or inferencing at the edge solutions.

THE WORLD’S HIGHEST DEEP LEARNING THROUGHPUT

NVIDIA Quadro GV100 Powerered by Volta Tensor Cores

Designed specifically for deep learning, the first-generation Tensor Cores in Volta deliver groundbreaking performance with mixed-precision matrix multiply in FP16 and FP32 – up to 12x higher peak teraflops (TFLOPS) for training and 6x higher peak TFLOPS for interence over prior-generation NVIDIA Pascal offerings. This key capability enables Volta to deliver 3x perforance speedups in training and inference over Pascal.

Each of Quadro GV100’s 640 Tensor Cores operates on a 4 x 4 matrix, and their associated data paths are custom-designed to power the world’s fastest floating point compute throughput with high-energy efficiency.

NVIDIA Quadro Sync Solutions

A BREAKTHROUGH IN TRAINING AND INFERENCE

For interence, Quadro GV100 also achieves more than a 3x performance advantage versus the previous generation and is 47x faster than a CPU-based server. Using the NVIDIA Tensor RT Programmable Inference Accelerator, these speedups are due in large part to Tensor Cores accelerating inference work using mixed precision.

 
 

Volta is equipped with 640 Tensor Cores, each performing 64 floating-point fused-multiply-add (FMA) operations per clock. That delivers up to 125 TFLOPS for training and inference applications. This means that developers can run deep learning training using a mixed precision of FP16 compute with FP32 accumulate, achieving both a 3x speedup over the previous generation and rapid convergence to a deep neural network’s expected accuracy levels.

This 3x speedup in performance is a key breakthrough of Tensor Core technology. Now, deep learning can appen in mere hours.

Contact gopny@pny.com for additional information.