Volta is equipped with 640 Tensor Cores, each performing 64 floating-point fused-multiply-add (FMA) operations per clock. That delivers up to 125 TFLOPS for training and inference applications. This means that developers can run deep learning training using a mixed precision of FP16 compute with FP32 accumulate, achieving both a 3x speedup over the previous generation and rapid convergence to a deep neural network’s expected accuracy levels.
This 3x speedup in performance is a key breakthrough of Tensor Core technology. Now, deep learning can appen in mere hours.