PNY Technologies Inc.
0
NVIDIA<sup>®</sup> H100

NVIDIA® H100

NVIDIA H100

  • SKU: NVH100TCGPU-KIT
  • Description

    NVIDIA H100 PCIe

    Unprecedented Performance, Scalability, and Security for Every Data Center

    The NVIDIA® H100 Tensor Core GPU enables an order-of-magnitude leap for large-scale AI and HPC with unprecedented performance, scalability, and security for every data center and includes the NVIDIA AI Enterprise software suite to streamline AI development and deployment. H100 accelerates exascale scale workloads with a dedicated Transformer Engine for trillion parameter language models. For small jobs, H100 can be partitioned down to right-sized Multi-Instance GPU (MIG) partitions. With Hopper Confidential Computing, this scalable compute power can secure sensitive applications on shared data center infrastructure. The inclusion of NVIDIA AI Enterprise with H100 PCIe purchases reduces time to development and simplifies deployment of AI workloads, and makes H100 the most powerful end-to-end AI and HPC data center platform.

    The NVIDIA Hopper architecture delivers unprecedented performance, scalability and security to every data center. Hopper builds upon prior generations from new compute core capabilities, such as the Transformer Engine, to faster networking to power the data center with an order of magnitude speedup over the prior generation. NVIDIA NVLink supports ultra-high bandwidth and extremely low latency between two H100 boards, and supports memory pooling and performance scaling (application support required). Second-generation MIG securely partitions the GPU into isolated right-size instances to maximize QoS (quality of service) for 7x more secured tenants. The inclusion of NVIDIA AI Enterprise (exclusive to the H100 PCIe), a software suite that optimizes the development and deployment of accelerated AI workflows, maximizes performance through these new H100 architectural innovations. These technology breakthroughs fuel the H100 Tensor Core GPU - the world's mostadvanced GPU ever built.

     

    Performance Highlights

    FP64

    26 TFLOPS

    FP64 Tensor Core

    51 TFLOPS

    FP32

    51 TFLOPS

    TF32 Tensor Core

    51 TFLOPS | Sparsity

    BFLOAT16 Tensor Core

    1513 TFLOPS | Sparsity

    FP16 Tensor Core

    1513 TFLOPS | Sparsity

    FP8 Tensor Core

    3026 TFLOPS | Sparsity

    INT8 Tensor Core

    3026 TOPS | Sparsity

    GPU Memory

    80GB HBM2e

    GPU Memory Bandwidth

    2.0 TB/sec

    Maximum Power Consumption

    350 W

    World's Most Advanced Chip

    • Built with 80 billion transistors using a cutting edge TSMC 4N process custom tailored for NVIDIA's accelerated compute needs, H100 is the world's most advanced chip ever built. It features major advances to accelerate AI, HPC, memory bandwidth, interconnect and communication at data center scale.

    NVIDIA Hopper Architecture

    • The NVIDIA H100 Tensor Core GPU powered by the NVIDIA Hopper GPU architecture delivers the next massive leap in accelerated computing performance for NVIDIA's data center platforms. H100 securely accelerates diverse workloads from small enterprise workloads, to exascale HPC, to trillion parameter AI models. Implemented using TSMC's 4N process customized for NVIDIA with 80 billion transistors, and including numerous architectural advances, H100 is the world's most advanced chip ever built.

    Fourth-Generation Tensor Cores

    • New fourth-generation Tensor Cores are up to 6x faster chip-to-chip compared to A100, including per-SM speedup, additional SM count, and higher clocks of H100. On a per SM basis, the Tensor Cores deliver 2x the MMA (Matrix Multiply-Accumulate) computational rates of the A100 SM on equivalent data types, and 4x the rate of A100 using the new FP8 data type, compared to previous generation 16-bit floating point options. The Sparsity feature exploits fine-grained structured sparsity in deep learning networks, doubling the performance of standard Tensor Core operations.

    Structural Sparsity

    • AI networks are big, having millions to billions of parameters. Not all of these parameters are needed for accurate predictions, and some can be converted to zeros to make the models “sparse” without compromising accuracy. Tensor Cores in H100 can provide up to 2x higher performance for sparse models. While the sparsity feature more readily benefits AI inference, it can also improve the performance of model training.

    Transformer Engine Supercharges AI, Up to 30x Higher Performance

    • Transformer models are the backbone of language models used widely today from BERT to GPT-3. Initially developed for natural language processing (NLP) use cases, Transformer's versatility is increasingly applied to computer vision, drug discovery and more. Their size continues to increase exponentially, now reaching trillions of parameters and causing their training times to stretch into months due to large math bound computation, which is impractical for business needs. The Transformer Engine uses software and custom Hopper Tensor Core technology designed specifically to accelerate training for models built from the world's most important AI model building block, the Transformer. Hopper Tensor Cores have the capability to apply mixed 8-bit floating point (FP8) and FP16 precision formats to dramatically accelerate the AI calculations for transformers.

    New DPX Instructions

    • Dynamic programming is an algorithmic technique for solving a complex recursive problem by breaking it down into simpler subproblems. By storing the results of subproblems so that you don't have to recompute them later, it reduces the time and complexity of exponential problem solving. Dynamic programming is commonly used in a broad range of use cases. For example, Floyd-Warshall is a route optimization algorithm that can be used to map the shortest routes for shipping and delivery fleets. The Smith-Waterman algorithm is used for DNA sequence alignment and protein folding applications. Hopper introduces DPX instructions to accelerate dynamic programming algorithms by 40x (DPU instructions 40x vs CPU comparison) compared to CPUs and 7x compared to NVIDIA Ampere architecture GPUs. This leads to dramatically faster times in disease diagnosis, real-time routing optimizations, and even graph analytics.

    New Thread Block Cluster Feature

    • Allows programmatic control of locality at a granularity larger than a single Thread Block on a single SM. This extends the CUDA programming model by adding another level to the programming hierarchy to now include Threads, Thread Blocks, Thread Block Clusters, and Grids. Clusters enable multiple Thread Blocks running concurrently across multiple SMs. to synchronize and collaboratively fetch and exchange data.

    Enhanced Asynchronous Execution Features

    • New Asynchronous Execution features include a new Tensor Memory Accelerator (TMA) unit that can transfer large blocks of data very efficiently between global memory and shared memory. TMA also supports asynchronous copies between Thread Blocks in a Cluster. There is also a new Asynchronous Transaction Barrier for doing atomic data movement and synchronization.

    Second-Generation Multi-Instance GPU (MIG) Technology

    • With Multi-Instance GPU (MIG) previously introduced in Ampere, a GPU can be partitioned into several smaller, fully isolated instances with their own memory, cache, and compute cores. The Hopper architecture further enhances MIG by supporting multi-tenant, multi-user configurations in virtualized environments across up to seven secure GPU instances, securely isolating each instance with confidential computing at the hardware and hypervisor level. Dedicated video decoders for each MIG instance deliver secure, high-throughput intelligent video analytics (IVA) on shared infrastructure. With Hopper's concurrent MIG profiling administrators can monitor right-sized GPU acceleration and optimize resource allocation for users. For researchers with smaller workloads, rather than renting a full CSP instance, they can elect to use MIG to securely isolate a portion of a GPU while being assured that their data is secure at rest, in transit, and at compute.

    New Confidential Computing Support

    • Today's confidential computing solutions are CPU-based, which is too limited for compute-intensive workloads like AI and HPC. NVIDIA Confidential Computing is a built-in security feature of the NVIDIA Hopper architecture that makes NVIDIA H100 the world's first accelerator with confidential computing capabilities. Users can protect the confidentiality and integrity of their data and applications in use while accessing the unsurpassed acceleration of H100 GPUs. It creates a hardware-based trusted execution environment (TEE) that secures and isolates the entire workload running on a single H100 GPU, multiple H100 GPUs within a node, or individual MIG instances. GPU-accelerated applications can run unchanged within the TEE and don't have to be partitioned. Users can combine the power of NVIDIA software for AI and HPC with the security of a hardware root of trust offered by NVIDIA Confidential Computing.

    HBM2e Memory Subsystem

    • H100 is bringing massive amounts of compute to data centers. To fully utilize that compute performance, the NVIDIA H100 PCIe utilizes HBM2e memory with a class-leading 2 terabytes per second (TB/sec) of memory bandwidth, a 50 percent increase over the previous generation. In addition to 80 gigabytes (GB) of HBM2e memory, H100 includes 50 megabytes (MB) of L2 cache. The combination of this faster HBM memory and larger cache provides the capacity to accelerate the most computationally intensive AI models.

    Fourth-Generation NVIDIA NVLink

    • Provides a 3x bandwidth increase on all-reduce operations and a 50% general bandwidth increase over the prior generation NVLink with 900 GB/sec total bandwidth for multi-GPU IO operating at nearly 5x the bandwidth of PCIe Gen 5.

    PCIe Gen5 for State of the Art CPUs and DPUs

    • The H100 is NVIDIA's first GPU to support PCIe Gen5, providing the highest speeds possible at 128GB/s (bi-directional). This fast communication enables optimal connectivity with the highest performing CPUs, as well as with NVIDIA ConnectX-7 SmartNICs and BlueField-3 DPUs, which allow up to 400Gb/s Ethernet or NDR 400Gb/s InfiniBand networking acceleration for secure HPC and AI workloads.

    Enterprise Ready: AI Software Streamlines Development and Deployment

    • Enterprise adoption of AI is now mainstream and organizations require end-to-end, AI ready infrastructure that will future proof them for this new era. NVIDIA H100 Tensor Core GPUs for mainstream servers (PCIe) come with NVIDIA AI Enterprise software, making AI accessible to nearly every organization with the highest performance in training, inference, and data-science. NVIDIA AI Enterprise together with NVIDIA H100 simplifies the building of an AI-ready platform, accelerates AI development and deployment with enterprise-grade support, and delivers the performance, security, and scalability to gather insights faster and achieve business value sooner.
    PNY Pro Logo
    Warranty Shield Icon
    Warranty

    Free dedicated phone and email technical support
    (1-800-230-0130)

    Dedicated NVIDIA professional products Field Application Engineers

    Contact gopny@pny.com for additional information.

  • Features

    NVIDIA H100

    PERFORMANCE AND USEABILITY FEATURES

    Data Center Class Reliability

    Designed for 24 x 7 data center operations and driven by power-efficient hardware and components selected for optimum performance, durability, and longevity. Every NVIDIA H100 board is designed, built and tested by NVIDIA® to the most rigorous quality and performance standards, ensuring that leading OEMs and systems integrators can meet or exceed the most demanding real-world conditions.

    NVIDIA Hopper Architecture

    The H100 PCIe Gen5 configuration provides all the capabilities of H100 SXM5 GPUs in just 350 Watts of Thermal Design Power (TDP). This configuration can optionally use the NVLink bridge for connecting up to two GPUs at 600 GB/s of bandwidth, nearly five times PCIe Gen5. Well suited for mainstream accelerated servers that go into standard racks offering lower power per server, H100 PCIe provides great performance for applications that scale to 1 or 2 GPUs at a time, including AI Inference and some HPC applications. On a basket of 10 top data analytics, AI and HPC applications, a single H100 PCIe GPU efficiently provides 65% delivered performance of the H100 SXM5 GPU while consuming 50% of the power.

    H100 SM Architecture

    Building upon the NVIDIA A100 Tensor Core GPU SM architecture, the H100 SM quadruples A100's peak per-SM floating point computational power, due to the introduction of FP8, and doubles A100's raw SM computational power on all previous Tensor Core and FP32 / FP64 data types, clock-for-clock. The new Transformer Engine, combined with Hopper's FP8 Tensor Cores, delivers up to 9x faster AI training and 30x faster AI inference speedups on large language models compared to the prior generation A100. Hopper's new DPX instructions enable up to 7x faster Smith-Waterman algorithm processing for genomics and protein sequencing. Hopper's new fourth-generation Tensor Core, Tensor Memory Accelerator, and many other new SM and general H100 architecture improvements together deliver up to 3x faster HPC and AI performance in many other cases.

    H100 Tensor Core Architecture

    Tensor Cores are specialized high-performance compute cores for matrix multiply and accumulate (MMA) math operations that provide groundbreaking performance for AI and HPC applications. Tensor Cores operating in parallel across SMs in one NVIDIA GPU deliver massive increases in throughput and efficiency compared to standard Floating-Point (FP), Integer (INT), and FMA (Fused Multiply-Accumulate) operations. Tensor Cores were first introduced in the NVIDIA Tesla V100 GPU, and further enhanced in each new NVIDIA GPU architecture generation. The new fourth-generation Tensor Core architecture in H100 delivers double the raw dense and sparse matrix math throughput per SM, clock-for-clock, compared to A100, and even more when considering the higher GPU Boost clock of H100 over A100. FP8, FP16, BF16, TF32, FP64, and INT8 MMA data types are supported.

    H100 Compute Performance Summary

    Overall, H100 provides approximately 6x compute performance improvement over A100 when factoring in all the new compute technology advances in H100. To summarize the improvements in H100, let's start with its 132 SMs providing a 22% SM count increase over A100's 108 SMs. Each of the H100 SMs is 2x faster thanks to its new 4th Generation Tensor Core. And within each Tensor Core, the new FP8 format and associated Transformer Engine provide another 2x improvement. Finally, increased clock frequencies in H100 deliver another approximately 1.3x performance improvement. In total, these improvements give H100 approximately 6x the peak compute throughput of A100, a major leap for the world's most compute-hungry workloads.

    MULTI-GPU TECHNOLOGY SUPPORT

    Fourth-Generation NVLink

    Provides a 3x bandwidth increase on all-reduce operations and a 50% general bandwidth increase over the prior generation NVLink with 900 GB/sec total bandwidth for multi-GPU IO operating at nearly 5x the bandwidth of PCIe Gen5.

    SOFTWARE SUPPORT

    NVIDIA AI Enterprise is Bundled with the H100 PCIe

    Support for NVIDIA Virtual Compute Server (vCS) accelerates virtualized compute workloads such as high-performance computing, AI, data science, big-data analytics, and HPC applications. Additionally, NVIDIA is offering every H100 PCIe purchaser an NVIDIA AI Enterprise license, an end-to-end, cloud-native suite of AI and data analytics software, optimized so every organization can excel at AI, certified to deploy anywhere from the enterprise data center to the cloud, with included global enterprise support so AI projects stay on track.

    Software Optimized for AI

    Deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others deliver dramatically faster training times and higher multi-node training performance. GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT delivers higher performance for both deep learning inference and High-Performance Computing (HPC) applications.

    NVIDIA CUDA Parallel Computing Platform

    Natively execute standard programming languages like C/C++ and Fortran, and APIs such as OpenCL, OpenACC and Direct Compute to accelerates techniques such as ray tracing, video and image processing, and computational fluid dynamics.

  • Specifications

    NVIDIA H100

    SPECIFICATIONS

    Product

    NVIDIA H100 Tensor Core GPU Accelerator

    Architecture

    Hopper

    Process Size

    4nm | TSMC

    Transistors

    80 Billion

    Die Size

    814 mm2

    FP64

    26 TFLOPS

    FP64 Tensor Core

    51 TFLOPS | Sparsity

    FP32

    51 TFLOPS

    TF32 Tensor Core

    51 TFLOPS | Sparsity

    BFLOAT16 Tensor Core

    1513 TFLOPS | Sparsity

    Memory Bandwidth

    864 GB/s

    FP16 Tensor Core

    1513 TFLOPS | Sparsity

    FP8 Tensor Core

    3026 TFLOPS | Sparsity

    INT8 Tensor Core

    3026 TOPS | Sparsity

    GPU Memory

    80GB HBM2e

    Memory Bandwidth

    2.0 TB/sec

    NVLink

    2-Way, 2-Slot, 600 GB/s Bidirectional

    Gen2 MIG (Multi-Instance GPU) Support

    Yes, Up to 7 GPU Instances at 10GB Each

    vGPU Support

    NVIDIA Virtual Compute Server with MIG Support

    NVIDIA AI Enterprise Support

    Bundled with NVIDIA H100

    Maximum Power Consumption

    350 W

    Thermal Solution

    Passive

    AVAILABLE ACCESSORIES

    • RTXA6000NVLINK-KIT provides an NVLink connector for the NVIDIA H100 PCIe suitable for standard PCIe slot spacing motherboards, effectively fusing two physical boards into one logical entity. Application support is required and each pair of H100 PCIe boards requires three (3) NVLink kits for correct operation.
    • RTXA6000NVLINK-3S-KIT provides an NVLink connector for the NVIDIA H100 PCIe for motherboards implementing wider PCIe slot spacing. All other features, benefits, application support, and three (3) NVLink kits per pair of H100 boards are identical to the standard slot spacing version.

    SUPPORTED OPERATING SYSTEMS

    • Windows Server 2012 R2
    • Windows Server 2016 1607, 1709
    • Windows Server 2019
    • RedHat CoreOS 4.7
    • Red Hat Enterprise Linux 8.1-8.3
    • Red Hat Enterprise Linux 7.7-7.9
    • Red Hat Linux 6.6+
    • SUSE Linux Enterprise Server 15 SP2
    • SUSE Linux Enterprise Server 12 SP 3+
    • Ubuntu 14.04 LTS/16.04/18.04 LTS/20.04 LTS

    WARRANTY

    • Dedicated NVIDIA professional products Field Application Engineers

    RESOURCES

    PACKAGE CONTAINS

    • NVIDIA H100 PCIe (Entitles purchaser to NVIDIA AI Enterprise, fulfilled by NVIDIA)
Close