Evaluating Llama 3.3 70B Inference on NVIDIA H100 and A100 GPUs

Derek Lewis
Chief Technology Officer

Large‑scale language models quickly expose the limits of yesterday’s hardware. To understand how much practical head‑room Hopper offers over Ampere in a production‑style setting, I profiled llama-3.3-70b-instruct on two 4‑GPU hosts—one populated with A100 80GB (PCIe), the other with H100 80GB (SXM5). Inference was served via NVIDIA NIM using the default TensorRT‑LLM profiles (TP = 4, PP = 1, bfloat16).

Workloads and measurement were driven by one of NVIDIA’s benchmarking tools, genai‑perf; the accompanying charts were produced directly from benchmark.py with no manual post‑processing. The GitHub repository for the benchmark code and raw data can be found in this GitHub repo. genai-perf measures several useful metrics, such as Time-To-First-Token (TTFT), Inter-Token Latency, and Tokens/Second - all driven by synthetically generated prompts for various patterned workloads.


Test Methodology

DimensionSetting
Modelllama‑3.3‑70b‑instruct
Containernvcr.io/nim/meta/llama-3.3-70b-instruct:1.8.2
Precisionbf16
ParallelismTensor parallelism = 4, Pipeline parallelism = 1
Traffic modelSynthetic prompts via genai-perf
• 200 → 200 tokens (translation / Q&A)
• 1,000 → 200 tokens (summarization)
Concurrency sweep1, 2, 5, 10, 50, 100, 250, 500 users
Metrics captured• Total Tokens / Second (TPS)
• Median Time‑To‑First‑Token (TTFT)

Throughput Results

200 → 200 tokens

Performance (200/200) – TPS vs TTFT

  • H100 scaled almost linearly up to 500 users, peaking at ≈ 7,000 TPS.
  • A100 saturated near ≈ 570 TPS and 50 users; additional users primarily increased queueing delay.

This corresponds to a ≈ 12–14× throughput advantage for the H100 configuration across the sweep.

1,000 → 200 tokens

Performance (1,000/200) – TPS vs TTFT

Longer inputs magnify memory pressure during decoding, yet the relative gap remains wide:

  • H100 delivered ≈ 2,600 TPS at 250 concurrent users.
  • A100 remained under ≈ 230 TPS at the same load.

Latency Under Load

200 → 200 tokens

TTFT vs Concurrency (200/200)

  • H100 preserved a less than 5s TTFT up to 500 simultaneous sessions.
  • A100 quickly surpasses >10s TTFT at 100 users and climbs quickly under high load.

1,000 → 200 tokens

TTFT vs Concurrency (1,000/200)

  • H100 maintains <5s TTFT latency out to the full user sweep, indicating additional capacity beyond 500 users.
  • A100 TTFT spikes above 5s at just 10-20 concurrent users.

Discussion

  1. Hopper has several architectural improvements over Ampere beyond faster clock speeds, more SMs, and HBM3 memory bandwidth. Tensor Memory Accelerators (TMAs), FP8 support, and Transformer Engine all contribute to the throughput and latency gains seen at higher user counts.
  2. In most cases, H100s will achieve target latency and concurrency requirements at a lower effective cost than A100s. An on-demand AWS p5.48xlarge (8xH100) costs 2.4x more than a p4de.24xlarge (8xA100), but delivers up to 14x the throughput. Put differently, you’d need about 13 p4de.24xlarge instances to match the throughput of a single p5.48xlarge.
  3. Longer input & output sequence lengths will increase latency and decrease concurrency/throughput.
  4. Blackwell should widen this gap further. I plan to run the same benchmarks on Blackwell hardware once it becomes more available.
  5. Although the llama-3.3-70b model will fit on 2x80GB A100s or H100s at bfloat16 precision, it leaves very little room for a KV Cache, so the supported minimum per the NVIDIA NIM documentation is 4x80GB A100s or H100s.

Takeaways

  1. Hopper’s advantage over Ampere is an order of magnitude, not a marginal bump. The H100 system delivered ~12x the throughput and far lower p50 latency across every concurrency level tested.
  2. If your service targets sub-second TTFT under real concurrency, A100 requires heavy over-provisioning. H100 stays under 5s TTFT even at 500 users.
  3. Know your input/output sequence lengths (ISL/OSL). Summarization tasks with long inputs behave very differently from short Q&A. Reasoning models shift the balance further by generating many more output tokens relative to input.
  4. bf16 is the practical default. It balances memory footprint with throughput on both architectures and is the easiest path inside NIM. fp8 and int4 profiles are not currently available for llama-3.3-70b-instruct in NIM.
  5. Benchmark at expected concurrency. Single-request numbers don’t tell the full story; queueing effects take over as user count climbs. 5 seconds TTFT is my rule of thumb for acceptable latency.

Reproducing the Experiment

The benchmarks can be reproduced by running using the start_nim.sh & benchmark.sh scripts provided in the above GitHub repository. The “NVIDIA NIM LLMs Benchmarking” documentation on genai-perf is also an excellent resource with some great insights into the various benchmarking metrics, as well as some of NVIDIA’s own benchmarking data with various ISL/OSLs and NIM profiles.

How Silex Can Help

Silex Data Solutions helps organizations evaluate GPU hardware, optimize inference infrastructure, and plan AI/ML deployments. If you’re making hardware decisions or scaling LLM workloads, our team can help.

Contact us to learn more.