Evaluating Llama 3.3 70B Inference on NVIDIA H100 and A100 GPUs
Large‑scale language models quickly expose the limits of yesterday’s hardware. To understand how much practical head‑room Hopper offers over Ampere in a production‑style setting, I profiled llama-3.3-70b-instruct on two 4‑GPU hosts—one populated with A100 80GB (PCIe), the other with H100 80GB (SXM5). Inference was served via NVIDIA NIM using the default TensorRT‑LLM profiles (TP = 4, PP = 1, bfloat16).
Workloads and measurement were driven by one of NVIDIA’s benchmarking tools, genai‑perf; the accompanying charts were produced directly from benchmark.py with no manual post‑processing. The GitHub repository for the benchmark code
and raw data can be found in this GitHub repo. genai-perf measures several useful metrics, such as Time-To-First-Token (TTFT), Inter-Token Latency, and Tokens/Second - all driven by synthetically generated prompts for various patterned workloads.
Test Methodology
| Dimension | Setting |
|---|---|
| Model | llama‑3.3‑70b‑instruct |
| Container | nvcr.io/nim/meta/llama-3.3-70b-instruct:1.8.2 |
| Precision | bf16 |
| Parallelism | Tensor parallelism = 4, Pipeline parallelism = 1 |
| Traffic model | Synthetic prompts via genai-perf• 200 → 200 tokens (translation / Q&A) • 1,000 → 200 tokens (summarization) |
| Concurrency sweep | 1, 2, 5, 10, 50, 100, 250, 500 users |
| Metrics captured | • Total Tokens / Second (TPS) • Median Time‑To‑First‑Token (TTFT) |
Throughput Results
200 → 200 tokens

- H100 scaled almost linearly up to 500 users, peaking at ≈ 7,000 TPS.
- A100 saturated near ≈ 570 TPS and 50 users; additional users primarily increased queueing delay.
This corresponds to a ≈ 12–14× throughput advantage for the H100 configuration across the sweep.
1,000 → 200 tokens

Longer inputs magnify memory pressure during decoding, yet the relative gap remains wide:
- H100 delivered ≈ 2,600 TPS at 250 concurrent users.
- A100 remained under ≈ 230 TPS at the same load.
Latency Under Load
200 → 200 tokens

- H100 preserved a less than 5s TTFT up to 500 simultaneous sessions.
- A100 quickly surpasses >10s TTFT at 100 users and climbs quickly under high load.
1,000 → 200 tokens

- H100 maintains <5s TTFT latency out to the full user sweep, indicating additional capacity beyond 500 users.
- A100 TTFT spikes above 5s at just 10-20 concurrent users.
Discussion
- Hopper has several architectural improvements over Ampere beyond faster clock speeds, more SMs, and HBM3 memory bandwidth. Tensor Memory Accelerators (TMAs), FP8 support, and Transformer Engine all contribute to the throughput and latency gains seen at higher user counts.
- In most cases, H100s will achieve target latency and concurrency requirements at a lower effective cost than A100s. An on-demand AWS
p5.48xlarge(8xH100) costs 2.4x more than ap4de.24xlarge(8xA100), but delivers up to 14x the throughput. Put differently, you’d need about 13p4de.24xlargeinstances to match the throughput of a singlep5.48xlarge. - Longer input & output sequence lengths will increase latency and decrease concurrency/throughput.
- Blackwell should widen this gap further. I plan to run the same benchmarks on Blackwell hardware once it becomes more available.
- Although the
llama-3.3-70bmodel will fit on 2x80GB A100s or H100s at bfloat16 precision, it leaves very little room for a KV Cache, so the supported minimum per the NVIDIA NIM documentation is 4x80GB A100s or H100s.
Takeaways
- Hopper’s advantage over Ampere is an order of magnitude, not a marginal bump. The H100 system delivered ~12x the throughput and far lower p50 latency across every concurrency level tested.
- If your service targets sub-second TTFT under real concurrency, A100 requires heavy over-provisioning. H100 stays under 5s TTFT even at 500 users.
- Know your input/output sequence lengths (ISL/OSL). Summarization tasks with long inputs behave very differently from short Q&A. Reasoning models shift the balance further by generating many more output tokens relative to input.
- bf16 is the practical default. It balances memory footprint with throughput on both architectures and is the easiest path inside NIM.
fp8andint4profiles are not currently available forllama-3.3-70b-instructin NIM. - Benchmark at expected concurrency. Single-request numbers don’t tell the full story; queueing effects take over as user count climbs. 5 seconds TTFT is my rule of thumb for acceptable latency.
Reproducing the Experiment
The benchmarks can be reproduced by running using the start_nim.sh & benchmark.sh scripts provided in the above GitHub repository. The “NVIDIA NIM LLMs Benchmarking” documentation on genai-perf is also an excellent resource with some great insights into the various benchmarking metrics, as well as some of NVIDIA’s own benchmarking data with various ISL/OSLs and NIM profiles.
How Silex Can Help
Silex Data Solutions helps organizations evaluate GPU hardware, optimize inference infrastructure, and plan AI/ML deployments. If you’re making hardware decisions or scaling LLM workloads, our team can help.
Contact us to learn more.