+ TPU vs GPU +

v6e-8 vs H100 and H200

Trelis Research

May 12, 2025

I run a comparison - on throughput and cost - of:

- 8x v6e from Google
- 2x H100 SXM from Nvidia
- 1x H200 SXM from Nvidia

running the Gemma 3 27B it model.

In short: When running vLLM on each, Nvidia is about 4-5x cheaper.

Lots more in the full video on the Trelis Research channel on YouTube.

And get access to the benchmarking scripts here:

Get the scripts

Cheers, Ronan

P.S. 🛠️ (NEW) Trelis Benchmarking Seminars - learn more here.

Video Links:

Slides

Trelis Links:

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

TPU vs NVIDIA GPU Benchmarking: Performance Analysis of Gemma 27B Inference

This analysis compares inference performance between Google's TPU v6e and NVIDIA's H100/H200 GPUs running the Gemma 27B model. The comparison examines hardware specifications, throughput metrics, and cost efficiency.

Hardware Specifications

TPU v6e:

VRAM: 32 GB per unit
HBM Speed: ~1.2 TB/s
Interconnect: ~450 GB/s
FP16 FLOPS: ~420 TFLOPS

NVIDIA H100:

VRAM: 80GB
HBM Speed: ~3.0 TB/s
Interconnect: ~900 GB/s
FP16 FLOPS: ~400 TFLOPS

NVIDIA H200:

VRAM: 141GB
HBM Speed: ~4.0 TB/s
Interconnect: ~900 GB/s
FP16 FLOPS: ~400 TFLOPS

Benchmark Configuration

Test Setup:

Model: Gemma 27B (16-bit precision)
TPU Configuration: 8x v6e units (256GB total VRAM)
GPU Configurations: 1x H200 (141GB VRAM) and 2x H100 (160GB total VRAM)
Library: vLLM
Input Tokens: 5000 ±50
Output Tokens: 1000 ±50
Concurrency Tests: 1, 8, and 64 simultaneous requests

Performance Results

Time to First Token:

TPU v6e (8x): 0.76s at concurrency 1, 0.79s at concurrency 8
H200 (1x): 0.9s at concurrency 1 and 8
H100 (2x): 0.9s at concurrency 1 and 8

Token Generation Speed (per request):

H100 (2x): Highest token production rate
TPU v6e (8x): Slightly faster than single H200
All configurations show decreased speed at concurrency 64

Cost Analysis

Cost per Million Tokens (at concurrency 8):

H200 (1x): $0.57
H100 (2x): $0.74
TPU v6e (8x): $2.85

Hourly Hardware Costs:

H200: $3.99/hour
H100: $2.99/hour per unit ($5.98 total)
TPU v6e: $21.60/hour total for 8 units

Key Findings

TPUs showed faster time to first token but higher overall cost per token
NVIDIA configurations demonstrated superior cost efficiency
TPUs maintain high compute capacity but appear bottlenecked by memory bandwidth, or possibly libraries/kernels.
Concurrency of 64 proved impractical across all configurations due to slow token generation speeds.