I run a comparison - on throughput and cost - of:
- 8x v6e from Google
- 2x H100 SXM from Nvidia
- 1x H200 SXM from Nvidia
running the Gemma 3 27B it model.
In short: When running vLLM on each, Nvidia is about 4-5x cheaper.
Lots more in the full video on the Trelis Research channel on YouTube.
And get access to the benchmarking scripts here:
Cheers, Ronan
P.S. 🛠️ (NEW) Trelis Benchmarking Seminars - learn more here.
Video Links:
Trelis Links:
🤝 Are you a talented developer? Work for Trelis
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
TPU vs NVIDIA GPU Benchmarking: Performance Analysis of Gemma 27B Inference
This analysis compares inference performance between Google's TPU v6e and NVIDIA's H100/H200 GPUs running the Gemma 27B model. The comparison examines hardware specifications, throughput metrics, and cost efficiency.
Hardware Specifications
TPU v6e:
VRAM: 32 GB per unit
HBM Speed: ~1.2 TB/s
Interconnect: ~450 GB/s
FP16 FLOPS: ~420 TFLOPS
NVIDIA H100:
VRAM: 80GB
HBM Speed: ~3.0 TB/s
Interconnect: ~900 GB/s
FP16 FLOPS: ~400 TFLOPS
NVIDIA H200:
VRAM: 141GB
HBM Speed: ~4.0 TB/s
Interconnect: ~900 GB/s
FP16 FLOPS: ~400 TFLOPS
Benchmark Configuration
Test Setup:
Model: Gemma 27B (16-bit precision)
TPU Configuration: 8x v6e units (256GB total VRAM)
GPU Configurations: 1x H200 (141GB VRAM) and 2x H100 (160GB total VRAM)
Library: vLLM
Input Tokens: 5000 ±50
Output Tokens: 1000 ±50
Concurrency Tests: 1, 8, and 64 simultaneous requests
Performance Results
Time to First Token:
TPU v6e (8x): 0.76s at concurrency 1, 0.79s at concurrency 8
H200 (1x): 0.9s at concurrency 1 and 8
H100 (2x): 0.9s at concurrency 1 and 8
Token Generation Speed (per request):
H100 (2x): Highest token production rate
TPU v6e (8x): Slightly faster than single H200
All configurations show decreased speed at concurrency 64
Cost Analysis
Cost per Million Tokens (at concurrency 8):
H200 (1x): $0.57
H100 (2x): $0.74
TPU v6e (8x): $2.85
Hourly Hardware Costs:
H200: $3.99/hour
H100: $2.99/hour per unit ($5.98 total)
TPU v6e: $21.60/hour total for 8 units
Key Findings
TPUs showed faster time to first token but higher overall cost per token
NVIDIA configurations demonstrated superior cost efficiency
TPUs maintain high compute capacity but appear bottlenecked by memory bandwidth, or possibly libraries/kernels.
Concurrency of 64 proved impractical across all configurations due to slow token generation speeds.