I Tested Every GPU

in fp8 and bf16

Trelis Research

Dec 23, 2024

AMD and Nvidia GPUs
MI300X, Hopper, Lovelace and Ampere
sglang (lmsysorg) and vLLM
bf16 and fp8
batch size 64, 550 Tok input, 150 output

and I got a few takeaways:

sglang with fp8 gives ~2x speedup versus vLLM on an A40 at large batch size (64).
fp8 gives large speed-up on Ampere GPUs (A40, A100) at large batch size, less so on Hopper or Lovelace, but still some speedup.
A40 and RTX 3090 give the best price per token, although they aren’t quite as fast on responses as H100 or H200 or MI300X.
A100 gives a nice balance between throughput and low cost.

And that’s it for this week, cheers, Ronan

Trelis.com .

Easy LoRA Inference at Finetunehost.com (closed alpha signup here).

Performance and Cost Analysis Across 12 GPUs [AI Summary]

Trelis presents a comprehensive benchmarking study of 12 different GPUs, examining their performance and cost-effectiveness for machine learning inference. The study encompasses a wide range of architectures, from AMD's MI300X to NVIDIA's latest Hopper H200 and H100, through to consumer-grade GPUs like the RTX 3090.

Test Methodology

The benchmarking methodology is rigorous and consistent across all GPUs. Each test involves:

- 500 input tokens (with small variance)

- 150 output tokens (with small variance)

- 64 concurrent requests

- Testing in both 8-bit and 16-bit formats

- Using the LLAMA 8B 3.1 model

- Primary testing through SGLang inference library

Key Performance Metrics

The analysis focuses on three crucial metrics: time to first token, token generation throughput, and price per million tokens.

The newest GPUs, particularly the H200 and H100, demonstrate superior time-to-first-token performance, averaging around half a second in both 8-bit and 16-bit formats. The A40, despite being a more affordable option at around 30 cents per hour (compared to $4 for H100), shows surprisingly competitive performance.

Cost-Performance Analysis

One of the study's most interesting findings is that older generation GPUs often provide the lowest cost on a dollar-per-token basis. While newer GPUs like the H200 and H100 offer higher throughput, their premium pricing means they're not always the most cost-effective choice for all use cases.

The A40 emerges as a particularly interesting option, offering excellent value, especially when running in 8-bit format, achieving costs as low as 4 cents per million input tokens. The A100 also shows strong price-performance characteristics, offering throughput comparable to newer generations at a lower price point.

FP8 vs 16-bit Performance

The study makes a strong case for FP8 (8-bit floating point) format in production environments. Unlike 4-bit quantization, FP8 provides minimal quality degradation compared to 16-bit formats while offering better GPU support. The analysis demonstrates that FP8 often improves throughput significantly, particularly on older GPUs.

SGLang vs vLLM Comparison

The research includes a comparative analysis of SGLang and vLLM inference libraries. On the H200, both libraries show similar performance in time-to-first-token metrics. However, on the A40, SGLang demonstrates notably better performance, particularly with 8-bit models. This suggests that SGLang might be the preferred choice for Ampere architecture GPUs running at high batch sizes.

Important Caveats

The study acknowledges several important considerations:

- Major providers like OpenAI and Anthropic likely achieve better performance through custom optimizations

- The test setup uses relatively short input/output lengths

- The pricing analysis focuses on input tokens only

- The current industry pricing model (separate input/output token pricing) might not perfectly reflect computational costs

Practical Applications

While newer GPUs offer superior raw performance, the study suggests that for many applications, older generation GPUs might provide better value. The A40, in particular, emerges as a strong contender for cost-effective deployment, especially when running FP8 models.

Technical Implementation

The study concludes with a practical guide to implementing FP8 quantization, using tools like LLM Compressor. This section provides valuable insights for practitioners looking to optimize their model deployment, demonstrating how to achieve significant size reduction while maintaining model quality.

Overall, this video is particularly relevant for organizations looking to optimize their ML deployment costs while maintaining high performance standards.