Context caching for Faster and cheaper inference

Sep 03, 2024

Large Language Models (LLMs) like Claude Sonnet and Google's Gemini now allow for massive input contexts - over 100,000 or even 1 million+ words!

But it's kind of slow and expensive...

However, if you're reusing the same background documentation in your queries - there's a trick to get a 2x speed up and a 4x reduction in cost.

➡️ Context Caching

- Stores the results of calculations done on your background information

- Re-uses those results for later queries.

=> 2x faster, 4x cheaper.

💡 Implementation tips:

- Put all background info at the start of your prompt

- Use specific headers/parameters for Claude and Gemini (see video)

- For open-source deployment, tools like SG Lang can automate caching

📊 Cost comparison (100K input tokens, 500 output, 10 requests):

Gemini Pro with caching: ~$1

Gemini Pro without caching: ~$4

Claude Sonnet with caching: $0.67

Claude Sonnet without caching: ~$3

Cheers, Ronan

More resources at Trelis.com/About

Trelis Research