Large Language Models (LLMs) like Claude Sonnet and Google's Gemini now allow for massive input contexts - over 100,000 or even 1 million+ words!
But it's kind of slow and expensive...
However, if you're reusing the same background documentation in your queries - there's a trick to get a 2x speed up and a 4x reduction in cost.
➡️ Context Caching
- Stores the results of calculations done on your background information
- Re-uses those results for later queries.
=> 2x faster, 4x cheaper.
💡 Implementation tips:
- Put all background info at the start of your prompt
- Use specific headers/parameters for Claude and Gemini (see video)
- For open-source deployment, tools like SG Lang can automate caching
📊 Cost comparison (100K input tokens, 500 output, 10 requests):
Gemini Pro with caching: ~$1
Gemini Pro without caching: ~$4
Claude Sonnet with caching: $0.67
Claude Sonnet without caching: ~$3
Cheers, Ronan
More resources at Trelis.com/About