Are you paying to send the same 100-page document with every API call? It's like commissioning a full translation every time you ask a question. Gemini Context Caching is the solution that slashes input token costs by up to 90% in applications with stable, repeated contexts. At Meteora Web, we've implemented it for clients analyzing technical manuals, legal contracts, and product documentation. Let's dive into how it works and how to put it into production now.
What is Gemini Context Caching and why does it save you money?
Context Caching is a feature of the Gemini API (models like Gemini 1.5 Pro and Flash) that allows you to store a context – a document, a long conversation, a set of instructions – and reuse it across multiple requests, paying only for the difference (each new token sent) instead of the entire context every time.
The savings are direct: if you have a 200,000 token document you send with every query, with caching you pay storage once (hourly fixed cost) and then only for query/response tokens. For 100 queries a day, the cost drops dramatically. We saw a client go from €0.50 to €0.05 per request on technical product documents.
Sponsored Protocol
This technology is ideal for: chatbots with custom knowledge bases, recurring contract analysis, legal documents, manuals, assistants that work on a static set of documents for long sessions.
How does Gemini Context Caching work?
The mechanism is simple: you upload a content once (a file, text, a list of messages) and get a unique identifier (cachedContent). Then, in subsequent calls, you pass that identifier instead of the original content. Gemini retrieves the context from cache and processes only the new tokens (user questions, updates).
What can be cached? Anything that goes into the contents parameter of a request: arrays of messages (chat), PDFs/images converted to tokens (with multimodal support), long texts. The only limit is the model's context window (up to 1 million tokens for Gemini 1.5 Pro).
Configurable TTL (Time-To-Live): you can set how long the cache stays active (from minutes to 30 days). After expiry, you only pay for reloading. Ideal for daily or weekly sessions.
Sponsored Protocol
Costs: storage is billed per hour (e.g., €0.10/hour for 200K tokens on Gemini 1.5 Flash – always check ai.google.dev/pricing). The savings are huge when you send many requests with the same context.
How to implement Context Caching with the Gemini API in Python?
Here's the code. We use the official google-generativeai library (version ≥0.7.0).
Step 1: Install and configure
pip install google-generativeaiimport google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")Step 2: Create a cache from a long text
# Example: a 100-page document in a string
long_text = open("manual.pdf", "r", encoding="utf-8").read()
# Create the model
model = genai.GenerativeModel("models/gemini-1.5-flash")
# Create cached content (use a single user message or system message)
cached_content = genai.caching.CachedContent.create(
model="models/gemini-1.5-flash",
contents=[
{
"role": "user",
"parts": [long_text]
}
],
ttl=3600 # 1 hour
)
cached_content.name # e.g. 'cachedContents/abc123'Step 3: Use the cache in requests
Sponsored Protocol
# Retrieve cache by id
cache = genai.caching.CachedContent.get("cachedContents/abc123")
# Create a model pointing to the cache
model_cached = genai.GenerativeModel(
"models/gemini-1.5-flash",
cached_content=cache
)
# Now ask questions
response = model_cached.generate_content("What is the maintenance chapter?")
print(response.text)Note: you can also pass the cache name directly to cached_content in generate_content, but the method above is cleaner.
Step 4: Update TTL and delete
# Extend cache by another 2 hours
cache.update(ttl=7200)
# Delete when no longer needed
cache.delete()
What are the limitations and best practices of Context Caching?
Limitations:
- The cache is a shared resource and has an hourly storage cost. If you make few requests, it's better not to use it.
- You can't modify the internal context without recreating the cache. If the document changes often, you need to invalidate and reload.
- Works only with Gemini 1.5 Pro and Flash models (not older ones).
Best practices from Meteora Web:
- Calculate the break-even point: for files <10K tokens and fewer than 10 requests/day, it's cheaper to pay normal input. Beyond that, caching wins.
- Set TTL based on session length: for daily assistants use 24 hours; for fixed documents, 7-30 days.
- Monitor costs: use the Google Cloud console to track storage and API logs to follow requests.
- Optimize the content: before caching, reduce text to what's needed (remove boilerplate, indexes). Fewer tokens = lower storage costs.
What to do next
- Identify repetition patterns in your project: are you sending the same document to multiple calls? Product list? Legal contract?
- Integrate the code above into your backend (Python, Node.js, Go – SDKs all support it).
- Test with a 50K token document and measure cost savings after 10–20 requests.
- Read the official documentation for deeper insight: Google AI – Context Caching.
- If you need custom support for your AI architecture, go back to our pillar guide on Gemini API.
At Meteora Web we've been using Context Caching since its release. On the right projects, the savings are net and immediate. Try it.