You just integrated the Claude API into your Python project, everything works. Then you check the bill: every request costs you because you always send the same 40k token context – system prompt, internal documentation, conversation history – even when the answer is just "yes." At $0.15 per million input tokens, it seems small, but multiply by thousands of daily calls: the bill climbs fast. And latency increases because the model must re-read everything each time.
At Meteora Web, we faced this exact problem on platforms that manage clients with fixed technical documents (manuals, policies, catalogs). The solution is prompt caching, a native feature of Claude’s API (Sonnet 4, Opus 4 models) that allows caching portions of the prompt – you pay for writing once and reuse the cache for subsequent requests. In this guide we show how to integrate it correctly, with working Python code, and how to avoid common pitfalls.
Why prompt caching changes the game
Imagine having to upload a 200-page PDF as context for a customer support bot every single time. Without caching, each call sends those tokens, you pay for them, and you wait for encoding. With caching, you send the content once, get a cache identifier, and then reference it. The savings are stark: up to 90% on input cost and a latency reduction of 2-3 seconds per request because the model skips the encoding phase.
We applied it to a corporate FAQ system: before, each answer cost $0.12 in input; after caching, it dropped to $0.015. Over 10,000 monthly requests, the saving was over $1,000. And the perceived latency went from 5 to 2 seconds.
When to use it (and when not)
Caching is useful when:
- You have a long, static system prompt (rules, tone, fixed examples).
- You include reference documents (manuals, PDFs, entire textual databases).
- You manage multi-turn conversations where the history remains stable for many exchanges.
It’s not needed when:
Sponsored Protocol
- The prompt changes radically every request (e.g., single sentence translations).
- Input tokens are already low (under 1,000) – caching has minimal overhead but is unnecessary.
Basic Claude API setup with Python
First, install the official Anthropic client. We always use the latest stable version – the anthropic SDK for Python.
pip install anthropicGet your API key from console.anthropic.com and set it as an environment variable. Never hardcode it.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
response = client.messages.create(
model="claude-sonnet-4-20250514", # model supporting caching
max_tokens=1024,
messages=[
{"role": "user", "content": "Hello, who are you?"}
]
)
print(response.content[0].text)Note: not all models support caching. At the time of writing, claude-sonnet-4 and claude-opus-4 do (always check the official documentation).
Implementing prompt caching step by step
The mechanism is straightforward: you mark a portion of the prompt as cacheable using a special header or a field in the content. With the Python SDK, we use the metadata field inside a text content block. Here’s how:
from anthropic.types.message import Message
# Static system content (to cache)
system_prompt = """You are an expert automotive mechanic.
Answer precisely and concisely. Always use correct technical terminology.
Never invent specs. If unsure, admit it."""
# Call with caching activated on system prompt
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "What is the difference between manual and automatic transmission?"}
]
)
# Read cache headers from response
print("Cache hit:", response.cache_read_input_tokens)
print("Cache creation:", response.cache_creation_input_tokens)The parameter cache_control: {"type": "ephemeral"} tells Anthropic to cache that content. The cache is ephemeral: it lasts about 5 minutes of inactivity, then expires. If you keep sending requests with the exact same system block, the cache stays active.
Sponsored Protocol
Reading cache status
In the response, two fields are critical:
cache_read_input_tokens: tokens read from cache (zero if cache was cold).cache_creation_input_tokens: tokens written to cache (only on first request).
Monitor them to verify savings. On every subsequent request with the same system prompt, cache_read_input_tokens should be positive, while cache_creation_input_tokens will be zero.
Caching on user messages: documents and long context
Often the static context is not just the system prompt but also attached documents (PDFs, articles). You can apply caching to message blocks too. Example: in a support chat, the user manual stays the same for the whole session.
# Simulate a long document (e.g., manual excerpt)
manuale = """
CHAPTER 1: INSTALLATION
1.1 Open the packaging and verify contents...
... (1000 tokens of text) ...
"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a technical support assistant for product X.",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": maniale,
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "How do I install the product?"
}
]
}
]
)Warning: the cache is per exact text block. If you change even one character, the content is re-created. Ensure the cached text is perfectly stable.
Sponsored Protocol
Common errors and debugging
We’ve seen clients struggle to get cache hits. Here’s what we check:
- Unsupported model: use only Sonnet 4 or Opus 4. Older models (Claude 3) don’t support caching.
- Cache too short: if requests are more than 5 minutes apart, cache expires. For long sessions, keep the flow active or recreate the cache periodically.
- Same content, different formatting: spaces, newlines, Unicode characters – matching is exact. Always use the same string.
- Forgetting to read headers: if you don’t check
cache_read_input_tokens, you won’t know if caching is working. We always log it in production.
Cache warming strategy
In production, we handle the case where the cache hasn’t been created (first request) or has expired. The code could warm the cache every 4 minutes with a dummy request if the load is low.
import time
def warm_cache(client, system_text, model="claude-sonnet-4-20250514"):
"""Keep cache alive by sending a minimal request every 4 minutes."""
client.messages.create(
model=model,
max_tokens=1,
system=[{"type": "text", "text": system_text, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "ping"}]
)
# In a separate thread
while True:
warm_cache(client, system_prompt)
time.sleep(240) # 4 minutesOf course, evaluate whether the cost of the maintenance request is lower than the savings. Usually yes, because a 1-token request is nearly free.
Sponsored Protocol
Full integration: conversational assistant template
Let’s put it all together: an assistant that caches the system prompt and a reference document, and manages user history.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_KEY"])
SYSTEM = """You are a GDPR expert for SMEs.
Answer in English, clearly, with references to the regulation articles.
"""
CONTEXT = """Excerpt from EU Regulation 2016/679:
Article 5: Principles...
Article 17: Right to erasure...
"""
cache_created = False
def ask(question: str, history: list = None) -> str:
global cache_created
messages = history or []
messages.append({"role": "user", "content": question})
# Build content blocks
user_content = [
{"type": "text", "text": CONTEXT, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": question}
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
messages=[
*messages[:-1], # previous history
{"role": "user", "content": user_content}
]
)
# Debug log
print(f"Cache creation: {response.cache_creation_input_tokens}, Cache read: {response.cache_read_input_tokens}")
reply = response.content[0].text
messages.append({"role": "assistant", "content": reply})
return reply, messagesNote: we do not cache the conversation history because it changes every turn. Only static blocks go into cache.
Sponsored Protocol
Measuring real savings
Anthropic bills cache tokens at a reduced rate (about 1/10 of normal). For claude-sonnet-4, approximate prices:
- Normal input: $3.00 / million tokens
- Cache input: $0.30 / million tokens
- Cache write: $3.75 / million tokens (one-time only)
So, if your prompt has 10,000 fixed tokens and you make 100 requests:
- Without cache: 100 × 10,000 = 1,000,000 input tokens → $3.00
- With cache: first request 10,000 tokens write ($0.0375) + 99 cache reads: 99 × 10,000 = 990,000 tokens at $0.30/million → $0.297 + $0.0375 = $0.3345
Savings: from $3.00 to $0.33 – about 90%.
What to do next
- Identify static prompts in your project: system prompt, reference documents, fixed instructions.
- Implement caching with
cache_controlas shown above, starting with a single content block in isolation. - Monitor cache headers in the response to verify real reuse.
- Consider cache warming with a timer if requests are sporadic but frequent.
- Check official documentation for model updates and pricing: Anthropic Prompt Caching Guide.
At Meteora Web, we apply these techniques every day in our clients' projects. For deeper insights, check our article on the ALE benchmark, which reveals the real limits of AI – another piece to avoid marketing illusions: ALE Benchmark.