Claude Prompt Caching: Optimizing API Costs & Performance
By Dorian Laurenceau
📅 Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.
🔗 Pillar article: Claude API: Complete Guide
What is Prompt Caching?
Prompt caching lets you store in memory portions of your prompt that don't change between requests. Instead of reprocessing the same tokens on every call, Claude reads them directly from cache.
Without Cache vs With Cache
| Aspect | Without cache | With cache |
|---|---|---|
| Processing | All tokens on every call | Only new tokens are processed |
| Input token cost | Standard price | 90% reduction on cached tokens |
| Latency | Proportional to total prompt | Reduced by ~85% for cached tokens |
| First call | Standard | Slightly more expensive (+25% cache write) |
| Subsequent calls | Standard | Much cheaper (cache read) |
The Caching Flow
Call 1 (cache miss):
[System prompt 10K tokens] → Cache Write → Cost: 1.25x (write overhead)
[User question 100 tokens] → Normal
Call 2+ (cache hit):
[System prompt 10K tokens] → Cache Read → Cost: 0.10x (90% savings)
[User question 100 tokens] → Normal
The honest read on prompt caching from engineers who run it in production, surfaced on r/LocalLLaMA, r/MachineLearning, and the Anthropic Discord: the 90% cost reduction number is real and routinely achieved, but only when your workload actually looks like cache hits. A call that reuses the same 10K-token system prompt across thousands of user queries saves money; a call where the "cached" section changes every request is paying the 1.25x write penalty and getting nothing back. The Anthropic prompt caching docs are explicit about this — the 5-minute TTL means that low-traffic endpoints rarely hit warm cache in practice.
Where the community correctly pushes back on naive benchmarks: the cost savings headline assumes a cache hit rate above ~50%. Below that, the write overhead on misses eats the gains. Teams that have measured it properly — see the OpenRouter caching benchmarks and the LangSmith observability posts — get different effective savings depending on whether the workload is a support chatbot (high hit rate), a RAG system with dynamic context (low hit rate), or a batch evaluation (near-100% hit rate once warm).
Pragmatic rule from people who actually shipped it: instrument the cache_read_input_tokens and cache_creation_input_tokens fields on every call, compute hit rate per endpoint, and turn off caching on any endpoint that sits below 30%. The feature pays for itself only on the workloads that match its shape.
Implementation
Python
import anthropic
client = anthropic.Anthropic()
# Long system prompt with cache_control
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": """You are an expert legal assistant specializing in employment law.
Here is the complete Employment Code that you must use as reference:
Section 1: Employment contracts are subject to common law rules...
[... 10,000 tokens of legal content ...]
Section 500: The provisions of this code are applicable...
""",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "What are an employee's rights in case of economic layoff?"}
]
)
# Check cache statistics
print(f"Cache write: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache read: {response.usage.cache_read_input_tokens} tokens")
print(f"Normal input: {response.usage.input_tokens} tokens")
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: `You are a technical assistant. Here is the API documentation:
[... detailed documentation ...]`,
cache_control: { type: "ephemeral" },
},
],
messages: [
{ role: "user", content: "How do I create a REST endpoint?" },
],
});
console.log(`Cache write: ${response.usage.cache_creation_input_tokens}`);
console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);
Caching Strategies
1. Fixed System Prompt
The most common use case: a system prompt that never changes between requests.
# The same system prompt is cached for all users
system = [{
"type": "text",
"text": "You are an assistant [detailed role + instructions + context]...",
"cache_control": {"type": "ephemeral"}
}]
# User 1
client.messages.create(model=model, max_tokens=1024, system=system,
messages=[{"role": "user", "content": "User 1's question"}])
# User 2 (benefits from cache)
client.messages.create(model=model, max_tokens=1024, system=system,
messages=[{"role": "user", "content": "User 2's question"}])
2. Reference Documents
Include a large document once, ask many questions about it.
# Cache the document
system = [{
"type": "text",
"text": f"Reference document:\n\n{long_document}",
"cache_control": {"type": "ephemeral"}
}]
# 50 questions on the same document → 49 cache hits
questions = ["Summarize chapter 3", "What risks are mentioned?", ...]
for q in questions:
response = client.messages.create(
model=model, max_tokens=1024, system=system,
messages=[{"role": "user", "content": q}]
)
3. Few-Shot Examples
Cache a set of examples reused in many requests.
system = [{
"type": "text",
"text": """Sentiment classification. Examples:
Text: "This product is fantastic!" → Positive
Text: "Late delivery, disappointed." → Negative
Text: "Fine, nothing special." → Neutral
[... 50 examples ...]
""",
"cache_control": {"type": "ephemeral"}
}]
# Classify thousands of texts with cached examples
for text in texts_to_classify:
response = client.messages.create(
model=model, max_tokens=100, system=system,
messages=[{"role": "user", "content": f"Classify: \"{text}\""}]
)
4. Cache in Messages (Conversation)
You can also cache content in messages (not just the system).
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Here is a 50-page report:\n\n{report}",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Summarize the main conclusions."
}
]
}
]
TTL and Cache Management
How TTL Works
| Event | Effect on TTL |
|---|---|
| Cache write (first call) | TTL initialized to 5 minutes |
| Cache hit (subsequent call) | TTL reset to 5 minutes |
| No call for 5 min | Cache expired, next call = cache write |
Optimizing TTL
import time
import threading
def keep_cache_warm(client, model, system, interval=240):
"""Keep the cache active by sending periodic pings."""
def ping():
while True:
client.messages.create(
model=model,
max_tokens=1,
system=system,
messages=[{"role": "user", "content": "ping"}]
)
time.sleep(interval) # 4 minutes < 5-minute TTL
thread = threading.Thread(target=ping, daemon=True)
thread.start()
Savings Calculation
Savings Formula
Savings per request = (cached_tokens × normal_price) - (cached_tokens × cache_read_price)
= cached_tokens × normal_price × 0.90
First request surcharge = cached_tokens × normal_price × 0.25
Break-even = 1 + (surcharge / savings_per_request)
= 1 + (0.25 / 0.90) ≈ 1.28 requests (so from the 2nd call)
Savings Table
| Cached tokens | Without cache (100 req) | With cache (100 req) | Savings |
|---|---|---|---|
| 1,000 | $0.30 | $0.05 | 83% |
| 5,000 | $1.50 | $0.23 | 85% |
| 10,000 | $3.00 | $0.41 | 86% |
| 50,000 | $15.00 | $1.88 | 87% |
| 100,000 | $30.00 | $3.68 | 88% |
Prices based on Claude Sonnet at $3/M input tokens, over 100 requests (1 cache write + 99 cache reads).
Monitoring Cache
Metrics to Track
def log_cache_metrics(response):
"""Log cache metrics for monitoring."""
usage = response.usage
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
regular_input = usage.input_tokens
cache_hit = cache_read > 0
cache_ratio = cache_read / (cache_read + regular_input) if (cache_read + regular_input) > 0 else 0
print(f"Cache hit: {cache_hit}")
print(f"Cache read: {cache_read} tokens")
print(f"Cache write: {cache_write} tokens")
print(f"Regular input: {regular_input} tokens")
print(f"Cache ratio: {cache_ratio:.1%}")
return {
"cache_hit": cache_hit,
"cache_read_tokens": cache_read,
"cache_write_tokens": cache_write,
"regular_input_tokens": regular_input,
"cache_ratio": cache_ratio
}
Cache Dashboard
| Metric | Target | Alert if |
|---|---|---|
| Cache hit rate | > 90% | < 70% |
| Cache ratio (cached tokens / total) | > 80% | < 50% |
| Cache writes / hour | Stable | Sudden spikes (cache expired) |
| Monthly savings | Predictable | Unexpected decrease |
Constraints and Limits
| Constraint | Detail |
|---|---|
| Minimum tokens | Content to cache must contain at least 1,024 tokens (2,048 for Opus) |
| Fixed TTL | 5 minutes, not configurable |
| Block order | Cached content must be at the beginning (prefix) |
| Number of breakpoints | Maximum 4 cache points per request |
| Compatibility | Works with streaming, tool use, images |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| No cache hit | Content slightly different between calls | Cached content must be IDENTICAL bit for bit |
| Cache expires too quickly | No calls for 5+ minutes | Implement a warmup ping |
| Unexpected surcharge | Too many cache writes vs reads | Verify content stability |
| Minimum not met | Content < 1,024 tokens | Add context or combine content |
- →Advanced MCP: Prompt Caching and Transports, Advanced Model Context Protocol patterns
Module 0 — Prompting Fundamentals
Build your first effective prompts from scratch with hands-on exercises.
Dorian Laurenceau
Full-Stack Developer & Learning DesignerFull-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.
Weekly AI Insights
Tools, techniques & news — curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
→Related Articles
FAQ
What is Claude prompt caching?+
Prompt caching lets you cache portions of your prompt (system prompt, documents, examples) so they aren't reprocessed on every request. Cached tokens cost 90% less and reduce latency by 85%.
How does prompt caching work?+
Add a cache_control marker on a content block. On the first call, the content is cached (cache write). Subsequent calls with the same content use the cache (cache hit) at reduced cost.
How long does the cache last?+
The default TTL (Time-To-Live) is 5 minutes. Each cache hit resets the TTL. The cache expires automatically after 5 minutes without use.
When should I use prompt caching?+
Use it when you send the same large content in multiple requests: long system prompts, reference documents, few-shot examples, fixed conversation context.
Is there an extra cost for cache writes?+
Yes, cache writes cost 25% more than normal input tokens. But cache reads cost 90% less. Break-even is reached on the 2nd call with a cache hit.