Claude Prompt Caching: Optimizing API Costs & Performance
By Learnia Team
Claude Prompt Caching: Optimizing API Costs & Performance
📅 Last updated: March 10, 2026 — Covers prompt caching with the Messages API, Python and TypeScript SDKs.
🔗 Pillar article: Claude API: Complete Guide
What is Prompt Caching?
Prompt caching lets you store in memory portions of your prompt that don't change between requests. Instead of reprocessing the same tokens on every call, Claude reads them directly from cache.
Without Cache vs With Cache
| Aspect | Without cache | With cache |
|---|---|---|
| Processing | All tokens on every call | Only new tokens are processed |
| Input token cost | Standard price | 90% reduction on cached tokens |
| Latency | Proportional to total prompt | Reduced by ~85% for cached tokens |
| First call | Standard | Slightly more expensive (+25% cache write) |
| Subsequent calls | Standard | Much cheaper (cache read) |
The Caching Flow
Call 1 (cache miss):
[System prompt 10K tokens] → Cache Write → Cost: 1.25x (write overhead)
[User question 100 tokens] → Normal
Call 2+ (cache hit):
[System prompt 10K tokens] → Cache Read → Cost: 0.10x (90% savings)
[User question 100 tokens] → Normal
Implementation
Python
import anthropic
client = anthropic.Anthropic()
# Long system prompt with cache_control
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": """You are an expert legal assistant specializing in employment law.
Here is the complete Employment Code that you must use as reference:
Section 1: Employment contracts are subject to common law rules...
[... 10,000 tokens of legal content ...]
Section 500: The provisions of this code are applicable...
""",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "What are an employee's rights in case of economic layoff?"}
]
)
# Check cache statistics
print(f"Cache write: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache read: {response.usage.cache_read_input_tokens} tokens")
print(f"Normal input: {response.usage.input_tokens} tokens")
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: `You are a technical assistant. Here is the API documentation:
[... detailed documentation ...]`,
cache_control: { type: "ephemeral" },
},
],
messages: [
{ role: "user", content: "How do I create a REST endpoint?" },
],
});
console.log(`Cache write: ${response.usage.cache_creation_input_tokens}`);
console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);
Caching Strategies
1. Fixed System Prompt
The most common use case: a system prompt that never changes between requests.
# The same system prompt is cached for all users
system = [{
"type": "text",
"text": "You are an assistant [detailed role + instructions + context]...",
"cache_control": {"type": "ephemeral"}
}]
# User 1
client.messages.create(model=model, max_tokens=1024, system=system,
messages=[{"role": "user", "content": "User 1's question"}])
# User 2 (benefits from cache)
client.messages.create(model=model, max_tokens=1024, system=system,
messages=[{"role": "user", "content": "User 2's question"}])
2. Reference Documents
Include a large document once, ask many questions about it.
# Cache the document
system = [{
"type": "text",
"text": f"Reference document:\n\n{long_document}",
"cache_control": {"type": "ephemeral"}
}]
# 50 questions on the same document → 49 cache hits
questions = ["Summarize chapter 3", "What risks are mentioned?", ...]
for q in questions:
response = client.messages.create(
model=model, max_tokens=1024, system=system,
messages=[{"role": "user", "content": q}]
)
3. Few-Shot Examples
Cache a set of examples reused in many requests.
system = [{
"type": "text",
"text": """Sentiment classification. Examples:
Text: "This product is fantastic!" → Positive
Text: "Late delivery, disappointed." → Negative
Text: "Fine, nothing special." → Neutral
[... 50 examples ...]
""",
"cache_control": {"type": "ephemeral"}
}]
# Classify thousands of texts with cached examples
for text in texts_to_classify:
response = client.messages.create(
model=model, max_tokens=100, system=system,
messages=[{"role": "user", "content": f"Classify: \"{text}\""}]
)
4. Cache in Messages (Conversation)
You can also cache content in messages (not just the system).
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Here is a 50-page report:\n\n{report}",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Summarize the main conclusions."
}
]
}
]
TTL and Cache Management
How TTL Works
| Event | Effect on TTL |
|---|---|
| Cache write (first call) | TTL initialized to 5 minutes |
| Cache hit (subsequent call) | TTL reset to 5 minutes |
| No call for 5 min | Cache expired, next call = cache write |
Optimizing TTL
import time
import threading
def keep_cache_warm(client, model, system, interval=240):
"""Keep the cache active by sending periodic pings."""
def ping():
while True:
client.messages.create(
model=model,
max_tokens=1,
system=system,
messages=[{"role": "user", "content": "ping"}]
)
time.sleep(interval) # 4 minutes < 5-minute TTL
thread = threading.Thread(target=ping, daemon=True)
thread.start()
Savings Calculation
Savings Formula
Savings per request = (cached_tokens × normal_price) - (cached_tokens × cache_read_price)
= cached_tokens × normal_price × 0.90
First request surcharge = cached_tokens × normal_price × 0.25
Break-even = 1 + (surcharge / savings_per_request)
= 1 + (0.25 / 0.90) ≈ 1.28 requests (so from the 2nd call)
Savings Table
| Cached tokens | Without cache (100 req) | With cache (100 req) | Savings |
|---|---|---|---|
| 1,000 | $0.30 | $0.05 | 83% |
| 5,000 | $1.50 | $0.23 | 85% |
| 10,000 | $3.00 | $0.41 | 86% |
| 50,000 | $15.00 | $1.88 | 87% |
| 100,000 | $30.00 | $3.68 | 88% |
Prices based on Claude Sonnet at $3/M input tokens, over 100 requests (1 cache write + 99 cache reads).
Monitoring Cache
Metrics to Track
def log_cache_metrics(response):
"""Log cache metrics for monitoring."""
usage = response.usage
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
regular_input = usage.input_tokens
cache_hit = cache_read > 0
cache_ratio = cache_read / (cache_read + regular_input) if (cache_read + regular_input) > 0 else 0
print(f"Cache hit: {cache_hit}")
print(f"Cache read: {cache_read} tokens")
print(f"Cache write: {cache_write} tokens")
print(f"Regular input: {regular_input} tokens")
print(f"Cache ratio: {cache_ratio:.1%}")
return {
"cache_hit": cache_hit,
"cache_read_tokens": cache_read,
"cache_write_tokens": cache_write,
"regular_input_tokens": regular_input,
"cache_ratio": cache_ratio
}
Cache Dashboard
| Metric | Target | Alert if |
|---|---|---|
| Cache hit rate | > 90% | < 70% |
| Cache ratio (cached tokens / total) | > 80% | < 50% |
| Cache writes / hour | Stable | Sudden spikes (cache expired) |
| Monthly savings | Predictable | Unexpected decrease |
Constraints and Limits
| Constraint | Detail |
|---|---|
| Minimum tokens | Content to cache must contain at least 1,024 tokens (2,048 for Opus) |
| Fixed TTL | 5 minutes, not configurable |
| Block order | Cached content must be at the beginning (prefix) |
| Number of breakpoints | Maximum 4 cache points per request |
| Compatibility | Works with streaming, tool use, images |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| No cache hit | Content slightly different between calls | Cached content must be IDENTICAL bit for bit |
| Cache expires too quickly | No calls for 5+ minutes | Implement a warmup ping |
| Unexpected surcharge | Too many cache writes vs reads | Verify content stability |
| Minimum not met | Content < 1,024 tokens | Add context or combine content |
Read Next
- →Contextual Retrieval and Advanced RAG — Leverage prompt caching in your RAG pipelines
- →Advanced MCP: Prompt Caching and Transports — Advanced Model Context Protocol patterns
Module 0 — Prompting Fundamentals
Build your first effective prompts from scratch with hands-on exercises.
Weekly AI Insights
Tools, techniques & news — curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
→Related Articles
FAQ
What is Claude prompt caching?+
Prompt caching lets you cache portions of your prompt (system prompt, documents, examples) so they aren't reprocessed on every request. Cached tokens cost 90% less and reduce latency by 85%.
How does prompt caching work?+
Add a cache_control marker on a content block. On the first call, the content is cached (cache write). Subsequent calls with the same content use the cache (cache hit) at reduced cost.
How long does the cache last?+
The default TTL (Time-To-Live) is 5 minutes. Each cache hit resets the TTL. The cache expires automatically after 5 minutes without use.
When should I use prompt caching?+
Use it when you send the same large content in multiple requests: long system prompts, reference documents, few-shot examples, fixed conversation context.
Is there an extra cost for cache writes?+
Yes, cache writes cost 25% more than normal input tokens. But cache reads cost 90% less. Break-even is reached on the 2nd call with a cache hit.