March 10, 202610 MIN READ

Claude Prompt Caching: Optimizing API Costs & Performance

Q: What is Claude prompt caching?

Prompt caching lets you cache portions of your prompt (system prompt, documents, examples) so they aren't reprocessed on every request. Cached tokens cost 90% less and reduce latency by 85%.

Q: How does prompt caching work?

Add a cache_control marker on a content block. On the first call, the content is cached (cache write). Subsequent calls with the same content use the cache (cache hit) at reduced cost.

Q: How long does the cache last?

The default TTL (Time-To-Live) is 5 minutes. Each cache hit resets the TTL. The cache expires automatically after 5 minutes without use.

Q: When should I use prompt caching?

Use it when you send the same large content in multiple requests: long system prompts, reference documents, few-shot examples, fixed conversation context.

Q: Is there an extra cost for cache writes?

Yes, cache writes cost 25% more than normal input tokens. But cache reads cost 90% less. Break-even is reached on the 2nd call with a cache hit.

By Dorian Laurenceau

Part ofModule 0 — Prompting Fundamentals→

📅 Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.

🔗 Pillar article: Claude API: Complete Guide

What is Prompt Caching?

Prompt caching lets you store in memory portions of your prompt that don't change between requests. Instead of reprocessing the same tokens on every call, Claude reads them directly from cache.

Without Cache vs With Cache

Aspect	Without cache	With cache
Processing	All tokens on every call	Only new tokens are processed
Input token cost	Standard price	90% reduction on cached tokens
Latency	Proportional to total prompt	Reduced by ~85% for cached tokens
First call	Standard	Slightly more expensive (+25% cache write)
Subsequent calls	Standard	Much cheaper (cache read)

The Caching Flow

Call 1 (cache miss):
  [System prompt 10K tokens] → Cache Write → Cost: 1.25x (write overhead)
  [User question 100 tokens] → Normal

Call 2+ (cache hit):
  [System prompt 10K tokens] → Cache Read → Cost: 0.10x (90% savings)
  [User question 100 tokens] → Normal

The honest read on prompt caching from engineers who run it in production, surfaced on r/LocalLLaMA, r/MachineLearning, and the Anthropic Discord: the 90% cost reduction number is real and routinely achieved, but only when your workload actually looks like cache hits. A call that reuses the same 10K-token system prompt across thousands of user queries saves money; a call where the "cached" section changes every request is paying the 1.25x write penalty and getting nothing back. The Anthropic prompt caching docs are explicit about this — the 5-minute TTL means that low-traffic endpoints rarely hit warm cache in practice.

Where the community correctly pushes back on naive benchmarks: the cost savings headline assumes a cache hit rate above ~50%. Below that, the write overhead on misses eats the gains. Teams that have measured it properly — see the OpenRouter caching benchmarks and the LangSmith observability posts — get different effective savings depending on whether the workload is a support chatbot (high hit rate), a RAG system with dynamic context (low hit rate), or a batch evaluation (near-100% hit rate once warm).

Pragmatic rule from people who actually shipped it: instrument the cache_read_input_tokens and cache_creation_input_tokens fields on every call, compute hit rate per endpoint, and turn off caching on any endpoint that sits below 30%. The feature pays for itself only on the workloads that match its shape.

Implementation

Python

import anthropic

client = anthropic.Anthropic()

# Long system prompt with cache_control
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are an expert legal assistant specializing in employment law.

Here is the complete Employment Code that you must use as reference:

Section 1: Employment contracts are subject to common law rules...
[... 10,000 tokens of legal content ...]
Section 500: The provisions of this code are applicable...
""",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "What are an employee's rights in case of economic layoff?"}
    ]
)

# Check cache statistics
print(f"Cache write: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache read: {response.usage.cache_read_input_tokens} tokens")
print(f"Normal input: {response.usage.input_tokens} tokens")

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: `You are a technical assistant. Here is the API documentation:
      
      [... detailed documentation ...]`,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    { role: "user", content: "How do I create a REST endpoint?" },
  ],
});

console.log(`Cache write: ${response.usage.cache_creation_input_tokens}`);
console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);

Caching Strategies

1. Fixed System Prompt

The most common use case: a system prompt that never changes between requests.

# The same system prompt is cached for all users
system = [{
    "type": "text",
    "text": "You are an assistant [detailed role + instructions + context]...",
    "cache_control": {"type": "ephemeral"}
}]

# User 1
client.messages.create(model=model, max_tokens=1024, system=system,
    messages=[{"role": "user", "content": "User 1's question"}])

# User 2 (benefits from cache)
client.messages.create(model=model, max_tokens=1024, system=system,
    messages=[{"role": "user", "content": "User 2's question"}])

2. Reference Documents

Include a large document once, ask many questions about it.

# Cache the document
system = [{
    "type": "text",
    "text": f"Reference document:\n\n{long_document}",
    "cache_control": {"type": "ephemeral"}
}]

# 50 questions on the same document → 49 cache hits
questions = ["Summarize chapter 3", "What risks are mentioned?", ...]
for q in questions:
    response = client.messages.create(
        model=model, max_tokens=1024, system=system,
        messages=[{"role": "user", "content": q}]
    )

3. Few-Shot Examples

Cache a set of examples reused in many requests.

system = [{
    "type": "text",
    "text": """Sentiment classification. Examples:

Text: "This product is fantastic!" → Positive
Text: "Late delivery, disappointed." → Negative
Text: "Fine, nothing special." → Neutral
[... 50 examples ...]
""",
    "cache_control": {"type": "ephemeral"}
}]

# Classify thousands of texts with cached examples
for text in texts_to_classify:
    response = client.messages.create(
        model=model, max_tokens=100, system=system,
        messages=[{"role": "user", "content": f"Classify: \"{text}\""}]
    )

4. Cache in Messages (Conversation)

You can also cache content in messages (not just the system).

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"Here is a 50-page report:\n\n{report}",
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": "Summarize the main conclusions."
            }
        ]
    }
]

TTL and Cache Management

How TTL Works

Event	Effect on TTL
Cache write (first call)	TTL initialized to 5 minutes
Cache hit (subsequent call)	TTL reset to 5 minutes
No call for 5 min	Cache expired, next call = cache write

Optimizing TTL

import time
import threading

def keep_cache_warm(client, model, system, interval=240):
    """Keep the cache active by sending periodic pings."""
    def ping():
        while True:
            client.messages.create(
                model=model,
                max_tokens=1,
                system=system,
                messages=[{"role": "user", "content": "ping"}]
            )
            time.sleep(interval)  # 4 minutes < 5-minute TTL
    
    thread = threading.Thread(target=ping, daemon=True)
    thread.start()

Savings Calculation

Savings Formula

Savings per request = (cached_tokens × normal_price) - (cached_tokens × cache_read_price)
                    = cached_tokens × normal_price × 0.90

First request surcharge = cached_tokens × normal_price × 0.25

Break-even = 1 + (surcharge / savings_per_request)
           = 1 + (0.25 / 0.90) ≈ 1.28 requests (so from the 2nd call)

Savings Table

Cached tokens	Without cache (100 req)	With cache (100 req)	Savings
1,000	$0.30	$0.05	83%
5,000	$1.50	$0.23	85%
10,000	$3.00	$0.41	86%
50,000	$15.00	$1.88	87%
100,000	$30.00	$3.68	88%

Prices based on Claude Sonnet at $3/M input tokens, over 100 requests (1 cache write + 99 cache reads).

Monitoring Cache

Metrics to Track

def log_cache_metrics(response):
    """Log cache metrics for monitoring."""
    usage = response.usage
    
    cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
    cache_read = getattr(usage, 'cache_read_input_tokens', 0)
    regular_input = usage.input_tokens
    
    cache_hit = cache_read > 0
    cache_ratio = cache_read / (cache_read + regular_input) if (cache_read + regular_input) > 0 else 0
    
    print(f"Cache hit: {cache_hit}")
    print(f"Cache read: {cache_read} tokens")
    print(f"Cache write: {cache_write} tokens")
    print(f"Regular input: {regular_input} tokens")
    print(f"Cache ratio: {cache_ratio:.1%}")
    
    return {
        "cache_hit": cache_hit,
        "cache_read_tokens": cache_read,
        "cache_write_tokens": cache_write,
        "regular_input_tokens": regular_input,
        "cache_ratio": cache_ratio
    }

Cache Dashboard

Metric	Target	Alert if
Cache hit rate	> 90%	< 70%
Cache ratio (cached tokens / total)	> 80%	< 50%
Cache writes / hour	Stable	Sudden spikes (cache expired)
Monthly savings	Predictable	Unexpected decrease

Constraints and Limits

Constraint	Detail
Minimum tokens	Content to cache must contain at least 1,024 tokens (2,048 for Opus)
Fixed TTL	5 minutes, not configurable
Block order	Cached content must be at the beginning (prefix)
Number of breakpoints	Maximum 4 cache points per request
Compatibility	Works with streaming, tool use, images

Common Errors

Error	Cause	Solution
No cache hit	Content slightly different between calls	Cached content must be IDENTICAL bit for bit
Cache expires too quickly	No calls for 5+ minutes	Implement a warmup ping
Unexpected surcharge	Too many cache writes vs reads	Verify content stability
Minimum not met	Content < 1,024 tokens	Add context or combine content

→Advanced MCP: Prompt Caching and Transports, Advanced Model Context Protocol patterns

GO DEEPER — FREE GUIDE

Module 0 — Prompting Fundamentals

Build your first effective prompts from scratch with hands-on exercises.

Explore the Module

Dorian Laurenceau

Full-Stack Developer & Learning Designer

Full-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.

Prompt EngineeringLLMsFull-Stack DevelopmentLearning DesignReact

Published: March 10, 2026Updated: April 24, 2026

Newsletter

Weekly AI Insights

Tools, techniques & news — curated for AI practitioners. Free, no spam.

Free, no spam. Unsubscribe anytime.

FAQ

What is Claude prompt caching?+