Back to all articles
9 MIN READ

Claude Prompt Caching: Optimizing API Costs & Performance

By Learnia Team

Claude Prompt Caching: Optimizing API Costs & Performance

📅 Last updated: March 10, 2026 — Covers prompt caching with the Messages API, Python and TypeScript SDKs.

🔗 Pillar article: Claude API: Complete Guide


What is Prompt Caching?

Prompt caching lets you store in memory portions of your prompt that don't change between requests. Instead of reprocessing the same tokens on every call, Claude reads them directly from cache.

Without Cache vs With Cache

AspectWithout cacheWith cache
ProcessingAll tokens on every callOnly new tokens are processed
Input token costStandard price90% reduction on cached tokens
LatencyProportional to total promptReduced by ~85% for cached tokens
First callStandardSlightly more expensive (+25% cache write)
Subsequent callsStandardMuch cheaper (cache read)

The Caching Flow

Call 1 (cache miss):
  [System prompt 10K tokens] → Cache Write → Cost: 1.25x (write overhead)
  [User question 100 tokens] → Normal

Call 2+ (cache hit):
  [System prompt 10K tokens] → Cache Read → Cost: 0.10x (90% savings)
  [User question 100 tokens] → Normal

Implementation

Python

import anthropic

client = anthropic.Anthropic()

# Long system prompt with cache_control
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are an expert legal assistant specializing in employment law.

Here is the complete Employment Code that you must use as reference:

Section 1: Employment contracts are subject to common law rules...
[... 10,000 tokens of legal content ...]
Section 500: The provisions of this code are applicable...
""",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "What are an employee's rights in case of economic layoff?"}
    ]
)

# Check cache statistics
print(f"Cache write: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache read: {response.usage.cache_read_input_tokens} tokens")
print(f"Normal input: {response.usage.input_tokens} tokens")

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: `You are a technical assistant. Here is the API documentation:
      
      [... detailed documentation ...]`,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    { role: "user", content: "How do I create a REST endpoint?" },
  ],
});

console.log(`Cache write: ${response.usage.cache_creation_input_tokens}`);
console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);

Caching Strategies

1. Fixed System Prompt

The most common use case: a system prompt that never changes between requests.

# The same system prompt is cached for all users
system = [{
    "type": "text",
    "text": "You are an assistant [detailed role + instructions + context]...",
    "cache_control": {"type": "ephemeral"}
}]

# User 1
client.messages.create(model=model, max_tokens=1024, system=system,
    messages=[{"role": "user", "content": "User 1's question"}])

# User 2 (benefits from cache)
client.messages.create(model=model, max_tokens=1024, system=system,
    messages=[{"role": "user", "content": "User 2's question"}])

2. Reference Documents

Include a large document once, ask many questions about it.

# Cache the document
system = [{
    "type": "text",
    "text": f"Reference document:\n\n{long_document}",
    "cache_control": {"type": "ephemeral"}
}]

# 50 questions on the same document → 49 cache hits
questions = ["Summarize chapter 3", "What risks are mentioned?", ...]
for q in questions:
    response = client.messages.create(
        model=model, max_tokens=1024, system=system,
        messages=[{"role": "user", "content": q}]
    )

3. Few-Shot Examples

Cache a set of examples reused in many requests.

system = [{
    "type": "text",
    "text": """Sentiment classification. Examples:

Text: "This product is fantastic!" → Positive
Text: "Late delivery, disappointed." → Negative
Text: "Fine, nothing special." → Neutral
[... 50 examples ...]
""",
    "cache_control": {"type": "ephemeral"}
}]

# Classify thousands of texts with cached examples
for text in texts_to_classify:
    response = client.messages.create(
        model=model, max_tokens=100, system=system,
        messages=[{"role": "user", "content": f"Classify: \"{text}\""}]
    )

4. Cache in Messages (Conversation)

You can also cache content in messages (not just the system).

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"Here is a 50-page report:\n\n{report}",
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": "Summarize the main conclusions."
            }
        ]
    }
]

TTL and Cache Management

How TTL Works

EventEffect on TTL
Cache write (first call)TTL initialized to 5 minutes
Cache hit (subsequent call)TTL reset to 5 minutes
No call for 5 minCache expired, next call = cache write

Optimizing TTL

import time
import threading

def keep_cache_warm(client, model, system, interval=240):
    """Keep the cache active by sending periodic pings."""
    def ping():
        while True:
            client.messages.create(
                model=model,
                max_tokens=1,
                system=system,
                messages=[{"role": "user", "content": "ping"}]
            )
            time.sleep(interval)  # 4 minutes < 5-minute TTL
    
    thread = threading.Thread(target=ping, daemon=True)
    thread.start()

Savings Calculation

Savings Formula

Savings per request = (cached_tokens × normal_price) - (cached_tokens × cache_read_price)
                    = cached_tokens × normal_price × 0.90

First request surcharge = cached_tokens × normal_price × 0.25

Break-even = 1 + (surcharge / savings_per_request)
           = 1 + (0.25 / 0.90) ≈ 1.28 requests (so from the 2nd call)

Savings Table

Cached tokensWithout cache (100 req)With cache (100 req)Savings
1,000$0.30$0.0583%
5,000$1.50$0.2385%
10,000$3.00$0.4186%
50,000$15.00$1.8887%
100,000$30.00$3.6888%

Prices based on Claude Sonnet at $3/M input tokens, over 100 requests (1 cache write + 99 cache reads).

Monitoring Cache

Metrics to Track

def log_cache_metrics(response):
    """Log cache metrics for monitoring."""
    usage = response.usage
    
    cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
    cache_read = getattr(usage, 'cache_read_input_tokens', 0)
    regular_input = usage.input_tokens
    
    cache_hit = cache_read > 0
    cache_ratio = cache_read / (cache_read + regular_input) if (cache_read + regular_input) > 0 else 0
    
    print(f"Cache hit: {cache_hit}")
    print(f"Cache read: {cache_read} tokens")
    print(f"Cache write: {cache_write} tokens")
    print(f"Regular input: {regular_input} tokens")
    print(f"Cache ratio: {cache_ratio:.1%}")
    
    return {
        "cache_hit": cache_hit,
        "cache_read_tokens": cache_read,
        "cache_write_tokens": cache_write,
        "regular_input_tokens": regular_input,
        "cache_ratio": cache_ratio
    }

Cache Dashboard

MetricTargetAlert if
Cache hit rate> 90%< 70%
Cache ratio (cached tokens / total)> 80%< 50%
Cache writes / hourStableSudden spikes (cache expired)
Monthly savingsPredictableUnexpected decrease

Constraints and Limits

ConstraintDetail
Minimum tokensContent to cache must contain at least 1,024 tokens (2,048 for Opus)
Fixed TTL5 minutes, not configurable
Block orderCached content must be at the beginning (prefix)
Number of breakpointsMaximum 4 cache points per request
CompatibilityWorks with streaming, tool use, images

Common Errors

ErrorCauseSolution
No cache hitContent slightly different between callsCached content must be IDENTICAL bit for bit
Cache expires too quicklyNo calls for 5+ minutesImplement a warmup ping
Unexpected surchargeToo many cache writes vs readsVerify content stability
Minimum not metContent < 1,024 tokensAdd context or combine content

GO DEEPER — FREE GUIDE

Module 0 — Prompting Fundamentals

Build your first effective prompts from scratch with hands-on exercises.

Newsletter

Weekly AI Insights

Tools, techniques & news — curated for AI practitioners. Free, no spam.

Free, no spam. Unsubscribe anytime.

FAQ

What is Claude prompt caching?+

Prompt caching lets you cache portions of your prompt (system prompt, documents, examples) so they aren't reprocessed on every request. Cached tokens cost 90% less and reduce latency by 85%.

How does prompt caching work?+

Add a cache_control marker on a content block. On the first call, the content is cached (cache write). Subsequent calls with the same content use the cache (cache hit) at reduced cost.

How long does the cache last?+

The default TTL (Time-To-Live) is 5 minutes. Each cache hit resets the TTL. The cache expires automatically after 5 minutes without use.

When should I use prompt caching?+

Use it when you send the same large content in multiple requests: long system prompts, reference documents, few-shot examples, fixed conversation context.

Is there an extra cost for cache writes?+

Yes, cache writes cost 25% more than normal input tokens. But cache reads cost 90% less. Break-even is reached on the 2nd call with a cache hit.