March 10, 20269 MIN READ

Contextual Retrieval: The Advanced RAG Technique That

By Dorian Laurenceau

📅 Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.

Contextual Retrieval: The Advanced RAG Technique

Classic RAG has a fundamental problem: when you split a document into chunks and encode them individually, each chunk loses the document's context. The sentence "Revenue grew by 3% over the previous quarter" tells you nothing about which company, which period, or which type of revenue. Contextual Retrieval solves this by injecting the missing context into each chunk before encoding, reducing retrieval failures by 67%.

The Problem with Classic RAG

To understand why Contextual Retrieval is necessary, let's review how standard RAG works. If you're new to RAG, start with our RAG fundamentals guide.

Loading diagram…

The classic RAG pipeline:

→Break the knowledge base into chunks (~a few hundred tokens)
→Encode chunks into vectors (embeddings)
→Store in a vector database
→At query time: find similar chunks → add them to the prompt

The problem: Each chunk is encoded in isolation. A chunk that says "Revenue grew by 3%" loses information about which company, which document, which period.

The part of Contextual Retrieval that the Anthropic announcement is honest about, and that practitioners on r/MachineLearning and r/LocalLLaMA keep confirming: naive RAG is mostly a ranking problem disguised as a retrieval problem. The embedding quality is usually fine. What fails is that the top-5 chunks by cosine similarity do not form a coherent answer, because each chunk was encoded as if it had no neighbors. Adding a short generated context prefix before embedding is cheap, and the reported ~49% reduction in retrieval failures (with reranking) is consistent with what teams replicate in production.

Where the community correctly pushes back: this is not magic. Contextual prefixes add tokens to your indexing cost, they require prompt caching to be economically viable at any real corpus size, and they don't rescue a badly chunked document. If your chunks are the wrong length or straddle semantic boundaries (a table split across two chunks, a function body separated from its signature), contextual embeddings just mean you're retrieving well-described nonsense.

The pragmatic sequencing: get chunking right first (semantic splits, respect markdown/code structure), add BM25 as a parallel retriever, measure baseline with a held-out eval set, then layer contextual embeddings and reranking. Teams that skip straight to Contextual Retrieval without a measurement harness often conclude it "doesn't help" — when the real story is that they have no way to tell.

BM25: The Essential Complement to Embeddings

Before diving into Contextual Retrieval, we need to understand BM25 (Best Matching 25), a text ranking algorithm based on lexical matching.

The key: Combining embeddings AND BM25 via rank fusion gives better results than either alone.

The Solution: Contextual Retrieval

The idea is simple but powerful: add explanatory context to each chunk BEFORE encoding it.

Before vs After

# ❌ Classic chunk (no context)
"The company's revenue grew by 3% over the previous quarter."

# ✅ Contextualized chunk
"This chunk is from an SEC filing on ACME Corp's performance 
in Q2 2023; the previous quarter's revenue was $314 million. 
The company's revenue grew by 3% over the previous quarter."

Loading diagram…

The Contextualization Prompt

For each chunk, Claude generates 50-100 tokens of context:

<document>
{{WHOLE_DOCUMENT}}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>

Please give a short succinct context to situate this chunk 
within the overall document for the purposes of improving 
search retrieval of the chunk. Answer only with the succinct 
context and nothing else.

Result: The generated context is prepended to the chunk, then everything is encoded (both embedding AND BM25).

Implementation with Prompt Caching

The cost of contextualizing each chunk with the full document could be prohibitive. The solution: prompt caching.

Loading diagram…

How it works:

→Load the full document into cache (one-time cost)
→For each chunk, send only the chunk + contextualization prompt
→The cached document is reused automatically

Cost: ~$1.02 per million document tokens (one-time cost during indexing).

To learn more about prompt caching, see our Claude prompt caching guide.

import anthropic

client = anthropic.Anthropic()

def contextualize_chunks(document: str, chunks: list[str]) -> list[str]:
    """Add context to each chunk via Claude with prompt caching."""
    contextualized = []
    
    for chunk in chunks:
        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"<document>\n{document}\n</document>",
                        "cache_control": {"type": "ephemeral"}
                    },
                    {
                        "type": "text",
                        "text": (
                            f"<chunk>\n{chunk}\n</chunk>\n\n"
                            "Give a short succinct context to situate "
                            "this chunk within the overall document for "
                            "improving search retrieval. "
                            "Answer only with the succinct context."
                        )
                    }
                ]
            }]
        )
        
        context = response.content[0].text
        contextualized.append(f"{context}\n\n{chunk}")
    
    return contextualized

Complete Pipeline: Contextual Embeddings + Contextual BM25 + Reranking

The optimal pipeline combines all three techniques:

Loading diagram…

Performance Results

Reranking Explained

Reranking is a post-processing step that re-evaluates chunk relevance:

→Initial retrieval: Top 150 chunks (embeddings + BM25)
→Reranking: A specialized model scores each chunk against the query
→Selection: Top 20 most relevant chunks
→Generation: The LLM uses these 20 chunks to respond

Why top-20 rather than top-10 or top-5? Benchmarks consistently show that top-20 gives better results than smaller windows, because the LLM has more context to synthesize its response.

Key Findings and Recommendations

→Embeddings + BM25 > Embeddings alone, Always combine both via rank fusion
→Top-20 chunks > Top-10 or Top-5, More context = better responses
→Adding context to chunks significantly improves accuracy, This is the core of Contextual Retrieval
→Reranking > No reranking, The final step that maximizes precision
→All gains stack, Contextual Embeddings + Contextual BM25 + Reranking + Top-20

Comparison with Advanced RAG: Lost in the Middle

Contextual Retrieval and the "Lost in the Middle" problem are two different facets of RAG optimization:

→Contextual Retrieval solves the context loss during encoding problem (retrieval side)
→Lost in the Middle solves the attention loss in the middle of long contexts problem (generation side)

For a complete analysis of the Lost in the Middle phenomenon, see our advanced RAG guide.

→Advanced RAG: Lost in the Middle, The other RAG challenge
→Claude Prompt Caching, Essential for reducing contextualization costs
→Context Engineering: The 4 Pillars, Holistic view of context management
→The 5 Agent Architecture Patterns, Integrating RAG into agentic systems

Dorian Laurenceau

Full-Stack Developer & Learning Designer

Full-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.

Prompt EngineeringLLMsFull-Stack DevelopmentLearning DesignReact

Published: March 10, 2026Updated: April 24, 2026

Newsletter

Weekly AI Insights

Tools, techniques & news — curated for AI practitioners. Free, no spam.

Free, no spam. Unsubscribe anytime.

FAQ

What is Contextual Retrieval?+

Contextual Retrieval adds explanatory context to each document chunk before embedding. Instead of encoding 'Revenue grew by 3%', you encode 'This chunk is from ACME Corp's Q2 2023 SEC filing. Revenue grew by 3%'. This reduces retrieval failures by 49% (67% with reranking).

What's the difference between BM25 and vector embeddings?+

Vector embeddings capture semantic meaning (synonyms, similar concepts). BM25 excels at exact lexical matching (error codes, technical identifiers, proper nouns). Combining both via rank fusion gives the best results.

How much does implementing Contextual Retrieval cost?+

About $1.02 per million document tokens thanks to prompt caching. The document is loaded into cache once, then each chunk is contextualized with a minimal call. It's a one-time cost during indexing.

Should I use reranking on top of Contextual Retrieval?+

Yes, reranking provides significant additional gains: from 49% error reduction (without reranking) to 67% (with reranking). Reranking filters the top-150 chunks down to the 20 most relevant ones.

Contextual Retrieval: The Advanced RAG Technique

The Problem with Classic RAG

BM25: The Essential Complement to Embeddings

The Solution: Contextual Retrieval

Before vs After

The Contextualization Prompt

Implementation with Prompt Caching

Complete Pipeline: Contextual Embeddings + Contextual BM25 + Reranking

Performance Results

Reranking Explained

Key Findings and Recommendations

Comparison with Advanced RAG: Lost in the Middle

Dorian Laurenceau

Weekly AI Insights

→Related Articles

Lost-in-the-Middle: Advanced RAG and Context Position

Build a Mini RAG System: Hands-On Workshop

RAG Fundamentals & Context Engineering: Grounding AI in

FAQ