Contextual Retrieval: The Advanced RAG Technique That Reduces Errors by 67%
By Learnia AI Research Team
Contextual Retrieval: The Advanced RAG Technique
Classic RAG has a fundamental problem: when you split a document into chunks and encode them individually, each chunk loses the document's context. The sentence "Revenue grew by 3% over the previous quarter" tells you nothing about which company, which period, or which type of revenue. Contextual Retrieval solves this by injecting the missing context into each chunk before encoding — reducing retrieval failures by 67%.
The Problem with Classic RAG
To understand why Contextual Retrieval is necessary, let's review how standard RAG works. If you're new to RAG, start with our RAG fundamentals guide.
The classic RAG pipeline:
- →Break the knowledge base into chunks (~a few hundred tokens)
- →Encode chunks into vectors (embeddings)
- →Store in a vector database
- →At query time: find similar chunks → add them to the prompt
The problem: Each chunk is encoded in isolation. A chunk that says "Revenue grew by 3%" loses information about which company, which document, which period.
BM25: The Essential Complement to Embeddings
Before diving into Contextual Retrieval, we need to understand BM25 (Best Matching 25), a text ranking algorithm based on lexical matching.
The key: Combining embeddings AND BM25 via rank fusion gives better results than either alone.
The Solution: Contextual Retrieval
The idea is simple but powerful: add explanatory context to each chunk BEFORE encoding it.
Before vs After
# ❌ Classic chunk (no context)
"The company's revenue grew by 3% over the previous quarter."
# ✅ Contextualized chunk
"This chunk is from an SEC filing on ACME Corp's performance
in Q2 2023; the previous quarter's revenue was $314 million.
The company's revenue grew by 3% over the previous quarter."
The Contextualization Prompt
For each chunk, Claude generates 50-100 tokens of context:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk
within the overall document for the purposes of improving
search retrieval of the chunk. Answer only with the succinct
context and nothing else.
Result: The generated context is prepended to the chunk, then everything is encoded (both embedding AND BM25).
Implementation with Prompt Caching
The cost of contextualizing each chunk with the full document could be prohibitive. The solution: prompt caching.
How it works:
- →Load the full document into cache (one-time cost)
- →For each chunk, send only the chunk + contextualization prompt
- →The cached document is reused automatically
Cost: ~$1.02 per million document tokens (one-time cost during indexing).
To learn more about prompt caching, see our Claude prompt caching guide.
import anthropic
client = anthropic.Anthropic()
def contextualize_chunks(document: str, chunks: list[str]) -> list[str]:
"""Add context to each chunk via Claude with prompt caching."""
contextualized = []
for chunk in chunks:
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=200,
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": f"<document>\n{document}\n</document>",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": (
f"<chunk>\n{chunk}\n</chunk>\n\n"
"Give a short succinct context to situate "
"this chunk within the overall document for "
"improving search retrieval. "
"Answer only with the succinct context."
)
}
]
}]
)
context = response.content[0].text
contextualized.append(f"{context}\n\n{chunk}")
return contextualized
Complete Pipeline: Contextual Embeddings + Contextual BM25 + Reranking
The optimal pipeline combines all three techniques:
Performance Results
Reranking Explained
Reranking is a post-processing step that re-evaluates chunk relevance:
- →Initial retrieval: Top 150 chunks (embeddings + BM25)
- →Reranking: A specialized model scores each chunk against the query
- →Selection: Top 20 most relevant chunks
- →Generation: The LLM uses these 20 chunks to respond
Why top-20 rather than top-10 or top-5? Benchmarks consistently show that top-20 gives better results than smaller windows, because the LLM has more context to synthesize its response.
Key Findings and Recommendations
- →Embeddings + BM25 > Embeddings alone — Always combine both via rank fusion
- →Top-20 chunks > Top-10 or Top-5 — More context = better responses
- →Adding context to chunks significantly improves accuracy — This is the core of Contextual Retrieval
- →Reranking > No reranking — The final step that maximizes precision
- →All gains stack — Contextual Embeddings + Contextual BM25 + Reranking + Top-20
Comparison with Advanced RAG: Lost in the Middle
Contextual Retrieval and the "Lost in the Middle" problem are two different facets of RAG optimization:
- →Contextual Retrieval solves the context loss during encoding problem (retrieval side)
- →Lost in the Middle solves the attention loss in the middle of long contexts problem (generation side)
For a complete analysis of the Lost in the Middle phenomenon, see our advanced RAG guide.
Continue Your Learning
- →RAG Fundamentals and Context Engineering — The basics before Contextual Retrieval
- →Advanced RAG: Lost in the Middle — The other RAG challenge
- →Claude Prompt Caching — Essential for reducing contextualization costs
- →Context Engineering: The 4 Pillars — Holistic view of context management
- →The 5 Agent Architecture Patterns — Integrating RAG into agentic systems
Weekly AI Insights
Tools, techniques & news — curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
→Related Articles
FAQ
What is Contextual Retrieval?+
Contextual Retrieval adds explanatory context to each document chunk before embedding. Instead of encoding 'Revenue grew by 3%', you encode 'This chunk is from ACME Corp's Q2 2023 SEC filing. Revenue grew by 3%'. This reduces retrieval failures by 49% (67% with reranking).
What's the difference between BM25 and vector embeddings?+
Vector embeddings capture semantic meaning (synonyms, similar concepts). BM25 excels at exact lexical matching (error codes, technical identifiers, proper nouns). Combining both via rank fusion gives the best results.
How much does implementing Contextual Retrieval cost?+
About $1.02 per million document tokens thanks to prompt caching. The document is loaded into cache once, then each chunk is contextualized with a minimal call. It's a one-time cost during indexing.
Should I use reranking on top of Contextual Retrieval?+
Yes, reranking provides significant additional gains: from 49% error reduction (without reranking) to 67% (with reranking). Reranking filters the top-150 chunks down to the 20 most relevant ones.