Long Context Windows: Working with Million-Token AI Models
By Learnia Team
Long Context Windows: Working with Million-Token AI Models
This article is written in English. Our training modules are available in French.
The context window—how much information an AI model can process at once—has expanded dramatically. What started at 2,000 tokens has grown to 1-2 million tokens, enough to process entire codebases, book series, or years of documents in a single prompt. This capability unlocks new applications but requires new strategies to use effectively.
This comprehensive guide explores how to work with long context windows, from understanding the technology to practical implementation patterns.
Context Window Evolution
The Journey
| Year | Model | Context Window |
|---|---|---|
| 2022 | GPT-3 | 4,096 tokens |
| 2023 | GPT-4 | 8,192 → 128K tokens |
| 2023 | Claude 2 | 100K tokens |
| 2024 | Gemini 1.5 | 1M → 2M tokens |
| 2024 | Claude 3 | 200K tokens |
| 2025 | Multiple | 1M+ tokens standard |
What Long Context Enables
| Context Size | What Fits | Use Cases |
|---|---|---|
| 8K tokens | ~20 pages | Single document analysis |
| 128K tokens | ~300 pages | Long document, small codebase |
| 1M tokens | ~2,500 pages | Multiple books, large codebase |
| 2M tokens | ~5,000 pages | Entire repository, document collections |
Understanding Token Limits
Token Basics
Tokens are the units models process:
"Hello, world!" = 4 tokens
"artificial intelligence" = 2 tokens
"supercalifragilisticexpialidocious" = 7 tokens (broken up)
Rough estimates:
- 1 token ≈ 4 characters in English
- 1 token ≈ 0.75 words
- 1,000 tokens ≈ 750 words ≈ 1.5 pages
Context = Input + Output
Important: context window includes both input AND output:
If context window = 100,000 tokens
And your input = 90,000 tokens
Maximum output = 10,000 tokens
If you need 20,000 token output:
Maximum input = 80,000 tokens
Model Comparison (2026)
| Model | Context Window | Effective for |
|---|---|---|
| Gemini 2.0 Pro | 2M tokens | Largest single context |
| Claude 3.5 Sonnet | 200K tokens | Strong analysis |
| GPT-4 Turbo | 128K tokens | Broad capabilities |
| Llama 3 (70B) | 128K tokens | Open source |
Use Cases for Long Context
1. Codebase Analysis
Use case: Entire repository understanding
Input:
- All source files (~500K tokens)
- Documentation (~50K tokens)
- Test files (~100K tokens)
- Configuration (~10K tokens)
Query: "Identify potential security vulnerabilities
across the entire codebase, considering how
modules interact."
Advantage: Cross-file analysis without chunking
2. Document Collection Analysis
Use case: Legal discovery
Input:
- 500 contracts and legal documents
- Communication archives
- Policy documents
Query: "Find all clauses across these documents that
may conflict with GDPR requirements."
Advantage: Find patterns across entire corpus
3. Book-Length Content
Use case: Novel analysis
Input:
- Complete book text (~200K tokens)
Queries:
- "Track character development arcs"
- "Identify foreshadowing for the ending"
- "Analyze thematic progression"
Advantage: Holistic understanding
4. Multi-Document Synthesis
Use case: Research synthesis
Input:
- 50 research papers in a field
- Full text of each
Query: "Synthesize the current state of research
on [topic], identifying consensus, conflicts,
and gaps."
Advantage: Comprehensive literature view
5. Conversation History
Use case: Long-running projects
Input:
- Months of conversation history
- Related documents referenced
- Code changes made
Query: Continue working with full context of
everything discussed and decided.
Advantage: No "forgetting" previous context
Strategies for Long Context
Strategy 1: Full Context Inclusion
When to use:
- →Documents must be analyzed together
- →Cross-reference relationships matter
- →Consistency across content required
Implementation:
def full_context_analysis(documents):
combined = "\n\n---\n\n".join(documents)
prompt = f"""
I'm providing {len(documents)} documents for analysis.
Please analyze them holistically.
{combined}
Based on all documents, answer: [question]
"""
return model.generate(prompt)
Strategy 2: Structured Chunking
When context exceeds limits or for efficiency:
def structured_chunking(content, chunk_size=50000):
chunks = split_into_chunks(content, chunk_size)
# First pass: analyze each chunk
chunk_summaries = []
for i, chunk in enumerate(chunks):
summary = model.generate(f"""
Analyze section {i+1}/{len(chunks)}:
{chunk}
Provide key findings relevant to: [question]
""")
chunk_summaries.append(summary)
# Second pass: synthesize
final = model.generate(f"""
Synthesize these section analyses into a
comprehensive answer:
{chunk_summaries}
Question: [question]
""")
return final
Strategy 3: Hierarchical Processing
For very large content:
Level 1: Individual documents → Key points
Level 2: Key points grouped → Theme summaries
Level 3: Theme summaries → Final synthesis
Example:
100 documents (1M tokens total)
→ 100 key point summaries (50K tokens)
→ 10 theme summaries (10K tokens)
→ Final answer (2K tokens)
Strategy 4: Retrieval-Augmented (RAG)
Combine retrieval with long context:
def rag_with_long_context(query, document_store):
# Retrieve most relevant chunks
relevant = document_store.search(query, top_k=50)
# Include retrieved content (still fits in long context)
prompt = f"""
Question: {query}
Relevant information from our documents:
{format_chunks(relevant)}
Based on this information, provide a comprehensive answer.
"""
return model.generate(prompt)
Best Practices
1. Structure Your Input
Good organization:
# CONTEXT OVERVIEW
You have access to [description of content]
# DOCUMENT 1: [Title]
[Content of document 1]
# DOCUMENT 2: [Title]
[Content of document 2]
# YOUR TASK
[Clear question or instruction]
# EXPECTED OUTPUT FORMAT
[Description of desired format]
2. Be Explicit About Relevance
"Focus particularly on sections discussing [topic].
Other content is provided for context but may be
less relevant to this specific question."
3. Request Citations
"When you reference information, cite the specific
document and section, e.g., [Document 3, Section 2.1]"
4. Handle Position Bias
Models may attend more to beginning and end:
Strategies:
- Put most important context first
- Repeat key information
- Explicitly reference middle sections in prompts
- Consider shuffling order across queries
5. Monitor Token Usage
def estimate_tokens(text):
# Rough estimate: 4 chars per token
return len(text) // 4
def check_capacity(content, model_limit, output_reserve=4000):
content_tokens = estimate_tokens(content)
available = model_limit - output_reserve
if content_tokens > available:
print(f"Warning: {content_tokens} tokens exceeds "
f"available {available} tokens")
return False
return True
Performance Considerations
Latency
| Context Size | Typical Latency |
|---|---|
| 10K tokens | 2-5 seconds |
| 100K tokens | 10-30 seconds |
| 1M tokens | 60-180 seconds |
Cost
Most models charge per token:
Example pricing (hypothetical):
Input: $0.01 per 1K tokens
Output: $0.03 per 1K tokens
For 500K token input + 2K token output:
Cost = (500 × $0.01) + (2 × $0.03) = $5.06 per query
Full codebase analysis might cost $5-20 per query
Accuracy
Research shows:
- →Performance generally strong across entire context
- →Some degradation on very specific retrieval from middle
- →Explicit references help accuracy
- →Structured formatting improves performance
Comparison with RAG
| Aspect | Long Context | RAG |
|---|---|---|
| Setup complexity | Low | High |
| Token efficiency | Lower | Higher |
| Retrieval accuracy | N/A (all included) | Depends on retrieval |
| Cross-document reasoning | Strong | Limited |
| Cost per query | Higher | Lower |
| Latency | Higher | Lower |
| Update flexibility | Re-process all | Update index |
When to use Long Context:
- →Need cross-document reasoning
- →Content fits within limits
- →Setup simplicity valued
- →Query frequency is low
When to use RAG:
- →Content vastly exceeds limits
- →Fast response needed
- →Many queries expected
- →Frequent content updates
Practical Examples
Example 1: Code Review
# Load entire codebase
codebase = load_repository("./my-project")
all_code = format_codebase(codebase) # ~200K tokens
prompt = f"""
# CODEBASE FOR REVIEW
{all_code}
# REVIEW REQUEST
Perform a comprehensive code review focusing on:
1. Security vulnerabilities
2. Performance issues
3. Code organization problems
4. Missing error handling
For each issue, provide:
- File and line reference
- Description of issue
- Recommended fix
"""
response = long_context_model.generate(prompt)
Example 2: Meeting History Analysis
# Load all meeting notes
meetings = load_meetings_year(2025) # 100 meetings
all_notes = format_meetings(meetings) # ~150K tokens
prompt = f"""
# MEETING NOTES: All 2025 Meetings
{all_notes}
# ANALYSIS REQUEST
1. What are the recurring themes discussed?
2. What decisions were made and when?
3. What action items remain unresolved?
4. What topics have evolved over time?
"""
response = long_context_model.generate(prompt)
Key Takeaways
- →
Context windows have reached 1-2 million tokens, enabling analysis of entire codebases or document collections
- →
Context includes both input and output—reserve tokens for response
- →
Full context beats chunking for cross-document reasoning when content fits
- →
Structure your input with clear organization and explicit instructions
- →
Consider performance tradeoffs: latency, cost, and accuracy
- →
Choose between long context and RAG based on use case requirements
- →
Position bias exists—structure content and prompts to mitigate
Master AI Fundamentals
Understanding context windows is fundamental to working effectively with modern AI. The right approach depends on your specific use case, content, and requirements.
In our Module 0 — AI Fundamentals, you'll learn:
- →How language models process information
- →Token economics and optimization
- →When to use different approaches
- →Model selection criteria
- →Practical prompt engineering
- →Staying current with AI evolution
These fundamentals help you make better decisions about AI usage.
Module 0 — Prompting Fundamentals
Build your first effective prompts from scratch with hands-on exercises.