LLM Routing: Choosing the Right Model for Each Task
By Dorian Laurenceau
๐ Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.
Should every question go to GPT-4? That's like using a sports car to fetch groceries-overkill for simple tasks and unnecessarily expensive. LLM routing matches questions to the right model, optimizing cost and speed without sacrificing quality.
<!-- manual-insight -->
LLM routing: what production teams actually do vs the blog-post version
LLM routing has become one of the most-discussed cost-optimisation techniques in production LLM applications. The threads on r/LangChain, r/LocalLLaMA, and r/MachineLearning reflect a gap between the concept's simplicity and the production reality.
What routing actually looks like in production:
- โTiered cascades. Cheapest model first; escalate to more capable models if the cheap model flags uncertainty or the output fails validation. Deployed at scale by Anyscale's RouteLLM research and similar projects.
- โTask-based routing. Code queries to Claude/GPT-5 Codex; writing queries to Claude Opus; retrieval-heavy queries to Gemini; multimodal to GPT-5 or Gemini. The "best model" depends on task, not popularity.
- โLatency-sensitive routing. Interactive chat gets fast/cheap models (Haiku, Gemini Flash, GPT-4o mini); background jobs get the slower, more capable ones.
- โReliability routing. Production systems route to a backup provider when the primary is slow or errored. Single-vendor dependency is operational risk.
What makes routing hard:
- โClassification cost. The router itself has to decide quickly which model to use. A 200ms classifier call in front of a 500ms generation is a 40% latency penalty for no quality gain.
- โDrift. The "cheap model good enough" assumption changes as models update. Routing logic needs periodic re-evaluation.
- โEvaluating quality consistently. You need evaluation that works across models, or you can't know whether cheap-tier responses are acceptable. Braintrust, LangSmith, and similar tools emerged to address this.
- โVendor lock-in in subtle forms. Different models have different JSON-mode quirks, function-calling formats, and prompt styles. Supporting three models well is three times the prompt-engineering work of one.
What's genuinely worth doing:
- โStart with two tiers. A cheap model for the 80% easy cases, a capable model for the 20% hard ones. Two-tier routing captures most of the value with a fraction of the complexity.
- โMeasure before you optimise. If your LLM bill is under a certain threshold, routing complexity costs more than it saves.
- โUse providers with similar APIs. OpenAI, Anthropic, and Google have converged on similar function-calling and structured-output patterns. Starting from that compatibility makes routing cheaper to implement.
- โCache, then route. Prompt caching often beats routing for cost reduction on workloads with reused prefixes.
The honest framing: LLM routing is useful but usually over-applied. For most applications at most scales, picking one good model and using it well beats an elaborate routing system. Reach for routing when you have evidence it will pay off, not because a conference talk made it sound clever.
Learn AI โ From Prompts to Agents
What Is LLM Routing?
LLM routing is the practice of directing different queries to different AI models based on task requirements.
The Basic Concept
User query โ Router โ Appropriate model
"What's 2+2?" โ Fast, cheap model (GPT-3.5)
"Analyze this legal contract" โ Powerful model (GPT-4)
"Generate a poem" โ Creative model (Claude)
Why Routing Matters
The Cost Reality
| Model | Input Cost (per 1M tokens) | Quality |
|---|---|---|
| GPT-3.5 Turbo | $0.50 | Good |
| GPT-4 Turbo | $10.00 | Excellent |
| GPT-4o | $2.50 | Very Good |
| Claude 3 Haiku | $0.25 | Good |
| Claude 3 Opus | $15.00 | Excellent |
20-60ร price difference between models.
The Math
Without routing:
1000 queries/day ร GPT-4 ($0.01/query) = $10/day = $300/month
With routing (70% simple, 30% complex):
700 queries ร GPT-3.5 ($0.0005) = $0.35
300 queries ร GPT-4 ($0.01) = $3.00
Total: $3.35/day = $100/month
Savings: 67% cost reduction
Routing Strategies
1. Task-Based Routing
Route based on what the user is asking:
Classification/extraction โ Small model
Creative writing โ Medium model
Complex reasoning โ Large model
Code generation โ Specialized model
2. Complexity-Based Routing
Estimate query difficulty:
Simple: "What's the weather?"
โ Fast model
Medium: "Summarize this article"
โ Balanced model
Complex: "Compare these three legal arguments"
โ Powerful model
3. Cascade Routing
Try smaller model first, escalate if needed:
Step 1: Send to GPT-3.5
Step 2: Check confidence/quality
Step 3: If uncertain โ re-send to GPT-4
4. Intent-Based Routing
Classify intent, then route:
Intent: customer_support โ Support-tuned model
Intent: code_help โ Code-specialized model
Intent: creative โ Creative model
Intent: analysis โ Reasoning model
What Makes a Query "Complex"?
Signals of Complexity
โ Multi-step reasoning required
โ Domain expertise needed
โ Long context to process
โ Nuanced judgment required
โ High-stakes outcome
Signals of Simplicity
โ Single fact lookup
โ Simple format conversion
โ Short, clear instruction
โ Low-stakes outcome
โ Well-defined output
Real-World Routing Examples
Customer Support Bot
"What are your hours?"
โ Route to FAQ lookup + small model
Cost: $0.0001 | Latency: 200ms
"I'm having a complex billing dispute about..."
โ Route to support-specialized model + human escalation flag
Cost: $0.005 | Latency: 1s
Code Assistant
"Add a comment to this line"
โ Small, fast model
Cost: $0.0002
"Refactor this 500-line function for performance"
โ Large model with long context
Cost: $0.02
Research Assistant
"When was the Eiffel Tower built?"
โ Small model (factual recall)
"Compare the economic impacts of three trade policies"
โ Large model (analysis + reasoning)
Cascade Pattern Deep Dive
The cascade approach is particularly powerful:
Step 1: User Query arrives
Step 2: Tier 1 - Small Model (GPT-3.5/Haiku)
- โConfident? โ Return answer
- โNot confident? โ Escalate
Step 3: Tier 2 - Medium Model (GPT-4o/Sonnet)
- โConfident? โ Return answer
- โNot confident? โ Escalate
Step 4: Tier 3 - Large Model (GPT-4/Opus)
- โAlways returns answer
Advantages of Cascading
- โโ Most queries resolved by cheap model
- โโ Complex queries still get best model
- โโ Natural quality/cost optimization
- โโ Built-in escalation path
Implementing a Simple Router
Conceptual Approach
1. Analyze incoming query
- Length
- Keywords (e.g., "analyze", "compare", "simple")
- Domain detection
2. Assign complexity score (0-10)
- 0-3: Simple โ Small model
- 4-6: Medium โ Medium model
- 7-10: Complex โ Large model
3. Route to selected model
4. (Optional) Evaluate response quality
- If low quality, retry with larger model
Routing Signals
Simple query indicators:
- Short (< 20 words)
- Contains "what is", "define", "when"
- Single question
Complex query indicators:
- Long (> 100 words)
- Contains "analyze", "compare", "evaluate"
- Multiple sub-questions
- Technical jargon
- Attached documents
Common Routing Mistakes
1. Over-Routing to Expensive Models
โ Send everything to GPT-4 "just in case"
โ
Trust smaller models for simple tasks
2. Under-Routing Complex Tasks
โ Always use the cheapest model
โ
Accept higher cost for quality-critical tasks
3. Ignoring Latency
โ Route based only on cost
โ
Consider: Simple queries need fast responses
4. No Fallback
โ Single model, no backup
โ
Have escalation path when confidence is low
Core Insights
- โLLM routing matches queries to appropriate models
- โCan reduce costs by 50-70% without quality loss
- โRoute by task type, complexity, or cascade
- โSimple queries โ cheap/fast models; complex โ powerful models
- โMonitor and iterate-routing rules need tuning
Ready to Build Smart AI Workflows?
This article covered the what and why of LLM routing. But production routing systems require implementation patterns, monitoring, and optimization.
In our Module 4, Chaining & Routing, you'll learn:
- โDesigning multi-model architectures
- โImplementing routing logic
- โCascade patterns and fallbacks
- โCost optimization strategies
- โMonitoring and improving routing accuracy
Module 4 โ Chaining & Routing
Build multi-step prompt workflows with conditional logic.
Dorian Laurenceau
Full-Stack Developer & Learning DesignerFull-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.
Weekly AI Insights
Tools, techniques & news โ curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
โRelated Articles
FAQ
What is LLM routing?+
LLM routing automatically selects the best AI model for each query. Simple questions go to fast, cheap models; complex tasks go to powerful, expensive ones. It optimizes cost and latency without sacrificing quality.
Why not always use the best model?+
Cost and speed. GPT-4 costs 10-30x more than GPT-3.5 and is slower. For 'What's 2+2?', a small model suffices. Routing uses expensive models only when necessary.
How does model routing work?+
A classifier (often a small LLM) analyzes the query and predicts task difficulty. Based on this, it routes to the appropriate model. Some systems use cascades: try cheap models first, escalate if needed.
What are the benefits of LLM routing?+
60-80% cost reduction with similar quality, faster responses for simple queries, ability to use specialized models for specific tasks, and automatic fallback when models fail.