Back to all articles
9 MIN READ

Claude Extended Thinking: Deep Reasoning in Practice

By Dorian Laurenceau

๐Ÿ“… Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.

๐Ÿ”— Pillar article: Claude API: Complete Guide


What is Extended Thinking?

Extended Thinking is a feature that allows Claude to think deeply before responding. Instead of immediately generating a response, Claude:

  1. โ†’Explores different approaches to the problem
  2. โ†’Verifies its reasoning step by step
  3. โ†’Corrects its errors along the way
  4. โ†’Formulates a more reliable final response

It's the equivalent of asking an expert to take time to think rather than answering instantly.

Without vs With Extended Thinking

AspectWithout Extended ThinkingWith Extended Thinking
ResponseImmediate, directThoughtful, structured
MathematicsFrequent errors on complex problemsMethodical solving
CodePossible subtle bugsStep-by-step verification
AnalysisSuperficialMulti-perspective
LatencyLowHigher (proportional to budget)
CostStandard+ thinking tokens

The honest state of "extended thinking" features across frontier models โ€” Anthropic's extended thinking, OpenAI's o-series, DeepSeek's reasoning mode โ€” is that they trade latency and cost for a real but narrow quality gain. Practitioners on r/MachineLearning and r/LocalLLaMA keep landing on the same profile: math, logical multi-step reasoning, and code with non-obvious constraints benefit clearly; rephrasing, summarization, and single-hop Q&A do not. Turning on extended thinking for every task is a waste of tokens; turning it off for the tasks that actually need it is a waste of accuracy.

Where the community correctly pushes back on the marketing: "thinking" tokens are not observable reasoning in any philosophical sense โ€” they are an extended scratchpad that statistically correlates with better final answers on reasoning-heavy tasks. Research like Chain-of-Thought prompting and the subsequent debates about whether CoT faithfully represents model cognition are the right context here. The short version: assume the thinking trace is useful for you as a debugging tool, not as a literal transcript of the model's beliefs.

Pragmatic operating rule: benchmark your own workload with and without extended thinking on a fixed eval set, then adopt it only where the accuracy gain exceeds the latency and token cost. For most production workflows, a small budget (1-4k thinking tokens) on the 20% of hardest tasks pays for itself; blanket enabling rarely does.

Enabling Extended Thinking via the API

Basic Implementation

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Maximum budget for thinking
    },
    messages=[{
        "role": "user",
        "content": "Solve this dynamic programming problem: given an array of integers, find the longest increasing subsequence."
    }]
)

# Iterate through response blocks
for block in response.content:
    if block.type == "thinking":
        print("๐Ÿง  Thinking:")
        print(block.thinking)
        print("---")
    elif block.type == "text":
        print("๐Ÿ’ฌ Response:")
        print(block.text)

Adaptive Thinking (Claude 4.6)

Claude 4.6's adaptive thinking automatically adjusts reasoning effort based on complexity:

response = client.messages.create(
    model="claude-opus-4-20250918",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # MAXIMUM budget, not fixed
    },
    messages=[{
        "role": "user",
        "content": "What is the capital of France?"
    }]
)
# โ†’ Minimal thinking (~50 tokens) because the question is simple

response2 = client.messages.create(
    model="claude-opus-4-20250918",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[{
        "role": "user",
        "content": "Prove that the square root of 2 is irrational."
    }]
)
# โ†’ Extended thinking (~5000 tokens) because the problem is complex

Budget Tokens: Optimizing Cost

The budget_tokens defines the maximum number of tokens Claude can use for thinking.

BudgetBest forAdded latency
1,000 - 3,000Easy questions, clarifications< 2s
3,000 - 8,000Moderate reasoning, code debugging2-5s
8,000 - 15,000Complex mathematics, architecture5-15s
15,000 - 32,000Very complex problems, proofs15-30s

Allocation Strategy

def get_thinking_budget(task_type):
    """Returns a thinking budget suited to the task type."""
    budgets = {
        "simple_qa": 1000,
        "code_review": 5000,
        "bug_fix": 8000,
        "algorithm_design": 12000,
        "math_proof": 20000,
        "architecture": 15000,
    }
    return budgets.get(task_type, 5000)

# Usage
budget = get_thinking_budget("bug_fix")
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": budget},
    messages=[{"role": "user", "content": "..."}]
)

Important Constraints

  • โ†’Minimum: budget_tokens must be โ‰ฅ 1024
  • โ†’Maximum: budget_tokens + max_tokens must not exceed the model's limit
  • โ†’Billing: Thinking tokens are billed at the output token rate
  • โ†’No cache: Thinking tokens are not eligible for prompt caching

When to Use Extended Thinking?

โœ… Use It

Use caseExampleExpected gain
MathematicsSolving equations, proving theorems+40% accuracy
ProgrammingDesigning algorithms, debugging complex code+35% accuracy
Logical analysisDetecting flaws in an argument+30% accuracy
PlanningCreating a technical migration planBetter coverage
Structured writingWriting an RFC or specificationMore coherent

โŒ Avoid It

  • โ†’Simple factual questions: "What is the capital of France?" โ†’ Not needed
  • โ†’Free creative tasks: "Write a poem" โ†’ Thinking doesn't improve creativity
  • โ†’Low-latency chatbots: Thinking adds latency
  • โ†’Tight budget: Thinking tokens are billed and add up quickly

Streaming with Extended Thinking

with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content": "Design a database for a social network."}]
) as stream:
    current_type = None
    for event in stream:
        if hasattr(event, 'type'):
            if event.type == "content_block_start":
                block_type = event.content_block.type
                if block_type == "thinking":
                    print("\n๐Ÿง  Thinking in progress...")
                    current_type = "thinking"
                elif block_type == "text":
                    print("\n๐Ÿ’ฌ Response:")
                    current_type = "text"
            elif event.type == "content_block_delta":
                if current_type == "thinking" and hasattr(event.delta, 'thinking'):
                    print(event.delta.thinking, end="")
                elif current_type == "text" and hasattr(event.delta, 'text'):
                    print(event.delta.text, end="")

Block Order in Streaming

1. message_start
2. content_block_start (type: "thinking")
3. content_block_delta (thinking content...)  โ† thinking blocks
4. content_block_stop
5. content_block_start (type: "text")
6. content_block_delta (text content...)     โ† final response
7. content_block_stop
8. message_stop

Performance Benchmarks

Extended Thinking significantly improves performance on reasoning benchmarks:

BenchmarkWithout ThinkingWith ThinkingImprovement
MATH (level 5)71.2%93.4%+22.2%
GPQA (Diamond)65.0%81.3%+16.3%
SWE-bench38.2%52.1%+13.9%
HumanEval88.5%95.7%+7.2%
ARC-Challenge89.1%96.8%+7.7%

Advanced Patterns

Multi-Turn Conversation with Thinking

messages = []

def chat_with_thinking(user_msg, budget=5000):
    messages.append({"role": "user", "content": user_msg})
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": budget},
        messages=messages
    )
    
    # Add the complete response (without thinking blocks)
    assistant_content = [
        block for block in response.content
        if block.type == "text"
    ]
    messages.append({"role": "assistant", "content": assistant_content})
    
    return response

# Usage
chat_with_thinking("Design a recommendation algorithm.", budget=10000)
chat_with_thinking("Optimize it for scalability.", budget=8000)

Dynamic Budget Based on Complexity

def estimate_complexity(message):
    """Estimates question complexity to adjust the budget."""
    complexity_indicators = {
        "prove": 3, "demonstrate": 3, "optimize": 2,
        "algorithm": 2, "architecture": 2, "compare": 1,
        "design": 2, "debug": 2, "why": 1,
        "analyze": 1, "explain": 0.5
    }
    
    score = sum(
        weight for keyword, weight in complexity_indicators.items()
        if keyword in message.lower()
    )
    
    if score <= 1:
        return 2000
    elif score <= 3:
        return 6000
    elif score <= 5:
        return 12000
    else:
        return 20000

budget = estimate_complexity("Prove and demonstrate that this algorithm is optimal.")
# โ†’ 20000 tokens (high complexity)

Common Errors

ErrorCauseSolution
Budget too smallbudget_tokens < 1024Minimum 1024 tokens
Truncated responsemax_tokens too low after thinkingIncrease max_tokens
Unexpected high costThinking on every requestEnable thinking only when necessary
Empty thinkingQuestion too simpleAdaptive thinking may skip reasoning

GO DEEPER โ€” FREE GUIDE

Module 0 โ€” Prompting Fundamentals

Build your first effective prompts from scratch with hands-on exercises.

D

Dorian Laurenceau

Full-Stack Developer & Learning Designer

Full-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.

Prompt EngineeringLLMsFull-Stack DevelopmentLearning DesignReact
Published: March 10, 2026Updated: April 24, 2026
Newsletter

Weekly AI Insights

Tools, techniques & news โ€” curated for AI practitioners. Free, no spam.

Free, no spam. Unsubscribe anytime.

FAQ

What is Claude's Extended Thinking?+

Extended Thinking allows Claude to 'think' before responding, exploring different approaches and verifying its reasoning. Thinking blocks are visible in the API response but are not billed as standard output tokens.

When should I use Extended Thinking?+

Use it for complex problems: advanced mathematics, multi-step reasoning, code analysis, planning, bug resolution, and any task requiring deep thought.

What is the cost of Extended Thinking?+

Thinking tokens are billed at the same rate as output tokens. However, Claude 4.6's adaptive thinking automatically optimizes the budget by using only the tokens needed.

Does Extended Thinking work with streaming?+

Yes, Extended Thinking is compatible with streaming. You first receive the thinking blocks, then the final response content. You can choose to show or hide thinking blocks on the client side.

What is the difference between Extended Thinking and Claude 4.6 adaptive thinking?+

Classic Extended Thinking uses a fixed budget. Claude 4.6's adaptive thinking automatically adjusts the reasoning effort based on question complexity, optimizing cost and latency.

How do I disable extended thinking in Claude?+

In the API, don't set the 'thinking' parameter or set it to 'disabled'. In the claude.ai web interface, extended thinking activates automatically based on question complexity, you don't need to manually disable it. In Claude Code, use the --no-thinking flag to force quick responses.

Which Claude model is best for reasoning?+

Opus 4.6 with extended thinking enabled is the best Claude model for complex reasoning (math, logic, system architecture). For standard reasoning, Sonnet 4.6 with its adaptive thinking offers excellent value โ€” it automatically adjusts its reasoning effort based on difficulty.