What four years of scaling taught us about how LLMs actually work

The "LLMs predict the next token" explanation is accurate and insufficient. Since 2022, the research on how these models work internally has exploded, and the picture is significantly weirder than the textbook summary. The threads on r/MachineLearning, r/MLScaling, and the interpretability research coming out of Anthropic and DeepMind have reshaped how practitioners think about what's inside.

What mechanistic interpretability has shown:

→LLMs don't just pattern-match — they build internal representations. Anthropic's Scaling Monosemanticity paper identified millions of interpretable features inside Claude Sonnet, including concepts like "unsafe code," "internal conflict," and "coding syntax errors."
→Some circuits are well-understood; most aren't. Induction heads, attention patterns, and basic arithmetic circuits have been mapped. The vast majority of model behaviour remains a black box.
→Scaling changes behaviour non-linearly. Capabilities emerge at scale thresholds. Chain-of-thought, in-context learning, and instruction-following all emerged rather than were designed in.

What the "token prediction" framing misses:

→Tokenisation shapes capability. GPT-style BPE tokenisers split numbers and non-English text in ways that measurably hurt math and multilingual performance. Tokenisers matter more than prompt engineering for many tasks.
→Context windows aren't uniform. "200k context" doesn't mean equal attention across 200k tokens. Lost-in-the-middle and uneven retrieval are persistent properties; see the RULER benchmark for honest long-context measurement.
→Temperature and sampling aren't cosmetic. They reshape the output distribution enough to affect correctness on factual tasks. Defaults are rarely optimal; see the OpenAI API reference and Anthropic's sampling docs for the honest specification.
→Attention is expensive and getting more efficient. FlashAttention, sparse attention, and linear-attention variants have cut the cost of long-context inference substantially; model architecture continues to evolve.

What practitioners should internalise:

→Models are statistical artefacts trained on data. Everything they "know" is what was in the training corpus, filtered through human feedback. They don't have access to real-time information unless tools provide it.
→"Hallucination" is not a bug that will be fixed. It's a property of probabilistic generation. Mitigation is architectural (RAG, verification, uncertainty), not a patch waiting to ship.
→Reasoning models are different. o-series, GPT-5 Thinking, Claude extended thinking use internal search at inference time. Their cost, latency, and quality profile is qualitatively different from the base models of 2022-2023.
→The gap between benchmark and production is large. Models that ace MMLU can still fail on your specific domain. Benchmark on your actual task before deciding.

The honest framing: the 2022 "token prediction" explanation is still the best starting point for newcomers, but it hides most of what actually matters for production use. Tokenisation, attention patterns, sampling, scaling effects, and the difference between reasoning and base models all matter more than the high-level summary suggests. The practitioners who ship reliable LLM products treat the model as an empirical system to be measured, not a clean abstraction.

Why Understanding LLMs Matters

Most AI users treat models as magic black boxes. They type a prompt, hope for the best, and blame the AI when results disappoint. But LLMs follow predictable rules. When you understand those rules, you can:

→Write prompts that work with the model's architecture, not against it
→Predict when a model will fail and prevent it
→Choose the right parameters (temperature, top-p) for each task
→Understand why context length matters and how to manage it

Tokens: The Atoms of AI Language

LLMs do not read words, they read tokens. A token is a chunk of text, typically 3-4 characters. Understanding tokenization explains many AI quirks.

Context Windows: The Model's Memory

The context window is the total number of tokens a model can process at once, both your input AND the model's output combined. Think of it as the model's working memory.

Temperature and Top-p: Controlling Creativity

These two parameters control HOW the model selects the next token from its probability distribution.

The Attention Mechanism: How LLMs Focus

The secret sauce of modern LLMs is the Transformer architecture and its attention mechanism. This is what allows the model to understand relationships between distant words.

Loading diagram…

Advanced: Decoding Strategies

Test Your Understanding

What's Next

You now understand the internal mechanics of LLMs: tokenization, context windows, temperature, and attention. Next, you will learn prompt engineering techniques, zero-shot, one-shot, and few-shot, to leverage this knowledge in practice.

Continue to the next article: Prompt Engineering Techniques to master the art of few-shot prompting.

Why Prompting Techniques Matter

The same model can produce wildly different results depending on HOW you ask. Zero-shot is fast but imprecise. Few-shot is slower to set up but dramatically more reliable. Choosing the right technique for the right task is the core skill of prompt engineering.

The honest read on zero-shot vs. few-shot vs. chain-of-thought in 2026, tracked across r/LocalLLaMA, r/PromptEngineering, and r/MachineLearning: the research that underpins these techniques is solid, but the community's lived experience has updated the textbook rankings. The original few-shot paper (Brown et al., 2020) established that examples dramatically improve in-context learning, the chain-of-thought paper (Wei et al., 2022) showed that intermediate reasoning helps, and the Anthropic prompting guide remains the cleanest practical reference. What's changed: with GPT-5-class and Claude-Opus-class models, the gap between zero-shot and few-shot has narrowed on many tasks, and the gap between zero-shot and chain-of-thought has narrowed on reasoning tasks because reasoning is increasingly baked into the model by default.

Where the community correctly pushes back on the "always use few-shot" doctrine: examples cost tokens and risk anchoring the output too tightly to the pattern you showed. If your task is novel or creative, zero-shot often outperforms because the model isn't constrained by your examples. If your task needs consistency (tone, format, structure) across many runs, few-shot wins. The test is simple: write one good example, run it both ways, compare.

Pragmatic rule from practitioners who've moved past the hype: pick the technique based on what failure looks like. "Wrong format" → few-shot with a format example. "Wrong reasoning" → chain-of-thought. "Wrong style" → few-shot with style examples. "Too generic" → zero-shot with stronger role and constraint specification. The technique is the tool; the failure mode tells you which tool.

The Three Techniques Explained

Zero-Shot Prompting

You give the model an instruction with NO examples. The model relies entirely on its training knowledge.

Few-Shot Prompting

You provide 3-5 examples of input-output pairs BEFORE your actual request. The model learns the pattern from your examples.

The 5 Components of an Effective Prompt

Beyond shot techniques, every prompt benefits from five structural components.

Technique Effectiveness Across Tasks

Advanced: Prompt Chaining with Techniques

Test Your Understanding

Continue Learning

You now know when to use zero-shot, one-shot, and few-shot, plus the 5 components of an effective prompt. Next, you will build your own prompt book, a reusable library of templates using these techniques.

→The Prompt Engineering Process, A systematic 6-step method for optimizing your prompts

Continue to the workshop: Build Your Prompt Book to create templates you will use every day.

Why You Need a Prompt Book

Every time you write a prompt from scratch, you pay a creativity tax. You reinvent structure, forget constraints, and get inconsistent results. A prompt book eliminates this waste.

Think of it like code libraries. No developer writes sorting algorithms from scratch, they import a library. Your prompt book is the same: tested, reusable, version-controlled.

Workshop: Build 5 Templates in 30 Minutes

The Iterative Refinement Process

Good templates are not written, they are refined. Here is the process.

Organizing Your Prompt Book

Common Template Anti-Patterns

Test Your Understanding

Where to Go From Here

You now have a 5-template prompt book and the skills to refine and expand it. In the next module, you will learn to get structured outputs from AI, JSON, tables, and schemas, the backbone of production AI workflows.

Continue to Structured AI Outputs to master JSON extraction and data formatting.

LLM Fundamentals

What four years of scaling taught us about how LLMs actually work

Why Understanding LLMs Matters

Tokens: The Atoms of AI Language

Context Windows: The Model's Memory

Temperature and Top-p: Controlling Creativity

The Attention Mechanism: How LLMs Focus

Advanced: Decoding Strategies

Test Your Understanding

What's Next

Why Prompting Techniques Matter

The Three Techniques Explained

Zero-Shot Prompting

Few-Shot Prompting

The 5 Components of an Effective Prompt

Technique Effectiveness Across Tasks

Advanced: Prompt Chaining with Techniques

Test Your Understanding

Continue Learning

Why You Need a Prompt Book

Workshop: Build 5 Templates in 30 Minutes

The Iterative Refinement Process

Organizing Your Prompt Book

Common Template Anti-Patterns

Test Your Understanding

Where to Go From Here

Structured AI Outputs

Weekly AI Insights