How LLMs Work: Tokens, Prediction & Architecture
By Dorian Laurenceau
📅 Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.
How LLMs Work: Tokens, Prediction & Architecture Explained Simply
You use AI every day, but do you know what happens between pressing Enter and seeing a response? Understanding the engine behind ChatGPT, Claude, and Gemini transforms you from a casual user into a power user. By the end of this article, you will understand tokens, context windows, temperature, and the attention mechanism, the four pillars of every LLM.
What four years of scaling taught us about how LLMs actually work
The "LLMs predict the next token" explanation is accurate and insufficient. Since 2022, the research on how these models work internally has exploded, and the picture is significantly weirder than the textbook summary. The threads on r/MachineLearning, r/MLScaling, and the interpretability research coming out of Anthropic and DeepMind have reshaped how practitioners think about what's inside.
What mechanistic interpretability has shown:
- →LLMs don't just pattern-match — they build internal representations. Anthropic's Scaling Monosemanticity paper identified millions of interpretable features inside Claude Sonnet, including concepts like "unsafe code," "internal conflict," and "coding syntax errors."
- →Some circuits are well-understood; most aren't. Induction heads, attention patterns, and basic arithmetic circuits have been mapped. The vast majority of model behaviour remains a black box.
- →Scaling changes behaviour non-linearly. Capabilities emerge at scale thresholds. Chain-of-thought, in-context learning, and instruction-following all emerged rather than were designed in.
What the "token prediction" framing misses:
- →Tokenisation shapes capability. GPT-style BPE tokenisers split numbers and non-English text in ways that measurably hurt math and multilingual performance. Tokenisers matter more than prompt engineering for many tasks.
- →Context windows aren't uniform. "200k context" doesn't mean equal attention across 200k tokens. Lost-in-the-middle and uneven retrieval are persistent properties; see the RULER benchmark for honest long-context measurement.
- →Temperature and sampling aren't cosmetic. They reshape the output distribution enough to affect correctness on factual tasks. Defaults are rarely optimal; see the OpenAI API reference and Anthropic's sampling docs for the honest specification.
- →Attention is expensive and getting more efficient. FlashAttention, sparse attention, and linear-attention variants have cut the cost of long-context inference substantially; model architecture continues to evolve.
What practitioners should internalise:
- →Models are statistical artefacts trained on data. Everything they "know" is what was in the training corpus, filtered through human feedback. They don't have access to real-time information unless tools provide it.
- →"Hallucination" is not a bug that will be fixed. It's a property of probabilistic generation. Mitigation is architectural (RAG, verification, uncertainty), not a patch waiting to ship.
- →Reasoning models are different. o-series, GPT-5 Thinking, Claude extended thinking use internal search at inference time. Their cost, latency, and quality profile is qualitatively different from the base models of 2022-2023.
- →The gap between benchmark and production is large. Models that ace MMLU can still fail on your specific domain. Benchmark on your actual task before deciding.
The honest framing: the 2022 "token prediction" explanation is still the best starting point for newcomers, but it hides most of what actually matters for production use. Tokenisation, attention patterns, sampling, scaling effects, and the difference between reasoning and base models all matter more than the high-level summary suggests. The practitioners who ship reliable LLM products treat the model as an empirical system to be measured, not a clean abstraction.
Why Understanding LLMs Matters
Most AI users treat models as magic black boxes. They type a prompt, hope for the best, and blame the AI when results disappoint. But LLMs follow predictable rules. When you understand those rules, you can:
- →Write prompts that work with the model's architecture, not against it
- →Predict when a model will fail and prevent it
- →Choose the right parameters (temperature, top-p) for each task
- →Understand why context length matters and how to manage it
Tokens: The Atoms of AI Language
LLMs do not read words, they read tokens. A token is a chunk of text, typically 3-4 characters. Understanding tokenization explains many AI quirks.
Context Windows: The Model's Memory
The context window is the total number of tokens a model can process at once, both your input AND the model's output combined. Think of it as the model's working memory.
Temperature and Top-p: Controlling Creativity
These two parameters control HOW the model selects the next token from its probability distribution.
The Attention Mechanism: How LLMs Focus
The secret sauce of modern LLMs is the Transformer architecture and its attention mechanism. This is what allows the model to understand relationships between distant words.
Advanced: Decoding Strategies
Test Your Understanding
What's Next
You now understand the internal mechanics of LLMs: tokenization, context windows, temperature, and attention. Next, you will learn prompt engineering techniques, zero-shot, one-shot, and few-shot, to leverage this knowledge in practice.
Continue to the next article: Prompt Engineering Techniques to master the art of few-shot prompting.
Module 1 — LLM Anatomy & Prompt Structure
Understand how LLMs work and construct clear, reusable prompts.
Dorian Laurenceau
Full-Stack Developer & Learning DesignerFull-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.
Weekly AI Insights
Tools, techniques & news — curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
→Related Articles
FAQ
What will I learn in this Prompt Engineering guide?+
Understand how Large Language Models generate text token by token. Learn about attention mechanisms, context windows, temperature, and top-p parameters with clear examples.