GPT-5.4 Guide: Features, Benchmarks & What's New
By Dorian Laurenceau
<!-- manual-insight -->📅 Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.
What practitioners actually report about GPT-5.4 (the gap between launch demos and production)
GPT-5.4 launched March 2026 with a launch event focused on native computer-use. The threads on r/OpenAI, r/ChatGPTPro, r/LocalLLaMA, and r/MachineLearning in the weeks after launch tracked the predictable arc: initial enthusiasm, then bug reports, then a more measured consensus on where it actually wins.
Where GPT-5.4 genuinely improves on prior models:
- →Tool selection in long-running agents. The reduction in spurious tool calls is the change practitioners notice most. Long agentic loops that used to derail on tool over-use now stay on task longer.
- →Computer-use as a first-class capability. It works for narrow, well-defined screen tasks (form filling, data extraction from known sites). It's still expensive in tokens and latency.
- →Cheaper input pricing on non-cached calls vs. GPT-5.3-Codex makes high-volume RAG and summarisation pipelines easier to justify.
- →Better behaviour on long-context retrieval. Less lost-in-the-middle than 5.3, though the RULER benchmark shows the gap to advertised context length is still real.
Where the launch demos overpromised:
- →Computer-use on novel sites is brittle. The demo websites were fine-tuned in. Production deployments report 30-60 % failure rates on first attempts at unknown SaaS interfaces; success rises with retries and explicit DOM hints.
- →"Fewer tool calls" cuts both ways. Some practitioners on r/LangChain report the model now skips tool calls it should make, particularly for fact-checking. Tune your tool descriptions and required-tool prompts.
- →Coding gains over GPT-5.3-Codex are smaller than the SWE-bench numbers suggest. Real codebases with non-trivial dependencies still trip both models.
- →Latency on reasoning-heavy tasks is not always lower despite the marketing. Watch your p95s in production.
What to actually do with GPT-5.4 in 2026:
- →Use it for agentic workflows that need tool selection discipline. This is the genuine improvement.
- →Use computer-use only for tasks you can verify cheaply. Treat the agent as a trainee whose work is checked, not as autonomous staff.
- →Benchmark against Claude Opus 4.6 and Gemini 2.5 Pro on your task. The leaderboards trade places monthly; your task is the only benchmark that matters.
- →Track OpenAI status and model deprecation announcements. GPT-5.4 will be deprecated like every model before it; budget the migration cost up front.
The honest framing: GPT-5.4 is a meaningful step on agentic capability and a modest one on raw reasoning. It is neither the AGI inflection some launch posts implied nor the incremental disappointment cynics expected. Pick it for the specific things it does well, benchmark on your actual task, and don't trust the demo videos.
Learn AI — From Prompts to Agents
What Is GPT-5.4?
GPT-5.4 is OpenAI's latest flagship model, released on March 5, 2026. It is the first general-purpose model with native computer-use capabilities, meaning it can see your screen, move the mouse, type on the keyboard, and execute multi-step workflows, all without third-party plugins.
The model is available in three surfaces: ChatGPT (as GPT-5.4 Thinking), the API (model ID gpt-5.4), and Codex. A higher-capacity variant, GPT-5.4 Pro, targets the most demanding professional tasks. Codex users get access to up to 1 million context tokens, the largest window OpenAI has shipped to date.
GPT-5.2 Thinking will stay accessible under Legacy Models in ChatGPT until June 5, 2026, giving teams a three-month migration window.
Key Improvements Over GPT-5.3-Codex
Knowledge work
GPT-5.4 scores 83.0% on GDPval, up from 70.9% for both GPT-5.3-Codex and GPT-5.2, representing a 17% absolute gain. It also produces 33% fewer false claims compared to GPT-5.2, and hits 87.3% on IB financial-modeling tasks.
Computer use
The standout feature is native computer use. GPT-5.4 achieves 75.0% on OSWorld, surpassing the human baseline of 72.4%. The model sees screenshots at up to 10.24 million pixels (original image detail) and controls the keyboard and mouse directly.
Tool use & browsing
A new tool search feature reduces token consumption by 47% when working across 36 MCP servers. BrowseComp jumps to 82.7% (up from 77.3%), and MCP Atlas reaches 67.2%. The Toolathlon benchmark rises from 51.9% to 54.6%.
Coding
GPT-5.4 matches GPT-5.3-Codex on SWE-Bench Pro (57.7% vs 56.8%) and adds a /fast mode with 1.5× token velocity, plus a new Playwright Interactive skill for browser-based testing.
Steerability
GPT-5.4 introduces mid-response adjustment, you can steer the model's behavior while it is still generating, and an automatic preamble for complex queries that outlines the reasoning plan before diving in.
Benchmark Comparison Table
| Benchmark | GPT-5.4 | GPT-5.3-Codex | GPT-5.2 | Claude Opus 4.6* |
|---|---|---|---|---|
| GDPval (knowledge work) | 83.0% | 70.9% | 70.9% | , |
| SWE-Bench Pro (coding) | 57.7% | 56.8% | 55.6% | , |
| OSWorld (computer use) | 75.0% | 74.0% | 47.3% | ~65% |
| BrowseComp (web search) | 82.7% | 77.3% | 65.8% | , |
| Toolathlon (tool use) | 54.6% | 51.9% | 46.3% | , |
| MMMU Pro (vision) | 81.2% | , | 79.5% | , |
| ARC-AGI-2 (abstract reasoning) | 73.3% | , | 52.9% | , |
| GPQA Diamond (science) | 92.8% | 92.6% | 92.4% | , |
| Humanity's Last Exam (w/ tools) | 52.1% | , | 45.5% | , |
| FrontierMath Tier 4 | 27.1% | , | 18.8% | , |
* Claude Opus 4.6 figures are approximate third-party estimates where available.
GPT-5.4 Pro pushes the ceiling further: BrowseComp 89.3%, ARC-AGI-2 83.3%, Humanity's Last Exam 58.7%, FrontierMath Tier 4 38.0%.
Pricing & Availability
| Model | Input | Cached Input | Output |
|---|---|---|---|
| gpt-5.4 | $2.50 / M tokens | $0.25 / M tokens | $15.00 / M tokens |
| gpt-5.4-pro | $30.00 / M tokens | , | $180.00 / M tokens |
| gpt-5.2 (reference) | $1.75 / M tokens | $0.175 / M tokens | $14.00 / M tokens |
GPT-5.4 is available to ChatGPT Plus, Team, and Pro subscribers. API access is open to all tiers. The cached-input price ($0.25/M) makes long-context and agentic workloads remarkably affordable, ten times cheaper than full-price input tokens.
Computer Use: A significant shift
Computer use in GPT-5.4 is not a plugin, it is a native capability baked into the model. It processes raw screenshots, identifies UI elements, and emits keyboard and mouse actions in a single inference pass.
On the OSWorld benchmark, which tests real desktop tasks like filling spreadsheets, navigating file managers, and using web apps, GPT-5.4 reaches 75.0%, above the human baseline of 72.4%. This is a massive leap from GPT-5.2's 47.3%.
The original image-detail level supports screenshots up to 10.24 million pixels, giving the model enough resolution to read small UI text and interact with dense interfaces. For developers, this opens up a new class of automation: testing desktop apps, filling government forms, migrating data across legacy systems, tasks that previously required brittle RPA scripts.
Tool Search & Efficiency
As agent architectures grow, so does the number of tools a model has to consider. OpenAI's new tool search feature allows GPT-5.4 to query a registry of tools instead of loading all definitions into the prompt.
The result: a 47% reduction in tokens when operating with 36 MCP (Model Context Protocol) servers. Fewer tokens means faster responses and lower costs, especially in production pipelines that chain multiple tools.
The MCP Atlas benchmark, which measures a model's ability to discover, select, and call the right tool from a large registry, improves from roughly 60% to 67.2%. Partners like Zapier confirm the gains: "GPT-5.4 xhigh is the new state of the art for multi-step tool use."
Prompting Tips for GPT-5.4
- →
Use tool search for large toolsets. If you manage more than 10 tools, define them in an MCP registry and let GPT-5.4 search rather than reading all schemas up front. This cuts token spend significantly.
- →
Leverage mid-response steering. GPT-5.4 supports real-time adjustments. If the model starts heading in the wrong direction, you can course-correct without re-prompting from scratch.
- →
Set
image_detail: originalfor computer-use tasks. High-resolution screenshots let the model read fine UI elements. Lower detail levels save tokens but may miss small buttons or text. - →
Use
/fastmode for throughput-sensitive coding tasks. The 1.5× token-velocity mode is ideal for batch refactoring or CI/CD-integrated code reviews where latency matters more than reasoning depth.
GPT-5.4 vs GPT-5.3-Codex vs Claude Opus 4.6
GPT-5.4 vs GPT-5.3-Codex: GPT-5.4 is a clear upgrade for knowledge work (+12 points on GDPval), browsing (+5 on BrowseComp), and tool use. For pure coding, the gap is narrower (57.7% vs 56.8% on SWE-Bench Pro), but the addition of computer use and tool search makes GPT-5.4 a more versatile agent backbone.
GPT-5.4 vs Claude Opus 4.6: The two models occupy different niches. GPT-5.4 dominates on computer use (75% vs ~65% estimated for Claude on OSWorld) and tool orchestration (BrowseComp 82.7%). Claude Opus 4.6 holds an edge on SWE-Bench Verified (81.4%) and long-form extended thinking tasks. Cursor's internal benchmarks rank GPT-5.4 as the current leader overall, while coding-heavy teams may still prefer Claude for deep refactoring. Harvey reports 91% on BigLaw Bench with GPT-5.4, positioning it as the top choice for legal AI.
In practice, many teams will route different tasks to different models, GPT-5.4 for browsing, tool use, and computer-use agents; Claude Opus 4.6 for complex code and nuanced reasoning.
Should You Upgrade?
| If you… | Recommendation |
|---|---|
| Build agents that use tools or browse the web | Upgrade immediately, tool search and BrowseComp gains are substantial. |
| Need desktop/browser automation | Upgrade, native computer use is unmatched at 75% OSWorld. |
| Run professional knowledge work (finance, law, consulting) | Upgrade, 83% GDPval and 33% fewer hallucinations are a step change. |
| Do mostly coding with Codex | Marginal gain for pure coding. Evaluate /fast mode and Playwright Interactive. |
| Are budget-constrained | The input price rose from $1.75 to $2.50/M tokens, but cached input at $0.25/M and 47% fewer tool tokens can offset this. Run the numbers for your workload. |
Bottom Line
GPT-5.4 is the most capable general-purpose model OpenAI has released. Native computer use, tool search, and a 12-point jump on professional knowledge work make it an immediate upgrade for agent builders, enterprise automation, and anyone who chains tools at scale. The coding gap over GPT-5.3-Codex is modest, but every other dimension shows clear, measurable progress. With GPT-5.2 retiring on June 5, now is the time to migrate.
Dorian Laurenceau
Full-Stack Developer & Learning DesignerFull-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.
Weekly AI Insights
Tools, techniques & news — curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
→Related Articles
FAQ
Is GPT-5.4 out?+
Yes. GPT-5.4 was released on March 5, 2026. It's available in ChatGPT (as GPT-5.4 Thinking), the API (model ID: gpt-5.4), and Codex.
What is new in GPT-5.4?+
GPT-5.4 adds native computer-use capabilities, tool search for 47% fewer tokens, 83% on GDPval (professional tasks), 75% on OSWorld (surpassing human 72.4%), and 1M context in Codex.
How much does GPT-5.4 cost?+
API pricing: $2.50/M input tokens ($0.25 cached), $15/M output tokens. GPT-5.4 Pro: $30/M input, $180/M output. Available to ChatGPT Plus, Team, and Pro users.
GPT-5.4 vs Claude Opus 4.6: which is better?+
GPT-5.4 leads in computer use (75% OSWorld vs Claude's ~65%) and tool use (BrowseComp 82.7%). Claude Opus 4.6 leads on SWE-Bench Verified (81.4%) and extended thinking. Each excels in different areas.
When will GPT-5.2 be retired?+
GPT-5.2 Thinking will remain available until June 5, 2026 under Legacy Models in ChatGPT, then be retired.