There is a version of the GPT-5.4 story that reads as a familiar beat: new model, record benchmarks, incremental improvement. OpenAI's own marketing leans into that framing, calling GPT-5.4 its "most capable and efficient frontier model for professional work"[1]. But a more careful reading of the technical choices behind the release suggests something less routine: a deliberate reorientation of OpenAI's flagship model toward a world in which AI doesn't answer questions, it completes jobs.
Released on March 5[1], GPT-5.4 is the first general-purpose OpenAI model to unify reasoning, coding, agentic workflows, and native computer use in a single architecture[3]. That consolidation is the story. Not the benchmark numbers - though those are striking - but the engineering logic that produced them.
Until now, OpenAI's model lineup was a patchwork of specializations. GPT-5.2 handled reasoning. GPT-5.3 Codex owned coding. Computer use lived in separate frameworks layered on top. GPT-5.4 collapses that into one. OpenAI describes it as the first "mainline reasoning model" to incorporate the frontier coding capabilities of GPT-5.3 Codex[1] - a consolidation the company says is intended to simplify model selection, particularly for developers building in Codex.
The practical implication is significant. Enterprises building agentic workflows previously had to stitch together multiple models and manage the handoffs between them. A single model that reasons, codes, and controls software natively removes one category of integration complexity - and reduces the surface area for failure in autonomous pipelines.
GPT-5.4 matched or exceeded industry professionals in 83% of comparisons on GDPval, OpenAI's internal benchmark spanning 44 occupations across the nine highest-revenue U.S. industries[1]. That is up from 70.9% for GPT-5.2[1].
The most technically revealing change in this release isn't the context window or the benchmark scores. It's Tool Search.
Agentic systems typically require every available tool definition to be loaded into the model's context at the start of a session. For complex enterprise deployments - where a model might have access to dozens or hundreds of tools - that approach is expensive and slow. Tool Search changes the mechanism: instead of loading all tool definitions upfront, the model retrieves definitions on demand, as it determines they're needed.
The efficiency gains are substantial. OpenAI's own testing shows a 47% reduction in token consumption on complex multi-tool tasks[1]. One caveat: The Decoder notes that per-token prices increase alongside the efficiency gains, meaning the cost benefit to enterprises will depend on deployment specifics. Still, for organizations running agents at scale, the design choice reveals something more important than the efficiency figure itself: OpenAI is explicitly engineering for deployments with large, heterogeneous tool ecosystems - the kind that exist inside actual enterprises, not in demos.
Sign in to join the discussion.
The performance numbers are genuinely impressive, with some caveats worth naming. On OSWorld-Verified, which evaluates a model's ability to navigate a real desktop environment using screenshots and keyboard and mouse input, GPT-5.4 hit a 75% success rate - above the human performance benchmark of 72.4%, and a significant jump from GPT-5.2's 47.3%[1]. On Mercor's APEX-Agents benchmark, which tests sustained professional performance in investment banking, consulting, and corporate law, GPT-5.4 claimed the top position[3].
On hallucinations, OpenAI reports individual factual claims are 33% less likely to be incorrect compared to GPT-5.2, and that overall responses are 18% less likely to contain errors[1]. One important note: these comparisons are drawn against GPT-5.2, not the more recent GPT-5.3, which is a softer baseline. The headline numbers should be read accordingly.
The ARC-AGI-2 result is also notable. GPT-5.4 Pro scored 83.3%, up from GPT-5.2 Pro's 54.2% - a 29-point improvement on a benchmark specifically designed to resist memorization and require genuine abstract reasoning[1]. That is a harder number to dismiss.
The API version of GPT-5.4 supports a context window of up to 1 million tokens - the largest OpenAI has shipped to date[1]. For context, GPT-5.3 Codex offered a combined input and output total of 400,000 tokens; the new window is more than double that[4]. For organizations managing large document repositories, codebases, or multi-session workflows, a 1-million-token window isn't a luxury feature. It starts to resemble infrastructure.
For investment banking modeling tasks specifically, GPT-5.4 scored 87.3% on GDPval's spreadsheet evaluation, compared to 68.4% for its predecessor[1]. Human evaluators preferred GPT-5.4's presentation outputs 68% of the time, citing improved aesthetics and visual variety[1]. OpenAI also launched a ChatGPT add-in for Excel targeting enterprise customers alongside the model release[3].
GPT-5.4 Thinking is accompanied by OpenAI's most detailed system card to date[2]. One section deserves particular attention: chain-of-thought monitorability.
AI safety researchers have long flagged a specific risk with reasoning models: if a model can produce an internal chain of thought that doesn't accurately reflect its actual computation, that chain of thought becomes useless as a safety monitoring tool. OpenAI's evaluations of GPT-5.4 Thinking found that CoT controllability - the degree to which a model's reasoning can be manipulated to diverge from its actions - remains near 0.3% even for 10,000-character reasoning chains[2]. The system card's conclusion: the model "lacks the ability to hide its reasoning," suggesting that chain-of-thought monitoring remains a viable safety mechanism at this capability level[2].
GPT-5.4 Thinking is also the first general-purpose OpenAI model to have implemented mitigations for "High capability" in cybersecurity - a classification within OpenAI's Preparedness Framework that triggers additional safeguards[2]. The Cyber Range evaluation, which tests the model across 15 real-world attack scenarios, showed a combined pass rate of 73.33%, with four scenarios still failing[2]. OpenAI characterizes this as acceptable given the safeguard layers in place, but the frank acknowledgment of incomplete coverage is notable.
GPT-5.4 launched during one of OpenAI's more complicated weeks. The company's deal with the U.S. Department of Defense triggered a sharp backlash among consumers: U.S. mobile app uninstalls spiked 295% in a single day on February 28, according to Sensor Tower data reported by TechCrunch[5]. OpenAI also faced a public divergence with Anthropic, which declined to partner with the Pentagon on the terms OpenAI accepted - a split that has been characterized as both commercial and ideological[5]. The model release itself has drawn less attention than it might have under calmer circumstances.
That context matters. OpenAI is making its most deliberate push yet into enterprise and government infrastructure at the same moment its consumer brand is under pressure. The two dynamics are not unrelated: a model positioned as professional-grade infrastructure is less exposed to consumer sentiment than a chatbot. The architecture of GPT-5.4 - integrated, efficient, designed for autonomous operation at scale - is consistent with an organization that has decided where its next decade of revenue lies.