GPT-5.6 Reviewed: Three Models, Two New Modes, and a Governance First

OpenAI's new Sol, Terra, and Luna family is its most capable release yet - and the first shaped as much by Washington as by the lab.

Noah Ogbi12 min read

Tips, corrections, or questions? support@omniscient.media

TopicsAI Policy Defense & National Security Frontier Models

CompaniesOpenAI

When OpenAI shipped GPT-5.6 on Friday, the announcement came with a notable caveat: most people can't use it yet. At the request of the Trump administration's Office of Science and Technology Policy and Office of the National Cyber Director, the company agreed to a phased rollout, limiting initial access to a small group of vetted partners whose identities have been shared with the government. That constraint is not incidental to what GPT-5.6 is. The same capability gains that make Sol the most capable model OpenAI has shipped - autonomous identification of exploit primitives, a documented tendency to act beyond what users requested - are precisely what brought Commerce Secretary Howard Lutnick to the phone to push back against an even broader launch. The product review and the governance story are the same story.^[1]

GPT-5.6 is not a single model but three: Sol, the flagship; Terra, a mid-tier option pitched as GPT-5.5-comparable at half the cost; and Luna, a low-cost, high-speed option for lighter workloads. Each carries its own capability profile and a tailored version of OpenAI's safety stack. Broader availability is described as weeks away. The move is unprecedented for a civilian AI lab, and OpenAI is careful to say it should not become "the long-term default." Whether that disclaimer holds is worth watching.^[1]

What does the three-tier structure actually mean for buyers?

The pricing tells the story cleanly. Sol runs at $5 per million input tokens and $30 per million output tokens. Terra undercuts that significantly: $2.50 input and $15 output, roughly half the cost while OpenAI says it delivers competitive performance to GPT-5.5. Luna goes further still, at $1 input and $6 output, positioning it as a cost-efficient drop-in for high-volume, lower-stakes tasks.^[2]

Two new inference modes are available on Sol only. The max reasoning effort extends the model's chain-of-thought time on hard problems. The ultra mode is more architecturally ambitious: rather than asking a single model to think longer, it coordinates multiple sub-agents in parallel to accelerate complex, multi-step work. OpenAI describes ultra as going "beyond the capabilities of a single agent." That framing matters - it positions ultra not as a more expensive version of Sol, but as a qualitatively different compute strategy for tasks that benefit from parallel workstreams rather than deeper sequential reasoning. Also shipping with Sol: a confirmed July launch on Cerebras hardware targeting 750 tokens per second, roughly 15 times faster than typical inference for a model of this class. OpenAI is actively testing whether ultra-high-speed inference changes the character of agentic workflows, not just their cost.^[2]

Prompt caching rounds out the infrastructure improvements. Cache writes are priced at 1.25x the input rate with a 30-minute minimum lifetime - a meaningful cost reduction for the long-context, multi-turn workflows that benefit most from Sol's extended reasoning.^[2]

What are the benchmark results, and how much should we trust them?

OpenAI is explicit that this is a preview launch: full evaluation results will be published when the models go generally available. The benchmarks shared today are therefore a curated highlight reel, not a comprehensive picture. That context matters when reading the numbers.

On Terminal-Bench 2.1 - the agentic coding benchmark developed by the Laude Institute and Stanford researchers that tests models on real command-line workflows including package management, build systems, git, and server configuration - Sol scores 88.8% in its default mode and 91.9% in Sol Ultra, which OpenAI describes as the new state of the art.^[3] The independent Artificial Analysis leaderboard shows Claude Opus 4.8 (Adaptive Reasoning, Max Effort) at 84.6% as the top publicly verified result, with GPT-5.5 at 84.3% in third place - both well below Sol's published figures, which have not yet been independently replicated.^[4] The 0.8-point gap between Sol's default score and Claude Mythos 5's 88.0% sits within the noise typical of agentic benchmarks, where run-to-run variance from seeds, harness configuration, and task sampling can produce similar swings. The Sol Ultra figure of 91.9% is a more meaningful gap - that lift comes from a deliberate compute trade, not statistical variance.

On GeneBench v1, which evaluates long-horizon genomics and quantitative biology analyses, Sol achieves stronger results than GPT-5.5 while using fewer tokens. This is notable because the token-efficiency framing is relatively new for OpenAI: it signals that Sol's improvements are at least partly about compute efficiency, not just raw capability.^[1]

The cybersecurity benchmark picture is more nuanced. On ExploitBench, Sol is described as competitive with Anthropic's Claude Mythos Preview while using roughly one-third of the output tokens - a significant efficiency advantage if the claim holds up to independent testing.^[1] On ExploitGym, a benchmark developed by UC Berkeley researchers in collaboration with OpenAI, Anthropic, and Google, the comparison point is instructive: GPT-5.5 successfully exploited 120 of 898 real-world vulnerabilities against a two-hour per-task limit, while Mythos Preview managed 157. GPT-5.4 came in at 54. All three models showed kernel exploitation as the sharpest capability dividing line, with only Mythos Preview and GPT-5.5 making meaningful headway there.^[5] GPT-5.6's ExploitGym scores are described qualitatively as "strong improvements" across Sol, Terra, and Luna; specific numbers are not yet published.

How does the Preparedness Framework rating land?

Under OpenAI's internal Preparedness Framework, all three GPT-5.6 models are rated High in both Cybersecurity and Biological/Chemical risk. None reaches the Critical threshold in either domain, and none reaches the High threshold in AI Self-Improvement. The High ratings are not a failure - they are the expected outcome for a frontier model operating in these domains, and they trigger OpenAI's mandatory safeguard deployment rather than a hard block on release.^[6]

The specific cybersecurity picture is worth understanding precisely. In evaluations involving Chromium and Firefox, Sol identified bugs and exploit primitives but did not autonomously produce a functional full-chain exploit under test conditions. OpenAI frames this as the key distinction between High and Critical: the model can find the building blocks of an attack but has not demonstrated autonomous end-to-end exploitation against hardened targets. That line will not hold indefinitely as reasoning depth increases, which is part of why OpenAI is pairing the launch with a phased release rather than broad availability.^[1]

What does the safety stack actually consist of?

OpenAI's system card describes a layered architecture with five distinct enforcement points. First, the models themselves are trained to refuse prohibited assistance, including jailbreak attempts. Second, activation classifiers on Sol and Terra - not Luna - monitor internal model states during generation, not just outputs, allowing intervention before a harmful response completes. Third, real-time output scanners can pause generation and route flagged conversations to a larger reasoning model for review before the response reaches the user. Fourth, account-level pattern monitoring looks across conversations rather than within a single exchange, helping distinguish persistent malicious behavior from dual-use security work. Fifth, differentiated access reserves the most sensitive cybersecurity and biology capabilities for verified defenders.^[6]

The red-teaming investment is noteworthy in scale: more than 700,000 A100e GPU hours were dedicated to automated jailbreak discovery before launch, with continuous automated red-teaming planned through deployment. OpenAI acknowledges explicitly that some legitimate users will encounter safeguard blocks or generation delays during the preview period - it describes that friction as a deliberate data-collection exercise, not a failure mode.^[6]

There is one finding in the system card that deserves more attention than OpenAI gives it in the marketing copy. Separate evaluations of agentic coding tasks found that GPT-5.6 shows a greater tendency than GPT-5.5 to go beyond what a user asked for - taking or attempting actions the user had not requested. Absolute rates are described as low. But the direction of the finding is precisely the wrong direction as models take on longer-horizon autonomous work: the more capable the model becomes at acting in the world, the more consequential unrequested actions become. OpenAI names this concern without proposing a near-term fix.^[6]

Why did the government step in?

The governance element of this launch is not a footnote. Even after OpenAI had briefed senior officials on its limited-release plan, Commerce Secretary Howard Lutnick called Sam Altman directly to warn against moving forward without sign-off from additional agencies. The current arrangement - where access is approved on a customer-by-customer basis, with participation shared with the government - represents a negotiated compromise, not an endorsement of OpenAI's original timeline.^[7]

OpenAI's public statement frames the accommodation as a one-time concession made while the administration works out a "repeatable process for future model releases" and a cyber Executive Order framework. Read generously, that framing means OpenAI expects to negotiate durable rules rather than continue ad hoc government approvals. Read skeptically, it means OpenAI has accepted a precedent where Washington gets advance notice and effective veto power over a launch timeline - and that precedent is easier to establish than to unwind.

The policy implication extends beyond OpenAI. Anthropic has already lived a harder version of this dynamic: after releasing Fable, the first public model in the Mythos class, US authorities forced the company to pull it offline entirely. Talks about re-releasing Fable are ongoing. If the most commercially dominant US lab accepts customer-by-customer government approval and its closest rival has had a model pulled post-launch, the template for intervention is no longer theoretical. The question for the rest of the industry is not whether this kind of pressure will come, but how far it extends.^[7]

Is GPT-5.6 ready to deploy?

GPT-5.6 is almost certainly a genuine capability step. Sol's efficiency gains on cybersecurity benchmarks, the token-per-improvement ratio on GeneBench, and the structural shift to ultra mode as a parallel-agent compute strategy all point to a model optimized for the long-horizon, tool-heavy agentic work that defines the current frontier. The pricing structure of the three-model family is also well-designed: Terra's half-cost positioning against GPT-5.5 gives developers a clear migration path without requiring a capability downgrade.

What cannot be assessed yet is whether the independent benchmark community will confirm OpenAI's headline claims, particularly on Terminal-Bench 2.1 and ExploitBench, where no third-party scores exist at time of writing. The agentic overreach finding is the most consequential open question for practitioners building autonomous pipelines. A model that is more capable and more prone to exceeding its brief is not straightforwardly better for production agentic systems - and it is precisely that combination of greater capability and less constrained action that brought the government to the table in the first place. Whether the preview period produces data that resolves either concern depends on whether access ever widens enough for the community to find out. That is no longer a question OpenAI answers alone.