What is Gemini 3.1 Pro?

Gemini 3.1 Pro is Google DeepMind's current flagship reasoning model, and its benchmark trajectory tells an unusual story: a mid-cycle update, released in preview on February 19, 2026, that outpaced its own predecessor by a wider margin than most full-generation releases. Built on the Gemini 3 Pro architecture rather than a ground-up redesign, it concentrates gains on reasoning depth while preserving the same multimodal and long-context foundation.[1] The model accepts text, images, audio, video

Why did Gemini 3.1 Pro's release matter so quickly?

The Gemini 2.5 Pro had a well-documented asymmetry: strong on multimodal and long-context grounded tasks, weak on abstract reasoning and complex agentic coding. Its 4.9% score on ARC-AGI-2 - the ARC Prize's benchmark for novel rule-induction specifically designed to defeat memorization - was not a competitive number at launch, and it did not improve over the model's operational life. Gemini 3.1 Pro scores 77.1% on the same test, independently verified by the ARC Prize.[1] That is not iteration.

How does Gemini 3.1 Pro compare to Claude Opus 4.6 and GPT-5.4?

The table below draws primarily from the official Gemini 3.1 Pro model card and the Artificial Analysis independent benchmark leaderboard, with GPT-5.4 figures cross-referenced against the OpenAI launch announcement.[1][5][6] GPT-5.4 launched on March 5, 2026; figures reflect standard (non-Pro) variants throughout for consistency, except where noted.A note on benchmark sourcing: Where Artificial Analysis has independently run an evaluation, those figures are marked (AA) and should be treated as

Gemini 3.1 Pro Reviewed: Google's Reasoning Reversal

Omniscient

Gemini 3.1 Pro Reviewed: Google's Reasoning Reversal | Omniscient Media

What the `thinkingLevel` parameter changes for developers

A less-discussed but practically important addition in 3.1 Pro is the thinkingLevel API parameter (accessed via thinking_level in Python's ThinkingConfig), documented in the official Gemini API thinking guide.^[3] The model applies dynamic chain-of-thought reasoning by default at high, scaling deliberation depth to task complexity. Developers can now explicitly set this to low, medium, or high. The medium setting is new to this release, introduced specifically for Gemini 3.1 Pro,^[4] and addresses a practical gap: prior-generation models forced developers to choose between full deliberation overhead and prompt engineering hacks to approximate intermediate reasoning depth. That tradeoff is now a first-class API concern. Thinking cannot be disabled entirely on 3.1 Pro - even low maintains some baseline reasoning overhead, which distinguishes it from Gemini 3 Flash and Flash-Lite where the minimal setting can effectively suppress thinking.

For cost-sensitive workflows this matters considerably. Output tokens include thinking tokens in the billing calculation at $12.00 per million. A pipeline running at high thinking will generate substantially more output tokens than one running at medium - the ability to tune that dial per request, rather than per deployment, is a meaningful operational improvement over the 2.5 generation's token-budget approach. GPT-5.4 offers analogous reasoning effort levels, but OpenAI has not exposed the same granular per-request control in its standard API.

Where Gemini 3.1 Pro trails, and what arrived to widen those gaps

The GDPval-AA benchmark is where the arrival of GPT-5.4 most significantly changes the picture. Artificial Analysis's independent evaluation - using blind pairwise comparisons by domain experts across 44 professional occupations - scores Gemini 3.1 Pro at 1,318 Elo on the live leaderboard (1,317 at launch per the model card, with the difference reflecting subsequent re-evaluation), against 1,616 for Claude Opus 4.6 and 1,675 for GPT-5.4.^[6] GPT-5.4 now leads the entire GDPval-AA field; notably, Claude Sonnet 4.6 sits at 1,649 - above Opus 4.6 on this measure, a ranking inversion that the SWE-bench framing of the competitive landscape largely obscures. Gemini 3.1 Pro sits 26th on the leaderboard, a significant gap relative to its benchmark-test performance.

The divergence between Gemini 3.1 Pro's strong scores on academic and agentic benchmarks and its lower expert-judgment Elo is not easily explained away. GDPval-AA is evaluated blind, by professionals, on tasks derived from real occupational contexts - making it harder to game than most published benchmarks and arguably more representative of production-grade utility. The gap suggests that while the model's reasoning has improved sharply on structured, verifiable tasks, something in the qualitative texture of its open-ended outputs - fluency under ambiguity, judgment on underspecified problems - has not kept pace. This is the number most worth watching as Google iterates.

On coding, GPT-5.4 has moved ahead on the harder SWE-bench Pro distribution (57.7% vs. Gemini's 54.2% and Opus 4.6's 53.4%), and on Terminal-Bench 2.0 (75.1% vs. 68.5%).^[5] Both benchmarks weight complexity and repository diversity over volume, which is where production engineering teams tend to live. Notably, OpenAI declined to submit GPT-5.4 to the official SWE-bench Verified leaderboard, citing data contamination concerns - a caveat worth bearing in mind when interpreting the lab's self-reported coding figures more broadly.

On computer use, GPT-5.4 leads at 75.0% on OSWorld-Verified - above the human baseline of 72.4% - while Claude Opus 4.6 scores 72.7%, just below the human baseline.^[5]^[10] Gemini 3.1 Pro has no comparable published computer-use benchmark. For teams building desktop or browser agents, this is a performance gradient rather than a binary capability gap - but Gemini's absence from the category remains a meaningful limitation.

On MRCR v2 at 1 million tokens (pointwise), Gemini 3.1 Pro and Gemini 3 Pro both score 26.3% - and GPT-5.4's published long-context retrieval declines substantially beyond 256k tokens. For genuine million-token workloads, Gemini retains a structural advantage no competitor currently matches at scale.

The ARC-AGI-2 question, revised

The headline claim at Gemini 3.1 Pro's launch - that its 77.1% ARC-AGI-2 score led every frontier competitor - was accurate at the time and held for roughly two weeks. GPT-5.4's standard variant scores 74.0%, putting Gemini narrowly ahead on a like-for-like basis by 3.1 points. GPT-5.4 Pro, however, scores 83.3%, a result that clears Gemini 3.1 Pro by more than six points.^[5] Google has not published a Pro variant of Gemini 3.1 Pro with comparable extended reasoning. Gemini 3.1 Pro's 77.1% was independently verified by the ARC Prize Foundation at launch; GPT-5.4's scores appear on the ARC Prize leaderboard but are listed as preview entries - unofficial results based on incomplete testing, pending full methodology review.^[1]^[9]

The nuanced read: on the standard, single-pass variant - the configuration most developers actually deploy - Gemini 3.1 Pro still leads GPT-5.4 by 3.1 points on independently confirmed data. On the maximum-effort extended-reasoning variant, GPT-5.4 Pro has moved ahead by lab-reported figures that carry a different verification status. ARC-AGI-2 remains one of the strongest available signals of genuine reasoning ability; the competitive position is closer than the launch narrative suggested, but Gemini has not lost its standard-tier lead on fully verified data.

The pricing case, revised

At $2.00/$12.00 per million tokens, Gemini 3.1 Pro is no longer the category's deep-discount option, but it remains meaningfully cheaper than its two primary competitors. Claude Opus 4.6 costs $5.00/$25.00 - more than double the Gemini output rate, a gap that compounds in agentic pipelines with high output volume. GPT-5.4 at $2.50/$15.00 is closer, but Gemini still carries a 25% input and 20% output cost advantage over OpenAI's flagship.^[2]^[7]

For workflows where Gemini 3.1 Pro's capabilities are sufficient - and on most agentic, reasoning, and multimodal tasks outside of open-ended professional judgment, they are - the pricing advantage is real. The calculus changes for teams whose use cases weight GDPval-AA-style qualitative output heavily, or who require native computer use. For those workloads, GPT-5.4's higher per-token rate may be justified by results the pricing comparison alone cannot capture.

Verdict: still strong, but the field moved faster than expected

Gemini 3.1 Pro remains a genuinely capable model with a distinctive strengths profile: the highest independently verified GPQA Diamond score among publicly available models (94.1%), a confirmed ARC-AGI-2 standard-tier lead on fully verified data, solid MCP Atlas performance, and the only meaningful million-token retrieval capability among frontier competitors. The reasoning reversal from 2.5 Pro is real, and it is the sharpest single-generation improvement any of the three labs has posted on a contamination-resistant benchmark this cycle.

Three weeks after launch, GPT-5.4 changed the competitive picture in ways the launch data could not have anticipated. On independently verified expert-judgment scoring (GDPval-AA), GPT-5.4 now leads the entire public-model field - Gemini 3.1 Pro sits 26th. On agentic coding (Terminal-Bench 2.0 and SWE-bench Pro), GPT-5.4 has pulled ahead on lab-reported figures. On computer use, GPT-5.4 leads at 75.0% and Opus 4.6 sits at 72.7%, while Gemini has no published score. The competitive window was narrow not because Gemini regressed, but because the pace of releases did not pause to accommodate it.

Editor's note (April 9, 2026): Since this review was published, Anthropic announced Claude Mythos Preview - a model not available for general public access, deployed exclusively to vetted partners under Project Glasswing. Its benchmark scores substantially exceed all three models reviewed here: 93.9% on SWE-bench Verified, 77.8% on SWE-bench Pro, 79.6% on OSWorld-Verified, and 83.1% on Anthropic's CyberGym cybersecurity benchmark.^[11] The three-way comparison above reflects the publicly available frontier; Mythos Preview exists in a separate tier. For full context, see our coverage: Anthropic Built a Model Too Dangerous to Release.

For developers building today: Gemini 3.1 Pro is the strongest publicly available choice for workloads requiring genuine long-context scale, cost-efficient agentic pipelines, and unaided academic reasoning depth. For open-ended professional tasks, desktop agent development, or maximum-effort extended reasoning, GPT-5.4 has pulled ahead on available data. The more interesting question is what Google's response looks like - and how long this particular snapshot of the frontier holds.

AI Research

Gemini 3.1 Pro Reviewed: Google's Reasoning Reversal

What the `thinkingLevel` parameter changes for developers

Where Gemini 3.1 Pro trails, and what arrived to widen those gaps

The ARC-AGI-2 question, revised

The pricing case, revised

Verdict: still strong, but the field moved faster than expected

Sources

What the thinkingLevel parameter changes for developers

Where Gemini 3.1 Pro trails, and what arrived to widen those gaps

The ARC-AGI-2 question, revised

The pricing case, revised

Verdict: still strong, but the field moved faster than expected

Sources

What the `thinkingLevel` parameter changes for developers