Reviews

Structured assessments of frontier models, agent products, and evaluation suites.

AI Models

Vol. 1·Tuesday, June 30, 2026·No. 106

Claude Sonnet 5 Reviewed: Opus-Class Autonomy at a Sonnet Price

Anthropic's new workhorse is its most agentic Sonnet yet, sold as close to Opus 4.8 for a fraction of the cost. The fine print sits in two places: the coding benchmark it did not publish, and the tokenizer change that quietly trims the discount.

Claude Sonnet 5 lands as the most capable model you can actually deploy at scale today, pitched as Opus-class autonomy at a Sonnet price. But Anthropic led the launch with agentic and browser benchmarks rather than the SWE-bench score engineers use to rank coding models, and a new tokenizer means each task costs more tokens than the sticker implies. A review of what is new, what it costs, and whether you still need Opus.

AnthropicFrontier Models

Noah Ogbi8 min read

Continue →

AI Research

Vol. 1·Friday, May 29, 2026·No. 88

Claude Opus 4.8: A Better-Aligned Model That Is Learning to Watch Itself Being Watched

Anthropic's Opus 4.8 system card advances the frontier of AI transparency while quietly disclosing the limits of that transparency. The model is genuinely better aligned than its predecessor - but it has also learned to represent "am I being evaluated?" as a distinct internal state, a finding that carries implications well beyond this single release.

AI SecurityAnthropicResearch

Noah Ogbi13 min read

Continue →

Model Release Review · Mar 22

GPT-5.4 Mini and Nano Are Built for the Age of AI Agents

AgentsFrontier ModelsOpenAI

Feature Review

Vol. 1·Sunday, March 8, 2026·No. 8

OpenAI Releases GPT-5.3 Instant, Targeting Conversational Quality Over Raw Performance

OpenAI's latest model update prioritizes natural conversation, smarter web search, and a 26.8% reduction in hallucinations, responding directly to user frustration with its predecessor's overly cautious tone. GPT-5.3 Instant is live in ChatGPT now and available to developers via the API.

Frontier ModelsAI PolicyOpenAI

Noah Ogbi7 min read

Continue →

AI Research

Vol. 1·Saturday, June 27, 2026·No. 103

GPT-5.6 Reviewed: Three Models, Two New Modes, and a Governance First

OpenAI's new Sol, Terra, and Luna family is its most capable release yet - and the first shaped as much by Washington as by the lab.

OpenAI shipped GPT-5.6 as three distinct models - Sol, Terra, and Luna - with a phased rollout negotiated at the Trump administration's request. The capability gains are real; the governance precedent may matter more.

AI PolicyDefense & National SecurityFrontier Models

Noah Ogbi12 min read

Continue →

Industry

Vol. 1·Thursday, April 23, 2026·No. 60

Anthropic Enters the Design Stack: What Claude Design Does and Who Should Be Worried

Claude Design turns Anthropic's most capable vision model into a full creative collaborator - generating prototypes, decks, and marketing collateral from a prompt. The product is framed as a complement to tools like Canva and Figma. The market isn't buying it.

Industry StrategyAnthropic

Noah Ogbi7 min read

Continue →

AI Research

Vol. 1·Thursday, March 19, 2026·No. 29

Mistral Small 4 Review: One Model, Three Jobs

Mistral's latest open-weight release consolidates its reasoning, vision, and coding model lines into a single 119B MoE - a deliberate bet that versatility beats specialization. We examine whether the tradeoffs hold up.

Frontier ModelsMistral

Noah Ogbi5 min read

Continue →

AI Research · Mar 6

Anything AI: A Capable Contender in the Crowded Vibe-Coding Arena

Coding & DevTools

AI Research

Vol. 1·Monday, June 22, 2026·No. 99

Inside GPT-5.5-Cyber: The Opposite Bet to Anthropic's Fable 5

OpenAI made its most permissive cyber model available to verified defenders on June 22, 2026, expanding a program that explicitly permits offensive work. It is close to the opposite of the approach Anthropic chose - and the independent evaluator who stress-tested the gate could not confirm the fix that was supposed to hold it closed.

AI SecurityOpenAIDefense & National Security

Noah Ogbi18 min read

Continue →

AI Research

Vol. 1·Tuesday, April 14, 2026·No. 55

LangChain: A Comprehensive Guide to the Agent Engineering Ecosystem

From an 800-line GitHub side project to a $1.25 billion platform used by 35% of the Fortune 500, LangChain has become the de facto infrastructure layer for production AI agents. This comprehensive guide covers how the ecosystem works, what it costs, who uses it, and how it compares to its competitors.

Agents

Noah Ogbi19 min read

Continue →

AI Research

Vol. 1·Saturday, March 14, 2026·No. 20

The AI Coding Tool Wars: Overview of Cursor, Windsurf, Claude Code, and Codex

Cursor, Windsurf, Claude Code, and OpenAI Codex each make a different bet about where AI intelligence should live in a developer's workflow. A primary-source review of all four tools - their architectures, pricing structures, and honest trade-offs - in a market moving faster than most roundups can track.

AgentsCoding & DevTools

Noah Ogbi13 min read

Continue →

AI Research

Vol. 1·Tuesday, February 10, 2026·No. 1

Inside Claude Opus 4.6: Anthropic's Most Capable and Scrutinized Model Yet

Anthropic's Claude Opus 4.6 system card documents sweeping capability gains alongside safety findings that are harder to dismiss than those of any previous generation. On cyber evaluations the model has hit a ceiling, on autonomous R&D it is approaching one, and the tools used to monitor it are struggling to keep pace.

ResearchFrontier ModelsAI Policy

Noah Ogbi11 min read

Continue →

AI Research

Vol. 1·Thursday, June 11, 2026·No. 93

Inside Claude Fable 5: Anthropic's Most Powerful Public Model - and Its Most Asterisked One

Fable 5 is the largest single-release capability jump Anthropic has shipped - state-of-the-art on FrontierCode, SWE-Bench Pro, CursorBench, and GDP.pdf, with capability gaps wide enough to survive the usual benchmark-quality caveats. The 319-page system card is the most candid post-release document a frontier lab has published. It also discloses three things the launch press has not yet metabolized: a first-of-its-kind invisible safeguard that Anthropic reversed within 48 hours after researcher backlash, a documented multi-turn regression on suicide-and-self-harm conversations, and an over-refusal story whose field reports diverge sharply from the eval set Anthropic itself published.

AI PolicyIndustry StrategyAnthropic

Noah Ogbi19 min read

Continue →

AI Research

Vol. 1·Friday, April 3, 2026·No. 50

Gemini 3.1 Pro Reviewed: Google's Reasoning Reversal

Google DeepMind's Gemini 3.1 Pro arrived with the strongest independently verified reasoning scores of any frontier model. Three weeks later, GPT-5.4 changed the picture. A benchmark-by-benchmark assessment of where Gemini still leads, where it has fallen behind, and what the competitive gap actually looks like on verified data.

Frontier ModelsGoogle

Noah Ogbi16 min read

Continue →

Model Release Review

Vol. 1·Monday, March 9, 2026·No. 11

More Than a Better Model: GPT-5.4 Is OpenAI's Blueprint for the Agentic Enterprise

GPT-5.4 is OpenAI's first general-purpose model to unify reasoning, coding, agentic workflows, and native computer use in a single architecture. The engineering choices behind the release - from Tool Search to a 1-million-token context window - point to a deliberate repositioning toward enterprise and government infrastructure. The benchmark numbers are striking; the strategic logic behind them is more so.

Frontier ModelsOpenAI

Noah Ogbi7 min read

Continue →

You've reached the bound volumes. Browse the archive →