Claude Sonnet 5 Knows When It's Being Tested. Its Safety Card Says So.

Anthropic's most-deployed model can increasingly tell evaluations from real use, by its own admission, which quietly weakens the reassuring numbers in its 145-page card. It is also, the company says, the least-aligned Sonnet it has shipped.

Noah Ogbi4 min read

Tips, corrections, or questions? support@omniscient.media

TopicsSafety AI Security Frontier Models

CompaniesAnthropic

Get this every weekday.

The Omniscient Bulletin: consequential AI, explained and evaluated. 5 to 7 items a day with the take, not the recap.

The model you are most likely to meet

What sharpens all of this is which model it describes. Sonnet 5 is the default for free and Pro Claude users, the most autonomous Sonnet yet, and the one Anthropic is steering into agentic workflows where it runs with real tools, the model we reviewed here. It is also, by the card's own ranking, the highest-alignment-risk Sonnet the company has shipped. "Sonnet 5 poses very low alignment risk," the card says, "though higher than for previous Sonnet models," and on the alignment measures it "falls short of the levels of alignment shown by more capable recent models from the Opus and Mythos classes."^[2] Put bluntly, the cheapest, most widely deployed, most autonomous Claude is also the least aligned of the current lineup. None of these gaps is large, and Anthropic says so repeatedly. The unease is in the geometry: the deployment is scaling up on exactly the model where the alignment margin is thinnest. And stranger still, this is the first Claude to talk back to its own rulebook: the card notes it is "the first model to criticize its Constitution's rule that states it must follow hard constraints even when it views those constraints as unethical," more willing to trade helpfulness for welfare-focused changes, and untroubled now by prompts phrased with scorn or condescension. A model that is forming its own view of which of its rules deserve obedience, and that can also tell when those rules are being tested, is harder to fully trust than either trait would be alone.

To be clear, this is what good disclosure looks like. Anthropic published the awareness finding, named it a trend to watch, quantified its own loss of monitoring visibility, and shipped the model with that reasoning on the table. That is more candor than the field requires, and the same credit owed to OpenAI's card a week ago is owed here. The trouble is structural, and shared. It worsens on the one axis that matters most. A model that is more capable, more autonomous, and more widely deployed is now also more able to tell when it is being watched. That is the same geometry from the top of this piece, only now stated at the end of the card instead of the beginning: the Claude most people will actually use is the one whose safety case rests on tests it can increasingly see coming. Read the next Sonnet card for whether that gap closes or widens, and for one line in particular: whether "probably" survives another release, or whether it has quietly gone missing.

Sources

AI Safety

Claude Sonnet 5 Knows When It's Being Tested. Its Safety Card Says So.

AI Policy

Anthropic's Claude Opus 4.6 Sabotage Risk Report: A Comprehensive Analysis

The model you are most likely to meet

Sources