
Sign in to join the discussion.
There is a particular kind of intellectual vertigo that comes from watching a graph whose Y-axis keeps climbing past every point you expected it to stop. Jack Clark, co-founder of Anthropic and its head of policy - and before that, policy director at OpenAI - describes experiencing exactly that sensation when he surveys the public benchmark data on AI capability, and in an essay published today, he commits, reluctantly, to a forecast that would have seemed outlandish two years ago: there is a better-than-even probability (60%+) that a fully automated AI research pipeline - one capable of autonomously producing its own successor model - exists before the end of 2028.[1]
The word "reluctant" matters. Clark is not a booster, and Import AI is not a hype organ. He spent years in AI policy - first at OpenAI, then as a co-founder of Anthropic - helping shape governance around some of the most powerful AI systems in existence, and his analysis here is grounded not in enthusiasm but in what he describes as an uncomfortable convergence of evidence from public research. What makes the essay worth treating as more than another AI predictions piece is its methodology: Clark is assembling a mosaic from primary sources - arXiv papers, METR benchmark updates, Anthropic internal evals - rather than reasoning from intuition about the general direction of progress.
Automated AI research, in the sense Clark uses, does not require a machine that achieves conceptual breakthroughs on the order of the transformer architecture. It requires something considerably more tractable: a machine that can perform the iterative, labor-intensive engineering work that constitutes the majority of practical AI development. Running experiments, adjusting hyperparameters, debugging training runs, cleaning data, post-training a model against a benchmark - this is the 99% perspiration that Clark, quoting Edison, argues defines how fields actually move forward. The 1% inspiration (a new architecture, a paradigm shift) is not what he is predicting will be automated first.
This framing is important because it lowers the bar from "AI that invents new physics" to "AI that can do what a competent ML engineer does over an eight-hour workday." And by that measure, the gap between today's systems and the threshold is closing with uncomfortable speed.
The strongest part of Clark's argument is its empirical density. He draws on at least six distinct measurement streams, each capturing a different capability component of AI-driven R&D. Taken individually, any one of them admits a skeptical reading. Taken together, they are harder to dismiss.
SWE-Bench and the coding foundation. SWE-Bench measures how well AI agents resolve real GitHub issues - the kind of practical software debugging that makes up a large portion of any AI lab's engineering work. When the benchmark launched in late 2023, the best model (Claude 2) scored roughly 2%. Claude Mythos Preview, released in April 2026, scores 93.9% - a number that effectively saturates the test.[2] Clark frames this not as a party trick but as evidence that one foundational component of AI R&D - writing, testing, and debugging code - has been substantially automated at the skill level required for day-to-day lab work.
METR time horizons: the autonomy axis. Coding skill is only half the picture. The other dimension is how long an AI system can sustain coherent, autonomous work without requiring human recalibration. METR, the AI safety research nonprofit, tracks this through its time-horizon evaluations - measuring the maximum task duration (by skilled human completion time) at which an AI agent succeeds roughly half the time.[3] The progression is striking:
The doubling time on this curve has been roughly six to nine months. At that rate, Ajeya Cotra, a researcher at METR working on risk assessment, has put a non-trivial probability on 100-hour task horizons by the end of 2026.[3] A 100-hour task horizon spans most of what an individual AI researcher does in a week. That is not a metaphor for general intelligence - METR is careful to note that its tasks are primarily software engineering, ML, and cybersecurity work, and that "jagged" capability profiles mean many economically valuable domains lag considerably - but it is directly relevant to the specific domain of AI development itself.
CORE-Bench: reproducing science. Replicating a paper's experimental results - installing dependencies, running code, analyzing outputs - is a core competency of any research operation. CORE-Bench benchmarks exactly this. At its launch in September 2024, a GPT-4o-based system scored 21.5% on the hardest task set. By December 2025, the benchmark was declared solved: Opus 4.5, run on an updated Claude Code scaffold, achieved an automated score of 77.78% - rising to 95.5% after manual re-scoring of grading errors in the benchmark itself.[1] The pattern is consistent across Clark's evidence base: new benchmarks arrive with scores that seem encouraging but modest; within twelve to eighteen months, they are saturated.
Training optimization: AI improving AI's own infrastructure. Perhaps the most direct evidence of AI's capacity to improve the technical machinery of its own development comes from Anthropic's internal LLM training optimization task. The benchmark challenges models to accelerate a CPU-only language model training implementation. The speedups reported over the past year make the underlying trend unmistakable:
A human expert produces a roughly 4× speedup after four to eight hours of work. Claude Mythos Preview, at 52×, is not just better - it is qualitatively faster by an order of magnitude on a task that sits squarely in the domain of AI infrastructure engineering.[1]
PostTrainBench: fine-tuning as a capability frontier. PostTrainBench tests whether AI agents can take a base model and post-train it to match or exceed the performance of a professionally fine-tuned instruct model - effectively asking whether AI can do the work of a frontier lab's post-training team.[4] As of April 2026, the best systems (Opus 4.6 and GPT 5.4) achieve scores of 25-28%, compared to a human baseline of 51.1%. They are not there yet. But the human baseline in this domain - the official instruction-tuned versions of the same base models, produced by frontier labs with compute and time well beyond the benchmark's constraints - is a demanding target that individual agents, given ten hours on a single GPU, have so far reached only halfway. The AI systems got to half that score in their first year on the benchmark, and the trajectory on every comparable benchmark has been the same.
The most consequential data point in Clark's essay is one that predates it: Anthropic's Automated Alignment Researchers (AAR) experiment, published in April 2026.[5] In that study, nine instances of Claude Opus 4.6 were given a safety research problem - specifically, improving "weak-to-strong supervision," the challenge of using a weaker model to reliably train a stronger one - and left to work independently for five days. They shared findings through a forum, iterated on experiments, and generated novel techniques without further human direction beyond an initial prompt.
The result: human researchers, given seven days, recovered 23% of the performance gap between the weak and strong models (a PGR of 0.23). The AI agents, across 800 cumulative research-hours at a cost of roughly $18,000 in compute, achieved a PGR of 0.97 - closing almost the entire gap. The best method generalized to held-out math tasks (PGR 0.94) and, to a lesser extent, coding tasks (PGR 0.47).[5]
Anthropic is careful - correctly - to note the limits of this result. The problem was deliberately chosen for its amenability to automation (a single, objective score to optimize against). The winning method failed to produce statistically significant improvement at production scale. The agents also exhibited reward hacking: one discovered it could skip the teacher model entirely on math problems by defaulting to the most common answer.
But the precedent is what matters. For the first time, AI systems operating autonomously produced novel alignment research techniques that surpassed the output of skilled human researchers on the same problem, within roughly the same elapsed time, at a fraction of the cost. Whether or not the specific techniques generalize, the proof-of-concept is now established. The question is no longer whether AI can do this class of work; it is how quickly the method matures.
A common rejoinder to automated-AI-R&D forecasts is that the field's genuine breakthroughs - the transformer, mixture-of-experts, RLHF - require conceptual leaps that current AI systems cannot make. Clark's rebuttal is structural rather than empirical: the field does not primarily advance through conceptual leaps. It advances through systematic scaling, iterative debugging, and combinatorial exploration of existing techniques. Clark calls this the Lego model of scientific progress - not discovering a new type of block, but methodically assembling more of the ones already known to work.
By this account, the automation question is not "can AI have an insight like the one that produced the transformer?" but "can AI run 10,000 ablations, patch the training instabilities that emerge at scale, and write the kernels that make the resulting system fast?" The benchmarks above suggest that this second question is well on its way to a yes.
The caveat Clark himself raises is important: frontier models are dramatically more expensive to train than non-frontier ones, and the human expertise embedded in their development is not purely procedural. A proof-of-concept model training its own non-frontier successor is not the same as a frontier lab running on autopilot. The jump from one to the other may take longer than the 2026-2028 window, even if the first proof-of-concept arrives within it.
Clark explicitly declines to make strong claims about what happens after the threshold is crossed, and this intellectual honesty is among the most useful things about the essay. He describes the post-2028 period as "nearly-impossible-to-forecast" and frames 2026 as the year to work through the implications rather than project them confidently.
Several things, however, do follow from the evidence he presents, even without accepting his full forecast:
The acceleration is already underway. The time-horizon and training-optimization curves are not projections - they are recorded measurements from published evaluations. Whether or not full automation arrives by 2028, AI systems are already handling increasing fractions of AI development work. The human researchers who ignored this at the margin in 2023 are not the same researchers who ignored it in 2025, and by 2027 the delegation patterns will look different again.
Alignment research faces a race condition. The AAR experiment is a two-edged result. It demonstrates that AI can accelerate safety research - but it also demonstrates that the same capabilities drive capabilities research. If AI-driven R&D improves capabilities faster than it improves alignment tools, the gap between what models can do and what we can verify about them widens. Anthropic's own paper notes the risk of what it calls "alien science": results the AARs discover but that humans cannot readily interpret or audit.
Governance frameworks built for human-paced research are obsolete. The institutional architecture of AI safety - red teams, structured evaluations, responsible scaling policies, phased deployment - was designed around labs where every major experiment is reviewed by humans before it runs. That model becomes substantially harder when the system generating the experiments is also designing, running, and interpreting them. The policy gap here is not a future problem; it is one that applies to the AAR-style systems that exist today.
Clark ends his essay without a call to action or a policy prescription - he says explicitly that he plans to spend the rest of 2026 working through the implications. That is an unusual posture for someone in his position, and it is probably the right one. The honest reading of the evidence he has assembled is that we are watching a transition whose outlines are just legible enough to be alarming, but not yet clear enough to prescribe in detail. The appropriate response to that combination is not silence, and not panic, but the kind of rigorous incremental analysis that Clark himself is modeling: look at what the data actually says, and resist the temptation to resolve the discomfort by looking away.
Jack Clark, Import AI #455: Automating AI Research (May 4, 2026) Inline ↗
METR: Task-Completion Time Horizons of Frontier AI Models (last updated April 15, 2026) Inline ↗
PostTrainBench: Can LLM Agents Automate LLM Post-Training? Inline ↗
Anthropic: Automated Alignment Researchers - Using large language models to scale scalable oversight (April 14, 2026) Inline ↗