What Does It Mean for AI to Beat Humans at Using a Computer? A Beginner's Guide to OSWorld

Noah Ogbi8 min readUpdated Jun 1, 2026

Tips, corrections, or questions? support@omniscient.media

The number that surprised everyone, including the researchers

When researchers at the University of Hong Kong's XLANG Lab, together with collaborators at Salesforce Research, Carnegie Mellon University, and the University of Waterloo, published OSWorld in April 2024, they included a baseline measurement: how well an average person performs on the same set of tasks. The answer was 72.36%.^[1] At the same time, the best AI model in the world could only complete 12.24% of those tasks.^[1] That 60-point gap was the point. OSWorld was designed to be hard - not a toy problem but a genuine reflection of what it costs to sit at a computer and get things done.

Less than two years later, that gap has effectively closed. Simular's Agent S crossed the human baseline in December 2025, reaching 72.6%.^[2] Then AGI Inc.'s OS Agent pushed to 76.26% on the original OSWorld leaderboard.^[3] And in early March 2026, OpenAI's GPT-5.4 posted a 75.0% score on the updated OSWorld-Verified platform, against a human benchmark of 72.4%.^[4] That last figure - GPT-5.4's 75.0% - arrived with the most corporate weight behind it and drew the widest headlines. The number is self-reported by OpenAI and awaits independent verification by the benchmark's maintainers. What it means, though, requires some unpacking.

OSWorld Benchmark: Top Score Progression — OSWorld top score progression from launch in April 2024 to March 2026. Human baseline: 72.36%.

What OSWorld actually tests

Most AI benchmarks are, at bottom, question-and-answer tests. They present a model with text - a problem, a passage, a code snippet - and ask for a response. OSWorld is structurally different. An agent is placed inside a real, running computer environment: a live Ubuntu, Windows, or macOS desktop, populated with real software. It then receives an instruction in plain English, something like "resize image" in GIMP, "concatenate two CSV files," or "find the next available flight from Chicago to London on a specific site." The agent must complete the task the same way a human would: by looking at the screen, moving a cursor, clicking buttons, typing, and navigating menus.^[1]

The benchmark covers 369 real-world tasks spread across browsers, productivity software, code editors, email clients, and file systems - the kinds of things knowledge workers actually spend their days doing.^[1] Crucially, success is not measured by whether the agent's answer sounds correct. It is measured by execution: did the file actually get renamed? Did the spreadsheet formula produce the right output? Did the email send? This execution-based scoring is what separates OSWorld from benchmarks that an LLM can game by producing plausible-sounding text.

Why the "Verified" update matters

The version GPT-5.4 was evaluated on - OSWorld-Verified - is not the same benchmark that launched in 2024. The XLANG Lab spent roughly two months in mid-2025 systematically auditing the original test suite, addressing over 300 reported issues.^[5] The problems they found are illuminating: websites had changed their HTML structure since tasks were written, anti-bot measures were blocking agents on tasks that required visiting live sites, some instructions were genuinely ambiguous (does "resize image" mean resize the layer or the canvas?), and some evaluation functions were penalizing correct answers that used an unexpected but valid method.^[5]

The infrastructure was overhauled at the same time. The original benchmark ran on individual machines using VMware, making parallel evaluation slow and results hard to reproduce consistently. OSWorld-Verified migrated to AWS, enabling up to 50 simultaneous test environments and compressing evaluation time from over ten hours to minutes.^[5] The practical consequence: scores on the Verified platform are more trustworthy than the headline numbers attached to many earlier systems, including some that reported crossing the human baseline before the benchmark itself was reliable enough to make that claim credibly.

From 12% to 75%: what changed

The 60-point improvement from the 2024 best (12.24%) to GPT-5.4's 75.0% is the kind of progress curve that is easy to quote and hard to intuit. A few concrete factors drove it.

First, vision capabilities improved dramatically. Early agents struggled to reliably identify where buttons were on a screen, particularly when UI layouts changed or elements were small. Modern multimodal models can process high-resolution screenshots with significantly better spatial accuracy.^[1] Second, computer use became a first-class capability rather than an afterthought layered on top of a language model. GPT-5.4 is the first general-purpose OpenAI model with native computer use built in, rather than relying on separate agentic frameworks that added integration complexity and failure modes.^[4] Third, training on human demonstration data played a significant role. The XLANG Lab's own analysis noted that leading models' success rates correlate closely with the availability of human trajectory data for similar tasks - meaning models got better partly by learning from recordings of humans actually doing these things.^[5]

The gap that doesn't show in the score

A score of 75% against a human baseline of 72.4% does not mean AI is "better than humans at computers" in any general sense. It means that on this specific set of 369 tasks, in a controlled environment, with time and attempts unbound, the best AI systems are now completing roughly the same proportion of tasks as an average human participant.

What the score does not capture is speed, reliability under novel conditions, or judgment. Independent research on OSWorld agent efficiency found that large model calls for planning and reflection account for most overall latency, and that agents slow down noticeably on multi-step tasks - each successive step can take three times as long as steps at the beginning of a task.^[6] A human doing the same tasks would not exhibit that degradation pattern. And the tasks in OSWorld, while genuine and varied, are still bounded: they have a defined correct answer. Much of what makes human computer work valuable is precisely the open-ended judgment that comes before you know what task to perform.

There is also a benchmark saturation concern worth naming. OSWorld-Verified's maintainers noted in their July 2025 update that leading models' performance "stems primarily from extensive human trajectory data" - implying that some portion of the improvement reflects training on data that looks like the benchmark rather than generalizable computer-use ability.^[5] This is not unique to OSWorld; it is a structural feature of any benchmark that becomes widely known.

Why this still matters

None of the above caveats diminish what the 2024-to-2026 trajectory represents. One year before Simular crossed the human baseline, the best agents were stuck around 20%.^[2] The improvement is not a statistical artifact. Tasks that genuinely required understanding a screen, navigating real software, and producing verifiable outputs were simply not being completed reliably, and now they largely are.

For people new to following AI, OSWorld is a better window into near-term impact than most benchmarks precisely because it is grounded in tasks that correspond to real economic activity. Scheduling, document editing, data manipulation, web navigation - these are things that organizations pay people to do. A system that can reliably perform them at human-level accuracy on a standardized test is closer to a deployed workforce tool than a system that scores well on abstract reasoning puzzles.

The question OSWorld does not answer - and was not designed to answer - is how quickly that laboratory reliability translates into enterprise deployment. That involves robustness across novel systems, security, cost per task, legal accountability, and the tolerance of organizations for AI errors that cost real money. The benchmark is a starting line, not a finish line. But knowing where the starting line is, and understanding what it actually measures, is the prerequisite for everything that follows.

Sources

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (XLANG Lab, University of Hong Kong, 2024) ↗
Simular's Agent S Outperforms Humans on OSWorld Benchmark (Simular, December 2025) ↗
The World's Most Capable Computer Agent (AGI, Inc.) ↗
OpenAI's GPT-5.4 Sets New Records on Professional Benchmarks (The Next Web, March 5, 2026) ↗
Introducing OSWorld-Verified (XLANG Lab, July 2025) ↗
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents (arXiv) ↗

Consequential AI, explained and evaluated, every weekday.

The Omniscient Bulletin: 5 to 7 items a day with the take, not the recap.

Reference Library

Vol. 1·Monday, March 16, 2026