ALE Benchmark Humiliates GPT 5.5 and Claude Fable 5: Real AI Stays Below 24% • Meteora Web Agency

The artificial intelligence world has been shaken by a surprising upset. A new evaluation tool called Agents' Last Exam, abbreviated ALE, has tested the most advanced language models with complex professional tasks, and the verdict is brutal. Even the best system, OpenAI's GPT 5.5 running on the Codex harness, achieved only a 24% pass rate. Anthropic's Claude Fable 5, released just yesterday, placed third with 22%. These figures prove that despite remarkable progress, AI remains far from replacing human professionals in real-world work environments.

What Makes Agents' Last Exam Different

ALE is the result of a collaboration between the Center for Responsible, Decentralized Intelligence at the University of California, Berkeley, and over 300 domain experts from more than 100 institutions. Its goal is to close the gap between artificially high scores on academic benchmarks and real GDP-relevant labor impact. Traditional tests rely on static question answering or narrow text-based terminal environments, which are easy to game. ALE forces models to operate within a Generalist Computer-Use Agent framework, where they must use vision, reasoning, and manipulation to navigate Linux and Windows virtual machines, interacting with professional software such as Siemens NX for 3D modeling, Unreal Engine for virtual scenes, FSLeyes for neuroimaging analysis, and Adobe After Effects for video compositing.

The evaluation system is almost entirely deterministic: only 6.8% of tasks rely on an LLM-as-a-judge. For the rest, comparison is made via code against an expert reference, eliminating tricks like reading hidden answer keys in Git history, a problem recently exposed in other benchmarks like SWE-Bench Pro. ALE also combats data contamination through a controlled release strategy: only 10% of the dataset is public on GitHub and Hugging Face, while over 1,300 tasks remain private and are rotated periodically.

Leaderboard Results and Implications for the Tech World

The ALE leaderboard places GPT 5.5 with Codex first (24% pass rate), followed by Ale Claw also on GPT 5.5 (23%), Claude Code with Fable 5 (22%), OpenClaw (21.1%), and Cursor CLI (20.4%). On the hardest tier, called Last Exam, almost all models score a devastating 0%. This means that for tasks at the frontier of professional competence, AI is simply inadequate. The data is especially critical for companies betting billions on autonomous agents: without a benchmark like ALE, the risk of overestimating real capabilities is enormous.

The research also showed that GPT 5.5 excels at following complex multi-part instructions, while Claude Fable 5 tends to forget intermediate steps, a fatal flaw in long-horizon workflows. For developers who want to test their own agents, ALE provides two leaderboards: Full (with proprietary software) and Unlicensed (free tools only), ensuring fair comparisons. This is a significant step forward compared to earlier tests, as highlighted in our coverage of the critical Starlette vulnerability that put millions of AI agents at risk.

For industry professionals, ALE is a reliable compass. If an agent ever manages to pass this exam, it will mean it is ready to join the workforce. Until then, the 24% ceiling of GPT 5.5 is a sobering reality check. Tools like ChatGPT for developers remain extremely useful for debugging and code review, but we should not overestimate their autonomy. The road to truly productive AI is still long and requires rigorous metrics like those provided by ALE.

To learn more about AI benchmarks and evaluation methods, you can visit the Wikipedia page on AI benchmarks.

Source: https://venturebeat.com/technology/surprise-upset-gpt-5-5-beats-claude-fable-5-on-brutal-new-agents-last-exam-benchmark