A new benchmark, Agents' Last Exam (ALE), has just shaken the AI world. OpenAI's GPT-5.5 unexpectedly surpassed Anthropic's brand-new Claude Fable 5, with a pass rate of 24.0% versus 22.0%. However, the most striking finding is that on the hardest difficulty tier, many models score a flat zero.
A test that measures real economic value
ALE is no academic exercise: it evaluates AI agents on complex professional workflows, from 3D modeling in Siemens NX to compositing in After Effects. The result proves that despite progress, models still struggle to translate into tangible productivity. For companies investing billions in AI agents, this is a wake-up call. The leaderboard is topped by GPT-5.5 at 24.0%, but the failure rate remains staggering.
Mounting regulation and internal tensions
In the background, Anthropic CEO Dario Amodei proposed a shock regulation: treat AI like commercial aviation, with mandatory testing and deployment holds for models above a certain power threshold. An FAA-style approach that could freeze frontier model releases. Meanwhile, a lawsuit against xAI alleges an engineer was fired for raising safety concerns about Grok, days before SpaceX's IPO. These events show trust in the sector is not guaranteed.
What enterprises must do
For technical decision-makers, the lesson is twofold. First, build multi-model architectures to avoid vendor lock-in, since a model could be withdrawn or blocked by regulators. Second, prepare for stringent cybersecurity compliance, treating model weights as trade secrets. As WWDC 2026 showed, AI is everywhere, but its reliability is still in question. Companies that do not plan for compliance and resilience now risk falling behind.
Sponsored Protocol