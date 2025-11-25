The company had rolled out Claude Sonnet 4.5 in late September and followed up with Claude Haiku 4.5 in October.

Claude Opus 4.5 has achieved an unprecedented score of 80.9% on the SWE-bench Verified test, a benchmark that evaluates real-world software engineering skills. This milestone makes it the first model to surpass the 80% threshold. In comparison, Google's Gemini 3 Pro scored 76.2%, while OpenAI’s GPT-5.1 Codex Max achieved 77.9%.

The new model outperformed all human applicants on Anthropic's challenging two-hour engineering assessment, which evaluates practical coding and problem-solving abilities.

"The take-home test is designed to assess technical ability and judgment under time pressure. It doesn’t test for other crucial skills candidates may possess, like collaboration, communication, or the instincts that develop over the years. But this result — where an AI model outperforms strong candidates on important technical skills — raises questions about how AI will change engineering as a profession," the company said.

Anthropic claims that its latest AI model outperforms competitors on the Tau2-bench, a benchmark designed to evaluate agents handling real-world, multi-turn tasks. In one test, the model acts as an airline service representative, correctly refusing a change to a basic economy booking when airline policies prohibit such modifications.

“Instead, Opus 4.5 found an insightful (and legitimate) way to solve the problem: upgrade the cabin first, then modify the flights,” Anthropic said.

Designed for reliable long-form content creation, Opus 4.5 can generate narrative chapters spanning 10 to 15 pages while maintaining consistency. It excels in sophisticated 3D reasoning exercises, offering richer and more precise spatial scene descriptions than before, the company said.