Anthropic Claude Opus 4 Achieves New State-of-the-Art on Complex Reasoning Benchmarks

Anthropic today published evaluation results for Claude Opus 4, demonstrating state-of-the-art performance on complex reasoning benchmarks including GPQA Diamond, ARC-AGI, and the newly introduced ReasonBench suite of multi-step logical deduction tasks.

Claude Opus 4 scored 78 percent on GPQA Diamond, surpassing the previous best score by 6 points, and achieved 85 percent on ReasonBench, a benchmark designed to be resistant to pattern matching and memorization strategies commonly seen in large language models.

The model is available through Anthropic's API with a 200,000-token context window and supports extended thinking capabilities that allow users to observe the model's step-by-step reasoning process before it produces a final answer.

Anthropic Claude Opus 4 Achieves New State-of-the-Art on Complex Reasoning Benchmarks

Share This Article

Related Articles

OpenAI Unveils GPT-5 Turbo With Native Multimodal Reasoning Capabilities

OpenAI Unveils GPT-5 With Multimodal Reasoning and Extended Context Window

Anthropic Releases Claude 4 With Enhanced Reasoning and Safety Benchmarks