Anthropic today published evaluation results for Claude Opus 4, demonstrating state-of-the-art performance on complex reasoning benchmarks including GPQA Diamond, ARC-AGI, and the newly introduced ReasonBench suite of multi-step logical deduction tasks.

Claude Opus 4 scored 78 percent on GPQA Diamond, surpassing the previous best score by 6 points, and achieved 85 percent on ReasonBench, a benchmark designed to be resistant to pattern matching and memorization strategies commonly seen in large language models.

The model is available through Anthropic's API with a 200,000-token context window and supports extended thinking capabilities that allow users to observe the model's step-by-step reasoning process before it produces a final answer.