Anthropic has released Claude 4.5 Opus, which has achieved the highest scores ever recorded on multiple AI reasoning benchmarks, including surpassing average human PhD performance on graduate-level science and mathematics examinations.
Benchmark Results
Claude 4.5 Opus sets new state-of-the-art across reasoning, coding, and long-context tasks.
- GPQA Diamond (PhD-level science): 92.1% (previous best: 89.2%, human PhD average: 81%)
- SWE-bench Verified (real-world coding): 72.3% (previous best: 65.8%)
- MATH (competition mathematics): 98.1%
- Context window: 1 million tokens with 99.2% accuracy on needle-in-haystack retrieval
- Multilingual reasoning: maintains 95%+ performance across 28 languages
Enterprise Impact
Early enterprise adopters report that Claude 4.5 Opus can independently handle complex multi-step research tasks, legal document analysis, and scientific literature review that previously required teams of specialists. Anthropic emphasizes its Constitutional AI approach ensures the model refuses harmful requests while maintaining helpfulness on legitimate tasks.