Leaked benchmark results for Anthropic's upcoming Claude 4 Opus model show it scoring 94% on GPQA (Graduate-level Physics QA) and outperforming GPT-5 on 8 out of 10 major benchmarks.
Leaked Benchmark Scores
- GPQA: 94% (GPT-5: 89%)
- MATH: 96.5% (GPT-5: 94%)
- HumanEval: 98% (GPT-5: 95%)
- MMLU-Pro: 91% (GPT-5: 88%)
- SWE-bench Verified: 72% (GPT-5: 65%)
What's New
Sources close to Anthropic say Claude 4 features a dramatically improved reasoning engine, 500K context window, native multimodal capabilities including video understanding, and significantly better instruction following in agentic workflows.
Anthropic has not confirmed the leak but is expected to announce Claude 4 at an event in late April.