Claude 4 Opus Benchmarks Leaked: Scores 94% on GPQA, Surpasses GPT-5

Leaked benchmark results for Anthropic's upcoming Claude 4 Opus model show it scoring 94% on GPQA (Graduate-level Physics QA) and outperforming GPT-5 on 8 out of 10 major benchmarks.

Leaked Benchmark Scores

GPQA: 94% (GPT-5: 89%)
MATH: 96.5% (GPT-5: 94%)
HumanEval: 98% (GPT-5: 95%)
MMLU-Pro: 91% (GPT-5: 88%)
SWE-bench Verified: 72% (GPT-5: 65%)

What's New

Sources close to Anthropic say Claude 4 features a dramatically improved reasoning engine, 500K context window, native multimodal capabilities including video understanding, and significantly better instruction following in agentic workflows.

Anthropic has not confirmed the leak but is expected to announce Claude 4 at an event in late April.

Claude 4 Opus Benchmarks Leaked: Scores 94% on GPQA, Surpasses GPT-5

Leaked Benchmark Scores

What's New

Share This Article

Related Articles

Open Source AI Models Closing Gap with Closed: Llama 4 vs GPT-5 Benchmarks

Claude 4 Opus Benchmarks: Anthropics New Model Tops Every Reasoning Test

Elon Musk's xAI Grok 4 Claims Benchmark Records