Claude 4 Opus Benchmarks: Anthropics New Model Tops Every Reasoning Test

Anthropic has released Claude 4 Opus, which achieved state-of-the-art results on every major AI reasoning benchmark, surpassing GPT-5.4 and Gemini Ultra 2.0 in multiple categories.

Benchmark Results

GPQA Diamond: 78.2% (GPT-5.4: 72.1%, Gemini: 69.8%)
MATH-500: 96.4% (GPT-5.4: 94.1%)
SWE-bench Verified: 72.0% (GPT-5.4: 64.2%)
HumanEval+: 98.2% (GPT-5.4: 96.8%)
ARC-AGI-2: 42.1% (GPT-5.4: 35.7%)

Key Improvements

Claude 4 Opus features a 500K token context window, improved instruction following, and significantly reduced hallucination rates. Anthropic credits advances in constitutional AI training and chain-of-thought reasoning.

Pricing

API pricing is $15 per million input tokens and $75 per million output tokens. A free tier is available through claude.ai with usage limits.

Claude 4 Opus Benchmarks: Anthropics New Model Tops Every Reasoning Test

Benchmark Results

Key Improvements

Pricing

Share This Article

Related Articles

Open Source AI Models Closing Gap with Closed: Llama 4 vs GPT-5 Benchmarks

Elon Musk's xAI Grok 4 Claims Benchmark Records

AI Hallucination Rate Drops to 2% in Latest Models