Leaked benchmark results purportedly from OpenAI's upcoming GPT-5 model show performance matching or exceeding human experts on graduate-level mathematics, physics, and legal reasoning tasks, sparking intense debate about AI capabilities.
Benchmark Results
The leaked evaluation results, first reported by The Information and partially confirmed by OpenAI insiders, show dramatic improvements over GPT-4.
- GPQA Diamond (graduate-level science): 89.2% (human expert average: 81%)
- MATH benchmark (competition math): 96.4% (GPT-4: 76.6%)
- Multilingual legal reasoning: 91.3% across 12 languages
- Long-context coherence: maintains accuracy across 500,000+ token contexts
Industry Reaction
AI researchers are divided between those who see the results as a clear path toward artificial general intelligence and skeptics who argue benchmarks do not capture the full spectrum of human reasoning. OpenAI has declined to comment on the leak but is expected to announce GPT-5 at a May event.