Anthropic has released benchmark results for Claude 4 Opus showing significant advances in complex reasoning, mathematical problem-solving, and code generation. The model achieved state-of-the-art results on the GPQA benchmark with 72.1% accuracy and demonstrated improved performance on multi-turn conversational tasks requiring sustained context tracking.
Notably, Claude 4 Opus showed marked improvements in tasks requiring the model to acknowledge uncertainty and avoid hallucination. In blind evaluations conducted by independent researchers, the model correctly identified the limits of its knowledge 91% of the time, compared to 73% for the nearest competitor.