Researchers are retooling AI benchmarks to better reflect real-world agent tasks. New suites emphasize multi-step planning, tool use, and long-horizon reliability.

Transparent evaluation methods are a key priority. The community hopes the new benchmarks will reduce gaming and guide meaningful improvements.