A Stanford study tested leading LLM agents on 500 multi-step real-world tasks and found success rates below 42 percent.
Models performed best on tasks under 12 steps and degraded sharply beyond 25 steps.
Researchers argue that memory and verification are larger bottlenecks than raw capability.