Researchers argue current multimodal benchmarks are saturated or contaminated.

They propose dynamic, leak-resistant evaluation suites.

Community adoption is the biggest obstacle.