Researchers argue current multimodal benchmarks are saturated or contaminated.
They propose dynamic, leak-resistant evaluation suites.
Community adoption is the biggest obstacle.
Researchers argue current multimodal benchmarks are saturated or contaminated.
They propose dynamic, leak-resistant evaluation suites.
Community adoption is the biggest obstacle.