BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published May 31, 2025 • 8
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy Paper • 2412.17759 • Published Dec 23, 2024
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published Mar 10, 2025 • 101
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper • 2505.17332 • Published May 22, 2025 • 31
Why do LLaVA Vision-Language Models Reply to Images in English? Paper • 2407.02333 • Published Jul 2, 2024
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review Paper • 2502.19614 • Published Feb 26, 2025
MVTamperBench: Evaluating Robustness of Vision-Language Models Paper • 2412.19794 • Published Dec 27, 2024 • 4
ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators Paper • 2306.08754 • Published Jun 14, 2023 • 3