SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published 5 days ago • 33
mlfoundations-dev/tulu-3-sft-personas-algebra-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 9.95k • 25
mlfoundations-dev/tulu-3-sft-personas-math-grade-filtered-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 9.29k • 10
mlfoundations-dev/wizardlm_orca-evol-instruct-110k-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 10k • 25 • 1
mlfoundations-dev/magicoder-evol-instruct-110k-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 9.98k • 9 • 1
mlfoundations-dev/stackexchange-codereview-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 9.99k • 3
mlfoundations-dev/glaive-code-assistant-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 8.51k • 19