-
RubricBench: Aligning Model-Generated Rubrics with Human Standards
Paper โข 2603.01562 โข Published โข 50 -
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
Paper โข 2603.03790 โข Published โข 104 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper โข 2505.20411 โข Published โข 93 -
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Paper โข 2602.23866 โข Published โข 73
Paul S PRO
SuperPauly
AI & ML interests
None yet
Recent Activity
liked
a model about 22 hours ago
coder3101/Qwen3.5-4B-heretic liked
a model about 23 hours ago
janhq/Jan-code-4b updated
a collection
about 23 hours ago
Agent Loops, Character, Work Ethics & Behavior Organizations
None yet