Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning Paper β’ 2509.22824 β’ Published Sep 26, 2025 β’ 21
VideoScore2: Think before You Score in Generative Video Evaluation Paper β’ 2509.22799 β’ Published Sep 26, 2025 β’ 26
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use Paper β’ 2509.01055 β’ Published Sep 1, 2025 β’ 78
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use Paper β’ 2509.01055 β’ Published Sep 1, 2025 β’ 78
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs Paper β’ 2505.20139 β’ Published May 26, 2025 β’ 19
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper β’ 2505.19955 β’ Published May 26, 2025 β’ 14
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design Paper β’ 2505.16175 β’ Published May 22, 2025 β’ 42
General-Reasoner: Advancing LLM Reasoning Across All Domains Paper β’ 2505.14652 β’ Published May 20, 2025 β’ 24
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation Paper β’ 2504.00043 β’ Published Mar 30, 2025 β’ 10
Small Models Struggle to Learn from Strong Reasoners Paper β’ 2502.12143 β’ Published Feb 17, 2025 β’ 39
ACECODER: Acing Coder RL via Automated Test-Case Synthesis Paper β’ 2502.01718 β’ Published Feb 3, 2025 β’ 28
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning Paper β’ 2502.01100 β’ Published Feb 3, 2025 β’ 21
Running Featured 559 Vision Arena (Testing VLMs side-by-side) πΌ 559 Transform your ideas into stunning visual art
On Memorization of Large Language Models in Logical Reasoning Paper β’ 2410.23123 β’ Published Oct 30, 2024 β’ 18
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper β’ 2410.10563 β’ Published Oct 14, 2024 β’ 37
Running Featured 559 Vision Arena (Testing VLMs side-by-side) πΌ 559 Transform your ideas into stunning visual art