Reinforcement Learning from Rich Feedback with Distributional DAgger
Abstract
Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods.
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
Community
We found something surprising about existing self-distillation methods:
๐๐๐ฒ๐ป ๐๐ต๐ฒ๐ป ๐๐ต๐ฒ ๐ณ๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ-๐ฐ๐ผ๐ป๐ฑ๐ถ๐๐ถ๐ผ๐ป๐ฒ๐ฑ "๐๐ฒ๐ฎ๐ฐ๐ต๐ฒ๐ฟ" ๐ถ๐ ๐ฏ๐ฒ๐๐๐ฒ๐ฟ ๐๐ต๐ฎ๐ป ๐๐ต๐ฒ ๐๐๐๐ฑ๐ฒ๐ป๐, ๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ป๐ด ๐ถ๐ ๐ฐ๐ฎ๐ป ๐๐๐ถ๐น๐น ๐บ๐ฎ๐ธ๐ฒ ๐๐ต๐ฒ ๐๐๐๐ฑ๐ฒ๐ป๐ ๐๐ผ๐ฟ๐๐ฒ.
This is particularly striking because self-distillation has become one of the most promising ways to go beyond RLVR, where every token receives the same trajectory-level reward.
So we asked:
๐๐ฎ๐ป ๐๐ฒ ๐น๐ฒ๐ฎ๐ฟ๐ป ๐ณ๐ฟ๐ผ๐บ ๐ฟ๐ถ๐ฐ๐ต ๐ณ๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ ๐ถ๐ป ๐ฎ ๐๐ฎ๐ ๐๐ต๐ฎ๐ ๐ฎ๐ฐ๐๐๐ฎ๐น๐น๐ ๐ด๐๐ฎ๐ฟ๐ฎ๐ป๐๐ฒ๐ฒ๐ ๐บ๐ผ๐ป๐ผ๐๐ผ๐ป๐ถ๐ฐ ๐ฝ๐ผ๐น๐ถ๐ฐ๐ ๐ถ๐บ๐ฝ๐ฟ๐ผ๐๐ฒ๐บ๐ฒ๐ป๐?
Introducing ๐๐ถ๐๐๐๐: ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ณ๐ฟ๐ผ๐บ ๐ฅ๐ถ๐ฐ๐ต ๐๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ ๐๐ถ๐๐ต ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ถ๐ผ๐ป๐ฎ๐น ๐๐๐ด๐ด๐ฒ๐ฟ.
Core idea: view rich-feedback RL as distributional imitation learning.
This gives:
โข monotonic policy improvement guarantees
โข regret bounds
And empirically, DistIL improves over RLVR, SDPO, and OPSD on:
โข science reasoning
โข coding
โข mathematical reasoning
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Self-Distilled Policy Gradient (2026)
- Learning from Language Feedback via Variational Policy Distillation (2026)
- Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization (2026)
- Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence (2026)
- Multi-Rollout On-Policy Distillation via Peer Successes and Failures (2026)
- Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents (2026)
- Self-Distilled Agentic Reinforcement Learning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper