6/ Our paper is out: arxiv.org/abs/2502.02671. This work was the result of my internship at Google DeepMind—huge thanks to the team: Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, @ramealexandre.bsky.social, @mblondel.bsky.social!
Posts by Daniil Tiapkin
5/ Our suggestions are the following:
- Use online generations during distillation;
- Train on more diverse prompt datasets;
- Expand the dataset with multiple completions per prompt.
4/ The results? Teacher hacking is real: better approximating the teacher does not always translate into a better approximation of the oracle. Fortunately, we found some strategies to mitigate it.
3/ The key intuition is that distillation optimizes a proxy objective since the teacher isn’t perfect, exactly like RLHF optimizes an imperfect reward. To study this, we built a controlled setup in which an oracle model replaced the ground-truth objective.
2/ In our novel work from Google DeepMind, “On Teacher Hacking in Language Model Distillation,” we analyze this possible limitation, which would be critical if real, as distillation is becoming central for the post-training of modern LLMs.
1/ If you’re familiar with RLHF, you likely heard of reward hacking—where over-optimizing the imperfect reward model leads to unintended behaviors. But what about teacher hacking in knowledge distillation: can the teacher be hacked, like rewards in RLHF?
Hope I'm not too late 😅