Daniil Tiapkin (@dtiapkin) Bsky

On Teacher Hacking in Language Model Distillation Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learn...

6/ Our paper is out: arxiv.org/abs/2502.02671. This work was the result of my internship at Google DeepMind—huge thanks to the team: Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, @ramealexandre.bsky.social, @mblondel.bsky.social!

1 year ago 1 0 0 0

5/ Our suggestions are the following:
- Use online generations during distillation;
- Train on more diverse prompt datasets;
- Expand the dataset with multiple completions per prompt.

1 year ago 0 0 1 0

4/ The results? Teacher hacking is real: better approximating the teacher does not always translate into a better approximation of the oracle. Fortunately, we found some strategies to mitigate it.

1 year ago 1 0 1 0

3/ The key intuition is that distillation optimizes a proxy objective since the teacher isn’t perfect, exactly like RLHF optimizes an imperfect reward. To study this, we built a controlled setup in which an oracle model replaced the ground-truth objective.

1 year ago 0 0 1 0

2/ In our novel work from Google DeepMind, “On Teacher Hacking in Language Model Distillation,” we analyze this possible limitation, which would be critical if real, as distillation is becoming central for the post-training of modern LLMs.

1 year ago 0 0 1 0

1/ If you’re familiar with RLHF, you likely heard of reward hacking—where over-optimizing the imperfect reward model leads to unintended behaviors. But what about teacher hacking in knowledge distillation: can the teacher be hacked, like rewards in RLHF?

1 year ago 1 0 1 1

Hope I'm not too late 😅

1 year ago 1 0 2 0

Posts by Daniil Tiapkin