Huge thanks to amazing collaborators: Justus, Farrin, Theo, Sameer and Stephan!
Meet us at #ICLR: Apr 23, morning poster P3-#608
Posts by Felix Draxler
We verify that this works in a speculative decoding experiment: We distill Vicuna-7B on conversations to predict the same output, but at greatly reduced latency. We achieve a 2.4x speedup over AR on diverse text tasks on one GPU, with 3.2x possible with optimized implementation.
Autoregressive models produce text by predicting the histogram of the next token. Random auxiliary variables determine the token to choose.
Parallel Token Prediction directly learns what external randomness maps to which token. This allows to predict many tokens at once.
LLMs are autoregressive and slow? No! Parallel Token Prediction decodes multiple consistent tokens in one model call. PTP allows arbitrary dependencies in one call, unlike discrete diffusion. Practical: 2.4x speedup
github.com/mandt-lab/ptp
ICLR: Apr 23, morning poster P3-#608