Use a small model to generate a 'draft' output, then use a larger and smarter model to score the 'draft', then use a rejection sampling scheme to accept the tokens which are agreed by the small and large models.

In tests, they find that a draft model can give them speedups ranging between 1.92X  (on a summarization benchmark called XSum) and 2.46X on a code generation task called HumanEval.

Fwd: Import AI 317: DeepMind speeds up language model sampling; voice cloning tech gets abused; more scaling laws for RL
from Josh Beckman ✉️