ML Tea: Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
Speakers: Tian Jin & Ellie Cheng
Abstract: Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teachers LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are the PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that allows LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21× to 1.93× with corresponding quality changes of +2.2% to -7.1%, measured as in length-controlled win rates.
Bios:
Tian Jin is a 5th-year Ph.D. student at MIT, advised by Michael Carbin and Jonathan Ragan-Kelley. His research focuses on machine learning and programming systems. Previously, Tian was a Research Engineer at IBM Research, where he led efforts to enable deep neural network inference on IBM mainframe machines and contributed to compiler support for the IBM Summit Supercomputer. He holds a dual degree in Computer Science and Mathematics from Haverford College.
Ellie is a 3rd year PhD Student at CSAIL, advised by Michael Carbin. Her research interests are in the intersection of programming languages and machine learning.