Autoregressive text generation

The mechanism by which LLMs produce text: one token at a time, where each new token is sampled from a probability distribution conditioned on all previously generated tokens. The model doesn’t plan ahead or produce a sequence all at once — it makes a single next-token prediction, appends that token to the context, and repeats.

This has several practical consequences:

  • Generation is sequential and cannot be trivially parallelized. Each token depends on the last, so producing 100 tokens requires 100 forward passes. This is why tokens per second is the relevant throughput metric, and why memory bandwidth (not raw compute) is the primary inference bottleneck.

  • The KV cache exists because of this loop. Rather than recomputing attention over the entire context on every step, previously computed key/value states are cached and reused, amortizing the cost across tokens.

  • Errors compound. A poorly chosen token early in generation shifts the probability distribution for all subsequent tokens. There is no backtracking — the model is committed to its prior output.

  • Autoregressive model - Wikipedia