Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Speculative decoding: more tokens per forward pass

In Chapter 3 we drew the line that organizes this whole book: a decode step reads the entire model and the KV cache to produce a single token, so it is bandwidth-bound, with the GPU’s arithmetic units mostly idle. Chapter 5 then showed how continuous batching fills those idle FLOPs by stacking many sequences into one step. Speculative decoding attacks the same waste from a different angle, and for a different shape of traffic: instead of needing many concurrent requests to saturate the hardware, it lets a single request consume more of that spare compute, by checking several candidate tokens in the forward pass that would otherwise verify one.

The trick rests on an asymmetry. Producing the next token costs one full memory pass over the weights. But producing the logits for a prefix you already have in hand also costs one memory pass, whether that prefix is one token long or ten. To see why, recall what dominates the cost of a decode step. A transformer forward pass does two kinds of work: it reads the weights out of GPU memory once, and it multiplies those weights against the tokens flowing through. For the handful of tokens we care about here, the matmuls are tiny and the weight read dwarfs everything else, so the time the pass takes is set by how much memory it has to stream, not by how many tokens ride along inside it. Feeding two tokens or ten tokens through that single weight read costs almost the same wall-clock time. That is the lever speculative decoding pulls.

So if you can cheaply guess the next handful of tokens, you can feed [last_token, guess_1, guess_2, ..., guess_k] through the big model in one shot and get, in parallel, the model’s true probability distribution at each of those positions. Call the big model the target (it is the source of truth, the model whose output you are obligated to reproduce) and call the cheap guesser the drafter (it proposes candidates the target then checks). Where the drafter’s guesses match what the target would have produced anyway, you keep them for free, because the single weight read already paid for verifying them. Where they diverge, you fall back to the target’s own choice and discard the rest. You have spent roughly one decode’s worth of bandwidth and harvested up to $k+1$ tokens instead of one, where $k$ is the number of drafted tokens.

The diagram below traces that loop end to end: one cheap draft step proposes $k$ candidate tokens, one target forward pass verifies all of them at once, and a verifier keeps the longest correct prefix before the loop repeats from wherever it stopped.

flowchart TD
    A["target has produced last_token"] --> B["drafter cheaply guesses<br/>guess_1 ... guess_k"]
    B --> C["build prefix<br/>[last_token, guess_1, ..., guess_k]"]
    C --> D["ONE target forward pass<br/>(single weight read)"]
    D --> E["target distribution p at every position"]
    E --> F["verifier: accept longest correct prefix"]
    F --> G["emit accepted drafts plus 1 target token<br/>(bonus if all accepted, else recovered)"]
    G --> A

The two hard questions are: who does the guessing, and how do you keep the guessed tokens without changing the model’s output distribution? vLLM answers the first with a family of proposers under vllm/v1/spec_decode/ (the “drafter” box above), and the second with a single rejection sampler in vllm/v1/sample/rejection_sampler.py (the “verifier” box). The rest of this chapter is those two halves and the scheduler glue that joins them.

The contract: drafting must be free, verification must be exact

The foundational result is Fast Inference from Transformers via Speculative Decoding (arXiv:2211.17192). Before the rule itself, fix two pieces of notation that recur throughout this section. Write $q(x)$ for the probability the drafter assigned to a token $x$ when it proposed it, and $p(x)$ for the probability the target assigns to that same token now that it has verified it. The drafter is trying to imitate the target, so $q$ is an approximation of $p$; the whole accept/reject machinery exists to correct the gap between them. The paper’s contribution is a rejection-sampling scheme that takes draft tokens from any cheap proposal distribution $q$ and accepts or rejects them against the target distribution $p$ such that the tokens you emit are distributed exactly as if you had sampled from $p$ directly. This is the load-bearing guarantee: speculation is a pure latency optimization, not an approximation. Read it first; everything else in this chapter is an engineering elaboration of its accept/reject rule.

vLLM names the rule directly. From the class docstring in vllm/v1/sample/rejection_sampler.py:

class RejectionSampler(nn.Module):
    """
    The implementation strictly follows the algorithm described in
        https://arxiv.org/abs/2211.17192.
    ...
    bonus tokens:
        If all proposed tokens are accepted, the bonus token is added to the
        end of the sequence. The bonus token is only sampled from the target
        probabilities.
    """

The “bonus token” is why you get $k+1$ and not just $k$: the forward pass over the drafted prefix also yields the logits for the position after the last draft, so if every draft is accepted you get one extra token sampled straight from the target for nothing.

The accept/reject decision lives in a Triton kernel. The greedy case is the easiest to see — for greedy sampling, “the target distribution” is just its argmax, so a draft is accepted exactly when it equals what the target would have picked. From rejection_greedy_sample_kernel in the same file:

            else:
                token_id = target_argmax_id
                rejected = draft_token_id != target_argmax_id
            tl.store(
                output_token_ids_ptr + req_idx * (max_spec_len + 1) + pos,
                token_id,
            )

Source: vllm/v1/sample/rejection_sampler.py

Notice the kernel walks positions left to right and stops at the first rejection (if not rejected: guards each step). A draft is a chain: each guess was produced assuming every earlier guess was correct, so once the target diverges at position i, every later draft was conditioned on a token the target just rejected and is now meaningless. You cannot keep guess 5 if guess 3 was wrong, because guess 5 only made sense in a world where guess 3 stood. So the verifier accepts a contiguous prefix and discards the entire tail. This is why acceptance length, not raw draft length, is the metric that matters: drafting ten tokens does you no good if the chain reliably breaks at the second one.

The diagram below traces that left-to-right walk for a four-token draft where the target diverges at position 2. Positions 0 and 1 match and are kept; position 2 mismatches, so it is replaced by the target’s own token and the walk stops, throwing away position 3 even though it was already verified in the same forward pass.

flowchart LR
    P0["pos 0: draft == target?"] -->|"yes, accept draft"| P1["pos 1: draft == target?"]
    P1 -->|"yes, accept draft"| P2["pos 2: draft == target?"]
    P2 -->|"no, reject"| R["emit target's token here<br/>STOP the walk"]
    R --> X["pos 3: discarded<br/>(was conditioned on rejected pos 2)"]

The random (temperature > 0) case is where 2211.17192 earns its keep. From rejection_random_sample_kernel:

                target_prob = tl.load(
                    target_probs_ptr + (start_idx + pos) * vocab_size + draft_token_id
                )
                # NOTE(woosuk): While the draft probability should never be 0,
                # we check it to avoid NaNs. If it happens to be 0, we reject.
                accepted = draft_prob > 0 and target_prob / draft_prob >= uniform_prob

Source: vllm/v1/sample/rejection_sampler.py

Accept the draft token $x$ with probability $\min\left(1, p(x)/q(x)\right)$. The intuition: if the target wanted $x$ at least as much as the drafter did ($p(x) \geq q(x)$), the draft is always kept, because the drafter did not over-propose it. But if the drafter over-proposes $x$ relative to the target’s true taste ($p(x) < q(x)$), you keep it only with probability $p(x)/q(x)$, scaling down exactly the excess. That is the accept step, and on its own it would emit x too rarely for the tokens the target slightly disfavored and never emit tokens the drafter happened not to guess at all, so the distribution would be biased.

The rejection step repairs that bias. On rejection you do not simply resample from $p$ — that would double-count the probability mass the accept step already spent. Instead you sample from the residual distribution $\max(p - q, 0)$, normalized. Read that residual as “the part of the target’s distribution the drafter under-served”: wherever the target wanted a token more than the drafter proposed it, the leftover mass $p - q$ lives there, and drawing from it precisely fills the gap the accept step left. Add the two steps together and the token you finally emit is distributed exactly as $p$. vLLM precomputes those “recovered” tokens (the residual draws) in sample_recovered_tokens before the kernel runs, using a Gumbel-max trick over max(target_prob - draft_prob, 0.0), so the kernel can grab one with a single load when it needs it. The math is subtle; the payoff is that the output stream is provably indistinguishable from ordinary sampling. A reviewer who is nervous about correctness can stop worrying here: this is the whole point of the design.

One detail worth flagging for n-gram drafting, which has no probabilities of its own: the kernel falls back to draft_prob = 1 when NO_DRAFT_PROBS is set, which reduces the random rule to “accept iff target_prob >= uniform_prob.” It still samples correctly; it just cannot claim the draft agreed with the target, so its acceptance rate is whatever the target happens to assign.

Who drafts: a zoo of proposers behind one interface

The proposers in vllm/v1/spec_decode/ differ entirely in how they produce candidates and not at all in how those candidates are verified. They sit on a rough spectrum from “no model at all” to “a second neural network.” The choice is a trade between how cheap the drafter is to run and how often its guesses are accepted: a free CPU-side lookup wins big when it hits but contributes nothing when it misses, while a learned drafter costs real GPU time per step but lands far more of its drafts. The right metric is not raw draft length but the expected number of accepted tokens per step: if each draft position $i$ is accepted with probability $\alpha_i$, a step that drafts $k$ tokens emits on average

$$\mathbb{E}[\text{tokens per step}] = 1 + \sum_{i=1}^{k} \prod_{j=1}^{i} \alpha_j$$

tokens (the leading $1$ is the guaranteed target token, and each draft only counts if every earlier draft in its chain was also accepted). That formula is the whole economics of speculation in one line, and it is sharply nonlinear: the curve below evaluates it for a constant per-position acceptance rate $\alpha$ and shows expected tokens per step climbing as $\alpha$ rises. Two things jump out. First, the payoff is convex in $\alpha$, so the last stretch of acceptance is worth far more than the first. Second, a longer draft $k$ only helps when acceptance is already high, because every extra position is gated behind the product of all the ones before it. Drafting eight tokens at $\alpha = 0.5$ yields about two per step, barely more than drafting two; drafting eight at $\alpha = 0.9$ yields over six.

Computed exactly from the formula above under the simplifying assumption that every draft position shares one acceptance rate $\alpha$; real drafters have a different $\alpha_i$ at each position, which the next chart shows.

The diagram below lays out that spectrum; the prose then walks it left to right.

flowchart LR
    L["cheaper to run,<br/>no GPU cost"] --> A["n-gram / prompt lookup<br/>(no model, copies past text)"]
    A --> B["suffix decoding<br/>(suffix tree, dynamic depth)"]
    B --> C["Medusa<br/>(parallel heads on target)"]
    C --> D["EAGLE / MTP<br/>(autoregressive feature drafter)"]
    D --> E["draft model<br/>(small standalone network)"]
    E --> R["heavier drafter,<br/>real GPU latency per step"]

The cheapest is n-gram / prompt lookup (ngram_proposer.py). An n-gram here just means a short run of $n$ consecutive tokens. The proposer does not run a model at all; it takes the last few tokens the request has produced (the current suffix), searches earlier in the same context for a place where that exact run appeared before, and proposes whatever tokens followed it last time as the draft. If the model said “the quick brown fox” once and has just produced “the quick brown” again, the obvious guess is “fox.” This is shockingly effective for tasks with verbatim copying — code editing, RAG with quoted context, summarization that echoes the source. The core is a Knuth-Morris-Pratt-style longest-suffix-match (KMP is the classic linear-time string-search algorithm), JIT-compiled with numba and run across the batch in parallel. The proposer’s docstring states the rule plainly:

    """
    Find the longest n-gram which matches the suffix of the given tokens
    whose length is within [min_ngram, max_ngram] (inclusive).

    If found, we will extract k right after the matched ngram.
    """

Because it costs nothing but CPU cycles, n-gram drafting is essentially free latency upside when it hits, and a no-op when it misses. The same file is careful never to draft past the model length (k = min(k, max_model_len - total_token)), which is the kind of boundary bookkeeping every proposer has to get right.

A step up is suffix decoding (suffix_decoding.py), which generalizes n-gram lookup to a per-prompt suffix tree maintained across requests (it wraps Arctic Inference’s implementation and is described in arXiv:2411.04975). It can speculate a dynamic number of tokens per step depending on how confident the tree is, rather than a fixed $k$.

Then come the learned drafters. Medusa (medusa.py) is the simplest of these. Its contribution, Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads (arXiv:2401.10774), is to bolt several small feed-forward “heads” onto the target model itself, each trained to predict the token at offset +1, +2, +3, and so on from the current hidden state. There is no separate model to run autoregressively — one forward pass of the target produces a hidden state, and the heads fan it out into several draft tokens at once:

        # Generate blocks and compute logits
        blocks = self.model(target_hidden_states)
        logits = self.model.compute_logits(blocks)

        # Compute argmax for each Medusa head and stack into a single tensor
        # Shape: [batch_size, num_heads]
        draft_tokens = torch.stack([logit.argmax(dim=-1) for logit in logits], dim=1)

The catch, which Medusa accepts, is that the +2 head is predicting from the same hidden state as the +1 head, with no knowledge of what +1 actually emitted. The drafts are not properly conditioned on each other, so acceptance falls off fast with depth.

EAGLE fixes exactly that. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (arXiv:2401.15077) argues that the right thing to draft autoregressively is not tokens but the target’s feature (hidden state) sequence, which is smoother and more predictable, and then to read tokens off those features. The EAGLE head is a small transformer that does run autoregressively, consuming the target’s last hidden state to draft a properly-conditioned chain. In vLLM this is almost pure configuration on top of a shared base — the entire proposer is:

class EagleProposer(SpecDecodeBaseProposer):
    def __init__(
        self,
        vllm_config: VllmConfig,
        device: torch.device,
        runner=None,
    ):
        super().__init__(
            vllm_config,
            device,
            pass_hidden_states_to_model=True,
            runner=runner,
        )

Source: vllm/v1/spec_decode/eagle.py

That pass_hidden_states_to_model=True flag is the whole distinction from a generic draft model (draft_model.py, the classic 2211.17192 setup of a small standalone model of the same family), which sets it to False and works from tokens alone. EAGLE and its MTP variants are the production default for most 2026 open models that ship a trained draft head; the long list of *_mtp method names in vllm/config/speculative.py is the roster of models that bake the draft head into their release.

All of this is selected through one config surface. SpeculativeConfig in vllm/config/speculative.py carries num_speculative_tokens (the $k$ above), the method, and the optional draft model, and can auto-detect the method from the model when you do not name it.

Fitting speculation into the token-budget scheduler

Here is where Chapter 5 pays off. We promised that the token-budget scheduler treats prefill, decode, chunking, and speculation as one problem, and the design note in vllm/v1/core/sched/scheduler.py says so outright:

        # NOTE(woosuk) on the scheduling algorithm:
        # There's no "decoding phase" nor "prefill phase" in the scheduler.
        # Each request just has the num_computed_tokens and
        # num_tokens_with_spec. num_tokens_with_spec =
        # len(prompt_token_ids) + len(output_token_ids) + len(spec_token_ids).
        # At each step, the scheduler tries to assign tokens to the requests
        # so that each request's num_computed_tokens can catch up its
        # num_tokens_with_spec.

Unpack the two counters in that note. num_computed_tokens is how many tokens of this request the model has actually run through a forward pass; num_tokens_with_spec is how many tokens the request wants run, including the speculative tail the proposer just appended as spec_token_ids. A speculating request is simply one whose num_tokens_with_spec is a few tokens ahead of where it would otherwise be. The scheduler does not have a special “verify a draft” code path; it allocates num_new_tokens = num_tokens_with_spec - num_computed_tokens slots, exactly as it would for a chunk of prefill, and the same forward pass that would have decoded one token now happens to cover last_token plus the $k$ drafts. This is the elegance of the V1 design: speculation rides on the same token-level rate limiter you already understand, and the same KV-block allocation, with no separate machinery.

The interesting half is the cleanup after the model runs, because the scheduler advanced num_computed_tokens optimistically. Before the forward pass it had no way to know how many drafts would be accepted, so it assumed the best case and counted all $k$ drafts as computed. The forward pass then verified $k$ drafts, the rejection sampler accepted some prefix of them, and now num_computed_tokens is too high by exactly the number of rejected drafts — those positions ran through the model but their tokens never became part of the real sequence. The fix is to roll the counter back. From update_from_output:

            if scheduled_spec_token_ids and generated_token_ids:
                num_draft_tokens = len(scheduled_spec_token_ids)
                num_accepted = len(generated_token_ids) - 1
                num_rejected = num_draft_tokens - num_accepted
                ...
                if request.num_computed_tokens > 0:
                    request.num_computed_tokens -= num_rejected

Source: vllm/v1/core/sched/scheduler.py

That num_accepted = len(generated_token_ids) - 1 is the bonus token at work: the sampler always returns at least one real token (the bonus, or the recovered token at the rejection point), so everything beyond that first one is an accepted draft. The rejected tail is rolled back by decrementing num_computed_tokens, and the KV slots those rejected tokens wrote are simply overwritten on the next step. No preemption, no recompute — speculation degrades to ordinary decode when acceptance is zero, costing only the wasted draft compute.

The sequence diagram below traces one full speculative step across the three actors, including the optimistic advance and the corrective rollback. Read it top to bottom as a single iteration of the loop from the first diagram in this chapter.

sequenceDiagram
    participant S as Scheduler
    participant P as Proposer drafter
    participant M as Model and RejectionSampler
    P->>S: append spec_token_ids (k drafts)
    S->>S: advance num_computed_tokens by k (optimistic)
    S->>M: allocate slots for last_token + k drafts
    M->>M: one forward pass over the whole prefix
    M->>M: walk drafts, accept prefix, add bonus token
    M->>S: generated_token_ids (accepted + 1 bonus)
    S->>S: num_rejected = k - (len - 1)
    S->>S: roll back num_computed_tokens by num_rejected
    Note over S,M: rejected KV slots overwritten next step

The engine also keeps score. SpecDecodingStats in vllm/v1/spec_decode/metrics.py tracks acceptance both in aggregate and per position:

    def observe_draft(self, num_draft_tokens: int, num_accepted_tokens: int):
        self.num_drafts += 1
        self.num_draft_tokens += num_draft_tokens
        self.num_accepted_tokens += num_accepted_tokens
        assert num_accepted_tokens <= self.num_spec_tokens
        for i in range(num_accepted_tokens):
            self.num_accepted_tokens_per_pos[i] += 1

The per-position array is the diagnostic you will actually use in production. It tells you the acceptance decay curve: how often position 0 is accepted versus position 3. A curve that collapses after the first token means your $k$ is set too high for this drafter on this traffic, and you are burning verification compute on drafts that never land. The chart below sketches that decay for two kinds of drafter on the same traffic: a strong feature drafter (EAGLE-style) whose per-position acceptance falls slowly, and a weak drafter (a shallow n-gram match, say) that lands the first guess often but collapses by position two or three. The shaded region under each is exactly the $\alpha_i$ sequence that feeds the chained product in the formula above, so a curve that has gone flat means the drafts at those positions are pure wasted verification.

Illustrative shapes, not measured: the exact numbers are model- and traffic-dependent, and the num_accepted_tokens_per_pos array is precisely what you would log to recover the real curve.

Where it helps, where it hurts, what is unsolved

Speculative decoding is not free throughput; it is a trade. You spend extra FLOPs (the draft, plus verifying tokens you may discard) to buy fewer memory passes per accepted token. That trade is favorable precisely when the GPU has spare FLOPs — i.e. at low-to-moderate batch sizes, where decode is most bandwidth-starved. At high concurrency, where continuous batching has already filled the compute, speculation competes with real requests for the same FLOPs and can reduce aggregate throughput even as it lowers single-request latency. This is the central tension to internalize: speculation and batching both target idle compute, so they are partly substitutes, and the right num_speculative_tokens is load-dependent. A server tuned for a packed batch should speculate little or not at all; one serving a latency-sensitive trickle should speculate aggressively.

The open problems are real. Dynamic speculation length — letting $k$ track the live acceptance curve and the current batch size rather than sitting at a fixed config value — is only partly solved (suffix decoding’s variable depth is one attempt). Draft quality is the other frontier: EAGLE-style feature drafting raised acceptance dramatically over plain draft models, but every gain in the drafter costs drafter latency, and the optimum is model- and task-specific. And there is an interaction we have deliberately deferred: when output is constrained to a grammar, the bitmask that bans illegal tokens must be applied to draft positions too, and rolled back over rejected drafts so the FSM state stays consistent. That handshake is the explicit reason Chapter 14 builds on this one — structured generation has to teach the constraint engine to speak speculation’s accept/reject language. We pick it up there.

Further reading

  • Fast Inference from Transformers via Speculative Decoding — arXiv:2211.17192 — the rejection-sampling rule vLLM implements verbatim; read it to understand why accepted tokens are distributed exactly as the target.
  • EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — arXiv:2401.15077 — draft at the feature (hidden-state) level autoregressively rather than at the token level; the production-default drafter for trained heads.
  • Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — arXiv:2401.10774 — bolt parallel prediction heads onto the target model itself; simplest learned drafter, with acceptance that decays fast with depth.
  • SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications — arXiv:2411.04975 — a model-free approach that generalizes n-gram lookup to per-prompt suffix trees with dynamic speculation depth.