The sampler and the egress path: logits to streamed text

Everything in Part II so far has been about getting the forward pass to happen efficiently: paging the KV cache, budgeting the batch, slicing prefills, reusing prefixes, dispatching the right attention kernel, and finally (in Chapter 10) replaying the whole thing as a CUDA graph while the next step’s scheduling overlaps on the CPU. But a forward pass produces a tensor of hidden states, and a hidden-state tensor is not an HTTP response. This chapter follows the last stretch of the pipeline: from the model’s output projection, through the sampler, across the process boundary back to the API server, and out as Server-Sent Events.

It is a deceptively boring-sounding stretch, which is exactly why it’s worth a chapter. Two things hide here that people get wrong. First, the sampler is a small GPU pipeline with exactly one mandatory GPU-to-CPU synchronization, and async scheduling from Chapter 10 turns even that into something you can skip on the hot path. Second, the path from sampled token ids to a correct SSE stream contains a distributed-cancellation race between two processes, and getting it wrong means either leaked GPU work or truncated outputs.

Before diving into either, it helps to have the whole route in view. The diagram below traces a single token from the moment the forward pass finishes to the moment its text leaves the server. Notice that the journey crosses a process boundary in the middle: the left half lives in the engine process (on and near the GPU), the right half lives in the frontend process (the HTTP server). That boundary is where the second problem hides.

flowchart LR
    subgraph engine["engine process (on and near the GPU)"]
        A["hidden states (GPU)"] --> B["gather last position per request"]
        B --> C["output projection to vocab logits"]
        C --> D["sampler: penalties, temperature, top-k/top-p, draw"]
        D --> E["sampled token ids (still on GPU)"]
        E --> F["the one D2H sync: copy ids to CPU"]
    end
    F -->|"ZMQ socket"| G
    subgraph frontend["frontend process (HTTP server)"]
        G["receive EngineCoreOutput"] --> H["incremental detokenizer: ids to text"]
        H --> I["stop-string check and abort handshake"]
        I --> J["SSE frame: data ... then DONE"]
    end

Only the last position matters

Recall from Chapter 1 that decode appends one token at a time, and from Chapter 5 that a continuously-batched step is a flat bag of tokens with no prefill/decode distinction. A subtle consequence: most of the positions in that flat bag produce hidden states you will immediately throw away. During prefill, the model computes a hidden state for every prompt token, but you only ever sample from the last one. During a chunked prefill (Chapter 6) you don’t even want that, the partial request’s last position isn’t the real end of the prompt yet.

So before the engine pays for the (large) output projection from hidden dimension to vocabulary size, it gathers only the rows it actually needs. The “output projection” is the final linear layer that maps each hidden state (a vector of size hidden_dim, a few thousand) to a logit vector of size vocab_size (often 128k or more). One logit per vocabulary entry, per position. Running that projection on every position in the flat batch would produce a giant [num_tokens, vocab] tensor, almost all of which is discarded. So the runner picks the keepers first. In the model runner:

# vllm/v1/worker/gpu_model_runner.py
logits_indices = query_start_loc[1:] - 1

Source: vllm/v1/worker/gpu_model_runner.py

To unpack that one line: query_start_loc is the cumulative-sum boundary array for the batch (the same ragged-sequence bookkeeping the attention kernel uses in Chapter 9). If three requests contribute 5, 3, and 8 tokens to the flat batch, query_start_loc is [0, 5, 8, 16]: each entry is where a request starts. Dropping the first entry with [1:] gives [5, 8, 16], the position just past the end of each request, and subtracting 1 gives [4, 7, 15], the last token of each request. So logits_indices is, in plain terms, “the last position of each request.” Those indices select hidden states, and only then is the vocabulary projection applied:

# vllm/v1/worker/gpu_model_runner.py
sample_hidden_states = hidden_states[logits_indices]
logits = self.model.compute_logits(sample_hidden_states)

Source: vllm/v1/worker/gpu_model_runner.py

This is why Chapter 3 could say prefill “produces no logits worth keeping.” The compute-bound prefill phase exists to fill the KV cache; the only logit it yields is the one that seeds decode. Gathering first also keeps the expensive [num_sampled, vocab] logits tensor as small as the batch allows.

The sampler as an ordered pipeline

The sampler turns a logit vector into a single chosen token id. A logit is an unnormalized score for each vocabulary entry; a softmax over the logits would turn them into a probability distribution, and “sampling” means drawing one token from that distribution. Between the raw logits and the draw sit a stack of transforms that reshape the distribution: penalties that discourage repetition, masks that forbid certain tokens, temperature that flattens or sharpens the curve, and top-k/top-p that lop off the unlikely tail. The sampler is an nn.Module, and its docstring is unusually candid about being a fixed sequence of those stages. The class comment in vllm/v1/sample/sampler.py lays out the order explicitly: compute logprobs if requested, cast to float32, apply allowed-token and bad-word masks, apply the non-argmax-invariant logit processors and penalties, then sample, which itself temperature-scales, applies argmax-invariant processors (min-p), applies top-k/top-p, and draws.

That order is the crux, so it is worth naming the organizing principle: every transform is classified by whether it can move the argmax (the single highest-scoring token). A transform is “argmax-invariant” if it can never change which token is on top, and “non-argmax-invariant” if it can. This matters because a greedy request (temperature zero) simply takes the argmax, so any argmax-invariant transform is wasted work for it. The pipeline below puts all the argmax-changing transforms first, outside sample(), and tucks the argmax-preserving ones inside sample() where a greedy request can skip them entirely. The diagram traces both paths through the sampler.

flowchart TD
    L["logits (gathered, one row per request)"] --> F32["cast to float32"]
    F32 --> M["apply allowed-token and bad-word masks"]
    M --> P["apply penalties and logit bias (can move the argmax)"]
    P --> Q{"all greedy?"}
    Q -->|"yes"| G["greedy_sample: just take the argmax"]
    Q -->|"no"| S["sample(): temperature, min-p, top-k/top-p, draw"]
    S --> W{"row is greedy?"}
    W -->|"yes (mixed batch)"| GR["use that row's argmax"]
    W -->|"no"| RD["use that row's random draw"]
    G --> OUT["sampled token ids (GPU tensors)"]
    GR --> OUT
    RD --> OUT

Order is not arbitrary here, and the code is careful about which transforms can change a greedy result. Penalties and bias and masks come first because they shift the argmax. Temperature, min-p, and top-k/top-p come inside sample() because for a greedy request they are irrelevant, so the runner can short-circuit:

# vllm/v1/sample/sampler.py
if sampling_metadata.all_random:
    greedy_sampled = None
else:
    greedy_sampled = self.greedy_sample(logits)
    if sampling_metadata.all_greedy:
        ...
        return greedy_sampled, processed_logprobs

Source: vllm/v1/sample/sampler.py

A pure-greedy batch never sorts a single logit; it returns from greedy_sample before sample() is even called. The interesting case is a mixed batch, where some requests in the same step want greedy decoding and others want random sampling. The GPU does not branch per row cheaply, so vLLM computes both answers for the whole batch (the greedy argmax for every row and the random draw for every row) and then selects per row with a torch.where keyed on whether each request’s temperature is below epsilon. The greedy rows pay for a random draw they throw away, and the random rows pay for an argmax they throw away, but both run as dense vectorized kernels with no data-dependent control flow, which on a GPU is far cheaper than trying to handle each row separately. Note the small honesty in apply_temperature: it rewrites a zero temperature to 1.0 before dividing, purely to avoid a divide-by-zero for the greedy rows whose result will be discarded anyway.

# vllm/v1/sample/sampler.py
@staticmethod
def apply_temperature(logits, temp, all_random):
    if not all_random:
        temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp)
    return logits.div_(temp.unsqueeze(dim=1))

Source: vllm/v1/sample/sampler.py

The whole thing runs in float32 (logits = logits.to(torch.float32)) even when the model ran in bf16, because the softmax and the cumulative sums that top-p needs are numerically nasty in low precision and this tensor is small enough that the cast is cheap.

Temperature is the one knob that literally reshapes the softmax curve, and the effect is hard to picture from the formula alone: dividing every logit by $T$ before the softmax sharpens the distribution toward the argmax when $T<1$ and flattens it toward uniform when $T>1$. The curves below take a fixed set of eight logits and apply three temperatures. At $T=0.5$ the top token carries over 80% of the mass (a near-greedy draw); at $T=1$ the model’s native distribution shows through; at $T=2$ the mass spreads out and the long tail becomes far more likely to be sampled.

Illustrative: probabilities computed from a fixed eight-entry logit vector [4, 3, 2.5, 2, 1.5, 1, 0.5, 0]; the shape is exact for these logits but the logits themselves are chosen for clarity, not measured.

Where top-p came from

Top-p, or nucleus sampling, has a clean provenance. The Curious Case of Neural Text Degeneration (arXiv:1904.09751) observed that greedy and beam search on large LMs produce degenerate, repetitive text, and that naive top-k truncation either leaves in garbage or cuts off the natural variety of the distribution. Their fix was to sample from the smallest set of tokens whose cumulative probability exceeds a threshold $p$, a set whose size adapts to how peaked the distribution is. That adaptive nucleus is exactly what the cumulative-sum mask in vLLM implements; the paper is the fastest way to understand why top-p is the production default rather than top-k.

Sorting is the enemy

The textbook way to do top-p is to sort the vocabulary, take a cumulative sum, and mask the tail. vLLM’s native path does precisely that:

# vllm/v1/sample/ops/topk_topp_sampler.py
logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
...
probs_sort = logits_sort.softmax(dim=-1)
probs_sum = torch.cumsum(probs_sort, dim=-1, out=probs_sort)
top_p_mask = probs_sum <= 1 - p.unsqueeze(dim=1)

Source: vllm/v1/sample/ops/topk_topp_sampler.py

A full sort over a 128k-or-larger vocabulary, for every request, every step, is not free. So on CUDA, vLLM prefers FlashInfer’s sorting-free sampler, which uses a rejection-sampling scheme to draw from the truncated distribution without ever materializing a sorted order:

# vllm/v1/sample/ops/topk_topp_sampler.py
next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
    logits, k, p, deterministic=True
)

Source: vllm/v1/sample/ops/topk_topp_sampler.py

The honesty here is in the docstring: this is statistically equivalent to the sorting path, not bit-identical to it. The dispatch is decided once at construction in TopKTopPSampler.__init__, which binds self.forward to forward_cuda only when flashinfer_sampler_supported() and the logprobs mode doesn’t need post-filter logits (FlashInfer doesn’t expose them). Even forward_cuda falls back to the native path when there’s nothing to filter or when per-request RNG generators are present, which FlashInfer 0.2.3+ can’t honor. This is a recurring shape in vLLM: a fast kernel for the common case, guarded by a wall of correctness conditions that quietly route the awkward cases to the slow, simple path.

One more detail that connects to randomness done right. The obvious way to draw a token from a probability distribution is torch.multinomial, but it causes a CPU-GPU sync, which would defeat everything Chapter 10 set up. So random_sample uses the Gumbel-max trick instead. The trick is a small identity: for a categorical distribution with probabilities $p_i$, if you perturb each log-probability with independent Gumbel noise $g_i$ and then take the argmax, the token you land on is distributed exactly as if you had sampled from the original distribution:

$$\arg\max_i \left( \log p_i + g_i \right) \sim p$$

Equivalently, dividing each probability by an independent unit-exponential sample $q_i$ and taking $\arg\max_i (p_i / q_i)$ gives the same draw. The payoff is that an argmax is a plain reduction with no host round-trip, so the entire draw stays on the GPU. vLLM draws the exponential noise and takes that argmax:

# vllm/v1/sample/ops/topk_topp_sampler.py
def random_sample(probs, generators, use_fp64_gumbel=False):
    """We use this function instead of torch.multinomial because
    torch.multinomial causes CPU-GPU synchronization."""
    q = empty_exponential_noise_like(probs, use_fp64_gumbel)
    if len(generators) != probs.shape[0]:
        q.exponential_()
    ...
    return sample_with_exponential_noise(probs, q)

Source: vllm/v1/sample/ops/topk_topp_sampler.py

The one mandatory sync

After all of that, the sampler returns GPU tensors. The comment in sampler.py is blunt: “These are GPU tensors.” The sampled token ids live in device memory. But the scheduler, the detokenizer, and the API server all run on the CPU, in (mostly) a different process. At some point the ids have to come down. That copy, a device-to-host transfer (D2H), is the single mandatory GPU-to-CPU synchronization of the entire step. The reason it is a synchronization and not just a copy is worth pinning down: GPU work is normally launched asynchronously, with the CPU queueing kernels and racing ahead without waiting for results. But to read an actual value the GPU computed, such as which token it sampled, the CPU has no choice but to stop and wait until that value is finished and copied down. Everything else, the whole forward pass, can stay asynchronous; this is the one place the CPU must learn a value the GPU computed.

The runner does this copy through a pinned host buffer for speed, and it is deliberate about how it waits:

# vllm/v1/worker/gpu_model_runner.py
def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
    # `tolist` would trigger a cuda wise stream sync, which
    # would block other copy ops from other cuda streams.
    # A cuda event sync would avoid such a situation.
    pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
    pinned.copy_(sampled_token_ids, non_blocking=True)
    self.transfer_event.record()
    self.transfer_event.synchronize()
    return pinned.tolist()

Source: vllm/v1/worker/gpu_model_runner.py

The subtlety is the difference between two ways of waiting. A naive tensor.tolist() triggers a device-wide stream synchronize: it blocks until every outstanding operation on the device is done, which also stalls unrelated copy streams that happen to be running other work (the comment links a real regression that hurt a disaggregated setup, foreshadowing Chapter 17). A CUDA event is narrower: transfer_event.record() drops a marker into the stream right after this one copy, and transfer_event.synchronize() waits only until execution reaches that marker. So the CPU blocks for exactly this copy and nothing else. This is exactly the kind of micro-cost that Chapter 10 cared about: when the GPU finishes a decode step in under a millisecond, a stray device-wide sync is a measurable tax.

Skipping even the sync: the async-scheduling shortcut

Chapter 10 introduced async scheduling, where the CPU plans step N+1 while the GPU runs step N. There is a chicken-and-egg problem buried in it: step N+1 needs the token sampled in step N as its input. If the CPU has to wait for that token to come down before it can build the next batch, the overlap collapses. vLLM’s answer is to not bring the token down at all on the hot path. In _bookkeeping_sync, the async branch keeps the sampled ids on the GPU and remembers their layout:

# vllm/v1/worker/gpu_model_runner.py
else:
    valid_sampled_token_ids = []
    invalid_req_indices = discard_sampled_tokens_req_indices.tolist()
    ...
    # Cache the sampled tokens on the GPU and avoid CPU sync.
    if self.input_batch.prev_sampled_token_ids is None:
        assert sampled_token_ids.shape[-1] == 1
        self.input_batch.prev_sampled_token_ids = sampled_token_ids

Source: vllm/v1/worker/gpu_model_runner.py

The next step’s input preparation copies prev_sampled_token_ids directly into the input buffer on-device, so the token round-trips GPU-to-GPU and never touches the CPU until it’s needed for output. Notice it builds an invalid_req_indices list instead of clearing tokens immediately: for partial (chunked-prefill) requests we sampled a junk token “for simplicity,” and those indices get blanked later. The host copy itself is deferred into AsyncGPUModelRunnerOutput, which fires the D2H on a separate copy stream and only blocks when someone calls get_output():

# vllm/v1/worker/gpu_model_runner.py
with torch.cuda.stream(async_output_copy_stream):
    async_output_copy_stream.wait_stream(default_stream)
    self.sampled_token_ids_cpu = self._sampled_token_ids.to(
        "cpu", non_blocking=True)
    ...
    self.async_copy_ready_event.record()

Source: vllm/v1/worker/gpu_model_runner.py

So the one mandatory sync is real, but it’s been pushed as far down the pipeline as it can go, off the critical path that gates the next forward.

Across the process boundary

Here the architecture from Chapter 1 reasserts itself. vLLM splits the frontend (HTTP, tokenization, detokenization, output processing) from the EngineCore (scheduler plus model runner) into separate processes, wired together over ZMQ. The core_client.py header states the roster plainly:

# vllm/v1/engine/core_client.py
* InprocClient: In process EngineCore (for V0-style LLMEngine use)
* SyncMPClient: ZMQ + background proc EngineCore (for LLM)
* AsyncMPClient: ZMQ + background proc EngineCore w/ asyncio (for AsyncLLM)

This split exists so that Python’s GIL on the busy HTTP server doesn’t steal cycles from the engine loop, and vice versa. The engine emits batches of EngineCoreOutput (token ids, finish reasons, logprob tensors already pulled to CPU) over the socket. On the frontend side, a single background task drains them. In AsyncLLM._run_output_handler, that loop pulls, chunks (so it never hogs the event loop), processes, and crucially handles aborts:

# vllm/v1/engine/async_llm.py
processed_outputs = output_processor.process_outputs(
    outputs_slice, outputs.timestamp, iteration_stats)
...
# 3) Abort any reqs that finished due to stop strings.
if processed_outputs.reqs_to_abort:
    await engine_core.abort_requests_async(
        processed_outputs.reqs_to_abort)

Source: vllm/v1/engine/async_llm.py

That last line is the seam where the cancellation race lives. Hold onto it.

Turning ids back into text, incrementally

The detokenizer’s job is to turn token ids back into the characters a user will read. The naive approach would be to call tokenizer.decode(all_ids) from scratch on every step, but that is both wasteful (re-decoding the entire prefix each step) and, more importantly, wrong at the boundaries. BPE and SentencePiece tokenizers are stateful at the byte level: a single emoji or CJK character spans multiple tokens, so the bytes of one character can arrive split across two decode steps, and a half-character is not valid text. Decoding a prefix can also produce a different string than decoding the whole thing (the classic leading-space problem, where a token’s rendering depends on whether it follows whitespace, and the partial-UTF-8 problem just described). So vLLM detokenizes incrementally, holding a streaming decoder per request that remembers the in-progress byte state from one token to the next and only emits characters once they are complete. The fast path, for tokenizers >= 0.22.0, primes a native DecodeStream with the prompt and steps it one token at a time:

# vllm/v1/engine/detokenizer.py
self.stream = tokenizers.decoders.DecodeStream(
    ids=request.prompt_token_ids,
    skip_special_tokens=self.skip_special_tokens,
)

Source: vllm/v1/engine/detokenizer.py

Each new token is fed through _protected_step, which exists entirely to survive the rough edges of real tokenizers: it catches overflow on bad ids and, on an “Invalid prefix encountered” error from a non-monotonic UTF-8 sequence, resets the stream rather than crashing the request. Both branches cite real issues. Incremental detokenization looks trivial and is full of one-off bugs, which is why the code reads defensively.

The stop-string buffer

A stop string is a piece of text that, when generated, should end the request. Stop strings are where detokenization stops being a pure decode and starts being a control-flow decision. Two facts make them tricky. First, a user-supplied stop string like "\n\n" might straddle a token boundary, so it can only be detected after decoding, in the character stream, not in the token-id stream. Second, and worse, if the stop string is not to be included in the output, you cannot stream the last few characters until you are sure they are not the beginning of a stop match. Imagine the stop string is "END" and you have just decoded "...the EN": you must not send those last two characters yet, because the next token might complete "END", in which case they need to vanish from the output. So the detokenizer holds back a buffer of $\max_s |s| - 1$ characters, where $s$ ranges over the configured stop strings and $|s|$ is the length of stop string $s$:

# vllm/v1/engine/detokenizer.py
if self.stop and not self.include_stop_str_in_output:
    self.stop_buffer_length = max(len(s) for s in self.stop) - 1
else:
    self.stop_buffer_length = 0

Source: vllm/v1/engine/detokenizer.py

get_next_output_text then trims that many trailing characters from each streamed delta (revealing the held-back tail only once the request is finished). And check_stop_strings searches only the newly-added characters plus enough lookback to catch a match that spans the previous chunk, then returns where to truncate. The point to internalize: a stop string is detected on the frontend, by inspecting decoded text, not by the engine inspecting token ids. That asymmetry is the whole reason the next section’s race exists.

The abort-on-stop race

Put the pieces together. The EngineCore schedules and samples; it does not detokenize, so it cannot see a stop string (it only knows about stop token ids and length limits, which it can enforce itself). The frontend detokenizes and therefore is the only party that can detect a stop string. When it does, the engine is still happily generating that request, one step ahead, wasting GPU on tokens nobody will read. The sequence diagram below traces exactly this: the engine keeps stepping while the abort message is still in flight, and the frontend has to discard the outputs that arrive in the gap.

sequenceDiagram
    participant E as "EngineCore (engine process)"
    participant F as "Output processor (frontend)"
    E->>F: "step N output: token ids"
    F->>F: "detokenize, no stop string yet"
    E->>F: "step N+1 output: completes the stop string"
    F->>F: "detect stop string, finish request locally"
    F->>E: "abort_requests_async(req_id)"
    Note over E: "engine already sampled step N+2 before abort arrived"
    E->>F: "step N+2 output: stale token ids"
    F->>F: "request_states.get(req_id) is None, drop it"
    E->>E: "free request, stop scheduling it"

So the output processor, on detecting a stop string, both finishes the request locally and signals that the engine must be told to abort it:

# vllm/v1/engine/output_processor.py
stop_string = req_state.detokenizer.update(
    new_token_ids, finish_reason == FinishReason.STOP)
if stop_string:
    finish_reason = FinishReason.STOP
    stop_reason = stop_string
...
if not engine_core_output.finished:
    # If req not finished in EngineCore, but Detokenizer
    # detected stop string, abort needed in EngineCore.
    reqs_to_abort.append(req_id)

Source: vllm/v1/engine/output_processor.py

That reqs_to_abort list is what flowed back to the abort_requests_async call in the output handler. The race is the window between “frontend decides to stop” and “engine actually frees the request.” In that window the engine may already have sampled the next token (or several, under speculation) for a request the frontend considers done. The frontend must therefore ignore late-arriving outputs for requests it has already finished, which is exactly what the top of process_outputs does:

# vllm/v1/engine/output_processor.py
req_state = self.request_states.get(req_id)
if req_state is None:
    # Ignore output for already-aborted request.
    continue

Source: vllm/v1/engine/output_processor.py

The other direction of the same race is client disconnect. When an HTTP client hangs up, the generate() async generator is cancelled or garbage-collected, and vLLM must abort the in-flight request or leak GPU work indefinitely:

# vllm/v1/engine/async_llm.py
except (asyncio.CancelledError, GeneratorExit):
    if q is not None:
        await self.abort(q.request_id, internal=True)

Source: vllm/v1/engine/async_llm.py

Both directions resolve to the same primitive, an abort message to the engine, and the same defensive rule, drop outputs whose request state is already gone. The reason most people get this wrong is that a single-process mental model hides it: if detokenization and generation shared a loop, you’d stop the moment you saw the stop string. Across a process boundary, with one side a step ahead, stopping is a two-phase handshake with a guaranteed window of wasted work and a guaranteed stream of stale outputs to discard.

Out the door as SSE

The final hop is almost anticlimactic. The OpenAI-compatible server iterates the RequestOutputs yielded by generate() and serializes each into a Server-Sent Events frame. The streaming generator’s shape is exactly what you’d expect from a FastAPI endpoint:

# vllm/entrypoints/openai/chat_completion/serving.py
async for res in result_generator:
    ...
    yield f"data: {data}\n\n"
...
yield "data: [DONE]\n\n"

Source: vllm/entrypoints/openai/chat_completion/serving.py

The data: ...\n\n framing and the terminal [DONE] sentinel are the SSE wire format the OpenAI client libraries expect. The stream_interval and DELTA-mode logic in RequestState.make_request_output decide how often to emit and whether to send full text or just the new delta, trading client chattiness against per-token latency visibility. But by the time text reaches this loop, every hard problem is already solved upstream: the sampling distribution, the one D2H sync, the incremental decode, the stop handshake.

What’s unsolved, and what’s next

The egress path is mature but not finished. Detokenization is single-threaded per request and runs in the frontend process; under very high concurrency it can become a CPU bottleneck that the chunked output-handler loop only partially mitigates. The abort handshake is correct but not instantaneous, so a deployment with long stop-string-terminated generations always burns some tokens past the stop, and under speculative decoding (Chapter 13) it can be several. And the GPU-resident sampled-token shortcut interacts delicately with pipeline parallelism, where the first and last stages don’t share memory and the scheduler has to ferry tokens back the long way, a caveat the code calls out by name.

Two threads from this chapter feed directly forward. The rejection-sampler machinery glimpsed in parse_output is the verification step that Chapter 13 builds speculative decoding on: the sampler already knows how to accept-or-reject a batch of candidate tokens. And the allowed-token mask applied right at the top of the sampler is the GPU-side half of the CPU-computes-mask / GPU-applies-mask handshake that Chapter 14 uses for grammar-constrained generation, which must additionally roll that mask back over speculative tokens the verifier rejected. The sampler, in other words, is not the end of the pipeline. It’s the seam where the next two chapters’ techniques splice in.

Inference Serving Roadmap

The sampler and the egress path: logits to streamed text

Only the last position matters

The sampler as an ordered pipeline

Where top-p came from

Sorting is the enemy

The one mandatory sync

Skipping even the sync: the async-scheduling shortcut

Across the process boundary

Turning ids back into text, incrementally

The stop-string buffer

The abort-on-stop race

Out the door as SSE

What’s unsolved, and what’s next

Further reading

Keyboard shortcuts

Inference Serving Roadmap