Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Structured and guided generation

A surprising fraction of production LLM traffic does not want free-form prose. It wants JSON that parses, a tool call whose arguments match a schema, an enum that is exactly one of three strings, a date in YYYY-MM-DD. The moment a model output feeds another program rather than a human reader, “mostly valid” is a bug. Retrying until the model happens to emit well-formed output is the naive fix, and it is expensive: you pay full decode latency for a response you then throw away.

The better fix turns out to be small and surgical, and it slots neatly into machinery we already have. Recall from Chapter 11 that the sampler is a pipeline operating on the last-position logits, the vector of one raw score per vocabulary token that the model produces before any token is chosen, and that the single mandatory GPU-to-CPU sync is the sampled token id. If we could reach into that pipeline one step earlier and forbid the tokens that would violate our grammar, the model would be physically unable to produce malformed output. That is the whole idea of structured generation: at every decode step, compute which tokens are legal given what has been emitted so far, and set the logits of all the illegal ones to negative infinity before sampling. A token whose logit is negative infinity has zero probability after the softmax, so the sampler can never pick it, no matter the temperature or sampling method. The legality test comes from a grammar, a formal description of every valid string (a JSON-schema, an EBNF rule set, a fixed list of choices), which the engine compiles into a finite-state machine (FSM): a little automaton whose current state encodes “given what has been emitted so far, here is exactly the set of tokens that may come next.”

The catch is where that “compute which tokens are legal” work happens. Walking that FSM over a 150,000-entry vocabulary to decide one bit per token is pure CPU bookkeeping, and the GPU is the scarce resource we spent the entire first half of this book learning not to stall. So the engineering problem is a scheduling problem dressed as a parsing problem: keep the FSM work off the GPU’s critical path, hand the GPU a finished bitmask, and have it do nothing more than a cheap masked write. And because Chapter 13 taught the engine to speculate several tokens ahead in one forward pass, the mask machinery has to cope with tokens that get drafted, masked, and then rejected, rolling the grammar’s state back as if they never happened.

The diagram below shows the division of labor that the rest of this chapter unpacks. The CPU walks the grammar and produces a bitmask; the GPU does its normal forward pass in parallel and then applies that mask in a single cheap kernel right before sampling. The two lanes overlap, so the masking adds almost nothing to per-step latency.

flowchart LR
    subgraph CPU["CPU lane (off the GPU critical path)"]
        G["grammar FSM: current state"] --> F["fill_next_token_bitmask: 1 bit per vocab token"]
        F --> M["packed bitmask (int32 per 32 tokens)"]
    end
    subgraph GPU["GPU lane"]
        FW["forward pass: produce logits"] --> AP["apply_token_bitmask_inplace: illegal logits set to -inf"]
        AP --> S["sampler: pick a legal token"]
    end
    M -->|"non_blocking host-to-device copy"| AP
    S -->|"sampled token id"| G

A mask is a bitmask

Constraining the vocabulary is conceptually a boolean vector of length vocab_size: one bit per token, set if the token is currently allowed. Storing 150,000 separate booleans would waste memory and bandwidth, so vLLM packs them, 32 tokens per int32, the way a compiler packs flags into bit fields. Allocating one such vector per token would also thrash the allocator, so the backends carve out the whole batch’s worth at once, a two-dimensional buffer of shape (sequences, vocab-words). The xgrammar backend’s allocator is a one-liner:

# vllm/v1/structured_output/backend_xgrammar.py
def allocate_token_bitmask(self, max_num_seqs: int):
    return xgr.allocate_token_bitmask(max_num_seqs, self.vocab_size)

Source: vllm/v1/structured_output/backend_xgrammar.py

A grammar is compiled once per request into a matcher object, and at each step the matcher fills one row of that bitmask:

# vllm/v1/structured_output/backend_xgrammar.py
def fill_bitmask(self, bitmask: torch.Tensor, idx: int) -> None:
    self.matcher.fill_next_token_bitmask(bitmask, idx)

Source: vllm/v1/structured_output/backend_xgrammar.py

fill_next_token_bitmask is the expensive part, and it is the part XGrammar exists to make fast. The naive cost is brutal: for each of the 150,000 vocabulary tokens, ask “could the grammar’s current state accept this token’s characters?”, which is a scan over the whole vocabulary every single step. XGrammar’s trick, summarized in the XGrammar paper (below), is to notice that most of that answer does not depend on the exact FSM state. A token like { or , is either always-structurally-plausible or never, regardless of where in the grammar you are; only a minority of tokens are “context-dependent” and need a live check. XGrammar precomputes the always/never answer per FSM state ahead of time, so the per-step work shrinks to a cache lookup over the precomputed set plus a small live check over the few context-dependent tokens, rather than a fresh scan. The point for us is that this is a CPU computation reading CPU-resident grammar state, with no GPU involvement whatsoever, which is exactly what lets it hide behind the GPU’s forward pass.

The curve below makes the difference concrete: the naive per-step cost is one live check per vocabulary token, so it grows linearly as vocabularies climb toward and past 150,000 tokens, while XGrammar’s per-step live work tracks only the small context-dependent set (a few percent of the vocabulary) and stays almost flat, which is what shrinks the per-step bookkeeping enough to hide behind a forward pass.

Illustrative: the naive line is one check per token by definition; the XGrammar line assumes a representative ~3% context-dependent fraction (the precomputed always/never tokens cost only a cache lookup, not a live check). The real fraction depends on the grammar and tokenizer.

Two backend details are worth pausing on because they reappear later. First, the matcher is created with a rollback budget tied to speculation:

# vllm/v1/structured_output/backend_xgrammar.py
return XgrammarGrammar(
    matcher=xgr.GrammarMatcher(
        ctx,
        max_rollback_tokens=self.num_speculative_tokens,
    ),
    ...
)

Source: vllm/v1/structured_output/backend_xgrammar.py

That max_rollback_tokens is not incidental. It is exactly the number of draft tokens the speculator may propose, because every one of them might be rejected and need un-doing. Second, compilation is not free. Compiling a JSON schema or EBNF grammar into a matcher can take tens of milliseconds, which is catastrophic if it happens inline on the engine’s step loop. vLLM moves it off the loop entirely.

Keeping compilation off the loop: the WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR state

When a structured request arrives, the manager submits its grammar compilation to a thread pool and stores the resulting Future on the request rather than blocking:

# vllm/v1/structured_output/__init__.py
if self._use_async_grammar_compilation:
    grammar = self.executor.submit(self._create_grammar, request)
else:
    grammar = self._create_grammar(request)
request.structured_output_request.grammar = grammar

Source: vllm/v1/structured_output/__init__.py

The pool is sized deliberately for CPU-bound work, half the cores ((cpu_count() + 1) // 2), not the Python default of five times the core count that suits I/O:

# vllm/v1/structured_output/__init__.py
max_workers = max(1, (multiprocessing.cpu_count() + 1) // 2)
self.executor = ThreadPoolExecutor(max_workers=max_workers)

Source: vllm/v1/structured_output/__init__.py

While that future is pending, the scheduler from Chapter 5 must not run the request, or it would reach the sampler with no mask to apply, and the model could emit a malformed first token before the grammar even exists. This is the purpose of the WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR request status: it is a blocked-waiting state, parked in a side queue, that the scheduler skips until the grammar is ready. The state diagram below traces a structured request from arrival (the node labelled WAITING_FOR_GRAMMAR is that WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR status, abbreviated to fit): it is held out of the runnable WAITING queue while its compilation future is pending, and is only promoted once a poll finds the future resolved.

stateDiagram-v2
    [*] --> WAITING_FOR_GRAMMAR: structured request arrives, compile submitted to thread pool
    WAITING_FOR_GRAMMAR --> WAITING_FOR_GRAMMAR: scheduler poll, future still pending, skip
    WAITING_FOR_GRAMMAR --> WAITING: poll finds grammar ready, promote
    WAITING --> RUNNING: scheduler admits request to a batch
    RUNNING --> RUNNING: decode step, mask, sample, advance FSM
    RUNNING --> FINISHED_ERROR: accept_tokens fails, mask and FSM disagree
    RUNNING --> [*]: grammar terminated or stop token

The promotion check is just a poll of the future’s completion:

# vllm/v1/core/sched/scheduler.py
if request.status == RequestStatus.WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR:
    structured_output_req = request.structured_output_request
    if not (structured_output_req and structured_output_req.grammar):
        return False
    request.status = RequestStatus.WAITING
    return True

Source: vllm/v1/core/sched/scheduler.py

The comment in StructuredOutputManager.__init__ is candid about why this is conditional on the executor backend: under external_launcher mode there is one scheduler per TP rank, and asynchronous compilation would let the WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR -> WAITING transition fire at different times on different ranks, breaking the lockstep determinism those ranks rely on. So in that mode compilation is forced synchronous. This is a recurring flavor of distributed-systems tax: an optimization that is purely local on one process becomes a consistency hazard the moment the same decision must be made identically on many.

The CPU-computes / GPU-applies handshake

Now the steady-state path. Each step, after the scheduler decides the batch, it asks the structured-output manager to build a compacted bitmask covering only the structured requests, in a known order:

# vllm/v1/core/sched/scheduler.py
bitmask = self.structured_output_manager.grammar_bitmask(
    self.requests,
    structured_output_request_ids,
    scheduler_output.scheduled_spec_decode_tokens,
)
return GrammarOutput(structured_output_request_ids, bitmask)

Source: vllm/v1/core/sched/scheduler.py

grammar_bitmask fills rows by calling each grammar’s fill_bitmask, and for large batches without speculation it shards that filling across a second, dedicated thread pool in chunks of 16 requests:

# vllm/v1/structured_output/__init__.py
if len(batch) == self.fill_bitmask_parallel_batch_size:
    promises.append(self._async_submit_fill_bitmask(batch))
    batch = []

Source: vllm/v1/structured_output/__init__.py

Then comes the seam that matters most. The completed mask is converted to a NumPy array, deliberately, because the comment notes that “serialization of np.ndarray is much more efficient than a tensor” when shipping it to the GPU workers, and on the worker the application is a single kernel:

# vllm/v1/structured_output/utils.py
grammar_bitmask = torch.from_numpy(sorted_bitmask).to(
    logits.device, non_blocking=True
)
...
xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor)

Source: vllm/v1/structured_output/utils.py

Note the non_blocking=True host-to-device copy and the in-place kernel that sets masked logits to negative infinity. The non_blocking flag lets the small bitmask copy be queued on the GPU stream without the CPU stalling to wait for it, so it rides alongside the work already in flight. The GPU does the absolute minimum: receive a small bitmask, AND it into the logits. All the grammar walking already happened on CPU, overlapped with the GPU’s forward pass for this very step. The sequence diagram below traces one decode step and makes the overlap explicit: while the GPU is busy with the forward pass, the CPU is already filling the bitmask for the same step, so when the logits land the mask is waiting.

sequenceDiagram
    participant Sched as "Scheduler (CPU)"
    participant SOM as "StructuredOutputManager (CPU)"
    participant GPU as "Model runner (GPU)"
    Sched->>GPU: launch forward pass for the batch
    Sched->>SOM: grammar_bitmask(requests, ids, spec tokens)
    Note over SOM: fill_bitmask per request, sharded across a thread pool
    SOM-->>Sched: packed bitmask as np.ndarray
    GPU-->>Sched: logits ready
    Sched->>GPU: apply_grammar_bitmask(logits, bitmask)
    Note over GPU: non_blocking copy then in-place AND, illegal logits to -inf
    GPU->>GPU: sample a legal token
    GPU-->>Sched: sampled token ids
    Sched->>SOM: accept_tokens, advance each FSM

The call site lives right before sampling in the model runner, exactly where Chapter 11 said the logits live:

# vllm/v1/worker/gpu_model_runner.py
if grammar_output is not None:
    apply_grammar_bitmask(
        scheduler_output, grammar_output, self.input_batch, logits
    )

Source: vllm/v1/worker/gpu_model_runner.py

One subtlety in apply_grammar_bitmask repays attention, and it is purely a row-alignment problem. The logits tensor has one row per request in the batch, in the batch’s order, mixing structured and free-form requests freely. The bitmask, by contrast, was built in a different order (sorted by structured_output_request_ids) and contains only the structured requests, because those are the only ones with a grammar to fill. The kernel ANDs row i of the mask into row i of the logits, so if the two orderings do not match, request A’s constraints land on request B’s logits, silently corrupting one output while the batch looks healthy. To prevent this, the function rebuilds a sorted_bitmask that has the same number of rows as the logits and is aligned to them: every free-form row is filled with -1 (all bits set, meaning every token allowed, so the AND is a no-op), and only the structured rows are overwritten with their real masks. As a fast path, if every logit row turns out to be a structured row, the alignment is already trivial and it skips passing the per-row indices to the kernel altogether. This is the kind of off-by-one bookkeeping any batched per-request transform demands; here it is load-bearing for correctness.

After the FSM: advancing and accepting

Masking is only half of correctness. Masking decides what may come next; advancing records what actually came, so the two must stay in step. Once a token is sampled, the grammar’s FSM has to advance so that the next step’s mask reflects the new state, otherwise the grammar would keep offering the same opening tokens forever. That advance happens back in the scheduler’s update_from_output, gated by should_advance. The gate exists because some models emit a reasoning phase first, a free-form scratchpad of “thinking” tokens before the structured answer begins, and those tokens must not drive the grammar (the grammar describes the answer, not the scratchpad). So should_advance returns false during reasoning and the FSM stays put until the real answer starts:

# vllm/v1/core/sched/scheduler.py
if new_token_ids and self.structured_output_manager.should_advance(request):
    struct_output_request = request.structured_output_request
    ...
    if not struct_output_request.grammar.accept_tokens(req_id, new_token_ids):
        ...
        request.status = RequestStatus.FINISHED_ERROR

Source: vllm/v1/core/sched/scheduler.py

In the happy path accept_tokens always succeeds, because the mask guaranteed the sampled token was legal. The defensive failure branch terminates the request, which is the honest thing to do: a rejection here means the mask and the FSM disagreed, a bug, not a recoverable condition.

Rolling back the mask over rejected speculative tokens

This is where the chapter’s dependence on Chapter 13 becomes concrete. With speculation, a single forward pass verifies several drafted tokens at once: a small draft model proposes a few likely next tokens, and the big model checks them all in one pass, keeping the longest accepted prefix. Each draft position needs its own mask, because the legal set at position $k+1$ depends on what was emitted at position $k$. To know what tokens are legal after the draft token at position $k$, the FSM must first be advanced as if that token were accepted. So the masks for a speculative step can only be computed by walking the FSM forward through the draft, one hypothetical token at a time. Concretely, grammar_bitmask allocates room for every speculative slot plus a bonus token (the bonus is the extra “free” token a verification pass yields when the whole draft is accepted):

# vllm/v1/structured_output/__init__.py
self._grammar_bitmask = self.backend.allocate_token_bitmask(
    max_batch_size * (1 + max_num_spec_tokens)
)

Source: vllm/v1/structured_output/__init__.py

and the serial spec-decode path walks the draft tokens, advancing the FSM as it fills each position, then rolls the FSM back by exactly the number of advances it made:

# vllm/v1/structured_output/__init__.py
for token in itertools.chain(req_tokens, (-1,)):
    self._fill_bitmasks(((grammar, cumulative_index, apply_bitmask),))
    ...
    if apply_bitmask and not grammar.is_terminated():
        accepted = grammar.accept_tokens(req_id, [token])
        ...
        state_advancements += 1
    cumulative_index += 1
if state_advancements > 0:
    grammar.rollback(state_advancements)

Source: vllm/v1/structured_output/__init__.py

The reason for the rollback is delicate, and it is the crux of how masking and speculation coexist. To predict the mask for draft position $k$, the FSM had to be advanced through the drafted tokens of positions $0 \ldots k-1$, hypothesizing that they will be accepted. But the rejection sampler (Chapter 13) has the final say, and it may reject some suffix of those drafts. The FSM cannot commit to the hypothetical advances, because if it did and the drafts were then rejected, the grammar’s state would be ahead of the tokens actually emitted, and every future mask would be wrong. So the manager treats the forward walk as throwaway scaffolding: it speculatively advances purely to compute the masks, counts exactly how many advances it made (state_advancements), and immediately rolls all the way back to the pre-step state. The real advance, over only the tokens the rejection sampler actually accepted, happens afterward in update_from_output via accept_tokens. The -1 sentinel appended to the loop marks the bonus/non-speculative slot; once it is hit the code stops advancing, so it fills that slot’s mask but never walks the grammar past a padding token.

The flowchart below traces the loop for one request: fill a mask, hypothetically advance, repeat per draft token, then unwind every advance so the grammar is back where it started.

flowchart TD
    A["start: FSM at pre-step state, state_advancements = 0"] --> B["fill bitmask for current position"]
    B --> C{"sentinel -1 or grammar terminated?"}
    C -->|"no"| D["accept_tokens([draft token]): hypothetically advance FSM"]
    D --> E["state_advancements += 1"]
    E --> B
    C -->|"yes"| F{"state_advancements > 0?"}
    F -->|"yes"| G["rollback(state_advancements): FSM back to pre-step state"]
    F -->|"no"| H["done; masks ready for verification pass"]
    G --> H

The grammar’s own rollback keeps its bookkeeping honest, decrementing the processed-token count and re-checking termination:

# vllm/v1/structured_output/backend_xgrammar.py
def rollback(self, num_tokens: int) -> None:
    self.matcher.rollback(num_tokens)
    self.num_processed_tokens -= num_tokens
    self._is_terminated = self.matcher.is_terminated()

Source: vllm/v1/structured_output/backend_xgrammar.py

This is why max_rollback_tokens was wired to num_speculative_tokens back at matcher construction: the rollback depth the engine needs is bounded precisely by how far speculation looks ahead. Get that bound wrong and rollback fails; tie it to the spec config and it is correct by construction.

Backends are interchangeable behind one interface

vLLM does not bet on a single grammar engine. The manager dispatches on a configured backend name to XGrammar, Guidance (llguidance), Outlines, or LM Format Enforcer, all implementing the same StructuredOutputBackend / StructuredOutputGrammar pair. The Guidance backend, for instance, presents the identical fill_bitmask shape over a different matcher:

# vllm/v1/structured_output/backend_guidance.py
def fill_bitmask(self, bitmask: torch.Tensor, idx: int) -> None:
    llguidance_torch.fill_next_token_bitmask(self.ll_matcher, bitmask, idx)
    self.check_error()

Source: vllm/v1/structured_output/backend_guidance.py

The differences live below the interface: which JSON-schema features each supports (xgrammar rejects multipleOf, uniqueItems, patternProperties, and more in has_xgrammar_unsupported_json_features; guidance rejects patternProperties), how grammars are specified (xgrammar converts Lark, a grammar notation, into EBNF, a standard grammar-description syntax; a simple list of allowed choices becomes a tiny root ::= "a" | "b" grammar, read as “the whole output is either the string a or the string b”), and engine-specific niceties like jump-forward decoding, which both backends note as future work. The uniform interface is what lets the rollback-and-mask machinery above stay backend-agnostic: swap the engine and the scheduler code above does not change.

What is still hard

The handshake is clean, but it is not free, and several rough edges remain. Compilation latency for large recursive schemas can still stall a request’s first token even off the critical path, because the request waits in WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR until its future resolves; pathological grammars can dominate TTFT. The per-step CPU mask fill, even sharded, competes for the same cores as detokenization and scheduling, so under high structured-request concurrency the CPU can become the bottleneck the GPU never was, which is exactly the kind of regression Chapter 21 will teach you to localize. Jump-forward decoding is the obvious next throughput win and is still unimplemented in both major backends here: when the grammar leaves exactly one legal continuation (after "name": a JSON object must continue with a ", for instance), the engine could emit those forced characters directly and skip the model forward pass entirely, since the model has no real choice to make. It is free tokens, but the bookkeeping to splice forced tokens into the KV cache and the sampler path is nontrivial, so it remains future work. And the interaction with reasoning models, where the grammar must hold its fire during a thinking phase and engage only at the answer, is governed by the should_advance / should_fill_bitmask logic and remains visibly under active development, with comments flagging code paths slated for unification.

The deeper lesson generalizes past grammars. Structured generation is a template for any “constrain or bias the next token” feature: do the bookkeeping on the CPU, overlap it with the GPU’s forward pass, hand the GPU a compact mask, and respect speculation by making every hypothetical advance reversible. The next part of the book leaves the single replica behind and asks how a model spreads across many GPUs at all, starting with tensor, pipeline, and expert parallelism.

Further reading

  • XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models — arXiv:2411.15100 — Precomputes context-independent token sets per FSM state to make per-step vocabulary masking cheap; read it for the data structures that make fill_next_token_bitmask fast enough to hide behind a forward pass.