Killing per-step overhead: CUDA graphs and async scheduling
By now the engine in Part II looks formidable. The token-budget scheduler from Chapter 5 re-plans the batch every step; the paged attention backends from Chapter 9 attend over ragged, block-scattered sequences at something close to bandwidth roofline. Each decode step does a real amount of GPU work. And yet, on a fast accelerator running a modestly sized model, you can profile a steady-state decode loop and find the GPU sitting idle for a meaningful fraction of every step, waiting on the CPU.
This is the embarrassing part of inference serving that nobody warns you about. In decode, every step produces one token per sequence, so the actual matmuls and attention calls are short. But before any of them can run, Python has to walk the model’s forward method, dispatch each operator through PyTorch’s eager-mode machinery, and ask the CUDA driver to launch each kernel. That is dozens of kernels per layer times dozens of layers, each launch costing microseconds of pure host-side overhead, so a single decode step can spend hundreds of microseconds of host time just issuing work. Stack enough and the launch of the work outlasts the work itself: the GPU finishes a tiny GEMM and stalls, idling, while the host queues up the next one.
To make the overhead concrete, picture the host-to-device pipeline for one decode step. The CPU is the producer: it walks Python, dispatches operators, and enqueues kernels onto a CUDA stream, which is just an ordered queue of work the GPU drains. The GPU is the consumer: it pulls kernels off the stream and runs them. Each kernel does microseconds of compute but also costs microseconds of host time to enqueue. When the compute is large (prefill), the GPU is the bottleneck and the host’s enqueueing hides behind it. When the compute is tiny (decode), the roles invert: the host cannot enqueue kernels fast enough to keep the GPU fed, so the GPU drains the stream, finds it empty, and waits. The diagram below traces both regimes.
flowchart LR
P["CPU producer: walk forward, dispatch ops, enqueue kernels"] -->|"CUDA stream"| Q["GPU consumer: drain stream, run kernels"]
Q --> R{"kernel size vs enqueue rate"}
R -->|"prefill: kernels big, GPU is the bottleneck, host hides behind it"| OK["GPU stays busy"]
R -->|"decode: kernels tiny, host can't keep up, stream runs dry"| IDLE["GPU idles, waiting on host"]
The crossover the diagram describes is easier to feel as a curve. The host time to enqueue a decode step is roughly fixed: it is the same procession of per-kernel launches regardless of how many sequences ride along in the batch. The GPU compute, by contrast, grows with the batch. The curves below plot both against batch size. Where GPU compute sits below the flat host-launch line (small batches, the decode regime), the host is the bottleneck and the GPU idles in the gap between the two; where it climbs above (large batches, toward prefill-like work), the GPU is busy enough to hide the launches behind it.
Illustrative: the flat host line and linear-ish GPU curve have the right shapes and the crossover is real, but the absolute microseconds and the exact crossover batch depend on the model, the accelerator, and the kernel mix.
You know the shape of this from systems work. It is the per-request fixed cost that dominates once the variable cost gets small enough, and decode made the variable cost small. This chapter is two complementary attacks on the fixed cost. The first, CUDA graphs, removes the launch overhead within a step by recording the whole sequence of kernels once and replaying it with a single launch call. The second, async scheduling, removes the scheduling overhead between steps by overlapping the CPU’s planning of step N+1 with the GPU’s execution of step N. Both are, in vLLM, a careful exercise in identity and timing: a CUDA graph only replays correctly if the data sits at exactly the addresses it was recorded against, and async scheduling only works if the engine can advance to the next step before it has even learned what the last step produced.
A step is a static shape, most of the time
Here is the observation that makes CUDA graphs viable at all. A CUDA graph is a recording of a stream of GPU operations that can be replayed with a single launch call, amortizing all the per-kernel host overhead. Recording happens once: you run the model normally inside a capture context, and instead of executing, CUDA writes down the exact sequence of kernels, their launch parameters, and the device addresses they read and write. Afterwards, one replay() call re-issues that entire recorded sequence to the GPU without the host walking the model again. The catch is in the word exact: a graph records exact operations on exact memory addresses. You cannot replay a graph captured for a batch of 32 sequences against a batch of 47; the tensor shapes, and therefore the kernel launch dimensions, are baked into the recording.
So CUDA graphs are only viable if a step’s shape repeats often enough that one recording serves many steps. Decode is friendly to this. In a uniform decode step, every running sequence contributes exactly one query token, so the input is a tensor of shape [num_seqs, ...] and only two things vary step to step: num_seqs (how many sequences are decoding right now, which drifts as requests arrive and finish) and the KV-cache lengths (how far each sequence has progressed, which grows by one every step). vLLM neutralizes each of these so the recording stays valid. It handles the varying num_seqs by capturing a separate graph for each of a fixed set of batch sizes and padding any real batch up to the next captured size, so the graph always sees a shape it recorded. It handles the varying KV-cache lengths by keeping the variable-length attention computation outside the graph entirely. That second point is the crux, and it is why vLLM does not just wrap the whole model in one graph: attention is precisely the operator whose work depends on those ever-changing lengths, so it is the one part that cannot be frozen into a static recording. The next section shows how vLLM carves attention out.
Splitting the graph at attention
When vLLM compiles a model with torch.compile under its custom backend, torch.compile first traces the model into an FX graph — a flat, ordered list of the operators the forward pass executes, the same intermediate representation you would get from any PyTorch tracing. vLLM does not hand that whole FX graph to a single CUDA-graph capture. It first splits the graph at the attention operators. The list of ops that constitute a split boundary lives in the compilation config:
# vllm/config/compilation.py
_attention_ops: ClassVar[list[str]] = [
"vllm::unified_attention_with_output",
"vllm::unified_mla_attention_with_output",
"vllm::mamba_mixer2",
"vllm::mamba_mixer",
...
"vllm::deepseek_v4_attention",
]
Source: vllm/config/compilation.py
That is the same unified_attention_with_output custom op you met in Chapter 9 as the dispatch seam into the attention backend. Here it doubles as a fence: a marker that says “cut the graph here.” The backend’s VllmBackend.__call__ resolves splitting_ops (which defaults to this _attention_ops list) and calls split_graph, which walks the FX nodes in order and assigns each one a subgraph id, bumping the id every time it crosses a splitting op so that attention ends up alone in its own segment and the dense ops cluster into the segments between:
# vllm/compilation/backends.py
if should_split(node, splitting_ops):
subgraph_id += 1
node_to_subgraph_id[node] = subgraph_id
split_op_graphs.append(subgraph_id)
# keep consecutive splitting ops together
if should_split(node.next, splitting_ops):
subgraph_id -= 1
else:
subgraph_id += 1
else:
node_to_subgraph_id[node] = subgraph_id
Source: vllm/compilation/backends.py
The result is an alternating structure: a compilable subgraph of everything between attention calls (QKV projection, MLP, layernorms, residual adds, MoE routing), then the attention op running eagerly, then the next compilable subgraph, and so on. vLLM calls this piecewise compilation, and the matching CUDA-graph mode is PIECEWISE. The diagram below shows the layout for two transformer layers: the dense segments (green) are each wrapped in their own CUDA graph and replayed with one launch, while the attention ops (orange) run in eager mode in the gaps between them.
flowchart LR
A["dense segment: QKV proj, layernorm (CUDA graph)"] --> B["attention (eager)"]
B --> C["dense segment: out proj, MLP, residual (CUDA graph)"]
C --> D["attention (eager)"]
D --> E["dense segment: MLP, layernorm (CUDA graph)"]
style A fill:#22863a,color:#fff
style C fill:#22863a,color:#fff
style E fill:#22863a,color:#fff
style B fill:#d97706,color:#fff
style D fill:#d97706,color:#fff
The reason for keeping attention out of the graph is exactly Chapter 9: attention is the one operation whose work depends on the content of the batch (sequence lengths, block tables), and its backends often allocate scratch, run variable-iteration kernels, or branch on metadata in ways that do not capture cleanly. A recording freezes a fixed kernel shape, but attention’s per-step work is intrinsically variable, so freezing it would record the wrong amount of work. So vLLM replays the dense, shape-stable parts and leaves attention eager in the gaps. The launch overhead it eliminates is the overhead of the dense segments, which is where the vast majority of the per-step kernels live; the handful of eager attention launches per layer are cheap by comparison.
Each non-splitting subgraph becomes a PiecewiseBackend compiled for the relevant shapes, and wrap_with_cudagraph_if_needed wraps it in a CUDAGraphWrapper tagged PIECEWISE:
# vllm/compilation/backends.py
return static_graph_wrapper_class(
runnable=piecewise_backend,
vllm_config=vllm_config,
runtime_mode=CUDAGraphMode.PIECEWISE,
cudagraph_options=CUDAGraphOptions(
debug_log_enable=is_first_graph,
gc_disable=not is_first_graph,
weak_ref_output=is_last_graph,
),
)
Source: vllm/compilation/backends.py
Note the per-graph options. Only the first subgraph logs (capturing happens for many shapes, and you do not want a log line each time), and garbage collection is disabled for all but the first graph because running gc.collect() between every layer’s capture would make capture pathologically slow.
Capture, replay, and the iron law of identity
The wrapper itself, in vllm/compilation/cuda_graph.py, is small and ruthless about memory. It is a tiny state machine per shape: the first call for a given shape records, and every subsequent call with that shape replays. The state diagram below traces that lifecycle.
stateDiagram-v2
[*] --> NotCaptured
NotCaptured --> Capturing: first call for this shape
Capturing --> Captured: record kernels and input addresses
Captured --> Captured: later call same shape, replay (one launch)
note right of Capturing
save data_ptr of every input tensor
end note
note right of Captured
replay re-runs recorded kernels,
which dereference the saved addresses
end note
The capture path is the interesting one:
# vllm/compilation/cuda_graph.py
input_addresses = [
x.data_ptr() for x in args if isinstance(x, torch.Tensor)
]
entry.input_addresses = input_addresses
cudagraph = torch.cuda.CUDAGraph()
...
with torch.cuda.graph(cudagraph, pool=self.graph_pool, stream=current_stream()):
output = self.runnable(*args, **kwargs)
Read the capture path together with the replay behavior to see why addresses matter so much. At capture time the wrapper records data_ptr() — the raw device address — of every input tensor, then runs the model once inside torch.cuda.graph(...) so CUDA writes down the kernel sequence. A captured graph reads from and writes to the exact device addresses that were live at that moment. Crucially, replaying it does not re-read whatever Python objects you pass as arguments next time; it re-runs the recorded kernels, and those kernels dereference the raw pointers they were recorded against. If the data you want this step to process sits at a different address than last step, replay will silently process whatever happens to be at the old address instead. This is the single most important fact about CUDA graphs and the source of nearly every bug people hit with them: the inputs must live at the same addresses on every replay. The wrapper does not solve this itself, on purpose. Its docstring is explicit that it stores no persistent buffers and copies nothing; it assumes the caller feeds it tensors that already live in stable buffers. What it does do, under debug logging, is verify the assumption:
# vllm/compilation/cuda_graph.py
new_input_addresses = [
x.data_ptr() for x in args if isinstance(x, torch.Tensor)
]
assert new_input_addresses == entry.input_addresses, (
f"Input addresses for cudagraphs are different "
f"during replay. Expected {entry.input_addresses}, "
f"got {new_input_addresses}"
)
Source: vllm/compilation/cuda_graph.py
If that assert ever fires, someone allocated a fresh tensor where a persistent buffer was expected, and the graph would have silently read garbage.
The stable buffers are supplied by the model runner. In gpu_model_runner.py the runner allocates persistent input tensors once, sized to the maximum, and copies each step’s data into them rather than allocating anew:
# vllm/v1/worker/gpu_model_runner.py
# Persistent buffers for CUDA graphs.
self.input_ids = self._make_buffer(self.max_num_tokens, dtype=torch.int32)
self.positions = torch.zeros(
self.max_num_tokens, dtype=torch.int64, device=self.device
)
This is the resolution of the address problem. Because self.input_ids is allocated exactly once and reused, its device address never changes. Every step writes the real token ids into the first slots of that one buffer (overwriting last step’s contents in place) and lets the captured graph read from that fixed address. The two invariants a graph needs are now both satisfied by construction: the dispatcher’s padding (next section) guarantees the shape the graph sees is one it captured, and the persistent buffer guarantees the address is the one it recorded. Shape from padding, address from the persistent buffer; that pair is the whole contract.
The output side is equally careful about lifetime. The wrapper always stores the cached output as a weak reference (weak_ref_tensors) so PyTorch’s graph pool, not Python, owns that memory. The weak_ref_output option then handles a subtlety of the piecewise chain: an intermediate subgraph’s output is fed straight into the next subgraph’s capture and must stay alive, but the last subgraph’s output “will not be used by any other cuda graph,” so it is the only one safe to release immediately inside the capture context (weak_ref_output=is_last_graph). Getting this wrong leaks the entire graph pool. The file’s inline comment, “mind-exploding: carefully manage the reference and memory,” is not hyperbole.
Dispatch: which graph, padded to what
At runtime the runner does not decide capture-versus-replay itself; it asks the CudagraphDispatcher to map the current batch onto a captured graph. Think of the dispatcher as a lookup table from “what does this batch look like” to “which recording, if any, fits it.” The key is a BatchDescriptor — the dataclass that captures everything a graph is specialized on (padded token count, request count, whether it is a uniform decode, LoRA state). The dispatcher holds two sets of these keys, one per runtime mode (FULL and PIECEWISE), because the two modes specialize on different things. Dispatch pads the batch up to the nearest captured size and tries the keys in priority order:
# vllm/v1/cudagraph_dispatcher.py
if CUDAGraphMode.FULL in allowed_modes:
batch_desc_to_check = batch_desc
if batch_desc_to_check in self.cudagraph_keys[CUDAGraphMode.FULL]:
return CUDAGraphMode.FULL, batch_desc_to_check
if CUDAGraphMode.PIECEWISE in allowed_modes:
batch_desc_to_check = replace(batch_desc, num_reqs=None, uniform=False)
if batch_desc_to_check in self.cudagraph_keys[CUDAGraphMode.PIECEWISE]:
return CUDAGraphMode.PIECEWISE, batch_desc_to_check
return CUDAGraphMode.NONE, BatchDescriptor(num_tokens)
Source: vllm/v1/cudagraph_dispatcher.py
The diagram below traces this three-way decision, and the three things to read out of it follow.
flowchart TD
Start["incoming batch, padded to nearest captured size"] --> F{"FULL key matches?"}
F -->|yes| UseF["replay FULL graph: whole step including attention, one launch"]
F -->|no| P{"PIECEWISE key matches? (relax num_reqs and uniform)"}
P -->|yes| UseP["replay PIECEWISE graphs: dense parts captured, attention eager"]
P -->|no| None["mode NONE: run fully eager, launch every kernel by hand"]
First, FULL is preferred when available. A FULL graph captures the entire step including attention, so it eliminates even the eager attention launches that PIECEWISE leaves in the gaps. This is the FULL_AND_PIECEWISE default: decode batches get a full graph (Chapter 9’s flash-attention backends support full-graph capture for uniform decode, where the attention shape is regular enough to freeze) while prefill or mixed batches, whose attention is too irregular, fall back to piecewise. Second, the piecewise lookup relaxes the key before checking it — it nulls out num_reqs and uniform — because a piecewise graph, with attention living outside the recording, genuinely does not care how the tokens are partitioned into requests; only the token count matters to the dense segments. The full graph cannot relax those fields, because its in-graph attention kernel reads metadata (block tables, sequence boundaries) that depends on exactly how many requests there are and whether they are a uniform decode. The relaxation is why a single piecewise recording covers many request-count variations that would each need their own full graph. Third, if neither mode has a matching key (the batch exceeds the largest captured size, or a feature like cascade attention is incompatible), dispatch returns NONE and the step runs fully eager, paying the per-kernel launch overhead this whole chapter is trying to kill. That last fallback is exactly the “cudagraph fallback” regression the observability chapter teaches you to spot: a few oversized batches quietly costing you the launch overhead you thought you had eliminated.
The padding is not free. Padding a batch of 47 up to a captured size of 48 means running one sequence’s worth of phantom work, but a batch of 33 pads all the way to 48, wasting a third of the step. The waste is a sawtooth in the real batch size: it spikes just above each capture size (the worst case in the set below is a batch of 9, which pads to 16 and wastes nearly half the step) and falls to zero right at a captured size. The curve below plots that wasted fraction against the real batch size for a representative set of capture sizes; the teeth get shallower toward the large captured sizes because the gaps between them shrink in relative terms.
Illustrative: the sawtooth shape and the spike-above-each-capture-size pattern are exact for the listed capture sizes; vLLM’s actual capture-size set and cap are tunable, so the real teeth differ.
vLLM bounds this with a sensible set of capture sizes and a capped maximum, and the dispatcher precomputes a dense bs_to_padded_graph_size table so the padding decision is a constant-time lookup. The waste is real but small relative to the launch overhead it eliminates, and the capture sizes are a tunable knob.
The other idle gap: scheduling
CUDA graphs kill the launch overhead inside a step. They do nothing for the overhead between steps. Even with a perfectly graphed step, there is a seam: after the GPU finishes step N, the host has to copy the sampled token ids back, run the scheduler’s update_from_output, decide the next batch, and build the input tensors, and only then can it launch step N+1. The synchronous loop alternates strictly between the two, so the GPU sits idle during every one of those host phases. The top half of the diagram below shows that idle gap; the bottom half shows the fix.
sequenceDiagram
participant CPU
participant GPU
Note over CPU,GPU: synchronous: GPU idles while CPU schedules
CPU->>GPU: launch step N
GPU-->>CPU: step N done
Note over GPU: idle
CPU->>CPU: schedule step N+1
CPU->>GPU: launch step N+1
Note over CPU,GPU: async: schedule N+1 while GPU runs N
CPU->>GPU: launch step N
CPU->>CPU: schedule step N+1 (overlaps GPU)
CPU->>GPU: launch step N+1
GPU-->>CPU: step N done
The fix is the classic one: pipeline it. Run the CPU scheduling of step N+1 concurrently with the GPU execution of step N, so that the moment the GPU finishes step N the next batch is already queued and waiting. vLLM calls this async scheduling, and it is gated by a single scheduler-config flag:
# vllm/config/scheduler.py
async_scheduling: bool | None = None
"""If set to False, disable async scheduling. Async scheduling helps to
avoid gaps in GPU utilization, leading to better latency and throughput.
"""
Source: vllm/config/scheduler.py
When it is on, the engine runs step_with_batch_queue instead of the plain step. The structural change is that execute_model returns a future instead of blocking for the GPU result, and that future goes into a queue; the engine then prefers to schedule and launch the next batch rather than immediately wait on the previous one. It only blocks on a future when the queue is full or there is nothing left to schedule, which is what keeps the GPU continuously fed:
# vllm/v1/engine/core.py
if not deferred_scheduler_output:
batch_queue.appendleft((future, scheduler_output, exec_future))
if len(batch_queue) < self.batch_queue_size and (
model_executed or self.scheduler.has_requests()
):
# Don't block on next worker response unless the queue is full
# or there are no more requests to schedule.
return None, model_executed
Source: vllm/v1/engine/core.py
The queue size is max_concurrent_batches, which is where async scheduling and pipeline parallelism meet. The same machinery that overlaps scheduling with execution is what keeps a pipeline-parallel deployment’s stages full:
# vllm/config/vllm.py
def max_concurrent_batches(self) -> int:
# PP requires PP-size concurrent batches to fill the pipeline.
# Async scheduling requires 2 concurrent batches to overlap.
pp_size = self.parallel_config.pipeline_parallel_size
if self.scheduler_config.async_scheduling:
if self.use_v2_model_runner:
return pp_size + 1
if pp_size <= 1:
return 2
return pp_size
Source: vllm/config/vllm.py
Async scheduling on a single replica wants two batches in flight; PP wants pp_size; the v2 runner combining both wants pp_size + 1. It is the same idea you reach for in any producer/consumer pipeline: queue just enough work to hide the other side’s latency without unbounded buffering.
The catch: scheduling the next step before you know the last token
There is a genuine problem buried in that overlap. To schedule step N+1, the scheduler needs each sequence’s next input token — which is the one sampled by step N, still running on the GPU. Waiting to read it back would kill the overlap. So async scheduling schedules step N+1 without yet knowing the sampled tokens, using placeholders.
The AsyncScheduler advances each running request’s bookkeeping optimistically, recording how many tokens are “in flight” as placeholders rather than concrete ids:
# vllm/v1/core/sched/async_scheduler.py
cur_num_spec_tokens = len(spec_decode_tokens.get(req_id, ()))
request.num_output_placeholders += 1 + cur_num_spec_tokens
# Add placeholders for the new draft/spec tokens.
# We will update the actual spec token ids in the worker process.
request.spec_token_ids = self._spec_token_placeholders
Source: vllm/v1/core/sched/async_scheduler.py
The genuinely clever part is in the worker, and it is what lets the placeholders work without ever stalling. Recall the chicken-and-egg problem: step N+1’s input is the token sampled by step N, but that token is still being computed on the GPU. The naive fix would be to copy the token back to the host and feed it into the next step’s input tensor, but that device-to-host copy is exactly the synchronization that would destroy the overlap. The insight is that the token never needs to leave the GPU to become the next step’s input, because both the sampled-token tensor and the persistent input buffer already live on the device. The model runner caches the previous step’s sampled-token tensor (prev_sampled_token_ids) and, when preparing the next batch’s input_ids, scatters those device-side tokens straight into the persistent input buffer with a GPU kernel:
# vllm/v1/worker/gpu_model_runner.py
self.input_ids.gpu.scatter_(
dim=0,
index=sampled_tokens_index_tensor,
src=self.input_batch.prev_sampled_token_ids[
prev_common_req_indices_tensor, 0
],
)
Source: vllm/v1/worker/gpu_model_runner.py
The host never learns the token id before launching the next step; the handoff from step N to step N+1 happens entirely on-device via a GPU scatter, never touching the CPU on the critical path. The sequence diagram below contrasts the two handoff paths: the on-device path (solid, on the critical path) carries the token from step N straight into step N+1, while the host-bound copy (the background detokenize/stream path) runs off to the side.
sequenceDiagram
participant CPU
participant GPU
CPU->>GPU: launch step N
GPU->>GPU: sample token, store in prev_sampled_token_ids (on device)
GPU->>GPU: scatter prev tokens into persistent input_ids buffer
CPU->>GPU: launch step N+1 (reads tokens already on device)
GPU-->>CPU: background copy of tokens (separate stream, for detokenize and stop checks)
There is even a common-case fast path: if the decode batch is unchanged and nothing was reordered, the whole thing collapses to one contiguous slice copy:
# vllm/v1/worker/gpu_model_runner.py
if common_indices_match and max_flattened_index == (num_common_tokens - 1):
self.input_ids.gpu[:num_common_tokens].copy_(
self.input_batch.prev_sampled_token_ids[:num_common_tokens, 0],
non_blocking=True,
)
return
Source: vllm/v1/worker/gpu_model_runner.py
The device-to-host copy of the sampled tokens still happens — the host needs them eventually to detokenize, stream (Chapter 11), and check stop conditions — but it is moved off the critical path. Instead of blocking the scheduling loop, it runs on a separate CUDA stream, so the host can keep launching steps while the copy proceeds in the background, and the result is only awaited when something actually needs the token ids on the CPU. AsyncGPUModelRunnerOutput kicks it off non-blocking and only synchronizes when the output is actually consumed:
# vllm/v1/worker/gpu_model_runner.py
with torch.cuda.stream(async_output_copy_stream):
async_output_copy_stream.wait_stream(default_stream)
self.sampled_token_ids_cpu = self._sampled_token_ids.to(
"cpu", non_blocking=True
)
...
self.async_copy_ready_event.record()
Source: vllm/v1/worker/gpu_model_runner.py
Where the optimism leaks
The placeholders are an optimistic bet: the scheduler assumes the steps it launched speculatively will all turn out to be wanted. Optimism has costs, paid in the corner cases where that bet is wrong and the engine has already committed work it now needs to unwind. The cleanest example: when something forces a running request out from under in-flight work. If the prefix cache is reset and every running request is force-preempted, any output frames the engine already launched for those requests are now stale — they computed tokens for a request that no longer exists in its old form — and must be dropped when they return from the GPU. The scheduler records exactly how many such frames are outstanding — the request’s placeholder count at preemption time — and the async scheduler drains them one per call:
# vllm/v1/core/sched/async_scheduler.py
if request.async_tokens_to_discard > 0:
# The request was force-preempted in reset_prefix_cache; drop one
# stale in-flight async output frame per call until the counter
# is drained.
request.async_tokens_to_discard -= 1
return [], False
Source: vllm/v1/core/sched/async_scheduler.py
The counter is set in reset_prefix_cache itself, where request.async_tokens_to_discard = request.num_output_placeholders captures the in-flight count before zeroing the placeholders. The everyday case — a sampled token that turns out to be a stop token, after step N+1 was already scheduled for that sequence — is handled more quietly: the over-produced tokens are simply truncated when the stopped request’s output is reconciled, the same num_output_placeholders bookkeeping unwinding by the number of tokens that actually came back.
Speculative decoding (Chapter 13) makes this harder, because the number of tokens accepted per step is not known until verification runs on the GPU. That is why the placeholders above include 1 + cur_num_spec_tokens and why the engine’s post_step skips its usual draft-token bookkeeping under async scheduling — the draft ids are resolved in the worker, not the engine:
# vllm/v1/engine/core.py
if not self.async_scheduling and self.use_spec_decode and model_executed:
draft_token_ids = self.model_executor.take_draft_token_ids()
if draft_token_ids is not None:
self.scheduler.update_draft_token_ids(draft_token_ids)
Source: vllm/v1/engine/core.py
And structured output (Chapter 14) is the hardest interaction of all, because the grammar bitmask for step N+1 depends on which token was actually sampled at step N. When a request is constrained by a grammar, the engine cannot blindly sample the next step; it must defer sampling until the prior output is processed, which is the deferred_scheduler_output branch in step_with_batch_queue above. Async scheduling and constrained decoding pull in opposite directions, and the code threads the needle by giving up the overlap precisely for the requests that need the prior token to compute their mask.
Honest limits
These techniques are mature and on by default, but not magic. CUDA graphs cost capture time at startup (one capture per shape), cost memory (the graph pool plus padding waste), and fall back to eager whenever a batch does not match a captured descriptor — the larger your max_num_seqs relative to your capture sizes, the more often that happens. Full-graph capture depends on the attention backend supporting it, so a Chapter 9 backend choice quietly decides whether you get FULL or only PIECEWISE. Async scheduling buys throughput at the cost of a more intricate state machine whose edge cases (stops, preemption, spec-decode rejection, grammar masks) are where subtle correctness bugs live, and it does not compose freely: the v1 runner does not fully support async scheduling with PP, which is why max_concurrent_batches special-cases it.
Both techniques share a theme that runs through the engine: the win comes from refusing to let the GPU wait on the host. We removed the launch wait by recording it and the scheduling wait by overlapping it. The one host round-trip we have not removed is the one that turns sampled token ids into streamed text — the device-to-host copy AsyncGPUModelRunnerOutput defers but cannot eliminate. That copy, the sampler pipeline that produces it, and the surprisingly racy path from token ids back to an SSE stream are the next chapter.
Further reading
This chapter’s mechanisms are engineering rather than research, so there is no single paper to cite; the techniques are best learned from the vLLM source quoted above and from the upstream documentation for torch.compile and torch.cuda.CUDAGraph. For background on the kernel-launch-overhead problem that motivates CUDA graphs, the roofline framing from Chapter 3 (Data Movement Is All You Need, arXiv:2007.00072) and Chapter 9’s attention-kernel discussion are the relevant prior reading within this book.