Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Prefill/decode disaggregation

Every chapter in Part II fought the same quiet war. Chapter 3 named the enemy: a prefill step is compute-bound, saturating the GPU’s tensor cores over the whole prompt at once, while a decode step is memory-bound, dragging the entire model and KV cache through HBM to emit a single token. Chapter 6 then put both kinds of work into one batch and one token budget, and chunked prefill exists precisely because mixing them is awkward. A long prefill, left whole, stalls every decode sharing its step; sliced into chunks, it merely taxes them. Chunked prefill is a truce, not a victory. The two phases still ride the same GPU, still draw from the same token budget, still interfere.

There is a more radical option, and this chapter is about it: stop sharing the GPU at all. Run prefill on one pool of GPUs and decode on another. Let the prefill pool run flat-out at its compute roofline with no decodes to slow it down, let the decode pool run flat-out at its bandwidth roofline with no prefill spikes wrecking its inter-token latency, and ship the KV cache from the first pool to the second over the network. The two pools scale independently, because their bottlenecks are independent. And the wire that carries the KV between them is not a new abstraction at all. It is the connector API from Chapter 16, the same KVConnectorBase_V1 that spilled cold blocks to CPU and disk, now stretched across a NIC.

That last point is the through-line. Offloading and disaggregation are the same machinery pointed at different destinations.

A piece of vocabulary first, because the rest of the chapter leans on it. The prefill instance (the “P” side, or producer) is the GPU pool that reads the whole prompt and computes its KV cache once. The decode instance (the “D” side, or consumer) is the pool that takes that KV cache and generates output tokens one at a time. The KV cache is the per-token key/value tensors that attention needs in order to attend back over everything seen so far; computing it for an N-token prompt is exactly the expensive prefill work, and once computed it is just bytes that can be copied. Disaggregation’s entire bet is that copying those bytes from P to D is cheaper than recomputing them on D. The diagram below shows the shape of the deployment: a request flows through P, its KV cache crosses the network, and D streams tokens back to the user.

flowchart LR
    U["client request"] --> P["prefill pool (producer): compute full KV cache once, compute-bound"]
    P -->|"KV cache over RDMA"| D["decode pool (consumer): emit one token per step, memory-bound"]
    D -->|"output tokens"| U
    P -.->|"scales on prompt-token rate"| AS["autoscaler"]
    D -.->|"scales on output-token rate"| AS

The dashed edges hint at the payoff the next section makes precise: because P is throttled by compute and D by memory bandwidth, the two pools can be sized and scaled by completely different signals.

Why separate at all: the goodput argument

Chunked prefill tames interference but cannot erase it, and it forces a single configuration onto two workloads with opposite needs. Prefill wants large batches and big tensor-parallel groups to feed the compute units; decode wants the smallest TP degree that still fits the model, so that each token’s mandatory walk through HBM is as short as possible. One GPU pool cannot be tuned for both. You pick a compromise and both phases pay for it.

Two 2024 papers made this argument quantitative and gave the field its vocabulary. DistServe (arXiv:2401.09670) disaggregates prefill and decode onto separate GPU pools and optimizes for goodput rather than raw throughput. The distinction is worth pinning down, because it is the whole reason to bother. Throughput counts every token a GPU emits, fast or slow. Goodput counts only the requests that actually met both of their latency targets from Chapter 2: TTFT (time to first token, set by prefill) and TPOT (time per output token, set by decode). A GPU can have wonderful throughput while missing SLOs left and right, if its batches are large but laggy. The insight is that prefill and decode degrade different SLOs when they interfere: a long prefill sharing a step inflates everyone’s TPOT, and a flood of decodes sharing a step inflates everyone’s TTFT. Co-locating them forces a single batching and parallelism configuration onto two phases with opposite optimal points, so you are always sacrificing one SLO to protect the other. DistServe’s finding is that splitting the phases, even after paying to move the KV cache, raises the number of requests served within both SLOs per GPU. Read it for the goodput formulation and the placement search over parallelism degrees. Splitwise (arXiv:2311.18677) reaches the same structural conclusion from a production-fleet angle: it splits the two phases onto separate machine pools, observes that prefill and decode have different power and hardware sweet spots (you can even use different GPU SKUs per pool), and works out KV-transfer and provisioning so the split pays off across a real cluster. Read it for the heterogeneous-hardware and capacity-planning view.

Both papers leave you with the same caveat, and it is the one to hold onto: disaggregation only wins when the KV transfer is cheap relative to the prefill it replaces. Move the bytes too slowly, or move them for a prompt too short to be worth it, and you have added a network hop for nothing. Everything in vLLM’s implementation below is, in one way or another, an attempt to keep that transfer off the critical path.

The shape of a disaggregated deployment

vLLM does not have a “disaggregation mode.” It has a connector and a role flag, and disaggregation is what you get when you point a producer instance and a consumer instance at the same connector. The role lives in KVTransferConfig:

# vllm/config/kv_transfer.py
kv_role: KVRole | None = None
"""Whether this vLLM instance produces, consumes KV cache, or both. Choices
are 'kv_producer', 'kv_consumer', and 'kv_both'."""

Source: vllm/config/kv_transfer.py

A prefill instance runs as a producer, a decode instance as a consumer. Each is an ordinary vLLM engine with an ordinary scheduler and paged KV cache; the only difference is that a connector is wired in. That connector is created once, lazily, and stashed as a process-global on the worker side:

# vllm/distributed/kv_transfer/kv_transfer_state.py
if (
    vllm_config.kv_transfer_config.is_kv_transfer_instance
    and _KV_CONNECTOR_AGENT is None
):
    _sync_engine_id_across_tp(vllm_config)

    _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
        config=vllm_config,
        role=KVConnectorRole.WORKER,
        kv_cache_config=kv_cache_config,
    )

Source: vllm/distributed/kv_transfer/kv_transfer_state.py

Note _sync_engine_id_across_tp: every worker in a TP group must agree on the engine id, because that id is how a remote instance addresses this one across the network. Disaggregation is point-to-point between specific engines, not a broadcast.

The connector you choose decides how the bytes actually move, and there are many, not one. The factory registers a whole zoo — NixlConnector, LMCacheConnectorV1, MooncakeConnector, the offloading connectors from Chapter 16, and MultiConnector which composes several at once. NIXL (NVIDIA’s transfer library, point-to-point RDMA) is the reference disaggregation backend and the one we will read, but the API is deliberately backend-agnostic. Mooncake (arXiv:2407.00079, met in Chapter 16) is both a paper and a backend in this tree: a KV-centric tiered store that a decode pool can pull from. The connector abstraction is what lets all of these be alternatives rather than forks.

The connector lifecycle, split across two sides

The base class documents its own contract better than I can paraphrase, so here is the seam. Every connector has a scheduler-side half and a worker-side half:

# vllm/distributed/kv_transfer/kv_connector/v1/base.py
    Scheduler-side: runs in the scheduler, binds metadata, which
    is used by the worker-side to load/save KV cache.
        get_num_new_matched_tokens() - get number of new tokens
            that exist in the remote KV cache. ...
    Worker-side: runs in each worker, loads/saves KV cache to/from
    the Connector based on the metadata.
        start_load_kv() - starts loading all KVs (maybe async)
        ...
        get_finished() - called with ids of finished requests, returns
            ids of requests that have completed async sending/recving.

Source: vllm/distributed/kv_transfer/kv_connector/v1/base.py

The split is the key to understanding everything that follows, so it is worth saying plainly what each half is and why there are two. The scheduler-side half lives in the engine’s step loop on the CPU. It never touches a GPU. Its job is purely to decide and to bookkeep: it answers “are these tokens available remotely?”, “have they arrived yet?”, and “may I free these blocks?”, and it emits small metadata structs describing what should move. The worker-side half lives inside execute_model, next to the GPU. Its job is to do the bytes: issue the RDMA reads, poll for completions, and hand a list of finished requests back up. Crucially, the two halves run on different clocks. The scheduler issues a transfer in one step but the bytes may not land for several steps, so the scheduler cannot simply block and wait. It has to park the request, keep stepping other work, and learn about completion asynchronously. That gap between “transfer requested” and “transfer done” is the central design problem of this whole chapter, and it is why a request needs a dedicated waiting state, which is exactly where the next section goes. The diagram below traces the two halves and the asynchronous handoff between them.

flowchart TD
    subgraph SCHED["scheduler-side (CPU, step loop)"]
        A["get_num_new_matched_tokens(): how many tokens are remote?"]
        B["park request in WAITING_FOR_REMOTE_KVS, reserve blocks"]
        C["each step: is request in finished_recving set yet?"]
        D["yes: rejoin schedulable pool, run forward"]
    end
    subgraph WORK["worker-side (GPU, execute_model)"]
        E["start_load_kv(): post RDMA read"]
        F["forward pass runs while bytes stream in"]
        G["get_finished(): poll completions"]
    end
    A --> B
    B -.->|"metadata describes the transfer"| E
    E --> F --> G
    G -.->|"finished_recving ids flow back up"| C
    C -->|"not yet"| C
    C --> D

The decode side: get_num_new_matched_tokens and the WAITING_FOR_REMOTE_KVS state

Start at the consumer. A request arrives at a decode instance with no local KV cache — the prompt was prefilled elsewhere. The decode scheduler must somehow learn that the tokens it would normally need to compute are already sitting in a remote prefill instance, allocate space for them, and not schedule any forward work until they have arrived. The hook for the first step is get_num_new_matched_tokens, called in the scheduler’s WAITING-queue pass right after the local prefix-cache lookup from Chapter 7:

# vllm/v1/core/sched/scheduler.py
if self.connector is not None:
    ext_tokens, load_kv_async = (
        self.connector.get_num_new_matched_tokens(
            request, num_new_local_computed_tokens
        )
    )

Source: vllm/v1/core/sched/scheduler.py

The connector returns two things: how many tokens it can supply beyond what the local prefix cache already has, and whether those tokens load asynchronously. For the NIXL consumer, a remote-prefill request reports its whole prompt as remotely available and flags the load as async:

# vllm/distributed/kv_transfer/kv_connector/v1/nixl/scheduler.py
if params is not None and params.get("do_remote_prefill"):
    # Remote prefill: get all prompt blocks from remote.
    token_ids = request.prompt_token_ids or []
    actual = self._mamba_prefill_token_count(len(token_ids))
    count = actual - num_computed_tokens
    if count > 0:
        return count, True

Source: vllm/distributed/kv_transfer/kv_connector/v1/nixl/scheduler.py

That True is what reshapes the whole step. The scheduler allocates blocks for the external tokens, then parks the request instead of running it:

# vllm/v1/core/sched/scheduler.py
if load_kv_async:
    # If loading async, allocate memory and put request
    # into the WAITING_FOR_REMOTE_KV state.
    request.status = RequestStatus.WAITING_FOR_REMOTE_KVS
    ...
    request.num_computed_tokens = num_computed_tokens
    self._inflight_prefills.add(request)
    continue

Source: vllm/v1/core/sched/scheduler.py

WAITING_FOR_REMOTE_KVS is a sibling of WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR from Chapter 14: a request that is admitted and has memory reserved, but is blocked on an external event before it can be scheduled. The distinction that makes it cheap is that it consumes KV blocks but no token budget — recall from Chapter 5 that the token budget is the scheduler’s per-step allowance of tokens it will actually push through the model. A parked request reserves the memory its KV will land in, but because no forward work runs for it, it does not eat into the budget that the running decodes are competing over. So a decode instance can hold a large backlog of inbound prefills without starving the tokens it is currently generating.

Notice too that num_computed_tokens is set to the full external count before the bytes arrive. This is deliberate optimistic bookkeeping: the scheduler pretends the tokens are already computed so that, once they land, the request can step straight into decode with nothing left to do. The comment in the code is careful to flag the optimism, because if the transfer fails those tokens get re-set to what was actually loaded, and the request falls back to recomputing the rest locally.

The request sits in that state until the worker-side connector reports the receive complete. The scheduler learns this through KVConnectorOutput.finished_recving, and only then does the request rejoin the schedulable pool:

# vllm/v1/core/sched/scheduler.py
if request.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
    if request.request_id not in self.finished_recving_kv_req_ids:
        return False
    self._update_waiting_for_remote_kv(request)

Source: vllm/v1/core/sched/scheduler.py

_update_waiting_for_remote_kv does one more thing that ties back to Chapter 7. The transferred blocks are now real cached KV, so it registers them with the local block pool — a remote prefill becomes a local prefix-cache entry, free to be reused by the next request that shares the prompt. And it handles a subtle edge worth unpacking. Prefill computes two distinct things for the prompt: the KV cache (one key/value pair per token, which is what got transferred) and the logits (the probability distribution over the next token, which prefill produces only for the final position and which decode needs in order to sample the first output token). The transfer carries the KV but not the logits, so after a full-prompt transfer the decode instance has KV for every token yet no logits for the last one. The fix is to mark just the final position as not-yet-computed, forcing the model to run a single-token forward pass over it. That one pass re-derives the missing logits from the KV that is already present.

# vllm/v1/core/sched/scheduler.py
# on a full prompt hit, we need to re-compute the last token
# in order to be able to sample the next token
if request.num_computed_tokens == request.num_tokens:
    request.num_computed_tokens = request.num_tokens - 1

Source: vllm/v1/core/sched/scheduler.py

One position of recompute, never the whole prompt. That is the entire saving disaggregation buys, and it is exactly the saving prefix caching buys; the only difference is the cache lives across a network.

We have now seen both ends in isolation. The sequence diagram below stitches them into the single timeline of one request, from arrival at the decode instance through to the first sampled token, and shows where the parking and the pull-based transfer fall.

sequenceDiagram
    participant U as Client
    participant DS as Decode scheduler (D, CPU)
    participant DW as Decode worker (D, GPU)
    participant PW as Prefill worker (P, GPU)
    U->>DS: request, prompt already prefilled on P
    DS->>DS: get_num_new_matched_tokens reports N tokens remote, async
    DS->>DS: reserve blocks, state = WAITING_FOR_REMOTE_KVS
    Note over DS: consumes KV blocks, no token budget
    DW->>PW: start_load_kv posts RDMA READ of KV blocks (pull)
    PW-->>DW: KV bytes stream in over the wire
    DW->>DS: get_finished puts request id in finished_recving
    DS->>DS: register blocks in local prefix cache
    DS->>DS: mark last position uncomputed (need logits)
    DS->>DW: schedule single-token forward over last position
    DW->>U: first output token
    Note over DW,U: then normal decode, one token per step

The prefill side: holding blocks until they are read

Now the producer. A prefill instance does its normal compute-bound forward pass, but it must not free the KV blocks the instant the request “finishes,” because a decode instance still needs to read them. The scheduler-side request_finished is where it decides to defer:

# vllm/distributed/kv_transfer/kv_connector/v1/nixl/scheduler.py
delay_free_blocks = any(len(group) > 0 for group in block_ids)
...
if delay_free_blocks:
    # Prefill request on remote. It will be read from D upon completion
    request_kv_blocks_ttl = self._kv_lease_duration
    ...
    self._reqs_need_send[request.request_id] = (
        time.perf_counter() + request_kv_blocks_ttl
    )

Source: vllm/distributed/kv_transfer/kv_connector/v1/nixl/scheduler.py

Returning delay_free_blocks=True tells the engine the connector has taken over responsibility for these blocks; they will not be freed when the request finishes. This is the same return contract request_finished uses for async offload saves in Chapter 16. The difference is the lease. A lease here means a time-bounded pin: rather than holding the blocks until some consumer definitely reads them (which might be never, if the consumer crashes), the producer pins them for only _kv_lease_duration seconds. If no decode instance shows up to read them before the lease expires, they are reclaimed anyway. For transfers that are legitimately still in flight, heartbeats (_heartbeat_by_engine) renew the lease so a slow-but-live reader is not cut off. The motivation is a failure mode: without the timeout, a crashed or slow consumer would strand prefill memory forever, and because every parked block is capacity the prefill pool cannot reuse, the pool would slowly bleed throughput until it could admit no new prompts. Distributed cache coherence under failure is, unsurprisingly, where most of the operational complexity hides.

The transfer itself is a pull, not a push, and the direction matters. A push would have the producer send bytes the instant prefill finishes, but the producer does not know where the blocks should land on the consumer until the consumer has allocated its own block slots, and it does not know whether the consumer even still needs them (it may already hold them in its prefix cache). A pull inverts this: the consumer, which knows its own layout and its own needs, reaches across RDMA and reads the blocks it wants from the producer’s memory. That is why all the initiating logic we saw lives on the decode side. The worker-side read for NIXL posts a point-to-point transfer per remote worker:

# vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py
def _read_blocks(
    self,
    read_spec: ReadSpec,
    dst_engine_id: str,
    ...
):
    """
    Post a READ point-to-point xfer request from a single local worker to
    a single remote worker.
    """

Source: vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py

The full-prefix-cache-hit case is the elegant degenerate one: if the decode instance already has all the blocks locally, it reads nothing and merely sends a notification so the prefill side knows it can release its lease early.

# vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py
# Full prefix cache hit: do not need to read remote blocks,
# just notify P worker that we have the blocks we need.
if len(local_block_ids) == 0:

Source: vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py

Where the transfer hides inside the forward pass

The reason any of this stays off the critical path is overlap: the worker-side load and save are bolted onto execute_model rather than run as separate phases, so the bytes move while the GPU is busy with other work instead of during an idle stall. “Off the critical path” means the transfer’s latency hides inside time the GPU was going to spend computing anyway, rather than adding to the wall-clock the user waits. The model-runner mixin achieves this by wrapping the forward pass in a context manager that kicks off the load before the forward and harvests completions after:

# vllm/v1/worker/kv_connector_model_runner_mixin.py
# Background KV cache transfers happen here.
kv_connector.start_load_kv(get_forward_context())
try:
    yield output
finally:
    if wait_for_save and not defer_finalize:
        kv_connector.wait_for_save()

    output.finished_sending, output.finished_recving = (
        kv_connector.get_finished(scheduler_output.finished_req_ids)
    )

Source: vllm/v1/worker/kv_connector_model_runner_mixin.py

Reading it in order: start_load_kv fires the receive without waiting for it; the forward pass proceeds on the GPU while those bytes arrive in the background; then get_finished polls which transfers have actually completed and feeds those ids back to the scheduler as the finished_recving set, which is precisely the signal that springs WAITING_FOR_REMOTE_KVS requests loose in the section above. The overlap goes the other way too. On the producer side the layer-by-layer save_kv_layer / wait_for_save hooks let the KV for each transformer layer start streaming out the moment that layer is computed, rather than waiting for the whole forward pass to finish. So a request’s early-layer KV can already be on the wire to the consumer while the producer is still computing its later layers — the transfer overlaps the prefill that produced it.

This overlap is not free, and the cost lands on Chapter 10’s machinery. The connector base flags that these async layer operations cannot be captured in a CUDA graph and force requires_piecewise_for_cudagraph. A CUDA graph captures a fixed sequence of GPU launches and replays them with near-zero per-step overhead, but a transfer whose timing depends on the network is not a fixed sequence, so the graph has to be broken into pieces around it. The trade is explicit: you give up some of the launch-overhead savings from CUDA graphs in exchange for the ability to overlap transfer with compute at all.

What is still hard

Disaggregation is real and shipping, but it is not free goodput. The transfer must be amortized, meaning the cost of moving the KV has to be small next to the prefill cost it replaces; otherwise you have spent a network round trip to save a forward pass that was cheaper than the round trip. This is the same caveat both papers raised, now made concrete: NIXL exposes a kv_recompute_threshold (default 64 tokens) below which it simply recomputes the prompt locally on the decode instance rather than pulling it, because for a prompt that short the network round trip costs more than the prefill it would save. The two curves below make the crossover visible: local recompute cost grows linearly with prompt length from near zero, while the pull cost starts at a fixed RDMA round-trip floor and rises only gently, so for short prompts recompute wins and for long prompts the pull wins, and the threshold is just where they cross. Set it too low and you pull tiny prompts that would have been cheaper to recompute; set it too high and you recompute prompts that would have been cheaper to pull. You lose either way.

Illustrative: the curve shapes (linear recompute from zero, a fixed round-trip floor plus gentle slope for the pull) are the real relationship and the crossover is placed at NIXL’s 64-token default, but the absolute microsecond values are made up, not measured.

The deeper open problems are about balance and failure. The prefill and decode pools must be sized in proportion to the actual workload, because each pool is loaded by a different quantity: the prefill pool’s load scales with total prompt tokens per second $\lambda_{\text{prompt}}$, the decode pool’s with total output tokens per second $\lambda_{\text{output}}$. A workload of short prompts and long answers needs lots of decode and little prefill; a workload of long documents summarized in a sentence needs the reverse. The right split tracks the ratio $\lambda_{\text{prompt}} / \lambda_{\text{output}}$, which is a property of the traffic, not the deployment, and it drifts hour to hour, so a split sized for yesterday’s traffic is mis-sized today. This is exactly why disaggregation pushes hard on the autoscaling story of Chapter 20: the two pools have to be scaled independently and continuously. Two more sharp edges remain. Heterogeneous TP between the pools (the producer and consumer running different tensor-parallel degrees) means a single logical block is split across a different number of GPUs on each side, so the byte layouts do not line up, which is the whole reason _read_blocks carries that block_size_ratio remapping and staging-buffer logic. And the lease-and-heartbeat dance is a distributed-systems problem in its own right: every pinned block on a producer is memory a crashed consumer can strand until a timeout fires.

Which brings us to the obvious next question. If a request must be routed to a decode instance that can read from the prefill instance that holds its KV — and ideally to a prefill instance that already has its prompt prefix cached — then routing is no longer stateless. The router has to know each replica’s P/D role, its KV residency, and its queue depth, and steer accordingly. That is Chapter 18.

Further reading

  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — arXiv:2401.09670 — disaggregates the two phases onto separate pools and optimizes for goodput (requests meeting both TTFT and TPOT SLOs); the foundational goodput-and-placement argument.
  • Splitwise: Efficient Generative LLM Inference Using Phase Splitting — arXiv:2311.18677 — splits phases across machine pools at fleet scale, including different GPU SKUs per pool and KV-transfer provisioning; read for the heterogeneous-hardware and capacity view.
  • Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — arXiv:2407.00079 — KV-centric tiered store underpinning a disaggregated cluster; both a paper and a connector backend in vLLM’s tree.