When the cache overflows: KV offloading and external KV stores

In Chapter 7 the GPU’s free KV pool turned into a content-addressed cache: identical prefixes get hashed, deduplicated, and reused, and prefix reuse became the single largest TTFT lever in real traffic. But that cache has a hard ceiling. The block pool lives in HBM, and HBM is the most expensive, scarcest resource in the box. When a popular system prompt has been pushed out by other traffic, the next request that wants it pays full prefill again, even though the exact KV bytes existed a few seconds ago. Chapter 15 then spent the model across many GPUs, which adds aggregate HBM but does nothing to change the eviction problem on any single device. The cache still overflows; it just overflows on more GPUs at once.

This chapter is about giving the prefix cache somewhere to fall back to. CPU DRAM (the host’s main memory, reached over the PCIe bus) is roughly an order of magnitude cheaper per byte than HBM (the GPU’s on-package high-bandwidth memory, where the block pool lives), and a host has a lot more of it. Local NVMe (a solid-state disk on the same machine) is cheaper still, and a networked store (memory or disk on another machine, reached over the network) cheaper again. The KV blocks you evicted from the GPU are not garbage; they are cold cache lines. If you can spill them to a cheaper tier and pull them back faster than you could recompute them, you have extended the effective prefix cache far beyond device capacity. The catch is the “faster than recompute” part, and the abstraction vLLM uses to manage the spilling is, deliberately, the same connector API that the next chapter will stretch across a network to move KV between separate machines.

These tiers form a classic memory hierarchy: each step down is cheaper and larger but slower to reach. The diagram below shows where KV blocks can live and roughly how fast each tier answers a request for them. Think of HBM as the working set, DRAM as the overflow, and the slower tiers as deep archive that only pays off for prefixes long and popular enough to amortize the trip.

flowchart TD
    GPU["GPU HBM block pool: fastest, smallest, most expensive"]
    DRAM["CPU DRAM (over PCIe): ~10x cheaper, much larger"]
    NVME["Local NVMe disk: cheaper still, larger"]
    NET["Networked / external store: cheapest, largest, shared across replicas"]
    GPU -->|"evict / spill (store)"| DRAM
    DRAM -->|"fetch on prefix hit (load)"| GPU
    DRAM -->|"spill"| NVME
    NVME -->|"load"| DRAM
    NVME -->|"spill"| NET
    NET -->|"load"| NVME

That diagram shows where blocks can live and how they move, but not how far apart the tiers really are. The cost of the gap is the whole reason offloading is a bet rather than a free win: each step down is roughly an order of magnitude slower to reach. The chart below shows representative round-trip latencies to fetch one KV block from each tier, on a log scale so the roughly-10x-per-step spacing is visible at a glance.

Illustrative orders of magnitude: the shape (about 10x slower per step down) is the point, not the exact figures, which vary with bus generation, block size, and contention.

The economics: when is a load cheaper than a recompute?

Decode is memory-bound and prefill is compute-bound; that asymmetry from Chapter 3 is exactly what makes offloading viable. Recomputing a prefix means running prefill over every cached token: a compute-bound pass whose cost grows with prefix length. Loading the same prefix from CPU means a PCIe transfer whose cost is the KV byte count divided by bus bandwidth. For a prefix of $n$ tokens the KV payload follows directly from the model shape:

$$\text{bytes} = n \cdot 2 \cdot L \cdot H \cdot d_{\text{head}} \cdot b$$

where $L$ is the number of layers, $H$ the number of KV heads, $d_{\text{head}}$ the per-head dimension, $b$ the bytes per element, and the factor $2$ counts the separate key and value tensors. The load time is then $t_{\text{load}} = \text{bytes} / \text{BW}{\text{bus}}$, while recompute time $t{\text{recompute}}$ rises with $n$ on the compute engine. The decision is therefore a direct comparison of two times. Recompute cost rises with prefix length on the compute engine; load cost rises with prefix length on the bus. Because moving already-computed bytes is far cheaper per token than re-deriving them, the two curves cross: below some prefix length recompute wins, above it the load wins. For a long shared prefix the transfer can win by a wide margin, because you are moving bytes you already paid to compute instead of re-deriving them. For a short prefix it can lose, because the fixed overheads of staging a transfer (setting up the transfer descriptor that tells the copy engine which bytes to move, synchronizing CUDA streams, and copying a partial first block) swamp the few hundred microseconds prefill would have taken anyway.

The crossover below makes this concrete. Recompute cost is essentially linear in prefix length with no fixed startup, so it is a straight line through the origin. Load cost is also linear in length but with a cheaper per-token slope (you are moving bytes, not re-deriving them) plus a fixed staging overhead (transfer-descriptor setup, stream sync, partial first block), so it starts high and climbs slowly. Below the crossing point the fixed overhead makes the load lose; above it the cheaper slope makes the load win by a widening margin.

Illustrative: recompute = 8 us/token, load = 300 us fixed + 5 us/token. Real slopes and the fixed overhead depend on model shape, GPU, and bus bandwidth, but the qualitative crossover (short prefix favors recompute, long prefix favors load) is the durable point.

So offloading is not a free win; it is a bet that the prefix is both long enough and reused often enough to amortize the round trip. vLLM encodes this skepticism directly. The CPU spec defaults to offloading prompt blocks only, skipping decode-phase KV (the KV produced one token at a time during generation, which is unlikely to be a shared prefix for any future request), and it can require a block to be requested more than once before it earns a slot in the cache. The store threshold and the prompt-only default both exist because indiscriminately offloading everything just burns PCIe bandwidth on blocks no one will ask for again.

The decision flow below traces how the engine resolves a single waiting request against this tiered cache. It checks the fastest tier first and falls through to the next only when a tier misses; prefill then runs over whatever tokens no tier could supply.

flowchart TD
    REQ["new request: hash prompt into block keys"] --> GPUHIT{"blocks already in GPU HBM?"}
    GPUHIT -->|"yes"| REUSE["reuse in place, no transfer"]
    GPUHIT -->|"no"| OFFHIT{"blocks found in offload tier? (lookup)"}
    OFFHIT -->|"no"| RECOMP["recompute: run prefill over the missing tokens"]
    OFFHIT -->|"yes"| WORTH{"long and reused enough to beat recompute?"}
    WORTH -->|"no"| RECOMP
    WORTH -->|"yes"| LOAD["prepare_load: pin blocks, transfer DRAM to HBM"]
    REUSE --> SCHED["schedule remaining tokens for prefill / decode"]
    LOAD --> SCHED
    RECOMP --> SCHED

The OffloadingManager: a second-tier block allocator in the scheduler

The scheduler-side brain of offloading is the OffloadingManager, and the docstrings on its primitives lay out a vocabulary that should feel familiar from the GPU block pool, only one level down the hierarchy.

# materials/.../vllm/v1/kv_offload/base.py
class OffloadingManager(ABC):
    @abstractmethod
    def lookup(self, key: OffloadKey, req_context: ReqContext) -> bool | None:
        """Checks whether a single block is offloaded and ready to be read."""
    @abstractmethod
    def prepare_load(self, keys, req_context) -> LoadStoreSpec:
        """Prepare the given blocks to be read.
        The given blocks will be protected from eviction..."""
    @abstractmethod
    def prepare_store(self, keys, req_context) -> PrepareStoreOutput | None:
        """Prepare the given blocks to be written."""

Source: vllm/v1/kv_offload/base.py

Read these three methods as the verbs of a second allocator. lookup asks whether a cold block is present in the offload tier (the answer feeds the “blocks found in offload tier?” branch in the diagram above). prepare_load pins blocks so they cannot be evicted mid-transfer (pinning means marking them in-use so the eviction policy skips them) and hands back a LoadStoreSpec, a small record describing where the bytes live so the worker knows what to copy. prepare_store reserves space for blocks on their way out of the GPU, and crucially returns the list of other blocks it had to evict to make room, so the caller learns exactly what was displaced. This is an allocator with its own eviction policy, sitting behind the GPU’s allocator, addressed by the same content hashes Chapter 7 computed for prefix caching. The key is just the block hash packed with its KV-cache-group index (the index identifying which group of attention layers a block belongs to, since a model can have several such groups):

# materials/.../vllm/v1/kv_offload/base.py
def make_offload_key(block_hash: bytes, group_idx: int) -> OffloadKey:
    """Pack a block hash and group index into an `OffloadKey`."""
    return OffloadKey(block_hash + group_idx.to_bytes(4, "big", signed=False))

Source: vllm/v1/kv_offload/base.py

Because the offload key derives from the same parent-chained block hash the GPU prefix cache uses, a block evicted from HBM and a block resident in DRAM share an identity. A later request hashing its prompt will produce the same keys, and a lookup will find them. The two caches are tiers of one logical cache, not two separate caches that happen to hold similar data.

The CPU manager is, under the hood, exactly a paged allocator with a pluggable eviction policy:

# materials/.../vllm/v1/kv_offload/cpu/manager.py
_CACHE_POLICIES: dict[str, type[CachePolicy]] = {
    "lru": LRUCachePolicy,
    "arc": ARCCachePolicy,
}

Source: vllm/v1/kv_offload/cpu/manager.py

LRU is the default; ARC (adaptive replacement) is offered for workloads where pure recency misbehaves. Either way the manager owns ref-counting, a free list, and event emission, and delegates only the “which block dies next” decision to the policy. If you have built a buffer cache, none of this is new; what is new is that the cache lines are KV blocks keyed by prompt content.

The connector API: the seam this chapter introduces

The manager decides what to move. Something else has to actually move bytes between HBM and DRAM during the forward pass, and that something is the connector. KVConnectorBase_V1 is the abstraction this chapter introduces and the next chapter reuses wholesale. It is split cleanly into scheduler-side and worker-side halves, and its scheduler-side entry point is the hook that lets external KV participate in admission:

# materials/.../vllm/distributed/kv_transfer/kv_connector/v1/base.py
@abstractmethod
def get_num_new_matched_tokens(
    self, request: "Request", num_computed_tokens: int,
) -> tuple[int | None, bool]:
    """Get number of new tokens that can be loaded from the
    external KV cache beyond the num_computed_tokens."""

Source: vllm/distributed/kv_transfer/kv_connector/v1/base.py

The scheduler calls this for every waiting request, right after it has computed the GPU prefix-cache hit, and folds the answer into how much prefill it must actually schedule:

# materials/.../vllm/v1/core/sched/scheduler.py
# Get externally-cached tokens if using a KVConnector.
if self.connector is not None:
    ext_tokens, load_kv_async = (
        self.connector.get_num_new_matched_tokens(
            request, num_new_local_computed_tokens
        )
    )

Source: vllm/v1/core/sched/scheduler.py

This is the heart of the design. The GPU prefix cache answers “how many tokens are already in HBM”; the connector answers “how many more are recoverable from a cheaper tier.” Prefill only runs over what neither tier could supply. The return signature carries a second subtlety: a None first element means “ask me again later,” which lets a slow backend kick off an asynchronous lookup without blocking the scheduler step (the scheduler moves on and re-polls the request on a later step), and a boolean second element distinguishes synchronous loads (the transfer finishes before the next step, so the request is ready immediately) from asynchronous ones (the request waits across several steps for its KV to arrive over a slow link). The asynchronous path is where this machinery starts to look like disaggregation, and it is the seam Chapter 17 widens.

The worker-side half is built around the forward pass. Its methods bracket model execution:

# materials/.../vllm/distributed/kv_transfer/kv_connector/v1/base.py
@abstractmethod
def start_load_kv(self, forward_context, **kwargs) -> None:
    """Start loading the KV cache from the connector to vLLM's paged
    KV buffer... before the forward pass to enable async loading..."""
@abstractmethod
def wait_for_save(self):
    """Block until all the save operations is done..."""

Source: vllm/distributed/kv_transfer/kv_connector/v1/base.py

The reason these methods bracket the forward pass rather than running before or after it is overlap. A GPU transfer and a GPU compute can proceed at the same time on separate streams, so the connector deliberately starts loads just before the layers compute and pushes saves out as layers finish, hiding the transfer behind work the GPU was going to do anyway. start_load_kv fires before the layers run so a load can overlap compute; save_kv_layer and wait_for_save push freshly computed KV out without stalling the GPU. The model runner wraps all of this in a context manager so every execute_model enters and exits the connector lifecycle uniformly:

# materials/.../vllm/v1/worker/kv_connector_model_runner_mixin.py
kv_connector.bind_connector_metadata(scheduler_output.kv_connector_metadata)
# Background KV cache transfers happen here.
kv_connector.start_load_kv(get_forward_context())
try:
    yield output
finally:
    if wait_for_save and not defer_finalize:
        kv_connector.wait_for_save()
    output.finished_sending, output.finished_recving = (
        kv_connector.get_finished(scheduler_output.finished_req_ids)
    )

Source: vllm/v1/worker/kv_connector_model_runner_mixin.py

Note kv_connector_no_forward: the engine runs the connector even on steps with no tokens to compute, because a request blocked on an asynchronous remote load still needs its transfer driven forward. The connector is not a passenger on the forward pass; it has its own work to do whether or not the model runs.

The sequence diagram below traces one engine step from the scheduler’s admission check through the worker’s bracketed forward pass, showing how a load is launched before compute and a store is deferred until after it. The scheduler-side hook and the worker-side lifecycle are two halves of the same connector, talking through the metadata the model runner binds at the top of every step.

sequenceDiagram
    participant S as Scheduler
    participant C as Connector scheduler side
    participant W as Connector worker side
    participant G as GPU forward pass
    S->>C: get_num_new_matched_tokens(request)
    C-->>S: extra tokens recoverable from offload tier
    S->>S: schedule prefill only over the remainder
    S->>W: bind connector metadata for this step
    W->>W: start_load_kv begins DRAM-to-HBM copy
    W->>G: run layers, compute overlaps the load
    G-->>W: fresh KV produced
    W->>W: wait_for_save drains last step's deferred stores
    W->>W: queue this step's new stores for next step
    W-->>S: report finished sending/receiving ids

Moving the bytes: streams, copy engines, and partial blocks

When OffloadingConnector actually transfers a block it delegates to a worker that drives async copies on dedicated CUDA streams, and the comments in the copy handler are where the hardware reality shows through:

# materials/.../vllm/v1/kv_offload/cpu/gpu_worker.py
def _select_swap_blocks_fn(kv_cache_groups_data_refs, gpu_to_cpu):
    """Resolve the swap_blocks function for a handler at init time."""
    # GPU->CPU is bandwidth-bound; the dedicated copy engine beats Triton.
    if gpu_to_cpu:
        return ops.swap_blocks_batch

Source: vllm/v1/kv_offload/cpu/gpu_worker.py

GPU-to-CPU stores route to the dedicated DMA copy engine (a piece of hardware whose only job is moving bytes across the bus) because the transfer is purely bandwidth-bound and the copy engine runs it without occupying SMs, the streaming multiprocessors that do the model’s math. Spending compute units to shuffle bytes would steal them from the next forward pass for no benefit. Small CPU-to-GPU loads, by contrast, can win with a Triton kernel (a small GPU program) that gathers many tiny page copies in one launch, and only when the payloads are small and 8-byte aligned, because at that size the per-transfer launch overhead dominates and batching the copies into one kernel beats issuing many separate DMA descriptors. This is the same memory-vs-compute reasoning as everywhere else in the book, applied to the interconnect: match the transfer to the hardware unit that runs it most cheaply.

Two ordering details matter for correctness. Stores must wait for the model to finish writing the KV they are reading, and loads from pinned host memory can let the driver reorder source reads because nothing on the GPU is concurrently writing them:

# materials/.../vllm/v1/kv_offload/cpu/gpu_worker.py
if self.gpu_to_cpu:
    # wait for model computation to finish before offloading
    stream.wait_stream(current_platform.current_stream())

Source: vllm/v1/kv_offload/cpu/gpu_worker.py

And stores are deliberately deferred to the start of the next engine step, so that offloading bandwidth does not contend with the latency-critical transfers around token sampling:

# materials/.../vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py
# NOTE(orozery): defer the store to the beginning of the next
# engine step, so that offloading starts AFTER transfers related
# to token sampling, thereby avoiding delays to token generation.
self._unsubmitted_store_jobs.append((job_id, entry.transfer_spec))

Source: vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py

This is the kind of detail that separates a working offload path from a fast one: a naive implementation that flushes stores synchronously inside the step adds DRAM-write latency to every decode, exactly where the ITL budget is thinnest.

There is also a granularity mismatch the worker has to absorb. Offloaded blocks are often coarser than GPU blocks: one offloaded block can hold several GPU blocks, a ratio $r = \text{block_size}{\text{offload}} / \text{block_size}{\text{GPU}}$ set by block_size in the connector config, so the slow tier stores fewer, larger pages. That keeps its index small and its transfers efficient, but it means a load no longer lines up cleanly with the GPU’s finer blocks. When a request’s prefix hit does not start exactly on one of those coarse boundaries, say the request shares the second half of a coarse block but not the first, the leading offloaded block is only partially relevant. The worker therefore carries block_indices to know how many sub-blocks to skip at the front of that first block so it copies only the bytes the request actually needs. The GPULoadStoreSpec docstring spells out precisely why this bookkeeping exists, and it is the price of making the offload tier’s page size independent of the GPU’s 16-token blocks.

Eviction, races, and the unglamorous correctness work

The genuinely hard part of offloading is not the happy path; it is keeping the two allocators consistent while requests preempt, abort, and reuse blocks underneath in-flight transfers. The danger is a use-after-free in slow motion: a store is still reading a GPU block to copy it down to DRAM when the scheduler, seeing that block as free, hands it to a new request that overwrites it; the store then captures garbage. A fence is the fix, code that forces the in-flight transfer to finish (a flush) before the block can be reused. The scheduler-side connector is full of these. A store that is still draining to DRAM holds claims on GPU blocks the scheduler might want to reallocate, so the scheduler tracks which pending jobs touch which block ID and forces a flush before any tracked block is handed to a new request:

# materials/.../vllm/distributed/kv_transfer/kv_connector/v1/offloading/scheduler.py
# Flush jobs that contain re-allocated blocks.
if (
    self._block_id_to_pending_jobs
    and not self._block_id_to_pending_jobs.keys().isdisjoint(
        self._current_batch_allocated_block_ids
    )
):
    self._current_batch_jobs_to_flush.update(...)

Source: vllm/distributed/kv_transfer/kv_connector/v1/offloading/scheduler.py

Preempted requests get their pending stores flushed too, because preemption (the load-shedding valve from Chapter 6) frees blocks that an outstanding store may still be reading. Sliding-window models add another wrinkle: their blocks can be dropped before the request even finishes, so those blocks are watched from the moment the store is created rather than at request completion. None of this is exotic distributed-systems theory; it is the careful reference-counting that any cache with asynchronous write-back and a shared backing store has to get exactly right, and it is worth reading the scheduler in full to appreciate how much of the file is fences rather than transfers.

Tiers below CPU, and pluggable external stores

CPU DRAM is the first tier, not the last. The TieringOffloadingSpec makes CPU a primary tier with configurable secondaries:

# materials/.../vllm/v1/kv_offload/tiering/spec.py
"secondary_tiers": [
    {"type": "example", "custom_param": 67}
]

Source: vllm/v1/kv_offload/tiering/spec.py

A FileMapper hashes blocks to filenames for an NVMe tier (the block hash becomes the file path, so the same content always maps to the same file); object-storage and networked tiers slot in behind the same OffloadingManager interface. The contract never changes: lookup, prepare_load, prepare_store. The tier just answers more slowly and holds more. This is the payoff of having defined the offload tier as an allocator with those three verbs: a new tier only has to implement the verbs, and the scheduler and worker above it are none the wiser.

Above the in-tree offloading machinery sits a second integration path: external KV stores that bring their own caching brain. The connector registry makes these first-class. LMCacheConnectorV1 wraps the LMCache engine, delegating every connector method straight through, and even advertises when it needs piecewise CUDA graphs because its layerwise loads cannot be captured (a callback to Chapter 10’s capture-vs-eager split):

# materials/.../vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py
@classmethod
def requires_piecewise_for_cudagraph(cls, extra_config) -> bool:
    return extra_config.get("use_layerwise", False)

Source: vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py

The Mooncake store is another such backend living in the same connector tree (mooncake/store/). What these external stores buy you is a KV cache that is shared across replicas rather than private to one process, which is precisely the bridge from this chapter’s single-replica offloading to the fleet-level cache-aware routing of Chapter 18, and to disaggregation, where producer and consumer are different machines entirely.

What is still unsolved

Offloading does not make the recompute-versus-load decision for you; it gives you the mechanism and leaves the policy underspecified. The store threshold and prompt-only defaults are blunt heuristics, and the genuinely right answer depends on prefix length distribution, reuse frequency, and live PCIe contention, none of which the engine measures end to end. Compression is the other open frontier: KV blocks can in principle be quantized or entropy-coded before they hit the slow tier, trading a little decode-time fidelity for a lot more effective cache and cheaper transfers, but in-tree vLLM offloads raw bytes and the integration surface for compressed KV is still maturing. And every tier you add lengthens the tail: a prefix that lives on NVMe behind a busy queue can be slower to fetch than to recompute, which means the lookup needs to be load-aware, not just presence-aware. The API has a None-means-retry escape hatch for slow backends, but choosing not to wait is still a manual call.

The connector you just met is intentionally more general than offloading needs. Its asynchronous-load path, its scheduler hook that admits external tokens, its worker-side load/save lifecycle: all of it works just as well when the “external store” is another GPU’s HBM reached over a network instead of this host’s DRAM reached over PCIe. That is the whole idea of the next chapter. Prefill and decode have opposite hardware appetites, so we will run them on separate GPU pools and ship the KV cache between them, over this same connector API, and the only thing that really changes is that the wire gets longer.

Inference Serving Roadmap