Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Request routing and cache-aware load balancing

We are out of the engine now. For the whole of Parts II through IV the unit of attention was one vLLM process, or one model spread across a few GPUs that still presented itself as a single scheduler. This part of the book steps up a level: there are many replicas behind a virtual endpoint, requests arrive at some front door, and something has to decide which replica each request goes to. That something is the router, and the thesis of this chapter is that the router you would build out of habit, the one you have built a dozen times for stateless services, is wrong here for a specific and interesting reason.

Your instinct, honed on the kind of traffic infrastructure this book assumes you already know, is least-connections or least-requests-in-flight, maybe power-of-two-choices to avoid the herd. That instinct is built on an assumption: that replicas are interchangeable, so the only thing that distinguishes them at routing time is how busy they are. Send the request to the least-loaded box because any box can serve it equally well. For a stateless RPC that is exactly right.

It is wrong for LLM inference because replicas are not interchangeable, and the thing that distinguishes them is invisible to a connection counter. Chapter 7 showed that a replica that has recently served your system prompt is holding its KV blocks in the idle pool, content-addressed and ready to reuse. Route there and your 2,000-token prefill collapses to a cache lookup. Route to an identically-loaded replica that has never seen the prompt and you pay the full prefill on the critical path. Two replicas with the same connection count, the same queue depth, the same GPU utilization can differ by an order of magnitude in the TTFT they will give this particular request, and the difference is entirely about what is in their caches. Least-connections cannot see it. The right replica is the one that already holds your prefix and is not about to preempt to make room, and finding it means the router has to consume signals from inside the engine that a stateless load balancer never needed.

The diagram below contrasts the two worlds. On the left, the stateless router treats every replica as a black box and reads only one number, in-flight connections, so it sends the request to whichever box looks least busy. On the right, the cache-aware router peers inside each replica: it knows which prefixes are cached and how close each replica is to its memory wall, so it can route the request to the replica that already holds the prompt even when that replica is not the emptiest.

flowchart LR
    R["incoming request: prompt = SYS + user"]

    subgraph Stateless["stateless router (least-connections)"]
        direction TB
        SR["reads: open connections only"]
        SA["replica A: 3 conns, holds SYS prefix"]
        SB["replica B: 1 conn, cold cache"]
        SR -->|"fewer conns -> pick B"| SB
    end

    subgraph CacheAware["cache-aware router"]
        direction TB
        CR["reads: cache map + load gauges"]
        CA["replica A: 3 conns, holds SYS prefix"]
        CB["replica B: 1 conn, cold cache"]
        CR -->|"A holds prefix -> pick A"| CA
    end

    R --> SR
    R --> CR
    SB -.->|"full prefill on critical path"| Slow["slow TTFT"]
    CA -.->|"prefill collapses to cache hit"| Fast["fast TTFT"]

The rest of this chapter is the story of how the router on the right earns that knowledge: which numbers the engine exports cheaply, how the engine streams its cache contents so the router can rebuild a picture of where every prefix lives, and how a routing policy weighs locality against load when both signals are stale.

What the engine is willing to tell you

Start with the cheap signals, the ones every replica already exports. Chapter 2 introduced the Prometheus series; here we read them as a router, not an operator. Two gauges carry most of the load information you want, and PrometheusStatLogger defines them plainly:

# vllm/v1/metrics/loggers.py
gauge_scheduler_waiting = self._gauge_cls(
    name="vllm:num_requests_waiting",
    documentation="Number of requests waiting to be processed.",
    multiprocess_mode="mostrecent",
    labelnames=labelnames,
)

Source: vllm/v1/metrics/loggers.py

# vllm/v1/metrics/loggers.py
gauge_kv_cache_usage = self._gauge_cls(
    name="vllm:kv_cache_usage_perc",
    documentation="KV-cache usage. 1 means 100 percent usage.",
    multiprocess_mode="mostrecent",
    labelnames=labelnames,
)

Source: vllm/v1/metrics/loggers.py

These two are the load-balancing primitives for inference. num_requests_waiting is queue depth, the thing your least-connections instinct already reaches for, except it is the engine’s real queue, the WAITING set from Chapter 5, not a count of open sockets. kv_cache_usage_perc is the one with no analog in stateless serving: it tells you how close a replica is to the memory wall, and a replica near 1.0 is a replica about to preempt (Chapter 6). Routing a fresh request to a replica at 0.98 KV usage does not just queue it; it can trigger the eviction of cached prefixes other requests were counting on, a second-order cost a connection counter is structurally blind to. A good router treats high KV usage as a strong negative signal even when the queue looks empty, because the queue being empty and the cache being full is precisely the state that precedes a preemption storm.

Scraping Prometheus on an interval is fine for slow-moving decisions, but it has a freshness problem: if the router scrapes every two seconds, its picture of a replica can be two seconds stale, which is many requests of drift on a busy fleet. A router making a choice per request wants the load attached to the response it already has in hand, so that each completed request doubles as a fresh measurement of the replica that served it. vLLM supports exactly this through an ORCA-style load header. The mechanism is a small contract between client and server: the client adds a request header that names which load-report format it wants back, and the server reads that header on the way in.

# vllm/entrypoints/openai/chat_completion/api_router.py
metrics_header_format = raw_request.headers.get(
    ENDPOINT_LOAD_METRICS_FORMAT_HEADER_LABEL, ""
)

Source: vllm/entrypoints/openai/chat_completion/api_router.py

On the way out, the server attaches a response header built from the same live gauges, translating each Prometheus series name into the ORCA field name the client expects:

# vllm/entrypoints/serve/utils/orca_metrics.py
prometheus_to_orca_metrics = {
    "vllm:kv_cache_usage_perc": "kv_cache_usage_perc",
    "vllm:num_requests_waiting": "num_requests_waiting",
}

Source: vllm/entrypoints/serve/utils/orca_metrics.py

This is the Open Request Cost Aggregation convention from the service-mesh world, repurposed for inference. ORCA was designed so that a backend could report its own load in-band, riding along on the responses it was already sending, instead of forcing the load balancer to poll it out of band. That is precisely the freshness fix we want here: every response carries the queue depth and KV usage of the replica that served it, measured at the instant of that response. A router that records the header from each completed request therefore holds load samples that are at most one request stale per replica, and it can run the equivalent of weighted-least-request against them with no separate scrape path at all. It is the same idea as the in-band server_load_metrics counter that load_aware_call maintains for simple in-flight tracking, but carrying the engine-internal numbers, queue depth and memory pressure, that actually matter for inference rather than a plain connection count.

The diagram below traces one request through this loop. Note that the load report the router learns from describes the previous state of the replica, the state just after it finished the prior request; this is the structural reason every signal in this chapter is slightly behind reality.

sequenceDiagram
    participant C as Client / router
    participant V as vLLM replica
    C->>V: request + header (report load as ORCA)
    Note over V: serve request, sample live gauges
    V->>C: response + header (kv_cache_usage_perc, num_requests_waiting)
    Note over C: record this replica's load for the next decision
    C->>V: next request, now routed with fresher load info

The signal least-connections can’t fake: who holds the prefix

Queue depth and KV usage make the router smarter about load. They say nothing about cache content, which is the whole reason replicas stopped being interchangeable. For that you need a different kind of signal, and Chapter 7 already showed you where it comes from: the block-lifecycle events the engine emits as its cache changes.

# vllm/distributed/kv_events.py
class BlockStored(KVCacheEvent):
    block_hashes: list[ExternalBlockHash]
    parent_block_hash: ExternalBlockHash | None
    token_ids: list[int]
    block_size: int

Source: vllm/distributed/kv_events.py

# vllm/distributed/kv_events.py
class BlockRemoved(KVCacheEvent):
    block_hashes: list[ExternalBlockHash]
    medium: str | None
    group_idx: int | None = None

Source: vllm/distributed/kv_events.py

Look at what BlockStored carries: a block’s hash and its parent_block_hash. We saw in Chapter 7 that these hashes are parent-chained, meaning each block’s hash is computed from its own tokens and the hash of the block before it, so a single hash names not just one block but the entire token prefix leading up to it. That chaining is what makes the stream useful. A sequence of (hash, parent_hash) pairs is exactly the edge list of a tree: each pair says “this block hangs off that parent,” and following the parent links from any node back to the root spells out a cached prefix token by token. The engine is, in effect, serializing its prefix tree onto the wire one edge at a time as it caches new blocks.

A router can reassemble that tree on the other side. Subscribe to one replica’s event stream, apply each BlockStored as a new edge and each BlockRemoved as a pruned one, and you hold an external, approximate copy of which prefixes live in that replica’s cache, rebuilt without ever reading a byte of its GPU memory. Subscribe to every replica’s stream and keep one such tree per replica, and you have a global map of where each prefix is cached. BlockRemoved prunes the tree as the engine evicts blocks; AllBlocksCleared resets a replica’s tree to empty. The router’s job at request time then reduces to a longest-prefix-match query against this map: hash the incoming prompt with the same scheme the engine uses, walk that hash chain against each per-replica tree, and route to the replica whose tree matches the deepest, because that is the replica that will reuse the most of your prompt and prefill the least.

The diagram below traces a single block-store event from the engine into the router’s reconstructed tree, then shows a request resolving against it.

flowchart TD
    subgraph Engine["vLLM replica: BlockPool"]
        E1["caches block for prompt prefix"]
        E2["emit BlockStored: hash=H3, parent=H2, tokens=..."]
        E1 --> E2
    end

    E2 -->|"PUB socket, async"| Pub["ZmqEventPublisher"]
    Pub --> Router

    subgraph Router["router: per-replica prefix tree"]
        T0["root"]
        T1["H1: SYS tokens"]
        T2["H2: SYS + tools"]
        T3["H3: SYS + tools + user"]
        T0 --> T1 --> T2 --> T3
    end

    Q["new request: hash prompt -> H1, H2, H3"] -->|"longest-prefix match"| Router
    Router -->|"deepest match here -> route to this replica"| Decision["chosen replica"]

The honest word in all of this is approximate, and it is worth being precise about why. The event stream is asynchronous and lossy by nature. ZmqEventPublisher ships events over a PUB socket, which is fire-and-forget: the publisher does not wait for subscribers to acknowledge, it only keeps a bounded replay buffer to help a reconnecting subscriber catch up. A subscriber that falls behind, or that reconnects after the buffer has rolled over, simply misses events, and its tree drifts away from the engine’s real cache. Even with no losses at all, there is an unavoidable lag: the router’s tree is always a few milliseconds behind the engine, so between the moment a block is actually evicted and the moment the matching BlockRemoved arrives, the router still believes the prefix is cached and can route a request to a replica that has already thrown it away.

There is a subtler trap too, which is double-counting. A replica is often not one process but several, and if every part emitted its own events the router’s tree would see the same block stored two, four, or eight times and could badly misjudge what is cached. vLLM handles this in two ways depending on where the events come from. For the core prefix cache the problem mostly does not arise: a tensor-parallel replica still has a single scheduler driving all its GPU ranks, and the BlockPool that emits these events lives in that one scheduler, so the router sees a single already-deduplicated stream per replica rather than one copy per rank. Where events genuinely do originate from many workers, as with the KV-transfer connectors of Chapters 16 and 17 that run independently per worker, vLLM deduplicates explicitly with KVEventAggregator. It counts how many workers have reported each event and emits an event only once it has been seen from all of them, so a block counts as truly cached on the replica only when every worker agrees it is:

# vllm/distributed/kv_events.py
def get_common_events(self) -> list[KVCacheEvent]:
    return [
        event
        for event, count in self._event_counter.items()
        if count == self._num_workers
    ]

Source: vllm/distributed/kv_events.py

Either way, the cache map is a hint, never a guarantee. A router built on it must degrade gracefully when the hint is wrong: a prefix-match miss should cost you a full prefill, not a failed request, and the routing policy has to blend the cache signal with the load signals rather than obey it blindly. Sending every request that shares a popular system prompt to the one replica that cached it first is how you turn a cache hit into a hotspot. This is the central tension of cache-aware routing, and it is worth saying plainly: you are balancing the locality benefit of reuse against the load benefit of spreading, with stale information about both.

The policy: locality versus balance

The research that named this tradeoff sharply is worth reading as the conceptual backbone of any router you build.

SGLang: Efficient Execution of Structured LM Programs (arXiv:2312.07104) is the RadixAttention paper from Chapter 7, but its serving side describes cache-aware routing across workers: keep an approximate radix tree of what each worker has cached and route to maximize prefix reuse. Read it for the framing that a router can hold a coarse model of every replica’s cache and act on it.

Preble: Efficient Distributed Prompt Scheduling for LLM Serving (arXiv:2407.00023) attacks exactly the hotspot problem above. It schedules prompts across replicas to balance prefix-cache locality against load, explicitly trading some reuse for spread when a popular prefix would otherwise overload one replica. Read it for the cost model that decides when locality is worth a load imbalance and when it is not.

The shape both arrive at is the same one your queueing intuition would predict once you accept the cache as state: routing becomes an optimization over a cost that has at least two terms. Write the depth of the matched prefix on replica $r$ as $d_r$, its queue depth as $w_r$, and its KV-cache utilization as $u_r$. The first term is a benefit, the prefill compute you save by hitting a warm prefix, which grows with $d_r$. The second term is a penalty, the queueing delay and preemption risk you incur by piling onto a replica that is already busy or near its memory wall, which grows with $w_r$ and $u_r$. A router scores each candidate replica by combining the two,

$$\text{score}(r) = \alpha \cdot d_r - \beta \cdot w_r - \gamma \cdot u_r,$$

where $\alpha$, $\beta$, and $\gamma$ weight locality against the two load penalties, deepest match pulling toward one replica, lightest load pulling toward another, and picks the replica that maximizes the combined score, $\arg\max_r \text{score}(r)$. When prefixes are long and reuse is high, the saved-prefill term is large and locality dominates, so you should chase the deepest match. When the cache map is stale or the best-matching replica is saturated, the load penalty swamps the locality benefit and balance dominates, so you fall back to least-loaded. The interesting routers are the ones that estimate both terms online from the very signals above, the BlockStored tree for the locality term, num_requests_waiting and kv_cache_usage_perc for the load term, and recompute the tradeoff per request.

The curve below makes the crossover concrete. It scores two candidates as the warm replica fills up: a replica that already holds a deep prefix match (high $d_r$) but whose KV utilization $u_r$ is climbing, against a cold replica that holds nothing ($d_r = 0$) but sits idle. While the warm replica has headroom its locality benefit keeps its score on top, so the router chases the prefix; but as $u_r$ rises the $-\gamma \cdot u_r$ penalty drags its score down until it falls below the cold, idle replica, and at that point the router spills to balance instead. The crossing point is exactly the utilization at which preemption risk outweighs the prefill you would have saved.

Illustrative weights ($\alpha=1,\beta=1,\gamma=10$; warm replica $d_r=8, w_r=2$; cold replica $d_r=0, w_r=0, u_r=0.1$): the locality-versus-load crossover is the durable point, not the exact numbers, which depend on how the router calibrates $\alpha$, $\beta$, $\gamma$ against measured prefill savings and queueing cost.

The flowchart below sketches that per-request decision. It is deliberately a soft decision, not a hard switch: a real router blends the two terms into a single score rather than taking one branch or the other, but the branches make the dominant force at each extreme legible.

flowchart TD
    Start["request arrives"] --> Hash["hash prompt, query per-replica prefix trees"]
    Hash --> Match{"any deep prefix match?"}
    Match -->|"no match"| LeastLoad["route by load only: lowest num_requests_waiting and kv_cache_usage_perc"]
    Match -->|"deep match found"| Busy{"is the matched replica near its limits? high queue or KV usage > threshold"}
    Busy -->|"replica has headroom"| Local["route to matched replica: reuse the warm prefix"]
    Busy -->|"replica saturated"| Spread["spill: route to a less-loaded replica, accept a colder cache"]
    LeastLoad --> Done["chosen replica"]
    Local --> Done
    Spread --> Done

P/D changes the question

Chapter 17 split prefill and decode onto separate pools. That fractures the routing problem into two, because the two pools care about different things. A prefill replica is compute-bound and short-lived per request; what matters there is cache locality (does it already hold the prefix?) and compute headroom. A decode replica is memory-bound and holds the request for its entire generation; what matters there is KV headroom and how many sequences it is already decoding. The same kv_cache_usage_perc gauge means something different on each: on a prefill node it is transient pressure, on a decode node it is the binding constraint for the request’s whole lifetime.

So a P/D-aware router is really two routers chained by a handoff. First it picks a prefill replica by prefix locality, the deepest-match logic from the section above, since that is where reuse pays off. It lets that replica compute the KV for the prompt. Then it picks a decode replica that has room to receive that KV and hold the request for the whole generation, scored by KV headroom rather than locality. Finally the KV cache itself is shipped from the prefill replica to the decode replica over the connector API from Chapters 16 and 17, and decode proceeds there. The router has to know each replica’s role, and crucially it has to make the two choices jointly: routing for prefill locality is pointless if the only available decode target has no KV headroom, because then the carefully-reused prefix just sits in the prefill replica’s hands, stalled, waiting for a decode slot to open. The locality win on the prefill side evaporates if the decode side cannot accept the result. This is the routing layer’s version of the same disaggregation bargain from Chapter 17, now expressed as a placement problem across two pools instead of a scheduling problem inside one engine.

The diagram below traces a request through the handoff and shows where each routing decision sits.

sequenceDiagram
    participant Cl as Client
    participant Ro as P/D router
    participant Pf as Prefill replica (by locality)
    participant De as Decode replica (by KV headroom)
    Cl->>Ro: request (prompt)
    Note over Ro: decision 1: pick prefill replica with deepest prefix match
    Ro->>Pf: route for prefill
    Note over Pf: compute full KV for the prompt
    Note over Ro: decision 2: pick decode replica with room
    Pf->>De: transfer KV over connector API
    De->>Cl: stream decoded tokens

Who actually consumes these signals

vLLM emits; it does not route. The routing lives in a small ecosystem of external components, and it helps to see them as consumers of exactly the signals above. The gateway-api-inference-extension, the Kubernetes-native effort, runs an endpoint picker that scrapes num_requests_waiting and kv_cache_usage_perc and increasingly consumes the KV-event stream for prefix awareness; it is the path most production fleets will reach for because it slots into existing ingress. sglang-router is the reference implementation of the radix-tree approach from the SGLang paper, maintaining the approximate per-worker cache trees and doing longest-prefix-match routing directly. Dynamo, NVIDIA’s serving framework, builds a global KV-aware router on the same block-event idea, layered with its own KV manager. llm-d assembles the Kubernetes pieces, the inference gateway plus vLLM’s events and metrics, into an opinionated cache-aware deployment. They differ in substrate and ambition, but they all reduce to the same loop: subscribe to block events to know where prefixes live, scrape or read-from-header the load gauges to know who is busy, and route each request to maximize reuse without creating a hotspot. The engine’s contribution is to make all of that observable; it was built to emit its cache state and its load, not to decide.

What is still unsolved

Be honest about the rough edges, because this is the youngest layer in the book. The cache map is fundamentally stale and lossy, and nobody has a clean answer for how aggressively to trust it; the safe default of blending it with load signals leaves real reuse on the table when the map happens to be accurate. Hotspot avoidance is a live tradeoff with no universal cost model, which is exactly why Preble is a paper and not a config flag. The signals themselves are coarse: a single kv_cache_usage_perc scalar collapses a whole prefix tree into one number, so the router learns that a replica is full but not which prefixes it would evict to admit you, which is the thing you actually need to predict your second-order cost. Multi-tenancy adds a routing dimension this chapter has only gestured at: a router that ignores LoRA adapters will scatter requests for one adapter across replicas that each then pay to load it, which is the problem Chapter 19 takes up next, where the vllm:lora_requests_info gauge becomes another routing signal and the per-adapter prefix-cache fork from Chapter 7 means the cache map itself has to be adapter-aware. And the whole edifice assumes prefixes repeat, an assumption the engine cannot enforce and the router can only exploit when the traffic above it cooperates.

The throughline of this chapter is the one to carry forward: state at the replica level turns load balancing from a stateless connection-counting problem into a placement problem over engine-internal signals, and the signals were there waiting because earlier chapters built the engine to expose them. Next we keep climbing the fleet: serving hundreds of fine-tuned variants from one base model, where the adapter becomes both a batching problem inside the engine and, as we just saw, another thing the router upstream has to route on.

Further reading

  • SGLang: Efficient Execution of Structured LM Programs — arXiv:2312.07104 — cache-aware routing across workers via approximate per-worker radix trees, the conceptual basis for prefix-locality routing.
  • Preble: Efficient Distributed Prompt Scheduling for LLM Serving — arXiv:2407.00023 — a cost model for trading prefix-cache locality against load when a popular prefix would overload one replica.