Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Multi-LoRA serving

Most of this book has assumed one model per replica. The token-budget scheduler from Chapter 5 batches requests against a single set of weights; the prefix cache from Chapter 7 reuses KV blocks across requests that share context; the tensor-parallel sharding from Chapter 15 splits one weight matrix across GPUs. None of that asked what happens when the requests in a batch want different weights.

In practice they often do. A platform that fine-tunes a base model for hundreds of customers, or hundreds of tasks, ends up with hundreds of variants that differ from the base by a tiny amount. Low-rank adaptation (LoRA) is what makes this cheap: instead of a full fine-tune, you learn a pair of small matrices $A$ and $B$ per target layer such that the effective weight is $W + \frac{\alpha}{r} \cdot B A$, where $r$, the rank, is typically 8 to 64. To unpack that formula: $W$ is the original weight matrix, say of shape $d_{\text{out}} \times d_{\text{in}}$; $A$ is $r \times d_{\text{in}}$ and $B$ is $d_{\text{out}} \times r$, so their product $BA$ has the same shape as $W$ but is forced to be low-rank, because it factors through the narrow $r$-dimensional waist. The scalar $\frac{\alpha}{r}$ rescales the delta so that changing the rank does not change its typical magnitude. The base W is frozen and shared; the two thin matrices that make up the adapter are a few megabytes. The modeling question is settled. The serving question is not, and it is the one this chapter is about: how do you batch a request for adapter 7 next to a request for adapter 113 next to a request for the base model, run them through the same fused kernels, and still apply each request’s own low-rank delta?

The naive answer is to refuse the problem: give each adapter its own replica. That throws away the entire premise. The base weights dominate memory, and a replica serving one rarely-used adapter wastes a whole GPU’s worth of frozen weights to host a few megabytes of delta. The whole point is statistical multiplexing across adapters on shared base weights, which means the batch will be heterogeneous, and the engine has to make a heterogeneous batch run at homogeneous-batch efficiency. As the title promises, this is a memory-and-batching problem, not a modeling one.

The wrong way and the Punica way

Start with what you must avoid. You could merge each adapter into the base weights ($W’ = W + BA$) and run a clean single-model forward pass. That is fast per request and catastrophic across requests: merging is per-adapter, so a batch with $k$ distinct adapters needs $k$ distinct weight matrices, and you are back to one-model-per-batch with the added cost of merging on every switch. Conversely, you could keep adapters separate and loop over the batch, running a small GEMM per request. Correct, but a decode step is already memory-bound (Chapter 3), and launching a tiny kernel per sequence drowns the GPU in launch overhead — exactly the per-step cost Chapter 10 worked to kill.

The way out is a batched LoRA kernel that handles many adapters in one launch by carrying a per-token index telling it which adapter’s weights to use. The trick is to stop thinking of the batch as a set of requests, each with its own weights, and start thinking of it as a flat list of token rows, each carrying a small integer that names its adapter. The kernel then does one big matrix multiply over all the rows at once, but for each row it reads its index and gathers the matching adapter’s $A$ and $B$ from a shared pool. Many adapters, one launch, no per-request loop. This is the Punica contribution, and vLLM’s implementation cites it directly:

# vllm/lora/punica_wrapper/punica_base.py
"""
Based on:
Chen, L., Ye, Z., Wu, Y., Zhuo, D., Ceze, L., & Krishnamurthy, A. (2023).
Punica: Multi-Tenant LoRA Serving.
https://arxiv.org/abs/2310.18547
"""

Source: vllm/lora/punica_wrapper/punica_base.py

The base-layer math, applied to the output of a wrapped linear layer, is two GEMMs — a shrink down to rank $r$ and an expand back to the output dimension:

# vllm/lora/punica_wrapper/punica_gpu.py
self.add_shrink(buffer, x, lora_a_stacked, scale, **kwargs)
self.add_expand(y, buffer, lora_b_stacked, output_slices,
                add_inputs=add_inputs, **kwargs)

Source: vllm/lora/punica_wrapper/punica_gpu.py

Read those two calls as the two halves of the delta $\frac{\alpha}{r} \cdot B A$ applied to the input x. The shrink computes $x A$ and lands in buffer, the rank-$r$ intermediate: it projects each token’s hidden vector down from the model width $d_{\text{in}}$ to the narrow rank $r$. It is computed at float32 for numerical headroom. The expand then computes $\text{buffer} \cdot B$ and accumulates it into the base output y, projecting back up from $r$ to the output width $d_{\text{out}}$. Splitting the delta into shrink-then-expand is what keeps it cheap: instead of one $d_{\text{out}} \times d_{\text{in}}$ multiply per token you do two skinny multiplies through the $r$-wide waist, which is the whole reason low rank is affordable. Note the sequencing: the base linear has already produced y for every token; LoRA is a residual added on top, never a replacement, which is exactly why a base-model request and an adapted request can ride in the same batch.

What makes it multi-tenant is that lora_a_stacked and lora_b_stacked are not single matrices. They are stacks indexed by an adapter slot (a small integer naming one shelf in the GPU’s adapter pool), and the kernel reads a per-token mapping to pick the right slice for each row of x. The diagram below traces one batched launch: four token rows belonging to three different adapters and the base model, all flowing through a single shrink/expand pair that selects each row’s A/B by its slot index.

flowchart TD
    subgraph BATCH["input rows x (one per token)"]
        T0["row 0  slot 0"]
        T1["row 1  slot 0"]
        T2["row 2  slot 2"]
        T3["row 3  base, index -1"]
    end
    subgraph POOL["stacked adapter buffers (indexed by slot)"]
        S0["slot 0: A0 / B0"]
        S1["slot 1: A1 / B1"]
        S2["slot 2: A2 / B2"]
    end
    BATCH --> SHRINK["shrink: buffer = x @ A[slot]  (down to rank r)"]
    POOL -->|"A[slot]"| SHRINK
    SHRINK --> EXPAND["expand: y += buffer @ B[slot]  (back to d_out)"]
    POOL -->|"B[slot]"| EXPAND
    BASE["base linear y = x @ W (runs for all rows)"] --> EXPAND
    EXPAND --> OUT["output y (base + per-row LoRA delta; index -1 skipped)"]

One launch, heterogeneous adapters, no merging, no per-request loop.

Slots, not adapters

The kernel indexes by slot, and slots are a fixed, small resource — this is the memory side of the problem. LoRAConfig names the two numbers that govern everything downstream:

# vllm/config/lora.py
max_lora_rank: MaxLoRARanks = 16
"""Max LoRA rank."""
max_loras: int = Field(default=1, ge=1)
"""Max number of LoRAs in a single batch."""

Source: vllm/config/lora.py

max_loras is the number of GPU-resident adapter slots — how many distinct adapters can appear in one batched kernel launch. max_cpu_loras (defaulting to max_loras) is how many adapters are kept staged in CPU memory. The key design choice is that the GPU-resident weight buffers are allocated once, up front, at max_loras * max_lora_rank and padded, rather than allocated per adapter on demand. That has two consequences worth making explicit. First, a batch can never ask for more slots than were provisioned, so the kernel’s indexing is always in bounds. Second, every adapter occupies a slot sized for max_lora_rank regardless of its actual rank, so a slot is a fixed-size shelf you swap adapters into and out of, not a custom-fit allocation. Fixed shelves are what avoid fragmentation: with per-adapter allocations of varying size, a churning population of adapters would leave the GPU’s adapter memory pocked with holes too small to reuse. This is the S-LoRA insight in vLLM’s shape: a unified pool of uniform adapter slots with a fixed budget, rather than per-adapter allocations that fragment as the adapter population churns.

Activation is the act of binding a registered adapter to a free slot index and copying its weights into the stacked GPU buffers:

# vllm/lora/model_manager.py
first_free_slot = next(
    ((i, lora_id) for i, lora_id in enumerate(self.lora_index_to_id)
     if lora_id is None), None)
if first_free_slot is None:
    raise ValueError("No free lora slots")
index, _ = first_free_slot
self._active_adapters[lora_id] = None
lora_model = self._registered_adapters[lora_id]
self.lora_index_to_id[index] = lora_model.id

Source: vllm/lora/model_manager.py

lora_index_to_id is the slot table: a list of length max_loras mapping each physical slot to the adapter currently living in it, or None for free. It is the single source of truth that ties the two worlds together. On one side, activation finds a hole (a slot holding None), records the binding, and then walks every wrapped module to copy that adapter’s A/B matrices into slot index of the stacked GPU buffers via module.set_lora(index, ...). Deactivation just nulls the slot. On the other side, the batched kernel from the previous section reads this same table’s inverse, the per-token slot indices, to address adapters by a small integer. That indirection is the whole point: the kernel never needs a weight pointer or an adapter id, only “which shelf,” and the slot table is what assigns and reclaims shelves.

When more adapters are wanted than there are slots, the same LRU logic the prefix cache used in Chapter 7 reappears, one level up. The active set is an LRU cache keyed by adapter id; activating a new adapter when the GPU slots are full evicts the least-recently-used one:

# vllm/lora/model_manager.py
if (lora_id not in self._active_adapters
        and len(self._active_adapters) >= self.lora_slots):
    self._active_adapters.remove_oldest()
result = super().activate_adapter(lora_id)
self._active_adapters.touch(lora_id)

Source: vllm/lora/model_manager.py

There are two such caches stacked, forming a memory hierarchy for adapters that mirrors the disk/CPU/GPU hierarchy you already know for KV blocks: a CPU-side LRU of registered adapters (capacity max_cpu_loras) and a GPU-side LRU of activated adapters (capacity max_loras). An adapter starts cold on disk; on registration it is loaded into CPU memory; on first use in a batch it is staged from CPU into a free GPU slot (or one freed by evicting the least-recently-used resident adapter); and as long as it keeps getting used, recency keeps it resident. The worker_manager.py loader stamps the docstring on its own class: “Every request, the requested LoRAs will be loaded (unless they are already loaded), and every other LoRA will be unloaded.” The state diagram below traces one adapter through these tiers and the events that move it between them.

stateDiagram-v2
    [*] --> OnDisk: adapter checkpoint exists
    OnDisk --> InCPU: register (load to host RAM)
    InCPU --> OnDisk: CPU LRU evict (max_cpu_loras full)
    InCPU --> InGPU: activate (copy A/B into a free slot)
    InGPU --> InCPU: GPU LRU evict (max_loras full, slot needed)
    InGPU --> InGPU: touch (used this step, stays warm)
    note right of InGPU
        occupies one of max_loras slots
        addressable by the batched kernel
    end note

Adapter residency is a caching problem with the same eviction pressure, the same locality assumptions, and the same cliff when the working set exceeds capacity that you already know from KV blocks. The two capacities matter independently: max_cpu_loras bounds how much disk reloading you suffer, and max_loras bounds how much GPU weight-copying you suffer, and a workload can be bottlenecked on either tier.

Building the per-token map every step

The batched kernel needs, for each token in the batch, the slot index of that token’s adapter (or -1 for the base model). But the scheduler thinks in requests, and the kernel thinks in rows, so something has to translate one into the other on every step. That translation is what this section is about: turning “request 42 wants adapter 7, request 43 wants the base model” into a flat per-token vector of slot indices the GPU can read. The v1 model-runner mixin assembles it from the input batch:

# vllm/v1/worker/lora_model_runner_mixin.py
prompt_lora_mapping, token_lora_mapping, lora_requests = (
    input_batch.make_lora_inputs(num_scheduled_tokens, num_sampled_tokens)
)
return self._set_active_loras(
    prompt_lora_mapping, token_lora_mapping, lora_requests, mapping_type)

Source: vllm/v1/worker/lora_model_runner_mixin.py

Two mappings, because two parts of the model need adapter indices at different granularities. token_lora_mapping has one entry per scheduled token, for the linear-layer LoRAs that run over every position (every token passes through the attention and MLP projections). prompt_lora_mapping has one entry per sampled token, for the logits/sampler LoRA that only runs at the last position of each sequence, where the next-token distribution is produced. The two granularities exist because the model applies LoRA in two different places that see different numbers of rows: thousands of token positions in the layers, but only one sampled position per request at the head.

Note this is recomputed every step against the current batch composition. Continuous batching (Chapter 5) means the set of requests, and therefore the set of active adapters, changes step to step; the LoRA index tensors are part of the per-step metadata the scheduler hands the worker, no different in spirit from the block tables. The sequence diagram below traces a single decode step, from the scheduler’s request-level decisions down to the kernel reading slot indices.

sequenceDiagram
    participant Sched as Scheduler
    participant Batch as InputBatch
    participant Mixin as ModelRunner mixin
    participant Punica as PunicaWrapper on GPU
    participant Kernel as Triton shrink/expand kernel
    Sched->>Batch: requests admitted this step (each with adapter id)
    Batch->>Mixin: make_lora_inputs returns token + prompt mappings
    Mixin->>Punica: set_active_loras (host-side mapping)
    Punica->>Punica: update_metadata maps ids to slot indices via lora_index_to_id
    Punica->>Kernel: token_lora_indices, sampler_indices (-1 = base/unslotted)
    Kernel->>Kernel: per row: skip if -1, else shrink/expand from that slot

On the GPU, update_metadata turns the host-side mapping into the index tensors the Triton kernels consume:

# vllm/lora/punica_wrapper/punica_gpu.py
self._update_base_metadata(mapping, lora_index_to_id, max_loras, vocab_size)
# Prepare cuda kernel metadata tensors
self.token_mapping_meta.prepare_tensors(self.token_lora_indices)
self.prompt_mapping_meta.prepare_tensors(self.sampler_indices)

Source: vllm/lora/punica_wrapper/punica_gpu.py

A token whose adapter is not currently slotted, or a base-model request, carries index -1, and the kernel’s convention is that -1 means “skip”: for those rows the shrink and expand simply do not write, leaving the base output untouched. This single sentinel value is what makes the whole scheme uniform. There is no separate code path for base-model requests and no special-casing in the scheduler; the base forward pass produces y for everyone, and the LoRA kernel adds a delta only where the per-token index points at a real slot. Base and adapted requests differ by one integer in a vector, nothing more.

Tensor parallelism, packing, and where the sharding goes

Chapter 15 sharded the base weights across a TP (tensor-parallel) group, splitting each weight matrix across GPUs so each rank holds only a slice. The adapters have to follow the same split, otherwise the LoRA delta would not line up with the base output it is being added to, and how they shard differs by layer type. This is why vLLM wraps each parallel-linear flavor in its own LoRA layer. Take ColumnParallelLinear, which shards its output dimension: each rank computes one chunk of the output columns. The base output y is therefore already partitioned across ranks, so the LoRA delta must be partitioned the same way. Since $B$ is the matrix that produces the output, $B$ shards by column to match, while $A$ (which only reaches the rank-$r$ waist) is replicated on every rank. The consequence is that the shrink $x A$ is computed redundantly on each rank, and only the expand $\text{buffer} \cdot B$ is sharded. A RowParallelLinear, which shards its input dimension, is the mirror image: $A$ shards and $B$ is replicated. The wrapper layers (column_parallel_linear.py, row_parallel_linear.py) encode exactly this by overriding slice_lora_a/slice_lora_b to cut each adapter’s weights to the local shard at activation time.

That replicated half is wasted work: every rank redoes the same small GEMM. The optional fully_sharded_loras flag pushes the split further, sharding the otherwise-replicated half too at the cost of an extra communication step to reassemble it. It pays off “at high sequence length, max rank or tensor parallel size” per the config docstring, because that is exactly when the redundant compute grows large enough to outweigh the added communication. It is the same communication-versus-redundant-compute tradeoff that governs base-weight sharding, and you can read it off the same roofline reasoning as everything else.

There is a packing wrinkle worth noting because it interacts with how real checkpoints are laid out. Models fuse q, k, v into one qkv_proj GEMM and gate/up into one gate_up_proj. The adapter weights for those fused projections arrive as separate slices and get packed to match:

# vllm/lora/lora_weights.py
class PackedLoRALayerWeights(LoRALayerWeights):
    """LoRA used for packed layers (eg. qkv_proj)."""

Source: vllm/lora/lora_weights.py

pack also folds the per-adapter scaling $\frac{\alpha}{r}$ into $B$ once, at load time, so the hot path multiplies by $1$ instead of recomputing a scale per step — a small optimize() that matters because it runs inside the batched kernel’s accounting. MoE adds a third axis: each expert is its own GEMM, so a LoRA-adapted MoE layer needs adapter weights stacked over (num_experts, rank, ...) and a fused kernel that routes tokens to experts and to adapter slots simultaneously (punica_wrapper’s add_lora_w13/add_lora_w2, building on the expert-parallel routing of Chapter 15). The complexity compounds, but the principle does not change: one batched launch, per-token indices selecting both expert and adapter.

The prefix cache forks per adapter

Chapter 7 left a hook we now have to honor. Prefix caching rests on one assumption: the KV (the cached keys and values for a span of tokens) is a pure function of the tokens, so two requests with byte-identical prefixes can share the same KV blocks. Adapters break that assumption. The keys and values are produced by the k_proj and v_proj attention projections, and those projections are precisely the weights LoRA modifies. So the same prefix tokens run through adapter 7 produce different keys and values than through adapter 113, even though the token ids are identical. The KV is now a function of the tokens and the adapter. Reusing one adapter’s KV for another would silently feed wrong keys and values into attention, a correctness bug that produces plausible-looking but incorrect output. The fix is the extra_keys mechanism from Chapter 7, forked per adapter:

# vllm/v1/core/kv_cache_utils.py
def _gen_lora_extra_hash_keys(request: Request) -> list[str]:
    if not request.lora_request:
        return []
    return [request.lora_request.lora_name]

Source: vllm/v1/core/kv_cache_utils.py

The adapter name folds into the block hash, so KV blocks computed under adapter 7 are addressable only by requests also using adapter 7. Prefix sharing now happens within an adapter’s request stream, not across adapters. This is the right semantics, and it has a real fleet consequence: a popular system prompt that would have been computed once is now computed once per adapter that uses it, fragmenting cache reuse along adapter lines. There is no free fix; correctness requires the fork. It is, however, the kind of thing a cache-aware router should know about, which is the next handoff.

Backpressure and the routing signal

The slot budget is a hard scheduling constraint, not a soft preference, and the token-budget scheduler from Chapter 5 enforces it directly. When considering a waiting request, the scheduler checks whether admitting it would need a slot beyond max_loras:

# vllm/v1/core/sched/scheduler.py
if (self.lora_config and request.lora_request
        and (len(scheduled_loras) == self.lora_config.max_loras
             and request.lora_request.lora_int_id not in scheduled_loras)):
    # Scheduling would exceed max_loras, skip.
    request_queue.pop_request()
    step_skipped_waiting.prepend_request(request)
    continue

Source: vllm/v1/core/sched/scheduler.py

A request whose adapter would not fit is deferred, not dropped — it goes back to the waiting queue and is retried next step, the same load-shedding-by-waiting valve from Chapter 6, now keyed on adapter slots rather than KV blocks. This is why the engine exposes adapter occupancy as a first-class metric. The Prometheus gauge reports which adapters are running, which are waiting, and the slot ceiling:

# vllm/v1/metrics/loggers.py
self.gauge_lora_info = self._gauge_cls(
    name="vllm:lora_requests_info",
    documentation="Running stats on lora requests.",
    ...
    labelnames=[self.labelname_max_lora,
                self.labelname_waiting_lora_adapters,
                self.labelname_running_lora_adapters])

Source: vllm/v1/metrics/loggers.py

That gauge is exactly the kind of engine-internal signal Chapter 18 argued a router must consume. Least-connections routing is wrong here for the same reason it was wrong for prefix locality: the right replica for a request is one that already has that adapter slotted (and won’t have to evict to admit it) and that holds the request’s prefix in the adapter-forked cache. A router that scrapes vllm:lora_requests_info can steer adapter-7 traffic toward replicas already serving adapter 7, packing the adapter working set per replica instead of smearing every adapter across every replica and forcing constant activation churn. The slot budget and the cache fork both push toward adapter affinity in routing, the multi-tenant cousin of the prefix affinity from Chapter 18.

What is still hard

The headline win is real: hundreds of adapters multiplexed on shared base weights, batched without merging, at close to base-model throughput when the active set fits in slots. The honest caveats are about that last clause and its edges.

Slot thrash is the dominant failure mode. If the live adapter population exceeds max_loras, every step risks evicting an adapter the next step needs, and activation is not free: it copies weights into the stacked GPU buffers and, on a CPU-cache miss, reads them from disk first. A workload with a long tail of rarely-used adapters can spend real time shuffling weights in and out, and the symptom (requests deferred, latency spiking) looks like load when it is actually adapter cache pressure. Raising max_loras costs GPU memory provisioned at full rank whether or not adapters use it; the tradeoff is genuine and workload-specific.

The shape of that failure is the same eviction cliff the prefix cache had in Chapter 7, one level up. The curve below sketches sustained throughput as the live adapter working set grows against a fixed slot budget: while distinct adapters fit in max_loras, every activation is a hit and throughput holds near the base-model rate; once the working set crosses the slot ceiling, each step starts evicting an adapter the next step wants, and throughput falls off as activation cost (GPU weight copies, and disk reloads on a CPU-cache miss) crowds out useful compute.

Illustrative: the flat region (working set within the 16-slot budget) and the post-cliff falloff (thrash as activation cost dominates) have the right shape, but the exact slope of the decline depends on the workload’s adapter reuse pattern and the disk-versus-CPU mix of activations.

Heterogeneous rank is the quiet tax. Buffers are sized at max_lora_rank, so a deployment mixing rank-8 and rank-64 adapters either provisions every slot for 64 (wasting capacity on the small ones) or caps the large ones.

And the cache fork, while correct, is a standing efficiency loss: shared context is recomputed per adapter, with no general way to share the part of the KV the adapter did not change, because in attention the adapter changes the very projections that produce the KV. The S-LoRA and Punica papers below go deeper on slot management and batched-kernel design respectively; both predate the MoE and fully-sharded refinements now in the tree, so read them for the load-bearing ideas, not the current shapes.

With adapters multiplexed and the routing signal exposed, the fleet can serve many models from few replicas. What it cannot yet do is bring a new base model online quickly, or decide how many replicas to run at all. That cold-start-and-scale problem is Chapter 20.

Further reading

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters — arXiv:2311.03285 — unified paged adapter memory and a fixed slot budget that let one replica hold thousands of adapters; the source of vLLM’s slot-and-eviction model.
  • Punica: Multi-Tenant LoRA Serving — arXiv:2310.18547 — the batched LoRA (SGMV/BGMV) kernel that applies many adapters in one launch via per-token indices; vLLM’s punica_wrapper is named for and cites it.