Multimodal requests: the encoder cache and a second budget

Up to now every request in this book has been a sequence of token ids. The token-budget scheduler from Chapter 5 plans a batch by counting tokens; chunked prefill in Chapter 6 slices a long prompt against that single budget; prefix caching in Chapter 7 lets identical prefixes skip the prefill compute entirely, keyed by a content hash of the block’s tokens plus a few extra_keys. One budget, one cache, one currency. That tidy picture survives exactly until someone sends a picture.

A multimodal request arrives with an image (or several images, audio, video) interleaved into the prompt. The model cannot attend to raw pixels. Before the language model sees anything, a separate vision (or audio) tower has to run a forward pass that turns the media into a block of embeddings, and those embeddings are then spliced into the token stream where placeholder tokens stand in for them. This is the architecture that LLaVA: Visual Instruction Tuning (arXiv:2304.08485) made the default template: a pretrained vision encoder, a small projection into the language model’s embedding space, and an otherwise ordinary autoregressive decoder. The serving consequence is blunt. There are now two models behind one request, the encoder runs first, and its output is large, reusable, and expensive enough that you do not want to recompute it.

So the engine grows a second budget and a second cache. The decoder still spends the token budget you already understand. The encoder spends a separate compute budget, measured in encoder embeddings, and parks its outputs in a separate cache, the EncoderCacheManager. Both of those resources are arbitrated inside the same schedule() pass you read in Chapter 5, which is what makes this a scheduling problem and not just a model-loading detail. This chapter is about how those two budgets compete step by step, how the encoder cache becomes its own prefix-caching problem keyed by media hash, and how that media hash threads back into the Chapter 7 block hash so the two caches stay consistent.

The diagram below traces the path of a single multimodal request through the two models, and shows where the two budgets are spent and where the two caches sit. Pixels enter on the left, become embeddings in the encoder, get spliced into the placeholder slots in the prompt, and only then does the decoder run. Each model has a resource that gates it: the encoder is gated by the encoder compute budget plus the EncoderCacheManager, the decoder by the token budget plus the KV cache from Chapter 4.

flowchart LR
    IMG["image bytes"] --> HASH["media hash (identifier)"]
    HASH --> ENC["vision encoder (one forward pass)"]
    ENC --> EMB["block of embeddings"]
    EMB --> ECACHE["EncoderCacheManager (keyed by media hash)"]
    ECACHE --> SPLICE["splice embeddings into placeholder slots"]
    PROMPT["text tokens + placeholder tokens"] --> SPLICE
    SPLICE --> DEC["language decoder (autoregressive)"]
    DEC --> KV["KV cache (Chapter 4)"]
    EBUD["encoder compute budget"] -->|gates| ENC
    TBUD["token budget (Chapter 5)"] -->|gates| DEC

The two halves look symmetric, but their resources behave differently, and the rest of the chapter is mostly about that asymmetry: where the media hash comes from and why it has to be shared between the two caches, how the encoder’s separate budget competes with the decoder’s inside one scheduling step, and what happens to the encoder cache over the lifetime of a request.

Where the media hash comes from

Everything starts with a hash of the bytes. Before a request becomes a Request, the multimodal processor (dispatched through vllm/multimodal/registry.py) turns each media item into a MultiModalFeatureSpec, and the engine attaches a list of those specs to the request. From vllm/v1/request.py:

self.mm_features = mm_features or []

Each spec carries the modality, the processed data, the placeholder location in the prompt, and two hashes. Two hashes, because the same image can be processed under different LoRA adapters (Chapter 19), and an encoder output produced under one adapter must not be served to a request using another. The mm_hash keys the processor output (the raw, adapter-independent result of decoding the bytes); the identifier keys the encoder output and folds in a LoRA prefix when one applies, so it is the stricter of the two. The fields that matter for the rest of this chapter are in vllm/multimodal/inputs.py:

identifier: str
"""The hash for caching encoder outputs (with LoRA prefix if applicable)."""
...
mm_position: PlaceholderRange
"""The location of the `modality` tokens corresponding to this item in the prompt..."""
mm_hash: str | None = None
"""The hash for caching processor outputs (without LoRA prefix)."""

The mm_position (a PlaceholderRange) is the third piece you need to hold onto: it records where in the prompt this item’s placeholder tokens sit. A multimodal prompt is not pure media. It is a token sequence in which a contiguous run of placeholder tokens marks the slot where the image’s embeddings will be substituted, surrounded by ordinary text. The mm_position is what tells the scheduler the start and length of that slot, and it is what later lets the scheduler decide whether an image even overlaps the token range it is about to compute this step.

The hash itself is content-addressed, computed in vllm/multimodal/hasher.py. For a PIL image it serializes mode, pixels, and palette; for a tensor it serializes dtype, shape, and raw bytes; then it folds everything through blake3 (or sha256/sha512 for FIPS):

@classmethod
def hash_kwargs(cls, **kwargs: object) -> str:
    hasher_factory = _get_hasher_factory(envs.VLLM_MM_HASHER_ALGORITHM)
    hasher = hasher_factory()
    for k, v in sorted(kwargs.items(), key=lambda kv: kv[0]):
        for bytes_ in cls.iter_item_to_bytes(k, v):
            hasher.update(bytes_)
    return hasher.hexdigest()

This identifier is the join key for the whole chapter. The same image sent by two different users hashes identically, so its encoder output can be shared. And, crucially, this is the same hash that Chapter 7’s prefix cache folds into its block hash. Recall that the block hash takes extra_keys; for multimodal blocks those keys are generated in vllm/v1/core/kv_cache_utils.py:

# The block contains the current mm input. Include its offset
# relative to the start of the block so prefix-cache keys stay
# distinct when the same MM item appears at different positions
# within otherwise-identical placeholder blocks.
extra_keys.append((mm_feature.identifier, offset - start_token_idx))

That single line is the seam between the two caches. To see why it has to exist, picture two requests whose prompts are byte-for-byte identical as token ids: same text, same number of placeholder tokens, same positions. To Chapter 7’s prefix cache, which hashes block contents by token id, those two prompts look like the same prefix and would share KV blocks. But if the two requests carried different images behind those identical placeholders, sharing the blocks would be a correctness bug, because the KV computed for one image would be reused for the other. Folding the image’s identifier into the block’s extra_keys closes that hole: the decoder’s KV blocks covering placeholder tokens are a prefix-cache hit only when the image bytes also match. (The offset half of the tuple guards a subtler case: the same image appearing at a different position inside otherwise-identical placeholder blocks must still hash distinctly.)

So the media hash is doing double duty, and the diagram below traces both jobs from the single hash. Computed once from the bytes, it keys the encoder-output cache directly, and it is also injected into the decoder’s prefix-cache block hash so the KV side stays honest about which pixels a block actually saw.

flowchart TD
    BYTES["image bytes"] --> H["media hash (identifier)"]
    H --> J1["join 1: key into EncoderCacheManager"]
    H --> J2["join 2: extra_keys in KV block hash"]
    J1 --> R1["same image, two users: share one encoder output"]
    J2 --> R2["different images, same placeholder tokens: never share KV blocks"]

A second cache with its own eviction

The EncoderCacheManager in vllm/v1/core/encoder_cache_manager.py is, structurally, a smaller and simpler cousin of the KV block pool from Chapter 4. It is sized in encoder embeddings, not blocks, and it tracks who is using what:

# mm_hash of mm_data => ids of requests that reference the mm_data
self.cached: dict[str, set[str]] = {}
...
# mm_hash of mm_data => num_encoder_embeds of the mm_data
self.freeable: OrderedDict[str, int] = OrderedDict()
self.freed: list[str] = []

The pattern mirrors the refcounted, lazy-LRU eviction you saw in block_pool.py. “Refcounted” means each entry tracks how many live requests point at it (the set[str] of request ids in cached); “lazy” means an unused entry is not freed at the moment its count hits zero but only later, when the space is actually needed. So an entry whose reference set is non-empty is pinned. When the last request lets go, the entry is not deleted; it slides into freeable, an OrderedDict acting as an LRU (least-recently-used) queue, and stays resident in case a future request asks for the same media. Only under memory pressure does it actually die, and only then in arrival order (the oldest, least-recently-freed entry goes first):

while num_embeds > self.num_free_slots:
    mm_hash, num_free_embeds = self.freeable.popitem(last=False)
    del self.cached[mm_hash]
    self.freed.append(mm_hash)
    self.num_free_slots += num_free_embeds

Source: vllm/v1/core/encoder_cache_manager.py

This is encoder-output prefix caching. The check_and_update_cache method is the cache lookup: if the media hash is already present, the request simply joins the reference set and the encoder is never run for that item. If the entry had been sitting in freeable, it gets rescued back into the live pool, exactly the touch-refcount-rescue move from Chapter 7. The worker side learns what to drop through get_freed_mm_hashes(), which hands the scheduler the list of hashes to tell the model runner to evict from its embedding store, the same scheduler-tells-worker contract that carries everything else in SchedulerOutput.

One subtlety worth internalizing: the docstring is explicit that the cache counts embeddings, not the placeholder tokens around them. Break tokens and text tokens interleaved between images do not consume encoder-cache slots. The two budgets really do measure different things.

The second budget, and how it competes

The compute budget is computed once at startup in vllm/v1/core/encoder_cache_manager.py:

encoder_compute_budget = max(
    scheduler_config.max_num_encoder_input_tokens, max_tokens_per_mm_item
)
encoder_cache_size = max(
    scheduler_config.encoder_cache_size, max_tokens_per_mm_item
)

Both default to max_num_batched_tokens (the very token budget from Chapter 5), as vllm/config/scheduler.py shows, with a floor: each must be at least one full media item, or the largest image could never be scheduled at all. So out of the box the engine carves an encoder compute allowance roughly the size of the decoder token budget, plus an encoder cache of comparable size. These are not yet user-tunable knobs (max_num_encoder_input_tokens and encoder_cache_size are init=False), which is a real limitation we will come back to.

The competition happens inside the running and waiting passes of schedule(). Each step starts the encoder budget fresh:

encoder_compute_budget = self.max_num_encoder_input_tokens

Source: vllm/v1/core/sched/scheduler.py

and for every request with media, before committing decoder tokens, the scheduler calls _try_schedule_encoder_inputs. That function is where the two budgets collide. Its contract, from vllm/v1/core/sched/scheduler.py, is that an encoder input is scheduled only if its embeddings overlap the token range about to be computed this step, it is not already cached, there is encoder compute budget left, and the encoder cache has room. The decisive moment is the rollback when an encoder input does not fit:

if not self.encoder_cache_manager.can_allocate(
    request, i, encoder_compute_budget, num_embeds_to_schedule
):
    if num_computed_tokens + shift_computed_tokens < start_pos:
        # We only schedule the decoder tokens just before the
        # encoder input.
        num_new_tokens = start_pos - (
            num_computed_tokens + shift_computed_tokens
        )
    else:
        num_new_tokens = 0
    break

Read that carefully, because it is the whole thesis in code. When the encoder budget or cache is exhausted, the scheduler does not skip the image and barrel ahead. It shrinks the decoder’s num_new_tokens down to the boundary just before the image, then breaks. The reason it must stop there rather than skip past the image is causal: the decoder cannot prefill the placeholder tokens until their embeddings exist, and the embeddings will not exist until the encoder runs, which it cannot do without budget. So the decoder is forced to halt at the wall. The two budgets are thereby coupled: running out of encoder capacity directly throttles how many decoder tokens this request gets this step. A request can stall mid-prompt, having prefilled the text up to the picture, waiting for a future step where encoder room opens up. This is the same chunked-prefill machinery from Chapter 6, now answering to a second resource constraint, and can_allocate itself will trigger eviction from freeable to try to make room before giving up.

The diagram below traces this decision for one image inside one scheduling step. The path that matters is the “no” branch out of can_allocate: instead of failing the request or jumping over the image, it clamps the decoder’s token grant to the boundary just before the placeholder run and breaks out of the loop, leaving the rest for a later step.

flowchart TD
    START["scheduling step: request with media"] --> OVL{"image embeddings overlap the token range to compute now?"}
    OVL -->|no| SKIP["leave it; nothing to schedule for this image yet"]
    OVL -->|yes| HIT{"media hash already in cache?"}
    HIT -->|yes| JOIN["join reference set; encoder is NOT run"]
    HIT -->|no| ALLOC{"can_allocate: encoder budget AND cache room? (evicts from freeable to try)"}
    ALLOC -->|yes| DEBIT["debit budget per embedding; queue encoder run + cache slot"]
    ALLOC -->|no| CLAMP["shrink decoder num_new_tokens to boundary before image, then break"]
    CLAMP --> STALL["request stalls mid-prompt; retries next step"]

When an encoder input does fit, the budget is debited per embedding and the item is queued for both compute and cache allocation:

num_embeds_to_schedule += num_encoder_embeds
encoder_compute_budget -= num_encoder_embeds
mm_hashes_to_schedule.add(item_identifier)
encoder_inputs_to_schedule.append(i)

Source: vllm/v1/core/sched/scheduler.py

The allocation is then committed back in the main loop, where self.encoder_cache_manager.allocate(request, i) reserves the slots and encoder_compute_budget = new_encoder_compute_budget carries the debit forward to the next request in the same step. Note mm_hashes_to_schedule: it dedupes within a single step, so a prompt that repeats the same image twice pays the encoder once.

There is also a release path. The encoder output is needed only until it has been consumed into the decoder’s KV cache. Once the prefill has moved past the image, _free_encoder_inputs lets the reference go:

elif start_pos + num_tokens <= request.num_computed_tokens:
    # The encoder output is already processed and stored
    # in the decoder's KV cache.
    self.encoder_cache_manager.free_encoder_input(request, input_id)

Source: vllm/v1/core/sched/scheduler.py

After this point the embeddings live on only in freeable, available for cross-request reuse but no longer pinned. The encoder cache is therefore busiest during prefill and quiet during decode, the opposite of the KV cache’s lifetime, which is part of why it gets a separate budget rather than being folded into the token budget.

The state diagram below pulls the whole life of one encoder-cache entry together, because the three structures introduced piecemeal above (cached, freeable, freed) are really three phases of one lifecycle. An entry is pinned while any request references it. It moves to freeable, an LRU queue, when the last reference is released, where it stays resident for cross-request reuse. A future request asking for the same media rescues it straight back to pinned. Only genuine memory pressure evicts it, in arrival order, into freed, which is the list the scheduler ships to the worker so the model runner drops the embeddings from its store.

stateDiagram-v2
    [*] --> Pinned: allocate, encoder runs or cache hit
    Pinned --> Pinned: another request references same hash
    Pinned --> Freeable: last reference freed, KV has consumed it
    Freeable --> Pinned: new request asks for same media, rescue
    Freeable --> Freed: memory pressure, evict oldest LRU
    Freed --> [*]: worker drops embeddings

Variable resolution, chunked encoders, and pruning

The encoder budget would be a minor accounting trick if every image cost the same. It does not. Qwen2-VL (arXiv:2409.12191) makes resolution dynamic: a high-resolution image or a long video can expand into thousands of embeddings, while a thumbnail is cheap. The encoder budget exists precisely because the largest item can dwarf a single step’s capacity, which is why compute_mm_encoder_budget floors the budget at max_tokens_per_mm_item and why disable_chunked_mm_input exists as an escape hatch. When chunking is allowed, the scheduler can split a giant encoder input across steps; when it is disabled, the rollback in _try_schedule_encoder_inputs refuses to partially schedule a media item and instead defers the whole thing to a step that can hold it:

if (
    self.scheduler_config.disable_chunked_mm_input
    and num_computed_tokens < start_pos
    and (num_computed_tokens + num_new_tokens)
    < (start_pos + num_encoder_tokens)
):
    num_new_tokens = max(
        0, start_pos - (num_computed_tokens + shift_computed_tokens)
    )
    break

Source: vllm/v1/core/sched/scheduler.py

The reason you would ever disable chunking is in the comment one screen up in the source: encoders typically use bidirectional attention, so the whole item often wants to be processed together. That bidirectionality is what makes the encoder different from the causal decoder, and it is why the encoder budget is denominated in whole embeddings rather than a streaming token count.

Video pushes the problem further, and vLLM answers with pruning rather than pure budgeting. vllm/multimodal/evs.py (labeled “EVS” in the source) implements similarity-based token dropping: it measures the cosine similarity between adjacent video frame embeddings, keeps the first frame whole, and discards the most redundant tokens from the rest. Let $q$ be the pruning ratio, so a fraction $1 - q$ of tokens survive. With $t$ tokens per frame over $f$ frames, the retention count is the simple part, a floored fraction:

$$\text{kept} = \max!\left(t,\ \lfloor t \cdot f \cdot (1 - q) \rfloor\right)$$

total_tokens = tokens_per_frame * num_frames
evs_num_tokens = int(total_tokens * (1 - q))
min_num_tokens = tokens_per_frame
return max(min_num_tokens, evs_num_tokens)

This is a different lever on the same constraint: instead of finding budget for every embedding a long video would produce, drop the embeddings that carry no new information before they ever reach the cache. It directly shrinks both the encoder cache footprint and the number of placeholder tokens the decoder must process. The curve below plots that formula for a 16-frame clip at 256 tokens per frame: as the pruning ratio $q$ climbs, the retained token count (and with it the encoder-cache footprint) falls off linearly until it hits the one-frame floor of $t$ tokens, below which it cannot drop no matter how aggressive the pruning.

Illustrative: shape follows the chapter’s $\max(t,\ \lfloor t \cdot f \cdot (1-q) \rfloor)$ formula for chosen $t$ and $f$; real per-frame token counts and useful pruning ratios depend on the model and the video’s redundancy.

What is still rough

The honest assessment is that this subsystem is younger than the KV path. Three open edges stand out.

First, the budgets are not configurable. Both max_num_encoder_input_tokens and encoder_cache_size are derived from max_num_batched_tokens and carry TODO comments asking to expose them. An operator who knows their traffic is image-heavy cannot yet hand more memory to the encoder cache without inflating the decoder token budget too, which couples two things that should be tunable apart. This is exactly the kind of contention Chapter 21 will teach you to read off the metrics.

Second, the encoder cache is per-replica and ephemeral. There is an ec_connector path in the scheduler (the has_cache_item / external_load_encoder_input branches) that mirrors the KV-connector abstraction from Chapter 16, hinting that encoder outputs could be offloaded or shared across replicas the way KV blocks are. But cross-replica encoder-output sharing is far less mature than cross-replica prefix caching, and a fleet-level router (Chapter 18) has no clean signal yet for “which replica already encoded this image.”

Third, the two-cache consistency rests entirely on that one extra_keys line. The media hash has to be a faithful function of the bytes; the EXIF-ImageID shortcut in the hasher, which trusts a UUID embedded in the image rather than rehashing pixels, is a small reminder that content addressing is only as sound as the content you choose to address. Get the hash wrong and you either lose reuse or, worse, serve one user’s image embeddings keyed under another’s prompt.

The mechanism is sound and the through-line is clean: a second model means a second budget and a second cache, both arbitrated inside the same step loop, both keyed by a media hash that also keeps the decoder’s prefix cache honest. With the encoder cache in place, every input to the language model, text or pixels, is now a sequence of embeddings the engine knows how to schedule and reuse. The next chapter drops below the scheduler entirely, into the attention kernels that have to read all of those embeddings, paged and ragged, at the bandwidth roofline.

Inference Serving Roadmap

Multimodal requests: the encoder cache and a second budget

Where the media hash comes from

A second cache with its own eviction

The second budget, and how it competes

Variable resolution, chunked encoders, and pruning

What is still rough

Further reading

Keyboard shortcuts

Inference Serving Roadmap

Multimodal requests: the encoder cache and a second budget

Where the media hash comes from

A second cache with its own eviction

The second budget, and how it competes

Variable resolution, chunked encoders, and pruning

What is still rough

Further reading