Deployment, model loading, and autoscaling the fleet

Every chapter so far has assumed the replica is already running. The KV-cache manager from Chapter 4, the token-budget scheduler from Chapter 5, the cache-aware router from Chapter 18 — all of them presuppose a warm process with weights resident in GPU memory, ready to schedule the next step. That assumption is the one this chapter pulls out from under you.

Because the thing your autoscaler does when traffic spikes is create a cold replica, and a cold replica is useless until its weights are on the GPU. A “cold” replica here means a freshly started process that has no model weights in GPU memory yet — it cannot run a single forward pass, because the matrices it needs to multiply are not there. The interval between “scale-up decision” and “first token served” is not network setup or container pull (those you can hide); it is dominated by reading tens or hundreds of gigabytes of weights off storage and landing them in HBM. HBM is High Bandwidth Memory, the GPU’s own on-package RAM where weights must live for the matrix multiplies to read them at full speed. A 70B model in bf16 is $70 \times 10^9 \times 2\ \text{bytes} = 140\ \text{GB}$. Even at a healthy 5 GB/s from local NVMe, that is $140 / 5 = 28$ seconds, nearly half a minute of pure I/O before the scheduler can admit a single request. So the cold-start cost and the autoscaling policy are not two separate concerns. They are the same latency story told from two ends: how fast can a loader fill HBM, and how do you arrange your fleet so you rarely have to pay full price for it.

The diagram below traces the path those bytes take and names the three places they can get stuck — the same three places the rest of this chapter attacks one by one.

flowchart LR
    A["scale-up decision"] --> B["start cold process"]
    B --> C["storage: NVMe / NFS / object store"]
    C -->|"read bytes"| D["CPU RAM (staging)"]
    D -->|"copy H2D"| E["GPU HBM (resident weights)"]
    E --> F["scheduler can admit first request"]
    C -.->|"lever 1: prefetch sequentially"| D
    C -.->|"lever 2: stream straight to GPU"| E
    C -.->|"lever 3: contiguous blob format"| E

This is the reader’s home turf — autoscaling on a queue-depth signal is bread and butter for a traffic engineer. What is unfamiliar is the scale-up latency, which in stateless RPC land is milliseconds and here is seconds-to-minutes, and which is set almost entirely by code in vllm/model_executor/model_loader/.

The loader is a dispatch, not an algorithm

vLLM does not have one way to load weights; it has a family of loaders, and which one runs is selected by a single config field rather than by any runtime logic. A loader, concretely, is a class that knows how to take a checkpoint on storage and populate the model’s tensors with it. The menu lives in vllm/config/load.py:

# vllm/config/load.py
@config
class LoadConfig:
    """Configuration for loading the model weights."""

    load_format: str | LoadFormats = "auto"

The docstring beneath it enumerates the formats — auto, safetensors, runai_streamer, tensorizer, bitsandbytes, sharded_state, and more — and each maps to a BaseModelLoader subclass (the common interface every loader implements). The point worth internalizing is that this is the only knob whose default ("auto") gives you the slowest reasonable path, and every other value is someone trading generality for cold-start latency. The default loader is correct everywhere; the fast loaders are correct somewhere specific. So the engineering question this chapter keeps returning to is: what do you know about your storage and your checkpoint that lets you pick a faster loader without breaking correctness?

The default path is in vllm/model_executor/model_loader/default_loader.py. Its job is unglamorous: figure out which files on disk (or on the Hub — HuggingFace’s model repository) hold the weights, then stream them tensor by tensor into the model. “Tensor by tensor” matters: the model is not one giant array but thousands of named weight matrices (one per layer per projection), and the loader walks them one at a time. The file-resolution logic is a cascade of format guesses — it does not know in advance whether the checkpoint is .safetensors or the older .bin pickle format, so it picks a glob pattern based on load_format:

# vllm/model_executor/model_loader/default_loader.py
if load_format == "hf":
    allow_patterns = ["*.safetensors", "*.bin"]
elif (
    load_format == "safetensors"
    or load_format == "fastsafetensors"
    or load_format == "instanttensor"
):
    use_safetensors = True
    allow_patterns = ["*.safetensors"]

Once it knows the files, load_weights simply hands an iterator to the model and times the result:

# vllm/model_executor/model_loader/default_loader.py
loaded_weights = model.load_weights(self.get_all_weights(model_config, model))

self.counter_after_loading_weights = time.perf_counter()
logger.info_once(
    "Loading weights took %.2f seconds",
    self.counter_after_loading_weights - self.counter_before_loading_weights,
)

Source: vllm/model_executor/model_loader/default_loader.py

That time.perf_counter() delta is the cold-start cost, logged on every boot. (time.perf_counter() is a high-resolution wall-clock timer; subtracting the reading taken before loading from the one taken after gives the elapsed seconds.) It is the number your autoscaler is implicitly fighting, and the number every alternate loader exists to shrink.

Where the seconds go, and three ways to claw them back

To see why there are exactly three levers, it helps to remember the path from the first diagram: bytes travel storage → CPU RAM → GPU HBM. Each lever attacks a different segment of that path. The first attacks the storage read pattern, the second removes the CPU hop, and the third removes the parsing work by changing the on-disk format. The diagram below puts the three side by side against the default path so you can see what each one skips.

flowchart TD
    subgraph Default["default safetensors (slowest, works everywhere)"]
        d1["storage: mmap, lazy random reads"] --> d2["CPU RAM"] --> d3["GPU HBM"]
    end
    subgraph L1["lever 1: prefetch strategy"]
        a1["storage: large sequential reads, page cache warmed"] --> a2["CPU RAM"] --> a3["GPU HBM"]
    end
    subgraph L2["lever 2: Run:ai streamer"]
        b1["object store: S3 / GCS / Azure"] -->|"concurrent streams, no CPU bounce"| b3["GPU HBM"]
    end
    subgraph L3["lever 3: tensorizer format"]
        c1["pre-serialized contiguous blob"] -->|"one read, no parse"| c3["GPU HBM"]
    end

The default safetensors path memory-maps each file and copies tensors out lazily. “Memory-mapping” (mmap) means the file is presented to the program as if it were already an array in memory; the operating system fetches the actual bytes from storage only when a tensor is touched. “Lazily” means a tensor’s bytes are read at the moment the model first asks for that tensor, not up front. On local SSD that is fine — a page fault hits fast local flash. On a network filesystem it is a disaster, because each page fault becomes a round trip over the network, so mmap turns into a storm of small random reads, each paying network latency. vLLM’s first lever is brute-force prefetching: warm the OS page cache (the kernel’s in-RAM buffer of recently read file pages) with large sequential reads before the loader touches the file, so that when mmap later faults, the bytes are already local. The work is sharded across ranks so a tensor-parallel group — the set of GPUs that split one model’s matrices among themselves, from Chapter 15 — does not all hammer the same bytes. Each rank takes a strided slice of the file list:

# vllm/model_executor/model_loader/weight_utils.py
if torch.distributed.is_initialized():
    rank = torch.distributed.get_rank()
    world_size = torch.distributed.get_world_size()
else:
    rank = 0
    world_size = 1
paths_to_prefetch = sorted_files[rank::world_size]

Source: vllm/model_executor/model_loader/weight_utils.py

The safetensors_load_strategy field documented in load.py (lazy, eager, prefetch) is exactly this lever exposed: on NFS or Lustre (network and parallel-cluster filesystems) you tell the loader to read everything up front instead of paging it in randomly. Note what this does not change — the bytes still travel storage → CPU RAM → GPU, the leftmost path in the diagram above. It only changes the access pattern (sequential instead of random), and for network storage that alone can be the difference between two minutes and twenty seconds, because one large sequential read amortizes the network latency that thousands of small random reads each pay in full.

The second lever is to skip the CPU bounce entirely. In the default path every byte lands in CPU RAM first and is then copied across the PCIe bus to the GPU — two trips through memory. The Run:ai Model Streamer loader (runai_streamer_loader.py) instead streams safetensors straight from object storage — S3, GCS, Azure Blob, the bucket APIs cloud replicas actually pull from — into GPU memory, overlapping many transfers at once. “Concurrency” here is how many object-store reads are in flight simultaneously, and a “memory limit” caps how much it buffers while doing so; both are tunable through environment variables the loader sets:

# vllm/model_executor/model_loader/runai_streamer_loader.py
if isinstance(concurrency := extra_config.get("concurrency"), int):
    os.environ["RUNAI_STREAMER_CONCURRENCY"] = str(concurrency)
if isinstance(memory_limit := extra_config.get("memory_limit"), int):
    os.environ["RUNAI_STREAMER_MEMORY_LIMIT"] = str(memory_limit)

This matters for autoscaling specifically because cloud replicas pull from object storage, not a local disk image. A streaming loader with high concurrency turns “download the whole checkpoint, then load it” into one overlapped pipeline, which is the difference that lets you bake a thin container and keep weights in a bucket.

The third lever is to change the format so loading is a single contiguous blob read rather than a parse. The default safetensors path spends real time figuring out where each tensor lives in the file and reshaping it into the model’s layout — that is the “parse” work. CoreWeave’s tensorizer (tensorizer.py) sidesteps it by serializing a vLLM model once into a layout that matches what the device wants, so deserialization is little more than copying a contiguous blob onto the GPU:

# vllm/model_executor/model_loader/tensorizer.py
deserializer.load_into_module(model)
end = time.perf_counter()

total_bytes_str = convert_bytes(deserializer.total_tensor_bytes)
duration = end - start
per_second = convert_bytes(deserializer.total_tensor_bytes / duration)

It even reports throughput, because throughput is the whole pitch. But there is an honest caveat baked into the source: tensorizer is only fast when the checkpoint was serialized from vLLM. Hand it a vanilla HuggingFace repo and it warns you it will fall back to a CPU load:

# vllm/model_executor/model_loader/tensorizer.py
logger.warning(
    "Deserializing HuggingFace models is not optimized for "
    "loading on vLLM, as tensorizer is forced to load to CPU. "
    "Consider deserializing a vLLM model instead for faster "
    "load times. ..."
)

Source: vllm/model_executor/model_loader/tensorizer.py

So the fast loaders cost you a pre-processing step (serialize once, deserialize fast forever) — a classic cache-build tradeoff. You pay setup cost so that every subsequent cold start is cheap, which is exactly the bargain you want when an autoscaler will create that replica hundreds of times over its life.

Distributed loading: don’t read what you don’t own

The single biggest structural win, and the place this chapter leans hardest on Chapter 15, is to not load the whole model on every GPU. Recall from Chapter 15 that under tensor parallelism each GPU (each “rank”) holds only a slice of every weight matrix — it never needs the other ranks’ slices to do its own share of the math. Yet the default loader makes every rank read the entire checkpoint and then throw away the parts it does not own, which means an 8-GPU group reads the same 140 GB eight times over. A sharded-state checkpoint fixes this by pre-splitting the weights into per-rank files, so rank k opens only the file holding rank k’s slice. The diagram below contrasts the two.

flowchart TD
    subgraph Naive["default: every rank reads everything"]
        F0["full checkpoint (140 GB)"] --> R0["rank 0 reads all, keeps 1/8"]
        F0 --> R1["rank 1 reads all, keeps 1/8"]
        F0 --> R2["rank ... reads all, keeps 1/8"]
        F0 --> R3["rank 7 reads all, keeps 1/8"]
    end
    subgraph Sharded["sharded_state: each rank reads only its slice"]
        S0["shard 0"] --> T0["rank 0"]
        S1["shard 1"] --> T1["rank 1"]
        S2["shard ..."] --> T2["rank ..."]
        S3["shard 7"] --> T3["rank 7"]
    end

A tensor-parallel rank only needs its shard; a sharded-state checkpoint lets each rank read just its slice:

# vllm/model_executor/model_loader/sharded_state_loader.py
class ShardedStateLoader(BaseModelLoader):
    """
    Model loader that directly loads each worker's model state dict, which
    enables a fast load path for large tensor-parallel models where each worker
    only needs to read its own shard rather than the entire checkpoint.
    """

Source: vllm/model_executor/model_loader/sharded_state_loader.py

For a TP-8 deployment (tensor parallelism across 8 GPUs) this cuts per-rank bytes-read by roughly 8x, and since all eight ranks read in parallel from independent files, wall-clock load time falls accordingly — the read is now eight times less data and eight-way parallel. Expert parallelism (EP, also from Chapter 15) extends the same idea to mixture-of-experts (MoE) weights, where the model has many “expert” sub-networks and each rank is responsible for only a subset of them. The default loader, when EP weight filtering is enabled, computes which experts a rank owns and refuses to read the rest from disk:

# vllm/model_executor/model_loader/default_loader.py
self.local_expert_ids = compute_local_expert_ids(
    num_experts,
    ep_size,
    ep_rank,
    placement=parallel_config.expert_placement_strategy,
)

The comment a few lines up is precise about the one case where this is unsafe — EPLB (Expert-Parallel Load Balancing, Chapter 15) dynamically reshuffles which physical GPU slot serves which logical expert to even out load, and it can place a redundant copy of an expert on a rank that the static ownership math says does not own it. If the loader had filtered that expert out, the rank would later be asked to run an expert whose weights it never read. So when EPLB is on, the filter is disabled and every rank loads every expert. Fast loading and dynamic expert rebalancing are in tension, and vLLM resolves it by choosing correctness over the load-time saving. That is the kind of coupling you only see when you read the code.

One more loader deserves a mention because it blurs the line with Chapter 12: the bitsandbytes path can quantize on the way in. The relevant comment in weight_utils.py, which sits just above the inflight-bitsandbytes branch, makes the mechanism explicit — online quantization ignores the checkpoint’s quant config and instead converts full-precision weights as they are read:

# vllm/model_executor/model_loader/weight_utils.py
# Online quantization doesn't read from checkpoint configs - it quantizes
# fp16/bf16 weights on the fly during loading.

Quantizing during load shrinks the bytes that land in HBM but adds CPU/GPU work to the load itself, so it trades a smaller resident footprint for a slower cold start. Whether that is a win depends entirely on whether you are memory-constrained or scale-up-latency-constrained — and now you can see those are different axes.

The process model you are scaling

Before autoscaling makes sense you need to know what a “replica” actually is, because it is not one process. vLLM splits the frontend (the FastAPI/OpenAI HTTP server that accepts requests) from the EngineCore (the GPU-bound process that runs the scheduler and the model worker), and they talk over ZMQ — ZeroMQ, a lightweight message-passing library — the split first seen in Chapter 11. Why split them? So the Python HTTP layer’s overhead never blocks the GPU loop, and so the two halves can have independent lifecycles. The diagram below shows the structure a fleet operator is actually scaling, from the single engine up to a data-parallel group behind one readiness bit.

flowchart TD
    LB["orchestrator / load balancer"] -->|"routes only when ready"| SUP["DP supervisor"]
    SUP -->|"spawn + poll /health"| C0["child 0: frontend + EngineCore (GPU 0)"]
    SUP -->|"spawn + poll /health"| C1["child 1: frontend + EngineCore (GPU 1)"]
    SUP -->|"spawn + poll /health"| C2["child N: frontend + EngineCore (GPU N)"]
    C0 -->|"ZMQ"| E0["scheduler + worker"]
    C1 -->|"ZMQ"| E1["scheduler + worker"]
    C2 -->|"ZMQ"| E2["scheduler + worker"]
    SUP -->|"all_healthy gate"| READY["group ready bit"]

The frontend’s lifecycle is in vllm/entrypoints/launcher.py, and the load-bearing detail for a fleet operator is graceful drain on shutdown:

# vllm/entrypoints/launcher.py
timeout = engine_client.vllm_config.shutdown_timeout
mode = "abort" if timeout == 0 else "drain"

Read that as: a shutdown_timeout of zero means “abort immediately,” and any positive value means “drain” — keep serving until either the in-flight requests finish or the timeout expires. Scale-down has its own correctness hazard: if you kill a replica that is mid-decode, you drop in-flight requests, and a decode that was forty tokens into a fifty-token answer is simply lost. A drain timeout lets running sequences finish before the process exits, which is the autoscaler-facing half of the lifecycle that people forget until they see a latency cliff during every scale-in event.

Above a single engine, data-parallel deployments use a supervisor to fan out one process per local GPU. (Data parallelism here means running independent full copies of the model, one per GPU, each serving its own requests — as opposed to the tensor parallelism above, where one model is split across GPUs.) vllm/entrypoints/openai/dp_supervisor.py spawns the children and, crucially, owns a readiness probe — a check that answers “is this process ready to receive traffic yet?”:

# vllm/entrypoints/openai/dp_supervisor.py
for local_rank in range(self.args.data_parallel_size_local):
    child_args = _build_vllm_dp_server_args(self.args, local_rank)
    child_env = _build_vllm_dp_server_env(self.args, local_rank)
    process = context.Process(
        target=_run_vllm_dp_server,
        name=f"APIServer_DPRank_{child_args.data_parallel_rank}",
        args=(child_args, child_env),
    )
    process.start()

The supervisor then polls each child’s /health and only declares the group ready once all children pass:

# vllm/entrypoints/openai/dp_supervisor.py
all_healthy = all(r is True for r in results)

if all_healthy:
    ...
    self._is_ready = True

That all_healthy gate is precisely the signal a Kubernetes readiness probe should wire to. (A readiness probe is the endpoint Kubernetes polls to decide whether to send a pod traffic; until it passes, the pod stays out of the load-balancer rotation.) The crucial design choice is the all — the group reports ready only when every child is, because routing a request to a child that is still loading would either fail or queue behind the cold start. The orchestration platform (K8s, llm-d, and friends) does not need to understand weight loading; it just needs to not route traffic until the slowest child has finished its cold start. The supervisor turns an N-GPU group into a single ready/not-ready bit, and the long tail of that bit — the reason it stays false for tens of seconds — is model load time. This is the precise point where the two halves of the chapter meet: the loader determines how long the bit stays false, and the autoscaler reacts to the bit. The autoscaler and the loader meet right here, at a health endpoint.

Autoscaling on signals the engine already emits

Here is where your existing instincts mostly transfer, with one substitution. You already know how to scale on a saturation signal. The only question is which signal, and vLLM hands you two from Chapter 2’s metric catalog. The first is queue depth:

# vllm/v1/metrics/loggers.py
gauge_scheduler_waiting = self._gauge_cls(
    name="vllm:num_requests_waiting",
    documentation="Number of requests waiting to be processed.",
    multiprocess_mode="mostrecent",
    labelnames=labelnames,
)

The second is how full the KV cache is:

# vllm/v1/metrics/loggers.py
gauge_kv_cache_usage = self._gauge_cls(
    name="vllm:kv_cache_usage_perc",
    ...
)

Why two? Because, as Chapter 2 insisted, “fast” is meaningless until you name the metric. num_requests_waiting rising means admission is backing up — requests are arriving faster than the scheduler can start them, so you are out of scheduling capacity. kv_cache_usage_perc near 1.0 means the paged KV cache from Chapter 4 is nearly full — you are out of memory — and from Chapter 6 you know what comes next when it stays there: preemption, the engine shedding load by evicting a running sequence’s cached state and recomputing it later when room frees up. A spike in vllm:num_preemptions is therefore not a warning that pain is coming; it is the system telling you it is already in pain. So a good policy watches KV usage as a leading indicator and queue depth as a confirming one: KV pressure predicts the preemption cliff before you go over it, and queue depth measures the SLO miss after you already have. Scaling on CPU utilization — the reflex from stateless services — tells you almost nothing here, because the work happens on the GPU, and a GPU pegged at 100% util can be either perfectly healthy (saturated and serving) or thrashing (preempting and recomputing). Utilization cannot distinguish the two; the KV and queue gauges can.

There is a subtlety that ties back to Chapter 18’s routing, and it is the crux of the whole chapter. Scaling is a feedback control loop: a signal is scraped, a controller decides, a replica boots, and capacity arrives — but every stage adds delay. The scrape is periodic, not instantaneous, and the signal is laggy relative to a per-step scheduler that makes decisions thousands of times a second. By the time num_requests_waiting has climbed high enough to trip a scale-up, and a new replica has paid its cold-start cost, the burst may already be over. The sequence diagram below lays the timeline out so the mismatch is visible: the new capacity arrives after the spike it was meant to absorb.

sequenceDiagram
    participant Traffic
    participant Metrics as metrics scrape
    participant Auto as autoscaler
    participant New as new replica
    Traffic->>Metrics: burst begins, queue climbs
    Note over Metrics: scrape interval delay
    Metrics->>Auto: num_requests_waiting high
    Auto->>New: scale-up, start cold replica
    Note over New: cold start, load weights (tens of s)
    Traffic->>Traffic: burst ends on its own
    New->>Traffic: ready, but too late

This is the fundamental autoscaling-for-LLMs problem the diagram makes concrete: your reaction time (tens of seconds of weight loading) is long compared to the burst duration. You cannot out-react a spike when reacting takes thirty seconds. The curves below put numbers on that mismatch: an incoming burst that arrives and recedes inside about a minute, against the capacity a reactive autoscaler actually delivers — flat through the scrape interval, then through the cold-start load, so the extra replica only comes online after the burst has already drained. The shaded gap between demand and capacity is exactly the unserved load that spills into queue depth and preemptions.

Illustrative: a synthetic burst against a capacity step that lags by a scrape interval plus a ~30 s cold start; shapes and timing are representative, not measured. The second replica (capacity jumps from 300 to 600 req/s near t = 55 s) lands after demand has already fallen back below 150 req/s.

The honest answers are all forms of not starting cold — that is, removing the cold-start latency from the critical path rather than trying to win a race against it. Keep a pool of warm-but-idle replicas sized to your burst variance, so a spike is absorbed by capacity that already exists (you pay for idle GPUs to buy latency). Keep weights in the page cache or an object store close to the GPU so the loader’s prefetch path is short. Scale from a snapshot rather than from scratch. And place models so that statistical multiplexing absorbs bursts without any scaling action at all — which is the contribution of AlpaServe (arXiv:2302.11665). Its insight, paraphrased: when you serve many models with bursty, uncorrelated demand, deliberately spreading each model across GPUs via model parallelism lets a burst for one model borrow idle capacity from others, so the right placement raises the load you can serve within an SLO without adding hardware. Read it for the framing that placement and multiplexing are a substitute for fast reaction, not a complement to it — exactly the lever you reach for when, as here, you cannot react fast enough.

What remains unsolved

None of this makes cold start free. The fast loaders shrink the constant; they do not change the fact that scale-up latency is bounded below by weight bytes over storage bandwidth,

$$t_{\text{load}} \geq \frac{\text{weight bytes}}{\text{storage bandwidth}},$$

and that bound grows with model size while autoscaler reaction-time budgets do not. The curves below plot that bound for bf16 checkpoints (2 bytes per parameter) across the three storage regimes this chapter discussed: fast local NVMe, a streamed object store, and an mmap-over-network filesystem doing small random reads. The 70B/140 GB/5 GB/s point from the opening paragraph sits where the NVMe curve crosses 28 seconds, and the same model on a network filesystem is already minutes away from serving its first token.

Illustrative: load times computed from the bytes/bandwidth bound using representative storage bandwidths; real loads add parsing and copy overhead, so these are lower bounds, not measurements.

That bound grows with model size while autoscaler reaction-time budgets do not. Scale-from-snapshot (forking a warm process’s GPU state) is promising and still rough at the edges. Right-sizing a warm pool is an open forecasting problem dressed up as a config value. And the laggy-signal problem means the most robust deployments today are the ones that scale least dynamically — generous warm pools and good placement, with reactive autoscaling as a backstop rather than the primary mechanism.

That closes the loop on the fleet. We can load a model fast, stand up a replica, route to it (Chapter 18), serve many adapters on it (Chapter 19), and grow or shrink the set on a signal. The one thing left is to see all of it well enough to know which knob to turn when a number moves the wrong way — which is the subject of the final chapter.

Inference Serving Roadmap