Observability and performance tuning: closing the loop
Everything in this book is a mechanism with a knob. The token-budget scheduler from Chapter 5 admits work against max_num_batched_tokens; preemption from Chapter 6 sheds load when the KV pool runs dry; prefix caching from Chapter 7 turns repeated context into a TTFT discount; the encoder budget from Chapter 8 competes for the same step; CUDA graphs from Chapter 10 amortize launch overhead but only for batch shapes that were captured; grammar masks from Chapter 14 add CPU work off the critical path, mostly. Every one of those mechanisms is invisible at the API boundary. A request goes in, tokens come out, and a latency number gets recorded. When that number regresses, you have a dozen suspects and one symptom.
This final chapter is about the bridge between the symptom and the suspect. You already know how to run a service: you watch percentiles, you alert on saturation, you correlate. What is different here is the vocabulary. An LLM engine emits signals that mean nothing unless you know the internals, and it hides the most useful ones inside a separate process. The skill is reading those signals well enough to say not “p99 TTFT is up” but “p99 TTFT is up because the prefix hit rate fell, because a tenant changed their system prompt, and the fix is a larger KV pool, not more replicas.” Localize to a mechanism, then turn the right knob.
That sentence describes a loop, and the loop is the spine of this chapter. You start at a symptom (a percentile moved), narrow it to one mechanism using the engine’s metrics, turn the single knob that controls that mechanism, then watch the same metrics to confirm the mechanism moved the way you predicted and that you did not shove the regression somewhere else. The diagram below traces that cycle, and the rest of the chapter fills in each box: what the metrics are, how to read them, and which knob each one points at.
flowchart LR
S["symptom: a percentile regressed"] --> L["localize: read engine metrics, find the one mechanism responsible"]
L --> K["turn the single knob for that mechanism"]
K --> V["verify: watch the same metrics, did the mechanism move?"]
V -->|"yes, and no new regression"| DONE["done, for now"]
V -->|"no, or it broke something else"| L
The series catalog, and why the shape of each metric matters
vLLM’s Prometheus surface is defined in one place, vllm/v1/metrics/loggers.py, in the constructor of PrometheusStatLogger. It is worth knowing the taxonomy before the names, because the taxonomy tells you what each metric can and cannot answer. Prometheus, the metrics system vLLM exports to, has three fundamental kinds of time series, and each kind answers a different shape of question. A gauge is a single number that can go up or down, sampled at the moment of scrape; it answers “what is true right now.” A counter only ever increases, so its value is meaningless in isolation, but its slope over a window (computed by the rate(...) function) tells you how fast events are happening. A histogram is a set of counters, one per pre-defined bucket of values, that together let you reconstruct a distribution and ask for a percentile after the fact. Match the question to the kind and the metric reads itself; mismatch them and you will compute nonsense, like a percentile over a gauge or a “current value” of a counter. The diagram below sorts vLLM’s most diagnostic series into these three kinds, with the question each one answers.
flowchart TD
M["vLLM metric series in loggers.py"] --> G["gauge: value now"]
M --> C["counter: read as a rate"]
M --> H["histogram: distribution, query percentiles"]
G --> G1["num_requests_running / waiting: saturation"]
G --> G2["kv_cache_usage_perc: how full is the KV pool"]
C --> C1["num_preemptions: is the engine shedding load"]
C --> C2["prefix_cache_hits / queries: reuse rate"]
H --> H1["time_to_first_token_seconds: TTFT SLO"]
H --> H2["inter_token_latency_seconds: ITL SLO"]
H --> H3["queue / prefill / decode time: phase breakdown"]
Gauges are instantaneous. They answer “what is happening right now”:
# vllm/v1/metrics/loggers.py
gauge_scheduler_running = self._gauge_cls(
name="vllm:num_requests_running",
documentation="Number of requests in model execution batches.",
multiprocess_mode="mostrecent",
labelnames=labelnames,
)
Source: vllm/v1/metrics/loggers.py
vllm:num_requests_running and vllm:num_requests_waiting are your saturation signals, the LLM analogue of an in-flight count and a queue depth: how many requests the engine is actively decoding this step versus how many are stuck behind them. vllm:kv_cache_usage_perc is the one with no analogue in a stateless service. It is the fraction of the paged KV pool from Chapter 4 that is currently allocated to live requests, and it, not CPU and not request count, is the resource that actually caps how many sequences can run at once. The reason is the autoregressive shape of the workload: every request holds KV blocks for as long as it is generating, and a long generation holds them for a long time, so concurrency is bounded by how much KV memory you have, not by how much compute. When this gauge sits near 1.0 the pool is nearly full, every block is spoken for, and the next request the scheduler tries to admit will find no free blocks, which forces it to evict a running request to make room. That eviction is preemption, and it is the bridge from this gauge to the counter in the next section.
Counters are monotonic; you read them as rates. The preemption counter is the single most diagnostic line in the whole file:
# vllm/v1/metrics/loggers.py
counter_num_preempted_reqs = self._counter_cls(
name="vllm:num_preemptions",
documentation="Cumulative number of preemption from the engine.",
labelnames=labelnames,
)
Source: vllm/v1/metrics/loggers.py
A nonzero rate(vllm:num_preemptions[1m]) means the engine is doing the load-shedding from Chapter 6: freeing a running request’s KV, resetting its computed-token count, and prepending it to the waiting queue to be recomputed later. That recompute is pure waste, and it shows up as ITL spikes for the victims. Preemption is not a bug; it is backpressure working as designed. But a sustained preemption rate means you have admitted more concurrency than the cache can hold, and no amount of routing cleverness upstream will fix it.
The prefix-cache counters come as a query/hit pair, deliberately counted in tokens, not requests:
# vllm/v1/metrics/loggers.py
counter_prefix_cache_queries = self._counter_cls(
name="vllm:prefix_cache_queries",
documentation=(
"Prefix cache queries, in terms of number of queried tokens."
),
labelnames=labelnames,
)
Source: vllm/v1/metrics/loggers.py
The hit rate is rate(prefix_cache_hits) / rate(prefix_cache_queries). Token-weighting matters: one request that reuses a 4000-token system prompt is worth far more than a hundred that share nothing, and a request-weighted rate would hide exactly the cache behavior you care about. A drop in this ratio is the most common silent TTFT regression in production, because the cause is upstream of the engine entirely: a routing change that stopped sending similar prompts to the same replica, or a tenant who appended a timestamp to their system prompt and busted every hash from Chapter 7.
Histograms are where the SLOs from Chapter 2 actually live. A histogram fixes a set of bucket boundaries up front, and every observation increments the counter for the bucket it falls into; at query time Prometheus interpolates across those buckets to estimate any percentile you ask for. The catch is that resolution lives entirely in the boundaries: a percentile can only be as precise as the buckets near it are dense. That is why the boundaries are hand-tuned rather than evenly spaced. TTFT and inter-token latency each get their own histogram:
# vllm/v1/metrics/loggers.py
histogram_time_to_first_token = self._histogram_cls(
name="vllm:time_to_first_token_seconds",
documentation="Histogram of time to first token in seconds.",
buckets=[
0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5,
0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0,
640.0, 2560.0,
],
labelnames=labelnames,
)
Source: vllm/v1/metrics/loggers.py
Those buckets are dense from a millisecond to a second and sparse after, because that is the range where a TTFT SLO is decided and where a percentile query needs resolution. Past a few seconds the request is already failing its SLO and you do not need fine resolution to know it; before a millisecond nothing interesting happens. Plotting each boundary against its position in the list, as the curve below does on a log y-axis, makes the hand-tuning literal: the boundaries crawl from a millisecond to one second in the first twelve entries, then leap by factors of two-and-a-half to four per step out to 2560 seconds. Where the line is shallow the buckets are tight and a percentile is sharp; where it shoots up the buckets are coarse and a percentile there is only a rough bound.
The same constructor builds vllm:inter_token_latency_seconds for ITL.
The histograms that turn a latency regression into a diagnosis, though, are the per-request phase histograms: vllm:request_queue_time_seconds, vllm:request_prefill_time_seconds, and vllm:request_decode_time_seconds. Each one measures how long a request spent in one stage of its life. The engine-core process stamps a timestamp on a handful of lifecycle events as a request moves through it: queued_ts when the request first lands in the waiting queue, scheduled_ts when the scheduler first admits it into a batch, first_token_ts when the model emits its first output token, and last_token_ts when it emits its final one. The three intervals are just differences between consecutive stamps, computed in vllm/v1/metrics/stats.py:
# vllm/v1/metrics/stats.py
# Queued interval is from first QUEUED event to first SCHEDULED
queued_time = req_stats.scheduled_ts - req_stats.queued_ts
# Prefill interval is from first SCHEDULED to first NEW_TOKEN
prefill_time = req_stats.first_token_ts - req_stats.scheduled_ts
# Decode interval is from first NEW_TOKEN to last NEW_TOKEN
decode_time = req_stats.last_token_ts - req_stats.first_token_ts
The diagram below traces a single request through those four timestamps and shows which interval each pair of stamps defines. Read it left to right as the request’s own timeline; the brackets underneath are the three histograms.
sequenceDiagram
participant Q as Waiting queue
participant S as Scheduler
participant M as Model on GPU
Q->>S: queued_ts, request arrives
Note over Q,S: queue_time = scheduled_ts - queued_ts
S->>M: scheduled_ts, admitted to batch
Note over S,M: prefill_time = first_token_ts - scheduled_ts
M->>M: first_token_ts, first output token
Note over M: decode_time = last_token_ts - first_token_ts
M->>M: last_token_ts, final output token
This decomposition is the first cut of any latency investigation. End-to-end latency is up, but where did the time go: queue time (the request sat waiting because the engine was saturated and could not schedule it), prefill time (a long prompt, or a cold prefix cache that forced the engine to compute KV it could have reused), or decode time (the batch grew large and per-token latency suffered)? The three histograms answer that directly, and the answer matters because each phase maps one-to-one onto a different fix, which is exactly the knob map at the end of this chapter. One subtlety makes these intervals more honest than they look: by design the prefill and decode intervals absorb any preemption that happened during them, because preemption does not get its own timestamp, it just stretches the wall-clock gap between the surrounding events. So a request preempted mid-decode shows inflated decode time rather than a separate “preempted” bucket, which is exactly the cross-check you want, a decode-time histogram that suddenly grows a long tail should line up with a nonzero preemption counter.
Tokens by source: the prefill that wasn’t
One subtle metric deserves its own paragraph because it closes the loop between Chapters 7, 8, and 16. The scheduler does not just count prompt tokens; it attributes each one to where its KV came from. Recall that before a model can decode, every prompt token needs a key/value entry in the cache, and there are exactly three ways that entry can come to exist: the engine computed it on this GPU just now, it found it already sitting in the local paged cache from a previous request (a prefix-cache hit, Chapter 7), or it pulled it across a KV connector from somewhere else, an offload tier or a remote prefill node (Chapters 16 and 17). Those three sources are mutually exclusive and exhaustive, which is why PromptTokenStats in stats.py states the bookkeeping as an invariant in its own docstring:
# vllm/v1/metrics/stats.py
# Invariants:
# computed + local_cache_hit + external_kv_transfer = total
# local_cache_hit + external_kv_transfer = cached_tokens
The first line says every prompt token is accounted for by exactly one of the three sources; the second line says the two non-computed sources together are what we call “cached.” Exposed as vllm:prompt_tokens_by_source, this lets you watch the prefix cache do its job in absolute terms rather than as a ratio: how many prompt tokens were actually computed (local_compute) versus served from the local paged cache (local_cache_hit) versus pulled over a KV connector from the offload tier or a remote prefill (external_kv_transfer). The hit-rate ratio from the previous section tells you the fraction reused; this metric tells you the absolute count, which is what you need when you are deciding whether a change paid for itself. When you tune gpu_memory_utilization to enlarge the cache, this is the metric that tells you whether the extra blocks bought you reuse (local_cache_hit rises) or just sat idle (it does not). It is also the honest denominator for cost: GPU-seconds are spent computing KV, so they are spent on local_compute tokens and almost nothing else, and a healthy serving deployment drives that number down over time without dragging accuracy or hit rate down with it.
CUDA-graph fallback, hidden in plain sight
Chapter 10’s whole argument was that decode steps can replay as captured CUDA graphs, a recorded sequence of GPU kernel launches that the driver fires in one shot instead of the CPU dispatching each kernel individually, which erases per-step launch overhead. The catch is that a graph is captured for a specific batch size: a graph recorded for 32 sequences only knows how to run 32. So at every step the engine faces a batch of whatever size the scheduler produced and has to pick one of three outcomes. If the size exactly matches a captured graph, it replays it, the fast path. If it is smaller than a captured size, it can pad up, run the next-larger captured graph and waste compute on the padding rows. If it is larger than anything captured, it falls back to eager execution, dispatching kernels one at a time and paying exactly the launch overhead the graph was meant to erase. The two slow outcomes are real costs and easy to miss because no latency histogram names them, so vLLM records a per-step stat whose fields tell you exactly which case you hit:
# vllm/compilation/cuda_graph.py
@dataclasses.dataclass(frozen=True)
class CUDAGraphStat:
num_unpadded_tokens: int
num_padded_tokens: int
num_paddings: int
runtime_mode: str
Source: vllm/compilation/cuda_graph.py
Those four fields are exactly the three-way decision made visible. num_unpadded_tokens is the batch the scheduler actually produced; num_padded_tokens is what it was rounded up to in order to hit a captured size; num_paddings is the difference, the wasted rows; and runtime_mode records which path ran. The diagram below is that per-step decision, and the stat is just a tally of which branch each step took.
flowchart TD
B["decode step: batch of N sequences"] --> Q{"N matches a captured graph size?"}
Q -->|"yes"| R["replay graph (fast path, num_paddings = 0)"]
Q -->|"no, N smaller than a captured size"| P["pad up to next size, replay (num_paddings > 0)"]
Q -->|"no, N larger than any captured size"| E["eager fallback (runtime_mode = eager, pay launch overhead)"]
Aggregated by CUDAGraphLogging into a frequency table over many steps, this is how you catch a regression that no latency histogram explains cleanly. If runtime_mode is frequently the eager fallback, your live batch sizes are landing outside the captured set, and you are paying the launch overhead the graph was supposed to erase. If num_paddings is large, you captured coarse batch sizes and are wasting compute padding small batches up to the next one. Both point at cudagraph_mode and the capture-size list, not at the scheduler. This metric is gated behind observability_config.cudagraph_metrics precisely because computing it every step is itself overhead; you turn it on when you suspect graph trouble and leave it off otherwise.
MFU: are you even near the roofline?
The deepest metric in the tree is the one that connects back to Chapter 3’s roofline. MFU, model FLOPs utilization, is the fraction of your accelerator’s peak arithmetic throughput that the model is actually using, achieved FLOP/s over peak FLOP/s; the bandwidth-utilization figure is its memory-side twin, the fraction of peak HBM bandwidth in use. Neither number is measured directly from hardware counters. Instead vllm/v1/metrics/perf.py carries an analytic model of the transformer: given the exact batch the scheduler ran, it computes from first principles how many floating-point operations that forward pass requires and how many bytes it must read and write per GPU, broken down by component (attn, ffn, unembed). Those estimates are exported as counters, and you turn them into utilization with a rate query. The header on the Prometheus class spells out the intended form:
# vllm/v1/metrics/perf.py
# rate(vllm:estimated_flops_per_gpu_total[1m]) / 1e12
#
# Average memory bandwidth in GB/s can be calculated using:
# (rate(vllm:estimated_read_bytes_per_gpu_total[1m]) +
# rate(vllm:estimated_write_bytes_per_gpu_total[1m])) / 1e9
Divide the estimated FLOP/s by your accelerator’s peak FLOP/s and you get MFU; divide the byte rate by peak HBM bandwidth and you get bandwidth utilization, achieved bytes/s over peak bytes/s. Here the crucial part is that a low MFU is not automatically bad, and reading it correctly requires the prefill-versus-decode asymmetry from Chapter 3. Decode is memory-bound: each step reads the whole model and the KV cache to produce a single token per sequence, so there is very little arithmetic to do per byte moved, a low arithmetic intensity $I = \text{FLOPs} / \text{byte}$, and a healthy decode-heavy workload will sit near the bandwidth roofline while showing single-digit MFU. That low MFU is the expected, correct reading, not a problem to fix, because there are simply no FLOPs there to extract. The identical single-digit MFU in a prefill-heavy workload means the opposite. Prefill is compute-bound, so low MFU there says you are leaving arithmetic on the table, almost always because the batches are too small to keep the matrix units busy, and the fix is to admit more work per step. The roofline draws the line between the two regimes at the balance-point intensity $I_{\text{balance}}$, the FLOPs-per-byte at which the compute and bandwidth ceilings meet:
$$I_{\text{balance}} = \frac{\text{peak FLOP/s}}{\text{peak bytes/s}}$$
A workload with $I < I_{\text{balance}}$ is bandwidth-bound (decode), and one with $I > I_{\text{balance}}$ is compute-bound (prefill). Same number, opposite diagnosis, and the only way to tell them apart is to know which phase dominates. The roofline below draws both regimes on one log-log plot: the sloped left arm is the bandwidth ceiling (achievable FLOP/s rises with intensity because more arithmetic rides on each byte moved), the flat right arm is the compute ceiling (peak FLOP/s, you cannot go faster no matter the intensity), and the knee where they meet is $I_{\text{balance}}$. Decode lives far down the sloped arm at low intensity, hard against the bandwidth ceiling; prefill lives out on the flat arm where the only way up is bigger batches. A single-digit MFU on the left arm is the ceiling; the identical number on the right arm is wasted compute.
Illustrative: ceilings use round H100-class numbers (peak ~1.0 PFLOP/s, ~3.35 TB/s HBM, so $I_{\text{balance}}\approx 299$ FLOP/byte); the shape and the two-regime split are the point, not the exact coordinates. The model is approximate (it assumes perfect MoE expert load balance, for one, which Chapter 15 showed is a fiction under skew, so it overstates utilization exactly when routing is lopsided), and it is gated behind enable_mfu_metrics for the cost of computing it. But it is the closest thing the engine gives you to an answer to “are we using the hardware we are paying for.”
Scrape it, and mind which process you are profiling
The collection side is mundane and that is the point. The example in examples/observability/prometheus_grafana/ is a Prometheus job pointed at the server’s /metrics:
# examples/observability/prometheus_grafana/prometheus.yaml
global:
scrape_interval: 5s
scrape_configs:
- job_name: vllm
static_configs:
- targets:
- 'host.docker.internal:8000'
Source: examples/observability/prometheus_grafana/prometheus.yaml
Two non-obvious things bite people here. First, scrape interval interacts with your bucket math: a 5-second scrape and a rate(...[1m]) window will smear short preemption bursts together, averaging a sharp spike down into a low plateau, so for incident forensics you want either a tighter rate window or the raw histogram. Second, and more important, recall the frontend/EngineCore process split from Chapter 11. vLLM does not run in one process. The HTTP server, request tokenization, and SSE response streaming live in the API-server (frontend) process; the scheduler, the model forward pass, and the GPU live in a separate EngineCore process, and the two talk over ZMQ, a fast inter-process message socket. That split is what makes the phase histograms so valuable for localization, because the boundary between “frontend cost” and “engine cost” runs straight through it. The diagram below shows the two processes, what each holds, and how a metric scrape versus a profiler attach sees them differently.
flowchart LR
CLIENT["client"] -->|"HTTP"| FE
subgraph FE["API-server process (frontend)"]
HTTP["HTTP server"]
TOK["tokenization"]
SSE["SSE streaming / detokenization"]
end
FE <-->|"ZMQ"| EC
subgraph EC["EngineCore process"]
SCHED["scheduler"]
MODEL["model forward pass"]
GPU["GPU / KV cache"]
end
PROM["Prometheus /metrics"] -.->|"multiprocess mode: aggregates both"| FE
PROM -.-> EC
PROF["torch profiler"] -.->|"attaches to ONE pid only"| EC
The /metrics endpoint aggregates both processes via Prometheus multiprocess mode, so the numbers you scrape already span the whole engine. A profiler, though, attaches to a single process and sees only that one, which is why localization has to come first. When TTFT is slow but the phase histograms say prefill time is fine, the lost time is in the frontend, in detokenization or the egress path, and you profile the API-server process; when the phase histograms say prefill or decode is slow, the cost is in the engine, and you profile EngineCore.
Profiling the engine therefore means attaching to the right PID. vLLM exposes /start_profile and /stop_profile endpoints that the API server forwards over ZMQ to the EngineCore, which calls down into the worker:
# vllm/v1/engine/core.py
def profile(self, is_start: bool = True, profile_prefix: str | None = None):
self.model_executor.profile(is_start, profile_prefix)
Source: vllm/v1/engine/core.py
That call wraps a Torch profiler around the worker process, enabled by setting --profiler-config.torch_profiler_dir (so that both the frontend’s CPU trace and the worker’s combined CPU-and-GPU trace land in one directory for side-by-side viewing). Where a metric is a single aggregated number, a trace is a timeline of every kernel and every gap between kernels on the GPU, and that resolution is what lets you see whether a slow decode step is bound by the attention kernel itself, by the launch gaps that CUDA graphs were supposed to close, or by the grammar bitmask copy from Chapter 14 quietly landing on the critical path. The division of labor is clean: metrics localize a regression to a process and a mechanism, and the trace then localizes it to a specific kernel.
The knob-to-mechanism map
Once you have localized, tuning is almost mechanical, because each symptom has a primary knob and most of the knobs trade along the throughput-versus-latency axis from Chapter 2. The diagram below is that mapping as a decision tree: start at the symptom, read the two or three metrics that distinguish its causes, and arrive at the one knob to turn. The prose after it walks each branch and names the tradeoff you are accepting.
flowchart TD
START["a percentile regressed; read the metrics"] --> W{"num_requests_waiting high?"}
W -->|"yes, but kv usage moderate and preemptions zero"| K1["scheduler-throughput-limited: raise max-num-batched-tokens"]
W -->|"yes, kv usage near 1.0 with steady preemptions"| K2["cache-limited: lower max-num-seqs or raise gpu-memory-utilization"]
START --> T{"TTFT p99 dragged up by rare long prompts, hurting everyone's ITL?"}
T -->|"yes"| K3["chunked-prefill interference: clamp long-prefill-token-threshold"]
START --> MM{"multimodal: good throughput but erratic TTFT?"}
MM -->|"yes"| K4["encoder budget contending: tune max-num-batched-tokens"]
START --> CG{"CUDA-graph stats show eager fallback or heavy padding?"}
CG -->|"yes"| K5["fix cudagraph_mode and capture sizes (not the scheduler)"]
If vllm:num_requests_waiting is high but kv_cache_usage_perc is moderate and preemptions are zero, you are scheduler-throughput-limited, not memory-limited: raise --max-num-batched-tokens to let more work into each step, accepting some ITL cost. If kv_cache_usage_perc rides near 1.0 with a steady preemption rate, you are cache-limited: lower --max-num-seqs to admit fewer concurrent sequences, or raise --gpu-memory-utilization to grow the pool if you have headroom. If TTFT p99 is dragged up by occasional very long prompts while ITL for everyone else suffers, that is the chunked-prefill interference from Chapter 6; clamp it with --long-prefill-token-threshold so a giant prefill is sliced thinner and shares each step more politely. If a multimodal deployment shows good token throughput but erratic TTFT, the encoder budget from Chapter 8 is contending with the token budget. In this version of vLLM that budget is not yet an independent flag, it is pinned to max_num_batched_tokens (max_num_encoder_input_tokens is marked not-currently-configurable in vllm/config/scheduler.py), so --max-num-batched-tokens is the lever you actually have, and you trade encoder headroom against decode ITL when you turn it. And if the CUDA-graph stats show frequent eager fallback or heavy padding, the fix is cudagraph_mode and the capture sizes, not anything in the scheduler.
The honesty clause: none of these knobs is independent. Raising max_num_batched_tokens helps prefill throughput and hurts decode ITL. Raising gpu_memory_utilization grows the cache but shrinks the activation headroom, and too far will OOM under a burst. Lowering max_num_seqs calms preemption but caps throughput. There is no setting that is good for every workload, which is why the loop is a loop: change one knob, watch the same metrics, confirm the mechanism moved the way you predicted, and check that you did not push the regression somewhere else.
What is still open
The biggest unsolved problem is that these signals are emitted per replica, and a fleet is many replicas behind a router. The router from Chapter 18 wants prefix-locality and queue-depth signals to make good decisions, and the block-storage events from Chapter 7 give it an approximate cache view, but turning per-replica metrics into a fleet-level control loop, autoscaling on num_requests_waiting and kv_cache_usage_perc together rather than CPU, as Chapter 20 began, remains more craft than science. Attribution across the connector boundary is genuinely hard: when external_kv_transfer tokens dominate, the latency is partly someone else’s prefill pool. And the analytic MFU model degrades exactly where serving gets interesting, under MoE skew, speculative decoding’s variable acceptance, and disaggregated phases that no single replica’s counters can see whole.
That is the real shape of the work. The engine will tell you almost everything, in a vocabulary this book has spent twenty chapters teaching you to read. The loop you close with it is never finished, only quieter.
Further reading
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — arXiv:2403.02310 — the goodput framing and chunked-prefill tradeoff that every knob in this chapter is tuning against.
- Orca: A Distributed Serving System for Transformer-Based Generative Models — OSDI ’22 — iteration-level batching, the source of the per-step scheduling signals you are reading.
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — arXiv:2401.09670 — why prefill and decode want different rooflines, and why one replica’s MFU cannot see the whole picture once they are split.