Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

What “serving” means for an autoregressive model

If you have spent your career moving bytes, you carry a model of “serving” in your bones. A request arrives, a handler does a bounded amount of work, a response goes out, and the connection is recycled. The unit of accounting is the call. Capacity is requests per second. The hard problems are the ones you already know how to name: head-of-line blocking, tail latency, fan-out amplification, backpressure when a downstream slows. Everything in that worldview rests on one quiet assumption, so deeply baked in that it rarely gets stated: a request’s cost is roughly knowable up front, and the time it occupies a worker is short and bounded.

An autoregressive language model breaks that assumption at the root. Not at the edges, not in some pathological tail, but for the median request on a quiet day. This chapter is about why, and about the shape the rest of the stack is forced into as a consequence. The thesis is simple to state and expensive to absorb: an LLM request is not a function call, it is a long-lived state machine, and the cost it incurs is variable, two-phase, and stateful. Once you take that seriously, almost every later technique in this book stops looking like a clever trick and starts looking like the only thing you could have done.

A token at a time, forever

Start from the computation itself. A transformer generating text does not produce its answer in one shot. It produces one token, appends that token to its own input, and runs the whole forward pass again to produce the next one. (A token is the model’s atomic unit of text, roughly a word-piece; the model’s vocabulary is a fixed list of them, and a forward pass is one full sweep of the input through all the model’s layers, ending in a probability over that vocabulary from which the next token is drawn.) The output is fed back as input, over and over, until the model emits a special end-of-sequence token or hits a configured length cap. That feedback loop is what “autoregressive” means: each output is conditioned on, and literally becomes part of, the input for the next prediction.

The diagram below traces that loop for a three-token answer. Notice that there is no single “run the model” box; there is a box per output token, and each one feeds the next.

flowchart LR
    P["prompt: 'The capital of France is'"] --> F1["forward pass 1"]
    F1 --> T1["emit ' Paris'"]
    T1 --> F2["forward pass 2 (prompt + ' Paris')"]
    F2 --> T2["emit '.'"]
    T2 --> F3["forward pass 3 (prompt + ' Paris.')"]
    F3 --> STOP["emit end-of-sequence: done"]

This loop has two consequences that a stateless-RPC mental model has no slot for.

The first is that the length of the work is not known when the request arrives. A chat completion might generate eight tokens or eight hundred. The client did not tell you, the model does not know yet, and the only way to find out is to run it and watch for the stop condition. Your scheduler is admitting a job whose duration is a random variable it cannot observe until the job is nearly done. Every queueing intuition you have that depends on knowing or estimating service time needs an asterisk.

The second is that the work is stateful across steps. To produce the next token, the model’s attention mechanism lets the current position look back at every previous position and weigh how relevant each is. Concretely, each earlier token contributed a key vector (what it offers) and a value vector (what it carries), and the new token’s query vector is compared against all of those keys to decide how much of each value to mix in. The crucial point for serving is that the keys and values of the earlier tokens do not change once computed. So recomputing them on every step would mean re-deriving the whole sequence’s keys and values at step 1, then again at step 2, and so on, work that grows with the square of the length, $O(n^2)$ in the sequence length $n$. The standard fix is the KV cache: the per-token key and value vectors computed in earlier steps are kept in GPU memory and reused, so each new step only computes the new token’s own key, value, and query and reads the rest straight out of the cache. The two curves below show what that buys: the total key/value work to generate a sequence is quadratic in its length when you recompute every step, and linear once the cache is in place. We dissect that cache in Chapter 4.

Illustrative: the shapes are exact ($n(n{+}1)/2$ versus $n$ in units of per-position key/value computations), but the absolute counts ignore per-token constant factors and weight reads. What matters here is the structural fact it forces: a request in flight owns GPU memory that grows with every token it generates, and that memory is the live, non-reconstructable state of the computation. Drop it and the request has to start over. This is the opposite of a stateless handler. It is closer to a long-lived TCP connection with a per-connection buffer that the kernel cannot page out for free.

Two phases, two completely different cost profiles

There is a further wrinkle that has no analogue at all in conventional serving: a single request runs in two phases that stress the hardware in opposite ways.

When the request first arrives, the model has to ingest the whole prompt. Every prompt token is processed, and the KV cache for the entire prompt is populated in essentially one big parallel pass. This is prefill. Because all the prompt tokens are available at once, the engine can stack them into one large matrix multiply and keep the GPU’s arithmetic units saturated; the bottleneck is how fast the chip can do math, so prefill is compute-bound (the arithmetic units are the limiting resource). Then the request flips into decode, generating output tokens one at a time. Here is the asymmetry. A decode step processes exactly one new token, so it does only one token’s worth of arithmetic, but to do even that it must stream the entire model’s weights and the request’s entire growing KV cache out of GPU memory and through the compute units. One token’s worth of math, the whole model’s worth of memory traffic. The bottleneck is no longer how fast the chip computes but how fast it can move bytes, so decode is memory-bandwidth-bound, and because a long answer is hundreds of such steps, decode is where almost all of a long request’s wall-clock time goes.

The diagram below contrasts the two phases at a glance. The key takeaway is that the thing that makes prefill efficient (many tokens in flight at once, so the expensive-to-fetch weights are amortized over a lot of arithmetic) is exactly the thing decode structurally lacks: one token per step means the weights get re-read for almost no math.

flowchart TD
    subgraph PREFILL["prefill: once, at request start"]
        A["all N prompt tokens at once"] --> B["one big parallel matrix multiply"]
        B --> C["KV cache for the whole prompt"]
        B --> D["bottleneck: arithmetic units (compute-bound)"]
    end
    subgraph DECODE["decode: repeated, one step per output token"]
        E["one new token"] --> F["read ALL weights + growing KV cache"]
        F --> G["emit one token, append to KV cache"]
        G --> E
        F --> H["bottleneck: memory bandwidth (memory-bound)"]
    end
    PREFILL --> DECODE

Chapter 3 turns this asymmetry into a roofline argument; for now, just hold the picture: one request, two regimes, and what each regime is starved of is the opposite of the other.

This is why “fast” is a meaningless adjective for an inference system, and why Chapter 2 has to split latency into at least two SLOs that trade against each other. Time to first token is dominated by prefill and queueing. The per-token latency of the stream that follows is a decode property. They are different metrics measuring different phases of the same request, and you can be excellent at one while being terrible at the other.

The request is a state machine, and vLLM says so out loud

Because the work is long-lived, variable, and stateful, the engine cannot treat a request as a stack frame that exists for the duration of one call. It has to model the request as an explicit object with an explicit lifecycle, persisted across hundreds of scheduler ticks. vLLM makes this concrete. Every request the engine knows about carries a status drawn from a small enum, in vllm/v1/request.py:

class RequestStatus(enum.IntEnum):
    """Status of a request."""

    WAITING = enum.auto()
    WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR = enum.auto()
    WAITING_FOR_REMOTE_KVS = enum.auto()
    WAITING_FOR_STREAMING_REQ = enum.auto()
    RUNNING = enum.auto()
    PREEMPTED = enum.auto()
    # Note: anything after PREEMPTED will be considered
    # as a finished status.
    FINISHED_STOPPED = enum.auto()
    FINISHED_LENGTH_CAPPED = enum.auto()
    FINISHED_ABORTED = enum.auto()
    FINISHED_IGNORED = enum.auto()
    FINISHED_ERROR = enum.auto()
    FINISHED_REPETITION = enum.auto()

Read this enum as the table of contents for Part II. WAITING is the admission queue: the request is known to the engine but holds no GPU memory yet. RUNNING is in the live batch, actively advancing a few tokens per step. The various WAITING_FOR_* states are stalls on resources that do not exist in a stateless world: a request can be parked because its grammar has not finished compiling (Chapter 14), or because its KV cache is being fetched from a remote node (Chapters 16 and 17). And PREEMPTED is the one that should make a traffic engineer sit up: it is the state of a request that was running and got evicted to free GPU memory for something else, then sent back to the waiting queue to be re-admitted later.

The lifecycle these states describe is genuinely a state machine, and the diagram below traces the paths a request can take through it. Two edges are the ones with no stateless analogue: the loop where RUNNING advances itself step after step without leaving the state, and the demotion from RUNNING back to WAITING through PREEMPTED.

stateDiagram-v2
    [*] --> WAITING: request arrives
    WAITING --> WAITING_FOR_RESOURCE: needs grammar or remote KV
    WAITING_FOR_RESOURCE --> WAITING: resource ready
    WAITING --> RUNNING: admitted, KV memory granted
    RUNNING --> RUNNING: step advances a few tokens
    RUNNING --> PREEMPTED: evicted to free GPU memory
    PREEMPTED --> WAITING: re-queued, KV thrown away
    RUNNING --> FINISHED: stop token / length cap / abort / error
    FINISHED --> [*]

The note in the source is precise about a subtlety that the integer ordering encodes: anything numerically past PREEMPTED counts as finished, so the engine can test status > RequestStatus.PREEMPTED rather than comparing against a set of terminal values. There is not one terminal state but six, and the distinction is load-bearing because each finish reason maps to a different thing the client is told:

_FINISHED_REASON_MAP = {
    RequestStatus.FINISHED_STOPPED: FinishReason.STOP,
    RequestStatus.FINISHED_LENGTH_CAPPED: FinishReason.LENGTH,
    RequestStatus.FINISHED_ABORTED: FinishReason.ABORT,
    RequestStatus.FINISHED_IGNORED: FinishReason.LENGTH,
    RequestStatus.FINISHED_ERROR: FinishReason.ERROR,
    ...
}

Source: vllm/v1/request.py

The Request object that carries this status is not a thin wrapper around the input. It accumulates the state of the computation as it runs. The fields that matter most are the bookkeeping the scheduler will lean on every tick:

        self.spec_token_ids: list[int] = []
        self.num_computed_tokens = 0
        self.cache_salt: str | None = cache_salt

Source: vllm/v1/request.py

num_computed_tokens is the heart of it. It is how far through its own token sequence the request has been processed so far, and it is what makes the two-phase distinction disappear from the scheduler’s point of view. A request mid-prefill and a request mid-decode are both just “a sequence whose num_computed_tokens is behind its total length,” and the scheduler’s only job is to help each one catch up under a budget. Chapter 5 builds the whole continuous-batching scheduler on exactly this framing. The reason it can is visible right here in the request’s own state: the engine never stopped tracking how much of each request it had already done.

Note also that num_computed_tokens is a plain mutable integer, and preemption simply resets it to zero. That is the entire recovery mechanism. There is no checkpoint, no partial-result salvage. When the engine reclaims a preempted request’s memory it throws away the computed KV and the request re-prefills from scratch on readmission. Preemption-as-load-shedding (Chapter 6) is brutal precisely because the state it sheds is expensive and non-reconstructable, which is the whole point: in a stateless system, shedding load is cheap, because there is no state. Here it costs you everything that request had computed.

The wire format admits it too

The state machine is not just an internal convenience; it leaks into the contracts between processes, because vLLM is structured as a frontend and an engine core in separate OS processes (the split is documented in docs/design/arch_overview.md and docs/design/multiprocessing.md, and Chapter 11 returns to it). The frontend ships an EngineCoreRequest in, and the engine streams EngineCoreOutputs back. Look at what an output actually is, in vllm/v1/engine/__init__.py:

class EngineCoreOutput(
    msgspec.Struct,
    ...
):
    request_id: str
    new_token_ids: list[int]

    new_logprobs: LogprobsLists | None = None
    ...
    finish_reason: FinishReason | None = None
    stop_reason: int | str | None = None
    events: list[EngineCoreEvent] | None = None

This is not a response. It is an incremental update keyed by request_id, carrying the new tokens since last time and a finish_reason that is None until the request actually ends:

    @property
    def finished(self) -> bool:
        return self.finish_reason is not None

Source: vllm/v1/engine/__init__.py

A single logical request produces a stream of these, one per engine step that advanced it, over its entire multi-hundred-step life. The engine even timestamps the request’s transitions explicitly, because the lifecycle is something operators need to observe rather than infer:

class EngineCoreEventType(enum.IntEnum):
    """The type of engine core request event."""

    QUEUED = 1
    SCHEDULED = 2
    PREEMPTED = 3

Source: vllm/v1/engine/__init__.py

QUEUED, SCHEDULED, PREEMPTED are the edges of the state machine made into telemetry. The intervals between them are what become the TTFT and queue-time metrics of Chapter 2. The protocol between the engine’s two halves is, in other words, a state-machine-replication protocol, not a request/response one.

One step, many requests in flight

Tie it together at the engine’s core loop. A conventional server processes a request to completion before reclaiming the worker. vLLM does the opposite: it re-plans the entire in-flight set on every single forward pass. The loop in vllm/v1/engine/core.py is almost anticlimactic:

    def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
        ...
        if not self.scheduler.has_requests():
            return {}, False
        scheduler_output = self.scheduler.schedule()
        future = self.model_executor.execute_model(scheduler_output, non_block=True)
        grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
        ...
            model_output = future.result()
        ...
        engine_core_outputs = self.scheduler.update_from_output(
            scheduler_output, model_output
        )

Walk the three calls in order. schedule() decides which requests, and how many of each one’s tokens, get GPU time this tick, packing them into a single batch under a fixed token budget. execute_model runs one forward pass for that whole assembled batch, so a prompt still being prefilled and a request a hundred tokens into its answer ride the same matrix multiply. update_from_output then appends the freshly sampled token to each request, advances its num_computed_tokens, checks each one’s stop conditions, flips finished requests into the right FINISHED_* state, and emits the incremental outputs. Then the loop does it again.

The diagram below traces one such step and shows why a request is never “owned” by the engine for the duration of its answer. The scheduler reassembles the batch every tick from whoever is currently runnable, so across a request’s life it drifts in and out of batches it shares with strangers.

sequenceDiagram
    participant S as "scheduler"
    participant E as "model executor (GPU)"
    participant B as "in-flight requests"
    loop every step
        S->>B: pick runnable requests under token budget
        S->>E: execute_model(batch)
        E->>E: one forward pass for the whole batch
        E-->>S: one new token per request in batch
        S->>B: append token, advance num_computed_tokens
        S->>B: check stop conditions, mark FINISHED_*
        S-->>S: emit incremental EngineCoreOutputs
    end

A request lives across thousands of these iterations, sharing each one with dozens of unrelated requests at wildly different points in their own lifecycles. There is no moment where the engine is “handling your request.” There is only a steady cadence of steps, and your request is a row in the batch that some steps include and others do not.

That is the genuinely new thing. The unit of work is not the request and it is not even the token. It is the step: a fixed-budget tick that advances many long-lived state machines by a little. Capacity is tokens per step times steps per second, divided across whoever is in flight. Tail latency is not “one slow handler” but “your request kept losing the budget fight, or got preempted, or stalled waiting for a remote KV block.” The familiar problems are all still here, but they have been pushed down a level, from between requests to within the batch.

What this forces, and where we go next

If you accept that a request is a long-lived, variable-length, two-phase, memory-owning state machine, the rest of the book stops being a grab bag and becomes a consequence. The KV cache is state, so memory, not compute, is what caps concurrency, and you have to manage it like a memory allocator (Chapter 4). The work is variable-length and re-planned per step, so the scheduler is a token-level rate limiter, not a thread pool (Chapter 5), and load-shedding means evicting live state (Chapter 6). Identical prefixes are recomputed state you already paid for, so you cache and share them (Chapter 7). The two phases stress the hardware so differently that you eventually run them on separate machines entirely (Chapter 17). Every one of these is an answer to a question that a stateless-RPC stack never had to ask.

One honest caveat before we proceed. The picture of a monotonically growing per-request KV cache is the common case, not a law. Some attention variants compress that state into a small fixed-size latent (MLA, Chapter 9), and some architectures replace the growing cache with a constant-size recurrent state, which changes the memory math in Chapter 4. The state machine stays; what varies is how much memory each state costs. We will be careful about that distinction when it matters.

The problem statement is now on the table. The next chapter makes it measurable: which latencies, at which percentiles, under which load, and why the cost denominator that decides whether your fleet is affordable is GPU-seconds rather than requests. Until you can name the metric, you cannot tell whether any of the machinery in Part II is actually helping.

Further reading

  • Efficient Memory Management for Large Language Model Serving with PagedAttention — arXiv:2309.06180 — Frames the central claim that KV-cache memory, not arithmetic, is the binding constraint on LLM serving concurrency; read it now for the problem statement, and again with Chapter 4 for the mechanism.