Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Continuous batching and the token-budget scheduler

Chapter 4 left us with a paged KV cache: a pool of fixed-size blocks that lets a single replica hold many sequences in GPU memory at once without pre-reserving a contiguous buffer for each one’s worst case. That solved the memory problem. It said nothing about time — about which of those resident sequences actually gets to run on the next forward pass, and for how many tokens.

That is the scheduler’s job, and it is where the two-phase cost structure from Part I comes back to bite. Recall the asymmetry: prefill is compute-bound (one big matmul over the whole prompt) and decode is memory-bound (read the whole model and KV cache to emit one token per sequence). A naive serving loop treats these as two different operations on two different kinds of work, and the obvious design — the one almost everyone reaches for first — is static batching: collect a batch of requests, run all of them to completion, then collect the next batch. This chapter is about why that design is quietly catastrophic for LLM serving, and how vLLM replaces it with a scheduler that does not believe in prefill or decode as distinct phases at all.

Why static batching wastes a GPU

Hold the static-batching picture in your head for a moment. You batch eight requests. They have different prompt lengths and, worse, wildly different output lengths — one stops after 12 tokens, another runs for 800. Under static batching the whole batch is hostage to the longest member. Seven sequences finish and then sit in the batch as dead weight, their slots still occupied, their padding still consuming compute on every decode step, until the eighth finally emits its stop token. Meanwhile new requests pile up in a queue they are not allowed to enter, because the batch is “in flight.”

You already know this failure mode from traffic infrastructure. It is head-of-line blocking, and the batch is the convoy. The GPU’s decode throughput is roughly flat in batch size up to the memory-bandwidth roofline — adding more concurrent sequences to a memory-bound step is nearly free until you run out of bandwidth or KV cache. The curve below shows why: a decode step’s cost is dominated by reading the model weights once, so packing more sequences into that one read raises aggregate token throughput almost linearly until bandwidth saturates and the curve bends into a flat plateau. Every slot wasted on a finished sequence is throughput from that linear region you paid for and threw away.

Illustrative: the shape (near-linear rise, then a saturating knee at the bandwidth roofline) is the point, not the absolute token/s numbers, which depend on model and GPU.

The diagram below contrasts the two regimes. Read each row as one sequence and the trailing dots as forward passes ticking by. Under static batching (top), once a sequence finishes its slot stays locked, and the queued request C cannot start until the whole batch drains. Under continuous batching (bottom), a finished slot is reclaimed on the very next pass and C slides into it immediately.

flowchart TB
    subgraph STATIC["static batching: batch runs to completion, then refills"]
        direction TB
        SA["A: decode . . . . . . . done"]
        SB["B: decode done (slot wasted, still padded)"]
        SC["C: WAITING in queue . . . . . . . . starts only now"]
    end
    subgraph CONT["continuous batching: membership re-decided every pass"]
        direction TB
        CA["A: decode . . . . . . . done"]
        CB["B: decode done, slot freed"]
        CC["C: waiting, then admitted into freed slot, then decode . . ."]
    end

The fix, introduced by Orca, is to stop scheduling at the granularity of a request and start scheduling at the granularity of an iteration — a single forward pass. An iteration (also called a step) is exactly one call to the model: the engine takes the current batch, runs one forward pass, and gets back one new token for every sequence in it. Scheduling “at iteration granularity” means the scheduler gets to redraw the batch before each of those passes, rather than once per request.

Iteration-level scheduling

The key idea from Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI ’22) is what the paper calls iteration-level scheduling, and what the rest of us call continuous batching. Instead of admitting a batch and running it to completion, the scheduler re-decides the batch’s membership before every forward pass. A sequence that finished on the previous step is evicted from the batch immediately, and a waiting request can take its place on the very next iteration — no waiting for the convoy to clear. Orca pairs this with “selective batching” to handle the fact that operations like attention cannot be naively batched across sequences of different lengths the way a feed-forward layer can. Read it as the origin point for everything in this chapter; it is the paper that reframed the batch from a static container into a per-step decision. (Orca is cited by title and venue; the description here is a paraphrase, not a quotation.)

vLLM takes this idea and pushes it one step further, to a place that is cleaner than Orca’s own framing. In vLLM there is no “selective batching” special case for attention versus everything else, and — the part that matters for this chapter — there is no prefill phase and no decode phase in the scheduler at all. The design note at the top of schedule() states it directly:

# NOTE(woosuk) on the scheduling algorithm:
# There's no "decoding phase" nor "prefill phase" in the scheduler.
# Each request just has the num_computed_tokens and
# num_tokens_with_spec. num_tokens_with_spec =
# len(prompt_token_ids) + len(output_token_ids) + len(spec_token_ids).
# At each step, the scheduler tries to assign tokens to the requests
# so that each request's num_computed_tokens can catch up its
# num_tokens_with_spec.

(vllm/v1/core/sched/scheduler.py)

Sit with what this collapses. A request is not “in prefill” or “in decode.” It is a pair of counters:

  • num_computed_tokens — how many of this request’s tokens the model has already pushed through a forward pass and recorded in the KV cache. It starts at 0 and only grows.
  • num_tokens_with_spec — how many tokens exist for this request right now: the prompt, plus every output token generated so far, plus any speculative tokens being proposed this step. The design note spells it out as len(prompt_token_ids) + len(output_token_ids) + len(spec_token_ids).

The gap between them, num_tokens_with_spec - num_computed_tokens, is the work still owed: tokens that exist but have not yet been run through the model. A fresh request with a 2,000-token prompt and nothing computed has a gap of 2,000 — that is what we used to call “prefill.” A request mid-generation has already computed its entire prompt and all prior outputs, so its gap is just the single new token it is about to extend by: a gap of 1 (or a few, when speculation proposes several at once — Chapter 13). That is what we used to call “decode.” The scheduler’s only job, every step, is to hand out token allotments that shrink these gaps. Prefill is just a large allotment; decode is a small one. They are the same operation at different scales, and the only thing distinguishing them is the size of one subtraction.

The state diagram below traces a single request through these counters. Notice there is no “prefill state” and no “decode state” — only WAITING, RUNNING, and the gap shrinking toward zero, with the same loop handling the 2,000-token first step and every 1-token step after it.

stateDiagram-v2
    [*] --> WAITING: request arrives, computed = 0
    WAITING --> RUNNING: blocks allocated, admitted to batch
    RUNNING --> RUNNING: step, computed += allotment, gap shrinks
    RUNNING --> WAITING: preempted, no free KV blocks
    RUNNING --> FINISHED: stop token, gap closed for good
    FINISHED --> [*]: KV blocks returned to pool

This is the thesis of the chapter, and it is worth saying plainly: once you frame scheduling this way, the scheduler becomes a token-level rate limiter, and you already know how to reason about those.

The token budget

The rate limit is a single integer. At the top of every schedule() call:

token_budget = self.max_num_scheduled_tokens

(vllm/v1/core/sched/scheduler.py)

That budget is the maximum number of tokens the engine will process in one forward pass, and it is set from configuration:

self.max_num_scheduled_tokens = (
    self.scheduler_config.max_num_scheduled_tokens
    if self.scheduler_config.max_num_scheduled_tokens is not None
    else self.scheduler_config.max_num_batched_tokens
)

(vllm/v1/core/sched/scheduler.py)

So --max-num-batched-tokens is the knob, and max_num_scheduled_tokens is usually equal to it (it can be slightly smaller when speculative decoding might append tokens, per the config docstring). The default is modest — DEFAULT_MAX_NUM_BATCHED_TOKENS = 2048 in vllm/config/scheduler.py, though in real deployments it is set far higher by EngineArgs. There is a second, orthogonal limit: max_num_seqs (default 128), the maximum number of distinct sequences in a batch. The token budget caps total work per step; max_num_seqs caps the width of the batch. Decode steps usually hit the sequence cap first (one token each, lots of sequences); a single long prefill can blow the whole token budget on its own.

This is the rate-limiter framing made literal. The budget is a bucket of tokens that refills to the same full value at the start of every step. The scheduler walks its queues and spends from the bucket — one token of work charges one unit — and the moment the bucket hits zero, scheduling stops for that step: every loop in schedule() is gated on token_budget > 0. Whatever did not fit simply waits for the next refill, one forward pass later. What makes it a good rate limiter for this workload is that a token is a token: a prefill token and a decode token cost the scheduler the same one unit of budget, which is exactly the uniformity the design note promised. The bucket does not know or care whether the tokens it is paying out belong to one giant prompt or to fifty sequences each advancing by one — it only counts to its limit and stops.

Two passes: running first, then waiting

With the budget in hand, the structure of schedule() is two sequential passes over two queues. The running queue holds requests already in the batch and generating; the waiting queue holds new arrivals and previously-preempted requests that want in. The order is deliberate and is the central policy choice of the whole scheduler: spend the budget on the running requests first, and only offer the leftovers to waiting ones. The flowchart below traces one full schedule() call. Every diamond that tests the budget is the same token_budget > 0 guard; when the bucket empties, the loop stops wherever it is and the rest of the queue waits for the next step.

flowchart TD
    START["schedule begins: token_budget = max_num_scheduled_tokens"] --> P1Q{"running queue empty or budget == 0?"}
    P1Q -->|"no"| P1G["gap = num_tokens_with_spec - num_computed_tokens"]
    P1G --> P1C["allotment = min of gap and token_budget"]
    P1C --> P1A{"allocate_slots: KV blocks available?"}
    P1A -->|"yes"| P1S["add to batch, token_budget -= allotment"]
    P1S --> P1Q
    P1A -->|"no"| P1P["preempt a running request (Chapter 6), mark step preempted"]
    P1P --> P1Q
    P1Q -->|"yes, Pass 1 done"| P2Q{"any preemption this step, or budget == 0?"}
    P2Q -->|"yes"| DONE["emit SchedulerOutput plan"]
    P2Q -->|"no"| P2W{"waiting queue empty or batch at max_num_seqs?"}
    P2W -->|"no"| P2G["check prefix cache, then same gap-clamp-allocate"]
    P2G --> P2S["move request from WAITING to RUNNING, token_budget -= allotment"]
    P2S --> P2W
    P2W -->|"yes"| DONE

Pass 1 covers the requests already running:

# First, schedule the RUNNING requests.
req_index = 0
while req_index < len(self.running) and token_budget > 0:
    request = self.running[req_index]
    ...
    num_new_tokens = (
        request.num_tokens_with_spec
        + request.num_output_placeholders
        - request.num_computed_tokens
    )

(vllm/v1/core/sched/scheduler.py)

There is the gap computation, straight out of the design note: num_new_tokens is how far this request still has to go. For a request that is mid-generation, that gap is 1 — it needs one decode token. For a request still chunking through a long prompt, the gap is whatever is left of the prompt. The scheduler does not care which; it just wants to schedule num_new_tokens for it, then clamps that figure down to whatever budget remains:

num_new_tokens = min(num_new_tokens, token_budget)

(vllm/v1/core/sched/scheduler.py)

That one min is the seam where prefill and decode reunite, and it is the hinge for the next chapter: a long prefill that wants 2,000 tokens but finds only 256 left in the budget gets chunked to 256 and rides alongside the decodes that already claimed the rest. We will develop chunked prefill properly in Chapter 6; for now, notice that it falls out of the rate limiter for free, with no special case.

Once a token count is settled, the request needs KV blocks for those tokens, and this is where Chapter 4’s allocator re-enters:

new_blocks = self.kv_cache_manager.allocate_slots(
    request,
    num_new_tokens,
    num_lookahead_tokens=self.num_lookahead_tokens,
)

if new_blocks is not None:
    # The request can be scheduled.
    break

(vllm/v1/core/sched/scheduler.py)

If allocate_slots returns None, the paged pool is out of blocks, and the scheduler must make room by preempting a running request — the load-shedding valve we hand off to Chapter 6. For now the happy path: blocks granted, the request joins the batch, and the budget is debited.

scheduled_running_reqs.append(request)
request_id = request.request_id
req_to_new_blocks[request_id] = new_blocks
num_scheduled_tokens[request_id] = num_new_tokens
token_budget -= num_new_tokens

(vllm/v1/core/sched/scheduler.py)

token_budget -= num_new_tokens is the rate limiter spending from its bucket. Running requests are served first, every step, which is the policy choice that protects inter-token latency: a request that is already generating keeps its decode cadence rather than being starved by a flood of new arrivals.

Pass 2 handles the waiting queue — new arrivals and previously-preempted requests — but only with whatever budget Pass 1 left behind:

# Next, schedule the WAITING requests.
if not preempted_reqs and self._pause_state == PauseState.UNPAUSED:
    ...
    while (self.waiting or self.skipped_waiting) and token_budget > 0:
        if len(self.running) == self.max_num_running_reqs:
            break

(vllm/v1/core/sched/scheduler.py)

Two guards worth noting. First, if Pass 1 had to preempt anyone (preempted_reqs is non-empty), Pass 2 is skipped entirely — there is no point admitting new work in a step where the engine is already shedding load. Second, the max_num_running_reqs ceiling (that’s max_num_seqs) caps batch width here, independently of the token budget. A waiting request goes through the same gap-and-clamp logic, checks the prefix cache for already-computed blocks (Chapter 7), allocates slots, and on success moves to RUNNING.

The order of the queues

Which waiting request is at the front? That is the entire content of request_queue.py, and it is deliberately boring — which is the point. The two policies are an enum:

class SchedulingPolicy(Enum):
    """Enum for scheduling policies."""

    FCFS = "fcfs"
    PRIORITY = "priority"

(vllm/v1/core/sched/request_queue.py)

FCFS is literally a deque; pop_request is popleft, add_request is append. Priority is a binary heap ordered by (priority, arrival_time). The default is "fcfs" (vllm/config/scheduler.py). The reason this is so thin is that all the interesting decisions — how many tokens, fit-or-preempt, prefix reuse — live in schedule() against the token budget. The queue only answers “who’s next,” and for a system whose hard constraints are tokens and blocks, plain arrival order is a defensible default. This is the part of the design that should feel most familiar to a traffic engineer: it is admission control with a token bucket sitting in front of an FCFS queue, and the cleverness is in the bucket, not the queue.

Note the preemption asymmetry baked into the queue API. When a request is preempted (Chapter 6), _preempt_request does self.waiting.prepend_request(request) — it goes back to the front under FCFS, so a preempted request resumes ahead of newer arrivals. Under priority there is no real front; prepend_request just re-inserts by (priority, arrival_time). Same call site, different semantics, chosen by the policy object.

The output contract and closing the loop

The scheduler does not run the model. It produces a plan and hands it to the worker. That plan is SchedulerOutput, and its shape is the scheduler-to-worker contract:

@dataclass
class SchedulerOutput:
    # list of the requests that are scheduled for the first time.
    scheduled_new_reqs: list[NewRequestData]
    # list of the requests that have been scheduled before.
    # ... we only send the diff to minimize the communication cost.
    scheduled_cached_reqs: CachedRequestData

    # req_id -> num_scheduled_tokens
    num_scheduled_tokens: dict[str, int]
    # Total number of tokens scheduled for all requests.
    total_num_scheduled_tokens: int

(vllm/v1/core/sched/output.py)

The num_scheduled_tokens dict is the per-step plan: for each request in the batch, how many of its tokens to process this pass. Note the new/cached split — a brand-new request ships its full payload once (NewRequestData), and on subsequent steps only the diff (CachedRequestData) crosses the process boundary, because the worker caches request state. That diff-minimization matters because, as Chapter 11 will show, the scheduler and the worker live in separate processes connected by ZMQ, and this object is serialized across that seam every single step.

The loop that drives it all is short enough to quote whole, in EngineCore.step():

scheduler_output = self.scheduler.schedule()
future = self.model_executor.execute_model(scheduler_output, non_block=True)
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
...
    model_output = future.result()
...
engine_core_outputs = self.scheduler.update_from_output(
    scheduler_output, model_output
)

(vllm/v1/engine/core.py)

Schedule, execute, update — repeat. That three-beat loop is the engine’s heartbeat, so it is worth tracing the data crossing the process boundary on a single beat. The sequence diagram below shows one step: the scheduler emits a plan, the worker runs the model and samples, and the result flows back so the scheduler can re-plan. The arrows that cross from worker to scheduler are exactly what update_from_output consumes.

sequenceDiagram
    participant S as Scheduler
    participant W as Worker (model executor)
    S->>S: schedule, spend token budget, advance num_computed_tokens
    S->>W: SchedulerOutput, num_scheduled_tokens per request
    W->>W: execute_model, one forward pass then sample
    W->>S: sampled tokens (model_output)
    S->>S: update_from_output, append token, check stop, roll back rejected drafts
    S->>S: next step, schedule again

The third call is what makes the batching continuous. update_from_output takes the sampled tokens back from the worker and reconciles each request’s state. For a normal decode it appends the new token and checks for a stop condition; for speculative decoding it also rolls num_computed_tokens back over rejected drafts:

if request.num_computed_tokens > 0:
    request.num_computed_tokens -= num_rejected

(vllm/v1/core/sched/scheduler.py)

That single subtraction is the same uniform machinery again: a rejected speculative token simply means the request did not advance as far as the plan assumed, so the gap it must catch up next step grows back by num_rejected. No phase, no special path — just an adjustment to the counter the design note built everything around.

To see why a subtraction is the natural correction here, look at the ordering. The accounting groundwork is laid before the worker even runs, in _update_after_schedule, which advances num_computed_tokens immediately at schedule time:

request.num_computed_tokens += num_scheduled_token

(vllm/v1/core/sched/scheduler.py)

The point of advancing the counter optimistically, at schedule time rather than after results come back, is latency: the engine pipelines steps, so the next schedule() can plan against the assumption that this step succeeded without blocking on its output. For an ordinary decode that assumption always holds — the token gets generated. For speculative decoding it might not, because some proposed draft tokens get rejected. So the correction is to undo the part of the optimistic advance that turned out to be wrong, which is exactly what the -= num_rejected rollback above does. Advance optimistically, walk it back only on rejection.

When a request stops, it is dropped from running, its request_id lands in finished_req_ids so the worker frees its cached state, and its KV blocks return to the pool — instantly available to the waiting request that the very next schedule() will admit into the freed slot. That is continuous batching closing the loop: no convoy, no head-of-line block, the batch reshaped to fit reality on every pass.

Tradeoffs and what is still unsolved

The token-budget design buys uniformity, but it does not buy you out of the throughput-versus-latency tension from Chapter 2 — it just relocates it onto one knob. A large max_num_batched_tokens lets big prefills run in fewer chunks and raises throughput, but a step dominated by a 4,000-token prefill stalls every concurrent decode for the duration of that step, spiking inter-token latency for everyone else. A small budget keeps decodes responsive but chunks prefills into more, smaller pieces, adding per-step overhead and slowing time-to-first-token. The two curves below show the bind: as the budget grows, throughput climbs and saturates, but the worst-case inter-token latency (a decode trapped behind a maxed-out step) climbs roughly in step with it. The scheduler gives you a clean rate limiter; it does not tell you where to set the rate. Chapter 6 is where that knob earns its keep, once chunked prefill and preemption give you finer control over how a step’s budget is spent.

Illustrative: throughput saturating while worst-case inter-token latency grows roughly linearly with the budget is the relationship to take away, not the specific values.

Two honest limitations to carry forward. First, FCFS is starvation-resistant but not fairness-aware: short cheap requests and long expensive ones share one budget with no notion of per-tenant shares — multi-tenancy is a fleet-level concern (Part V), not something this single-replica scheduler addresses. Second, the scheduler plans blind to output length: it knows a prompt’s size exactly but cannot know whether it will generate 5 tokens or 5,000, so it cannot anticipate when KV pressure is about to force a preemption. It reacts when allocate_slots returns None rather than predicting it. Learned output-length estimation to drive smarter admission and preemption remains genuinely open.

For now we have the engine’s heartbeat: a per-step, token-budgeted re-planning loop in which prefill and decode are the same operation seen at different scales. The next chapter spends that budget deliberately — slicing long prefills into chunks that ride alongside decodes, and turning preemption-with-recompute into the engine’s backpressure valve.

Further reading

  • Orca: A Distributed Serving System for Transformer-Based Generative Models — OSDI ’22 — introduces iteration-level (continuous) batching and selective batching; the conceptual origin of this chapter’s scheduler.
  • Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — arXiv:2403.02310 — formalizes the throughput/latency tension the token budget exposes and motivates the chunked-prefill mechanism developed in Chapter 6.