Distributed inference: tensor, pipeline, and expert parallelism with expert-parallel load balancing

Everything in Part II assumed the model fit on one GPU. That assumption holds for a 7B model in bfloat16, and Chapter 12 stretched it further: quantize the weights to FP8 or 4-bit and a surprisingly large model squeezes onto a single device, with the KV cache from Chapter 4 fighting it for the leftover HBM. But the assumption breaks eventually. A 405B dense model is most of a terabyte of weights before you reserve a single block for KV. A trillion-parameter mixture-of-experts model is far past that. And even when the weights fit, a model can be too slow on one GPU: decode is memory-bandwidth-bound, and one device only has so many bytes per second to read its weights with.

So you split the model across devices. The interesting part is that there is no single way to split it. There are three axes, each cutting the model along a different grain, and each pays for itself in a different communication currency with a different failure mode. Before going deep, it helps to hold the three side by side. Tensor parallelism (TP) cuts across every layer, so the work of one matrix multiply is shared by several GPUs; pipeline parallelism (PP) cuts along the depth of the model, so each GPU owns a contiguous run of whole layers; expert parallelism (EP) exists only for mixture-of-experts models and cuts by which expert, putting different expert sub-networks on different GPUs. The diagram below contrasts the three cuts on the same toy model.

flowchart LR
    subgraph TP["Tensor parallel: shard within a layer"]
        direction TB
        TPL["one matmul"] --> TPa["GPU 0: left columns"]
        TPL --> TPb["GPU 1: right columns"]
    end
    subgraph PP["Pipeline parallel: split by depth"]
        direction TB
        PPa["GPU 0: layers 0 to N"] --> PPb["GPU 1: layers N to 2N"]
    end
    subgraph EP["Expert parallel: split by expert"]
        direction TB
        EPa["GPU 0: experts 0 to 7"]
        EPb["GPU 1: experts 8 to 15"]
    end

This chapter walks the three, grounds each in vLLM’s process-group machinery, and then confronts what makes large MoE serving genuinely hard in production: once experts live on different GPUs, throughput is hostage to how evenly traffic spreads across them. The chapter closes by watching vLLM move experts between GPUs under live load to fix that skew.

One mesh, four axes

Before any layer is sharded, vLLM has to decide which GPU belongs to which group. That bookkeeping lives in initialize_model_parallel, and the cleanest way to understand the whole chapter is to read how it lays out the ranks:

# vllm/distributed/parallel_state.py
# the layout order is: ExternalDP x DP x PP x TP
all_ranks = torch.arange(world_size).reshape(
    -1,
    data_parallel_size,
    pipeline_model_parallel_size,
    prefill_context_model_parallel_size,
    tensor_model_parallel_size,
)  # noqa

Source: vllm/distributed/parallel_state.py

A rank is just a global integer name for one GPU process: in a cluster of world_size GPUs, the ranks run from 0 to world_size - 1. The reshape above takes that flat list of ranks and folds it into a multi-dimensional grid. There are five dimensions here, ordered from slowest-varying to fastest: the trailing -1 absorbs whatever is left over as “external data parallel”, then data parallel (DP), pipeline parallel (PP), a prefill-context-parallel dimension (a fifth axis for splitting long-prompt prefill that this chapter sets aside), and finally tensor parallel (TP) as the innermost, fastest-varying axis. The three axes this chapter cares about — TP, PP, and EP — are all carved from this same grid. Every GPU is one cell in it, and its coordinates along each axis say which TP group, which PP stage, and which DP replica it belongs to. The grid is the single source of truth from which every communication group is derived.

To build a particular parallel group, vLLM transposes the axis it cares about to the end, flattens, and unbinds — that is, it lines up the rank-grid so the axis of interest runs contiguously, then reads off each row as one group of ranks that will talk to each other. Tensor-parallel groups are just consecutive ranks (all_ranks.view(-1, tensor_model_parallel_size)), so ranks 0–7 form one TP group, 8–15 the next, and so on; pipeline groups stride across the PP dimension (all_ranks.transpose(2, 4)...), picking one rank from each stage; the data-parallel and expert-parallel groups come from their own transposes. Each call hands its rank lists to init_model_parallel_group, which wraps them in a GroupCoordinator — the object that owns the actual NCCL communicators (the GPU-to-GPU message channels) and exposes all_reduce, all_gather, send, recv for that group. The docstring at the top of the function is worth internalizing because it dictates topology:

# vllm/distributed/parallel_state.py
# Note that for efficiency, the caller should make sure adjacent ranks
# are on the same DGX box.

Source: vllm/distributed/parallel_state.py

Adjacency is not cosmetic. TP groups are consecutive precisely because TP is the most communication-hungry axis and wants the fattest links (NVLink within a box), while PP, which exchanges far less, can tolerate slower cross-node hops. Hold that ordering in mind; it is the reason the three axes have the cost profiles they do.

Tensor parallelism: split every matmul, pay an all-reduce per layer

Tensor parallelism is the Megatron idea (arXiv:1909.08053): shard the weight matrices within a layer and have every GPU compute a slice of every operation. vLLM implements it as two complementary linear layers. A ColumnParallelLinear splits the weight along its output dimension so each rank owns a vertical slab:

# vllm/model_executor/layers/linear.py
# Divide the weight matrix along the last dimension.
self.output_size_per_partition = divide(output_size, self.tp_size)

Source: vllm/model_executor/layers/linear.py

Its partner, RowParallelLinear, splits along the input dimension, so each rank consumes the matching slice of the previous layer’s sharded output. The cleverness of Megatron’s pairing is that a column-parallel matmul followed by a row-parallel matmul needs communication only at the very end. Walk it through. The first matmul gives each rank its own vertical slab of the intermediate result — no rank has the whole thing, but together they have all of it, with no overlap. That sharded intermediate then flows straight into the second matmul: because RowParallelLinear is split along its input, each rank’s slab is exactly the input slice that rank needs, so no data has to move between the two matmuls. Each rank multiplies its shard against its slice and produces a partial result — a same-shaped tensor that is correct except that it is missing every other rank’s contribution. To finish, you add the partials together elementwise across ranks. That summation is the all-reduce: a collective in which every rank contributes its tensor and every rank receives the sum. It happens with a single collective:

# vllm/model_executor/layers/linear.py
if self.reduce_results and self.tp_size > 1:
    output = tensor_model_parallel_all_reduce(output_parallel)
else:
    output = output_parallel

Source: vllm/model_executor/layers/linear.py

That tensor_model_parallel_all_reduce is a thin shim over the TP group’s communicator (vllm/distributed/communication_op.py just calls get_tp_group().all_reduce(...)). The diagram below traces a single column-then-row pair on two ranks, showing where the data stays sharded and where the one all-reduce stitches it back together.

flowchart TD
    X["input activation (replicated on every rank)"]
    X --> C0["GPU 0: ColumnParallelLinear, left half of weight"]
    X --> C1["GPU 1: ColumnParallelLinear, right half of weight"]
    C0 --> R0["GPU 0: RowParallelLinear on its shard, partial output"]
    C1 --> R1["GPU 1: RowParallelLinear on its shard, partial output"]
    R0 --> AR["all-reduce: sum partials across ranks"]
    R1 --> AR
    AR --> Y["full output (now replicated again)"]

In a transformer block, this happens twice — once after attention’s output projection, once after the MLP’s down-projection — so a TP-sharded forward pass costs two all-reduces per layer per token. That is the defining property of tensor parallelism: it is exact and load-balanced (every rank does identical work), but it injects a blocking collective on the critical path of every single layer. Double the TP degree and you double those collectives. This is why TP wants NVLink and rarely scales gracefully past one node: the all-reduce latency, not the compute, becomes the wall. Even the vocabulary embedding is sharded this way — vocab_parallel_embedding.py splits the token table across ranks and all-reduces the lookups — so that no single GPU holds a full copy of anything.

A subtle TP detail that bites people: bias. Because the partial results get summed, adding a bias on every rank would add it tp_size times. vLLM fuses the bias only on rank 0:

# vllm/model_executor/layers/linear.py
# Only fuse bias add into GEMM for rank 0 (this ensures that
# bias will not get added more than once in TP>1 case)
bias_ = None if (self.tp_rank > 0 or self.skip_bias_add) else self.bias

Source: vllm/model_executor/layers/linear.py

Pipeline parallelism: split by layer, pay a bubble

Pipeline parallelism cuts the other way. Instead of slicing each layer across GPUs, you give whole contiguous blocks of layers to each GPU: stage 0 holds layers 0–N, stage 1 holds N–2N, and a request’s activations flow from one stage to the next. The communication is tiny by comparison — a single hidden-state tensor handed across the stage boundary — and vLLM exposes it as a plain point-to-point send and receive on the PP group:

# vllm/distributed/parallel_state.py
def send(self, tensor: torch.Tensor, dst: int | None = None) -> None:
    """Sends a tensor to the destination rank in a blocking way"""

Source: vllm/distributed/parallel_state.py

So PP’s communication currency is cheap, which is exactly why it survives across slow inter-node links where TP would choke. Its cost is structural instead of bandwidth-bound: the bubble. The bubble is idle GPU time that comes purely from the shape of the dependency, not from any slow operation. Because each stage can only start once the stage before it has handed over its activations, a single batch threads through the stages one at a time: while stage 0 works on a batch, stages $1$ through $P-1$ sit idle waiting for its output; when stage 0 finishes and passes the baton, it goes idle, since there is nothing new behind that batch to feed it. With one batch in flight the pipeline is mostly empty — at any instant only one of the $P$ stages is busy.

The standard fix from training — Megatron again — is to keep several microbatches in flight so a stage always has something to chew on: as soon as stage 0 hands batch 1 forward, it picks up batch 2, and the stages fill up like a factory line. vLLM applies the same idea at serving time by keeping several batches in flight through the pipeline at once — a batch queue sized to the PP degree, since “PP requires PP-size concurrent batches to fill the pipeline” (max_concurrent_batches in vllm/config/vllm.py, drained by step_with_batch_queue in vllm/v1/engine/core.py). The sequence below shows two stages first running one batch (idle gaps everywhere) and then overlapping two batches so each stage stays busy.

sequenceDiagram
    participant S0 as Stage 0 layers 0..N
    participant S1 as Stage 1 layers N..2N
    Note over S0,S1: one batch in flight, note the bubble
    S0->>S0: compute batch A
    S0->>S1: send activations of A
    Note over S0: S0 idle, the bubble
    S1->>S1: compute batch A
    Note over S0,S1: two batches in flight, bubble shrinks
    S0->>S0: compute batch A
    S0->>S1: send A
    S1->>S1: compute A
    S0->>S0: compute batch B
    S0->>S1: send B
    S1->>S1: compute B

But it never fully hides the bubble at the fill and drain edges — at the very start only stage 0 has work, and at the very end only the last stage does. The fill-and-drain idle fraction of a $P$-stage pipeline fed $m$ microbatches is $(P-1)/(m+P-1)$: with a single batch in flight ($m=1$) it collapses to $(P-1)/P$, which is the “only one of the $P$ stages is busy” worst case, and it shrinks toward zero only as $m$ grows well past $P$. The curve below shows that bubble fraction against pipeline depth for a few microbatch counts.

Illustrative: computed from the ideal $(P-1)/(m+P-1)$ bubble formula, which ignores per-stage compute imbalance, communication time, and decode’s one-token-per-step starvation; real serving bubbles are larger.

Decode makes this worse: a decode step is a single token per sequence, so there is precious little work to overlap. PP buys you capacity for weights that will not otherwise fit, at the price of latency you can only partially reclaim. TP and PP are routinely combined — TP within a box, PP across boxes — which is exactly what the ExternalDP x DP x PP x TP layout is built to express.

Expert parallelism: split by expert, pay an all-to-all

The third axis only exists for mixture-of-experts (MoE) models, and it is where the chapter’s real subject begins. In a dense transformer, every token passes through the same single MLP. In an MoE layer that one MLP is replaced by many independent expert MLPs — separate sets of weights, often dozens or hundreds of them — plus a small router network that looks at each token and picks a few experts (the top-k, often just one or two) to actually run for that token. Every other expert is skipped for that token. Switch Transformers (arXiv:2101.03961) showed you can scale parameter count enormously this way while keeping per-token compute roughly fixed, because each token only ever touches a handful of experts: you can add experts to grow the model’s capacity without making any single token’s forward pass more expensive. The natural way to serve such a model is expert parallelism: put different experts on different GPUs. vLLM builds the EP group only when the model actually has experts —

# vllm/distributed/parallel_state.py
# Don't create EP group for dense models.
if config.model_config is None or config.model_config.is_moe:

Source: vllm/distributed/parallel_state.py

— and the placement of experts onto ranks is computed by determine_expert_map, which by default spreads them linearly and returns a global-to-local map with -1 marking experts this rank does not own (so a rank can quickly tell “not mine, this token’s expert lives elsewhere”). The communication pattern is the giveaway. A token’s chosen experts can live on any GPU, not the one the token currently sits on, so the layer must route every token to wherever its chosen experts are, run them, then route the results back. The forward trip is the dispatch: each rank sends every token to the rank that owns the expert that token picked. After the experts compute, the combine sends each result back to the rank the token came from. Each of those is an all-to-all collective — a pattern where, in the general case, every rank sends a (possibly different) chunk of data to every other rank simultaneously. The diagram below traces one token’s round trip through dispatch and combine.

flowchart LR
    T["token on GPU 0, router picks expert 11"]
    T -->|dispatch all-to-all| E["GPU 1 runs expert 11"]
    E -->|combine all-to-all| B["result returns to GPU 0"]
    B --> N["token continues to next layer on GPU 0"]

vLLM has a whole family of backends for these two collectives (vllm/distributed/device_communicators/all2all.py, with the choice exposed as all2all_backend in vllm/config/parallel.py, defaulting to an allgather/reduce-scatter scheme). All-to-all is the most expensive collective shape there is, because every rank talks to every other rank, but the saving grace is that each token only carries the experts it actually selected, so the volume of data on the wire is small even though the fan-out is total.

And here is the catch that the rest of the chapter is about. An all-to-all is a synchronizing step: it cannot finish until every rank has both sent and received its share, so the whole collective is gated by the slowest participant. And the slowest participant is whichever rank has the most expert work to do. If the router sends 40% of all tokens to experts that happen to live on GPU 3, then GPU 3 is doing 40% of the expert compute while the other GPUs idle, and every all-to-all stalls until GPU 3 finishes. Expert parallelism is only as fast as its most overloaded expert. Routers, trained on data that does not look like your production traffic, are reliably not uniform — a few experts become popular, others go nearly cold — and so a freshly deployed MoE model often runs at a fraction of its theoretical throughput purely from load skew. The DeepSeek-V3 technical report (arXiv:2412.19437) documents this at scale and describes the production answer: replicate the hot experts and actively rebalance which GPU hosts which. That answer is the EPLB subsystem.

Expert-parallel load balancing: rebalancing under live traffic

The vocabulary first, because EPLB’s whole design rests on it. The docstring of eplb_state.py lays it out:

# vllm/distributed/eplb/eplb_state.py
# - **Logical Expert**: An expert that is part of the model's logical structure.
# - **Redundant Expert**: To achieve load balancing, for some popular logical
#   experts, we create additional copies of the expert weights.
# - **Physical Expert**: An expert that is instantiated on a specific device.

The distinction is the whole trick, so make it concrete. A logical expert is one of the experts the model architecture defines — there are a fixed number of them, say 256, and the router only ever picks logical experts. A physical expert is an actual instantiated copy of an expert’s weights sitting on a specific GPU. Normally there is one physical copy per logical expert. EPLB breaks that one-to-one assumption: it deploys more physical slots than there are logical experts (the extras are the redundant experts), so DeepSeek-R1 with 32 redundant experts becomes 288 physical experts across the cluster. The freedom this buys is that a single hot logical expert can be backed by several physical copies on several GPUs. The router still picks the logical expert; the runtime then spreads the tokens that chose it across its physical copies, so its load is divided instead of piling onto one device. The diagram below shows a hot expert (logical 11) given two physical copies while cold experts keep one.

flowchart TD
    L11["logical expert 11 (hot, 40% of tokens)"]
    L42["logical expert 42 (cold)"]
    L11 --> P0["physical slot on GPU 0"]
    L11 --> P1["physical slot on GPU 1"]
    L42 --> P2["physical slot on GPU 2"]
    P0 --> G0["GPU 0 now carries ~20%"]
    P1 --> G1["GPU 1 now carries ~20%"]
    P2 --> G2["GPU 2 carries its small share"]

The job of EPLB is to decide, from observed traffic, the mapping from physical slots to logical experts, and to keep that mapping current as traffic drifts.

EPLB runs as a loop with three phases: observe the load every step, periodically decide a better mapping, then apply it by moving weights. The observation half runs every step. The engine records how many tokens each physical expert served into a sliding window (a count over the most recent N steps, so old traffic ages out), and periodically logs how lopsided things are — the step method computes a balancedness ratio, the mean load over the max load across ranks,

$$\text{balancedness} = \frac{\text{avg tokens}}{\text{max tokens}}$$

which is exactly the “busiest rank sets the pace” intuition turned into a number you can alert on:

# vllm/distributed/eplb/eplb_state.py
balancedness = avg_tokens / max_tokens if max_tokens > 0 else 0.0

Source: vllm/distributed/eplb/eplb_state.py

A balancedness of 1.0 is perfect; 0.4 means your worst GPU is carrying more than double its fair share and your effective throughput is roughly that fraction of peak. The bars below make the earlier “40% of tokens land on GPU 3” example concrete on an 8-GPU cluster: skewed, the busiest GPU holds 40% against a 12.5% fair share, so balancedness is $12.5/40 \approx 0.31$ and the all-to-all runs at roughly a third of peak; after EPLB replicates the hot expert across more physical slots, the per-GPU load flattens back toward that 12.5% line and balancedness climbs near 1.0.

Illustrative: an 8-GPU sketch of the chapter’s 40%-on-one-GPU example; real per-rank loads come from the observed token counts EPLB records, and the post-rebalance distribution depends on how many redundant slots the hot expert receives.

The window size and how often to act on it are tunable — window_size defaults to 1000 steps and step_interval to 3000 in EPLBConfig (vllm/config/parallel.py) — so EPLB reacts to sustained skew, not to momentary noise. The flowchart below traces the full loop, including the asynchronous apply path detailed below.

flowchart TD
    A["every step: record per-expert token counts in sliding window"]
    A --> B{"step_interval reached?"}
    B -->|no| A
    B -->|yes| C["run placement: replicate hot experts, pack onto GPUs"]
    C --> D["stage new expert weights into off-critical-path buffer (inference keeps running on old weights)"]
    D --> E{"all ranks confirm transfer done?"}
    E -->|no, keep serving and wait| E
    E -->|yes| F["atomic swap: move_from_buffer to live weights, all ranks flip together"]
    F --> A

When the rearrangement step fires, the recorded loads feed a placement algorithm in vllm/distributed/eplb/policy/default.py, adapted from DeepSeek’s open-source EPLB. It is a two-part packing problem: first decide how many copies each expert gets, then decide which GPU each copy lives on. The first part, replication, hands the redundant physical slots to the heaviest logical experts, greedily. It repeatedly looks at the load each expert would carry if its current copies split its tokens evenly — weight / logcnt, the expert’s total load divided by its replica count — and gives the next spare copy to whichever expert is worst off by that measure, then updates that expert’s replica count and repeats. An expert pulling 40% of traffic with one copy looks far worse than one pulling 5%, so it gets the first extra copy; after that its per-copy load halves and some other expert may become the worst, so the next copy may go elsewhere —

# vllm/distributed/eplb/policy/default.py
for i in range(num_log, num_phy):
    redundant_indices = np.argmax(weight / logcnt, axis=-1)
    phy2log[:, i] = redundant_indices
    logcnt[arangen, redundant_indices] += 1

— and second, pack those physical experts onto GPUs so the per-GPU totals come out even. This is bin packing: each GPU is a bin with a load budget, each physical expert is an item weighing its share of the traffic, and balanced_packing places items so no bin ends up far heavier than the others (lightest-bin-first). The policy is hierarchical: it first packs expert groups onto nodes, then replicates within a node, then packs onto GPUs within a node, because intra-node NVLink is cheap and cross-node traffic is not — the same topology awareness that drove the rank layout at the top of the chapter. Keeping a hot expert’s copies on the same node means its share of every all-to-all stays on the fast intra-node links. One detail reveals how much this is built for live reconfiguration rather than restart: preserve_intragpu_slots reorders the new mapping so that any expert staying on a given GPU keeps its old slot, which means its weights never have to move:

# vllm/distributed/eplb/policy/default.py
# Reorder the new mapping per GPU so that experts that remain on the same GPU
# keep their previous slot positions when possible.

Source: vllm/distributed/eplb/policy/default.py

That matters because actually applying a new mapping means shipping expert weights between GPUs while the model is serving requests. rearrange_expert_weights_inplace in rebalance_execute.py walks the layers and, for each, copies the experts that need to move into a pre-allocated buffer and exchanges them over the EP group’s process group. Note that EPLB uses its own communicator, deliberately separated from the forward-pass collectives:

# vllm/distributed/parallel_state.py
# This is a separate process group to isolate EPLB communications
# from MoE forward pass collectives and prevent deadlocks when
# using torch.distributed in execution with torch.distributed in EPLB.

Source: vllm/distributed/parallel_state.py

The most production-critical refinement is that this can run asynchronously. A synchronous rearrangement would freeze every GPU mid-traffic while terabytes of expert weights shuffle around — a latency spike no SLO survives. The async path (vllm/distributed/eplb/async_worker.py, driven by the is_async branches in step) copies the new weights into a separate buffer while normal inference keeps running on the old weights, then performs a fast in-place swap only once every transfer is finished, via move_from_buffer. The swap must be all-or-nothing across the whole cluster, and that is why the step method checks _all_ranks_result_ready before committing: if some ranks switched to the new physical-to-logical mapping while others were still on the old one, the dispatch all-to-all would send tokens to a GPU that no longer hosts the expert they expect, corrupting the routing. Waiting for every rank to confirm guarantees all of them flip on the same step. The same machinery generalizes to changing the number of GPUs entirely: vllm/distributed/elastic_ep/ reuses the rank-remapping logic (the rank_mapping argument threaded through rearrange_expert_weights_inplace) to scale an EP deployment up or down, while vllm/model_executor/layers/fused_moe/eep_reconfigure.py rebuilds the fused-MoE expert kernels for the new EP size — together redistributing experts onto a changed device set without a full restart.

What is still hard

None of this is settled engineering. Picking parallel degrees — how much TP versus PP versus EP for a given model, cluster, and SLO — is still mostly empirical; the cost models are crude and the search space is large. EPLB rebalances on past load, so a sharp traffic shift outruns it until the next window closes, and choosing window_size/step_interval trades responsiveness against thrashing the network with weight shuffles. The replication count (num_redundant_experts) spends HBM you could have given to KV cache, re-opening the Chapter 4 tradeoff at cluster scale: every redundant copy is blocks you are not caching prefixes in. And all of this multiplies the failure surface — a single dead GPU now takes down a TP shard, a pipeline stage, or an expert host, each with a different blast radius. Fault tolerance for inference at this scale is genuinely unsolved; most systems today simply restart the affected replica.

What we have not split is the KV cache itself. We sharded weights every way there is, but a request’s KV cache still lives wholly within one replica, and we have been assuming it fits. The next chapter breaks that assumption — when the prefix you want has been evicted from GPU memory, you spill cold KV blocks to CPU, disk, or an external store through a connector abstraction. That same connector, stretched across a network, is what finally lets prefill and decode run on separate GPU pools two chapters from now.

Inference Serving Roadmap

Distributed inference: tensor, pipeline, and expert parallelism with expert-parallel load balancing

One mesh, four axes

Tensor parallelism: split every matmul, pay an all-reduce per layer

Pipeline parallelism: split by layer, pay a bubble

Expert parallelism: split by expert, pay an all-to-all

Expert-parallel load balancing: rebalancing under live traffic

What is still hard

Further reading

Keyboard shortcuts

Inference Serving Roadmap