Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Quantization: weights, KV cache, and activations

In Chapter 3 we established the asymmetry that drives this whole book: a decode step is memory-bound. To emit one token for one sequence, the GPU streams the entire model’s weights from HBM through the compute units, does a trivially small amount of arithmetic per byte, and writes one token back. The arithmetic units sit mostly idle, waiting on the memory system. Chapter 4 then showed that the second thing the GPU must read on every step is the KV cache, which grows linearly with context and is what actually caps concurrency.

Put those two facts together and a blunt lever falls out. If decode time is dominated by bytes read, then reading fewer bytes buys you time almost one-for-one. Halve the bytes of the weights and a memory-bound decode step gets close to $2\times$ faster; shrink the KV cache and you fit more sequences in the same HBM, which is more concurrency at the same latency. Quantization is the family of techniques that does exactly this: store numbers in fewer bits than the BF16 the model was trained in.

A word on the units, since the whole chapter turns on them. BF16 (“brain floating point, 16-bit”) is the format models are typically trained in: two bytes per number, with enough exponent range to represent both tiny gradients and large activations. FP8 is a one-byte floating-point format; it comes in two flavors that trade range against precision, e4m3 (4 exponent bits, 3 mantissa bits: more precision, narrower range) and e5m2 (5 exponent bits, wider range, less precision). INT4 and INT8 are 4-bit and 8-bit integers, which carry no exponent at all and so need an explicit scale (a multiplier) to map the small integer back to the real value it stands for. The arithmetic of the lever is simple: BF16 to FP8 halves the bytes, BF16 to INT4 quarters them, and on a memory-bound step that fraction is roughly the speedup ceiling. The bars below put the two sides of that trade next to each other: bytes per weight element (what you store and stream) and the corresponding decode speedup ceiling (16 bits divided by the format’s bits).

Speedup ceiling is the upper bound for a perfectly memory-bound decode step (BF16 bits / format bits); real speedups fall short of it because of overheads like on-the-fly dequant and scale reads.

That is the optimistic framing. The pessimistic one, which we will spend most of the chapter on, is that “fewer bits” is not a single knob. There are three distinct things you can quantize, each with its own accuracy cost, its own kernel requirements, and its own place in the forward pass: the weights, the KV cache, and the activations. They are usually discussed together and they are genuinely different problems. This chapter walks them in the order their payoff and their difficulty suggest: weights first (biggest, easiest win), KV cache second (the concurrency lever), activations last (the hardest, because they are dynamic).

It helps to see all three on the same picture before we separate them. The diagram below traces a single decode step through one transformer layer and marks the three quantizable tensors at the points where the GPU actually reads or writes them. The weights are read at every linear layer (the GEMMs in attention’s projections and in the MLP). The activations are the intermediate tensors that flow along the arrows from one operation to the next, recomputed fresh on every step. The KV cache is read and written at the attention block, where the new token’s K and V are appended and the whole history is read back. Three different tensors, three different lifetimes: weights live for the whole serving run, the KV cache lives for the duration of one request, and activations live for the span of one operation.

flowchart LR
    X["activation in (per token)"] --> QKV["QKV projection (GEMM)"]
    W1["weights (static)"] -. read .-> QKV
    QKV --> ATT["attention"]
    KV["KV cache (per request)"] -. "read and write" .-> ATT
    ATT --> O["output projection (GEMM)"]
    W2["weights (static)"] -. read .-> O
    O --> MLP["MLP (GEMMs)"]
    W3["weights (static)"] -. read .-> MLP
    MLP --> Y["activation out (per token)"]

Why weights are the natural first target

The weights are the largest single thing read per decode step and, crucially, they are static. They do not change between requests or between steps, so you can quantize them once, offline, and ship the quantized checkpoint. All the hard statistical work, finding scales that minimize error, happens before serving ever starts. This is the regime of post-training quantization, and it is where the canonical research lives.

Two papers define the practical landscape. GPTQ (arXiv:2210.17323) showed you can quantize an LLM’s weights to 3-4 bits in one pass by greedily rounding columns and using second-order (Hessian) information to compensate the not-yet-quantized weights for each rounding error, keeping accuracy where naive rounding falls apart. AWQ (arXiv:2306.00978) made the sharper observation that not all weights matter equally: a small fraction of “salient” weight channels, identified by the magnitude of the activations that flow through them, dominate the error, so you protect those by per-channel scaling and quantize the rest aggressively. Both are worth reading not for the algorithm details (vLLM consumes their output, it does not run them) but for the mental model: 4-bit weight-only quantization is accurate enough for production because the error is concentrated and can be steered, not because rounding to 4 bits is innocuous.

vLLM’s job is the serving half: take a checkpoint someone already quantized with GPTQ, AWQ, or llm-compressor, and run a forward pass that is actually faster. The configuration surface is deliberately small. vllm/config/quantization.py names the schemes a user can ask for:

# vllm/config/quantization.py
QUANT_KEY_NAMES: dict[str, QuantKey] = {
    "fp8_per_tensor_static": kFp8StaticTensorSym,
    "fp8_per_tensor_dynamic": kFp8DynamicTensorSym,
    "fp8_per_token": kFp8DynamicTokenSym,
    "fp8_per_channel_static": kFp8StaticChannelSym,
    "fp8_per_block_static": kFp8Static128BlockSym,
    "fp8_per_block_dynamic": kFp8Dynamic128Sym,
    "mxfp8": kMxfp8Dynamic,
    "mxfp4": kMxfp4Dynamic,
    "int8_per_channel_static": kInt8StaticChannelSym,
}

Read the suffixes carefully, because they encode the real design space along two axes. The first axis is granularity: how many real numbers share a single scale. Recall that a low-bit integer only means something once you multiply it by a scale; the question is how finely you vary that scale across the tensor. per_tensor uses one scale for the entire weight matrix, which is the cheapest to store and apply but coarse, because one multiplier has to serve numbers of wildly different magnitude. per_channel gives each row or column its own scale; per_token gives each token’s activation vector its own scale; per_block slices the tensor into fixed-size blocks (commonly 128 elements) and gives each block a scale. Finer granularity means the scale tracks local magnitude more closely, so the rounding error is smaller, at the cost of storing and reading more scales.

The second axis is timing: when the scale is computed. A static scale is fixed ahead of time, measured during an offline calibration run and baked into the checkpoint, so serving just reads it. A dynamic scale is computed at runtime from the actual tensor about to be quantized, which adapts perfectly to each input but adds a small reduction (find the max, derive the scale) on the critical path. Static is free at serving time but only as good as the calibration data; dynamic is always well-matched to the input but never free. Every quantization decision in this chapter is some point in this granularity-by-timing space, and the suffix on each scheme name tells you exactly where it sits.

The dispatch seam, again

Here is the load-bearing connection to Chapter 9. There we saw attention dispatched behind an opaque custom op so the engine could pick the best kernel for the hardware without the model code knowing. Quantized linear layers do exactly the same thing, for exactly the same reason: a 4-bit weight times a 16-bit activation is not a GEMM any stock library implements, so vLLM carries its own family of mixed-precision GEMM kernels and chooses among them at load time.

The chooser is choose_mp_linear_kernel, and it walks a platform-specific priority list trying each kernel until one says it can handle this layer:

# vllm/model_executor/kernels/linear/__init__.py
for kernel in platform_kernels:
    if kernel.__name__ in envs.VLLM_DISABLED_KERNELS:
        ...
        continue
    if (compute_capability is not None
            and kernel.get_min_capability() > compute_capability):
        ...
        continue
    can_implement, failure_reason = kernel.can_implement(config)
    if can_implement:
        return kernel

Source: vllm/model_executor/kernels/linear/__init__.py

The priority order is where the performance engineering lives:

# vllm/model_executor/kernels/linear/__init__.py
_POSSIBLE_KERNELS: dict[PlatformEnum, list[type[MPLinearKernel]]] = {
    PlatformEnum.CUDA: [
        CutlassW4A8LinearKernel,
        MacheteLinearKernel,
        AllSparkLinearKernel,
        MarlinLinearKernel,
        HummingLinearKernel,
        ConchLinearKernel,
        ExllamaLinearKernel,
        TritonW4A16LinearKernel,
    ],
    ...
}

Source: vllm/model_executor/kernels/linear/__init__.py

This is the same pattern as Chapter 9’s attention backend registry: an ordered list of candidates, a capability gate, and a can_implement predicate. The flowchart below traces the chooser’s loop for a single layer. It walks candidates in priority order; for each one it checks three things in turn (is the kernel disabled by an env var, does the GPU’s compute capability meet the kernel’s minimum, and does can_implement accept this layer’s exact dtype/granularity) and returns the first kernel that clears all three. “Compute capability” is NVIDIA’s version number for a GPU architecture (Ampere is 80, Hopper is 90, Blackwell higher still); a kernel that needs 90 simply will not be considered on an 80-class card. Because the list is ordered fastest-first, the chooser naturally lands on the best kernel the hardware can actually run.

flowchart TD
    START["choose_mp_linear_kernel for this layer"] --> NEXT{"more kernels in priority list?"}
    NEXT -->|no| FAIL["raise: no kernel can serve this config"]
    NEXT -->|yes| K["take next kernel"]
    K --> DIS{"disabled by VLLM_DISABLED_KERNELS?"}
    DIS -->|yes| NEXT
    DIS -->|no| CAP{"GPU compute capability meets kernel minimum?"}
    CAP -->|no| NEXT
    CAP -->|yes| IMPL{"can_implement accepts this layer?"}
    IMPL -->|no| NEXT
    IMPL -->|yes| WIN["return this kernel"]

Machete is the Hopper-and-newer path (get_min_capability returns 90); Marlin is the broadly-compatible workhorse that runs on older cards. The names you will hear most in 2026 production, Marlin and Machete, are simply the two entries in this list that win on the most common hardware.

What does the kernel actually do? Machete’s apply is the clearest illustration that “mixed-precision GEMM” is a real, distinct operation:

# vllm/model_executor/kernels/linear/mixed_precision/machete.py
output = ops.machete_mm(
    a=x_2d,
    b_q=w_q,
    b_type=c.weight_type,
    b_group_zeros=w_zp,
    b_group_scales=w_s,
    b_group_size=c.group_size,
)

Source: vllm/model_executor/kernels/linear/mixed_precision/machete.py

The activations a arrive in BF16/FP16; the weight b_q is still packed 4-bit integers with its group scales (w_s) and zero-points (w_zp) alongside. (A “zero-point” is the integer value that represents real-zero; together with the scale it defines the affine map from 4-bit integer back to real number, $\text{real} = \text{scale} \cdot (q - \text{zero_point})$.) The kernel reads the weight as 4 bits per element, dequantizes on the fly inside the GEMM, and accumulates in high precision.

That on-the-fly dequant is the whole trick, and it is worth stating exactly why it works. The expensive thing on a decode step is moving the weight from HBM into the compute units; that move happens at 4 bits per element, so it is $4\times$ cheaper than BF16. Only once a small tile of weights has arrived in fast on-chip memory does the kernel expand it back to high precision and multiply. The memory traffic is 4-bit (the win), but the math is effectively 16-bit (the accuracy). The expansion costs some arithmetic, but on a memory-bound step the arithmetic units were idle anyway, so it is close to free. This is “weight-only” quantization, often written W4A16: weights at 4 bits, activations at 16. It is the safest, most accurate form, and it is exactly the right tool for a memory-bound decode step where bytes are the bottleneck and FLOPs are free.

This is the crucial difference from quantizing activations, which we reach at the end of the chapter. The diagram below contrasts the two. In weight-only (W4A16) the storage is low-precision but the multiply is high-precision, so the kernel pays a dequant before every multiply. In weight-and-activation (for example W8A8) both operands enter the multiply already low-precision, so the tensor cores do genuine low-precision math, which speeds up compute-bound prefill as well, but only on hardware that has the matching low-precision tensor-core path.

flowchart TD
    subgraph WO["W4A16: weight-only"]
        A1["activation: BF16"] --> M1["multiply (high precision)"]
        WQ1["weight: 4-bit in HBM"] --> D1["dequant to BF16 on-chip"]
        D1 --> M1
        M1 --> ACC1["accumulate: BF16"]
    end
    subgraph WA["W8A8: weight and activation"]
        A2["activation: quantize to FP8"] --> M2["multiply (FP8 tensor core)"]
        WQ2["weight: FP8 in HBM"] --> M2
        M2 --> ACC2["accumulate: BF16"]
    end

The plumbing that connects checkpoint to kernel is uniform. AWQ’s linear method, for instance, builds the kernel config and calls the chooser at weight-creation time:

# vllm/model_executor/layers/quantization/awq_marlin.py
kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config)
...
self.kernel = kernel_type(
    mp_linear_kernel_config,
    w_q_param_name="qweight",
    w_s_param_name="scales",
    w_zp_param_name="qzeros",
)

Source: vllm/model_executor/layers/quantization/awq_marlin.py

and apply is then a one-liner that delegates to whatever kernel won, return self.kernel.apply_weights(layer, x, bias). AWQ’s checkpoint quirks (a non-standard 4-bit packing order, _REVERSE_AWQ_PACK_ORDER, repacked in process_weights_after_loading) are absorbed before the kernel ever sees the weights, so the same Marlin/Machete kernels serve GPTQ and AWQ and compressed-tensors checkpoints alike. The checkpoint format is the front door; the kernel is the back end; the seam between them is choose_mp_linear_kernel.

By 2026 the dominant checkpoint format is compressed-tensors, the llm-compressor output format. Its config class is a dispatcher that reads the per-layer quant args and resolves them to a concrete scheme:

# vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
if self._is_nvfp4_format(weight_quant):
    if input_quant is None:
        return CompressedTensorsW4A4Fp4(use_a16=True)
    ...
    return CompressedTensorsW4A4Fp4()
...
if (self._is_wNa16_group_channel(weight_quant, input_quant)
        and (format == CompressionFormat.pack_quantized.value)
        and (weight_quant.num_bits in WNA16_SUPPORTED_BITS)):
    return CompressedTensorsWNA16(...)

Source: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

The scheme names are the chapter’s three targets made literal: W4A16 (weight-only, the case above), W8A8 (weights and activations at 8 bits), W4A4 (both at 4). The split between W*A16 and W*A8/W*A4 is precisely the line between “I only quantized weights” and “I also quantized activations”, and the rest of the chapter is about why crossing that line is hard.

The KV cache: a separate budget, a separate dtype

Weight quantization shrinks the model. It does nothing for the KV cache, which Chapter 4 identified as the real concurrency limit. So the KV cache gets its own, completely independent knob, kv_cache_dtype, set in vllm/config/cache.py:

# vllm/config/cache.py
CacheDType = Literal[
    "auto", "float16", "bfloat16",
    "fp8", "fp8_e4m3", "fp8_e5m2", "fp8_inc", "fp8_ds_mla",
    ...
    "int8_per_token_head", "fp8_per_token_head", "nvfp4",
]

Storing KV in fp8 instead of BF16 halves the bytes per cached token, which roughly doubles how much context (or how many concurrent sequences) fit in the same HBM. The curves below show KV-cache footprint growing linearly with context length for a single sequence on a representative 70B-class model (80 layers, 8 KV heads, head dim 128, so $2 \times 80 \times 8 \times 128 = 163{,}840$ bytes per token at BF16): the fp8 line sits at exactly half the slope, so any given HBM budget reaches roughly twice the context. This is orthogonal to weight quantization: you can run BF16 weights with an fp8 cache, or 4-bit weights with a BF16 cache, in any combination.

Illustrative: slopes are exact (fp8 is half of BF16), absolute GiB values assume the 70B-class config above; a different model’s per-token byte count shifts both lines but not their 2:1 ratio.

The accuracy story is also different, and worth being honest about. Weights are quantized once with a careful offline algorithm. KV-cache entries are produced during inference, token by token, so quantizing them means quantizing a fresh tensor on every step with whatever scale you have. The cache is touched twice per step, and it is worth being precise about both touches. On the write, the new token’s freshly computed K and V (in BF16) are divided by their scales and rounded to fp8 before being stored into the paged cache. On the read, the entire stored history is loaded back as fp8 and multiplied by the same scales to recover an approximate BF16 before attention multiplies it against the query. The scale is the bridge in both directions, which is why getting it right matters so much.

vLLM’s BaseKVCacheMethod attaches _k_scale/_v_scale to each attention layer (vllm/model_executor/layers/quantization/kv_cache.py), and those scales are used “to quantize k/v_cache entries before saving them to the cache” and “dequantize … before fetching them”. The write happens inside the paged-cache op from Chapter 4:

# vllm/v1/attention/backends/flash_attn.py
reshape_and_cache_flash(
    key, value,
    key_cache, value_cache,
    slot_mapping,
    self.kv_cache_dtype,
    layer._k_scale,
    layer._v_scale,
)

Source: vllm/v1/attention/backends/flash_attn.py

The cleanest way to get the scale is to calibrate it offline (run a calibration set, record the typical magnitude of K and V per layer, bake k_scale/v_scale into the checkpoint). process_weights_after_loading in kv_cache.py loads exactly those, and warns loudly if they are missing:

# vllm/model_executor/layers/quantization/kv_cache.py
if k_scale == 1.0 and v_scale == 1.0 and "e5m2" not in layer.kv_cache_dtype:
    logger.warning_once(
        "Using KV cache scaling factor 1.0 for fp8_e4m3. "
        "If this is unintended, verify that k/v_scale "
        "scaling factors are properly set in the checkpoint.")

Source: vllm/model_executor/layers/quantization/kv_cache.py

A scale of 1.0 means “no real calibration”, which for fp8_e4m3 (narrow exponent range) risks clipping. That warning is a tripwire for a common production mistake: turning on kv_cache_dtype=fp8 without a calibrated checkpoint and silently losing accuracy. Note also the newer *_per_token_head cache dtypes, where the scale is computed dynamically per token and head inside the kernel at write time, sidestepping the calibration problem at some kernel cost. There is no free lunch here, only a choice of where to pay.

The sequence below traces the full life of one cached entry across two steps, showing where the scale enters on the write and again on the read. Notice that the stored bytes are always fp8 and the scale never lives in the cache itself; it lives on the layer, applied at the boundary each time the kernel crosses between fp8 storage and BF16 math.

sequenceDiagram
    participant Attn as "attention layer"
    participant Cache as "paged KV cache (fp8)"
    Note over Attn: step N (token just computed)
    Attn->>Attn: "compute K,V in BF16"
    Attn->>Cache: "reshape_and_cache_flash: divide by k_scale/v_scale, round to fp8, store"
    Note over Attn: step N+1 (next token)
    Cache->>Attn: "load history as fp8"
    Attn->>Attn: "multiply by k_scale/v_scale to recover BF16"
    Attn->>Attn: "attention(query, dequantized K,V)"

One important caveat ties back to Chapter 9: MLA models store a compressed latent KV, not raw K and V. Quantizing that latent (the fp8_ds_mla dtype) is a different operation from quantizing per-head K/V, which is why the dtype list has model-family-specific entries. The KV-cache lever exists for every architecture, but what exactly you are shrinking depends on the attention design.

Activations: the hard one

That leaves activations, the tensors flowing between layers, recomputed on every forward pass for every token. Quantizing them is the prize because it unlocks genuine low-precision GEMMs (FP8 tensor cores doing FP8-times-FP8 math, not just FP8-storage-then-dequant), which speeds up the compute-bound prefill and not just decode. It is also the hardest, for a reason the research names directly.

SmoothQuant (arXiv:2211.10438) is the paper to read here. Its core finding: activations contain systematic outlier channels whose magnitudes are far larger than the rest, and those outliers wreck naive per-tensor activation quantization, while the weights are comparatively smooth and easy. The reason outliers are so damaging is direct: a per-tensor scale has to be large enough to represent the biggest value in the tensor, so a single huge outlier channel stretches the scale and forces every ordinary value to round to a coarse grid, destroying their precision. Weights, by contrast, are evenly distributed and quantize cleanly.

SmoothQuant’s fix is to migrate the difficulty from the hard side to the easy side. A linear layer computes a product of an activation $X$ and a weight $W$, that is $X \cdot W$. If you divide a troublesome activation channel by some factor $s$ and multiply the matching weight channel by the same $s$, the product is unchanged, since $(X / s) \cdot (s \cdot W) = X \cdot W$, but the activation is now tame and the weight has absorbed the bump. Since the weights were quantization-friendly to begin with, they tolerate the bump well, and now both operands quantize cleanly. This algebraic shuffle is done offline and baked into the checkpoint; it is the conceptual ancestor of AWQ’s salient-channel idea and the reason W8A8 checkpoints can exist at all.

In vLLM, the static-vs-dynamic distinction from the config surface becomes concrete for activations. A static scheme has one activation scale baked in from calibration. A dynamic scheme measures the scale at runtime, per token, just before the GEMM, which is more accurate (it adapts to each input) but adds a quantize step on the critical path. The FP8 method keeps both options visible:

# vllm/model_executor/layers/quantization/fp8.py
ACTIVATION_SCHEMES = ["static", "dynamic"]

Source: vllm/model_executor/layers/quantization/fp8.py

and the apply path is candid that a true FP8 GEMM is not always the route taken: the comment guarding the dispatch reads “we will use BF16 dequant when direct FP8 is not supported.” In the batch-invariant path, for instance, when no CUTLASS FP8 kernel applies, it simply dequantizes the weight back to BF16 and runs an ordinary GEMM:

# vllm/model_executor/layers/quantization/fp8.py
# per-tensor/channel: dequant to BF16 and run GEMM
weight_fp8 = layer.weight.to(torch.bfloat16)
weight_scale = layer.weight_scale.to(torch.bfloat16)

Source: vllm/model_executor/layers/quantization/fp8.py

The point generalizes beyond that one branch: low-precision activation math is hardware-gated, and where the right tensor-core path is unavailable the engine pays for a dequant rather than producing a wrong answer.

The compressed-tensors scheme dispatcher reflects the same reality: W8A8 and W4A4 are real schemes, but they sit behind capability checks and specific kernels (CUTLASS FP8, NVFP4), and a checkpoint that asks for activation quantization the hardware cannot accelerate gets a slower path rather than a wrong answer.

MoE, and where this is still unsettled

Mixture-of-experts models stress every byte argument harder, because their expert weights dominate the parameter count while only a couple of experts fire per token. Quantizing expert weights is therefore the single biggest lever for MoE memory, and it gets a dedicated path, MoeWNA16Config (“W8A16/W4A16 quantization”, vllm/model_executor/layers/quantization/moe_wna16.py) feeding weight-only quantized weights through the fused-MoE kernels. The fact that it is weight-only (A16) is telling: activation quantization inside the expert routing is even harder than in dense layers, so production MoE quant in 2026 is overwhelmingly weight-only.

What remains genuinely unsolved is worth stating plainly. There is no single accuracy metric that survives quantization cleanly: perplexity can look fine while a specific reasoning or code-generation capability degrades, and the degradation is model- and task-specific. Calibration data quality for activation and KV scales is a real, under-tooled operational burden. And the kernel zoo keeps growing (NVFP4, MXFP4, MXFP8), each new format racing hardware support, so the choose_mp_linear_kernel list is a moving target, not a settled design. Quantization buys throughput almost for free in the easy cases and demands careful, honest evaluation in the hard ones.

We have now spent fewer bytes per token. The next chapter spends fewer forward passes per token: speculative decoding verifies several cheaply-drafted tokens in one bandwidth-bound pass, trading the spare compute that quantization just made even more abundant for fewer trips through memory.

Further reading

  • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — arXiv:2210.17323 — one-shot second-order weight quantization to 3-4 bits; the accuracy foundation under vLLM’s GPTQ/Marlin path.
  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — arXiv:2306.00978 — protect the small set of salient weight channels identified by activation magnitude; read it to understand why 4-bit weights work.
  • SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — arXiv:2211.10438 — migrate activation outliers into the weights so both quantize cleanly; the key to why W8A8 activation quantization is feasible.