Preface

This is a roadmap of the engineering problems you must solve to serve large language models at scale, and of the techniques the field has converged on to solve them. It is organized as a single argument that builds from the bottom up: it begins with the shape of the problem and the hardware asymmetry that drives everything else, works through every mechanism inside one well-run replica, then layers on the algorithms that lower the cost of a token, the strategies for spreading one model across many GPUs, and finally the concerns of operating a whole fleet. Each chapter assumes the ones before it.

It is written for engineers who already know how to build and operate distributed systems and now need to reason precisely about inference serving, whether you are evaluating a serving stack, tuning one in production, or trying to understand why a latency SLO is being missed. It is not an introduction to transformers or to deep learning, and it is not a how-to for any single deployment. It is a way of thinking about where the costs are and which lever moves which number.

Throughout, vLLM is treated as the canonical implementation. Where a technique is described, it is grounded in vLLM’s actual source so that the abstraction stays honest and you can go read the code yourself. vLLM is distributed under the Apache License 2.0; source quoted here is reproduced under that license, with attribution to the vLLM project and its contributors. Quotations are kept short and serve to anchor the discussion in the real control flow, not to substitute for reading the repository.

Research papers are cited as jumping-off points, not reproduced. When a chapter rests on an idea from the literature, it names the paper and explains the idea in its own words, then points you to the original for the proofs, the measurements, and the nuance. Each chapter ends with a short “Further reading” list for exactly this purpose, and those lists are aggregated in the References at the back.

This is a living roadmap of a field that is still moving quickly. The mechanisms described here are the ones that matter as of this writing, but the frontier shifts with every model release and every kernel. Treat the structure as durable and the specifics as a snapshot: the questions each chapter asks will outlast the particular answers vLLM gives today.

Keyboard shortcuts

Inference Serving Roadmap

Preface