References

The papers cited across this book, aggregated from each chapter’s “Further reading” section and deduplicated. They are listed in order of arXiv identifier; the one venue-only citation is listed last.

The Curious Case of Neural Text Degeneration — arXiv:1904.09751
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — arXiv:1909.08053
Data Movement Is All You Need: A Case Study on Optimizing Transformers — arXiv:2007.00072
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — arXiv:2101.03961
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — arXiv:2210.17323
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — arXiv:2211.10438
Fast Inference from Transformers via Speculative Decoding — arXiv:2211.17192
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving — arXiv:2302.11665
LLaVA: Visual Instruction Tuning — arXiv:2304.08485
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — arXiv:2306.00978
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — arXiv:2307.08691
Efficient Memory Management for Large Language Model Serving with PagedAttention — arXiv:2309.06180
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — arXiv:2310.07240
Punica: Multi-Tenant LoRA Serving — arXiv:2310.18547
S-LoRA: Serving Thousands of Concurrent LoRA Adapters — arXiv:2311.03285
Splitwise: Efficient Generative LLM Inference Using Phase Splitting — arXiv:2311.18677
SGLang: Efficient Execution of Structured LM Programs — arXiv:2312.07104
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — arXiv:2401.09670
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — arXiv:2401.10774
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — arXiv:2401.15077
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — arXiv:2403.02310
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — arXiv:2405.04434
Preble: Efficient Distributed Prompt Scheduling for LLM Serving — arXiv:2407.00023
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — arXiv:2407.00079
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution — arXiv:2409.12191
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications — arXiv:2411.04975
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models — arXiv:2411.15100
DeepSeek-V3 Technical Report — arXiv:2412.19437
Orca: A Distributed Serving System for Transformer-Based Generative Models — OSDI ’22

Keyboard shortcuts

Inference Serving Roadmap

References