References
The papers cited across this book, aggregated from each chapter’s “Further reading” section and deduplicated. They are listed in order of arXiv identifier; the one venue-only citation is listed last.
- The Curious Case of Neural Text Degeneration — arXiv:1904.09751
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — arXiv:1909.08053
- Data Movement Is All You Need: A Case Study on Optimizing Transformers — arXiv:2007.00072
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — arXiv:2101.03961
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — arXiv:2210.17323
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — arXiv:2211.10438
- Fast Inference from Transformers via Speculative Decoding — arXiv:2211.17192
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving — arXiv:2302.11665
- LLaVA: Visual Instruction Tuning — arXiv:2304.08485
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — arXiv:2306.00978
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — arXiv:2307.08691
- Efficient Memory Management for Large Language Model Serving with PagedAttention — arXiv:2309.06180
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — arXiv:2310.07240
- Punica: Multi-Tenant LoRA Serving — arXiv:2310.18547
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters — arXiv:2311.03285
- Splitwise: Efficient Generative LLM Inference Using Phase Splitting — arXiv:2311.18677
- SGLang: Efficient Execution of Structured LM Programs — arXiv:2312.07104
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — arXiv:2401.09670
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — arXiv:2401.10774
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — arXiv:2401.15077
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — arXiv:2403.02310
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — arXiv:2405.04434
- Preble: Efficient Distributed Prompt Scheduling for LLM Serving — arXiv:2407.00023
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — arXiv:2407.00079
- Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution — arXiv:2409.12191
- SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications — arXiv:2411.04975
- XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models — arXiv:2411.15100
- DeepSeek-V3 Technical Report — arXiv:2412.19437
- Orca: A Distributed Serving System for Transformer-Based Generative Models — OSDI ’22