Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

References

The papers cited across this book, aggregated from each chapter’s “Further reading” section and deduplicated. They are listed in order of arXiv identifier; the one venue-only citation is listed last.

  • The Curious Case of Neural Text Degeneration — arXiv:1904.09751
  • Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — arXiv:1909.08053
  • Data Movement Is All You Need: A Case Study on Optimizing Transformers — arXiv:2007.00072
  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — arXiv:2101.03961
  • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — arXiv:2210.17323
  • SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — arXiv:2211.10438
  • Fast Inference from Transformers via Speculative Decoding — arXiv:2211.17192
  • AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving — arXiv:2302.11665
  • LLaVA: Visual Instruction Tuning — arXiv:2304.08485
  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — arXiv:2306.00978
  • FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — arXiv:2307.08691
  • Efficient Memory Management for Large Language Model Serving with PagedAttention — arXiv:2309.06180
  • CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — arXiv:2310.07240
  • Punica: Multi-Tenant LoRA Serving — arXiv:2310.18547
  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters — arXiv:2311.03285
  • Splitwise: Efficient Generative LLM Inference Using Phase Splitting — arXiv:2311.18677
  • SGLang: Efficient Execution of Structured LM Programs — arXiv:2312.07104
  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — arXiv:2401.09670
  • Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — arXiv:2401.10774
  • EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — arXiv:2401.15077
  • Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — arXiv:2403.02310
  • DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — arXiv:2405.04434
  • Preble: Efficient Distributed Prompt Scheduling for LLM Serving — arXiv:2407.00023
  • Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — arXiv:2407.00079
  • Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution — arXiv:2409.12191
  • SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications — arXiv:2411.04975
  • XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models — arXiv:2411.15100
  • DeepSeek-V3 Technical Report — arXiv:2412.19437
  • Orca: A Distributed Serving System for Transformer-Based Generative Models — OSDI ’22