Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

llm-d

A study guide to the llm-d/llm-d repository (checked out at materials/llm-d/llm-d, tracking main around the v0.7 release, 2026-05).

1. What it is

llm-d is a Kubernetes-native distributed inference serving stack: an orchestration and routing layer that sits above model servers (vLLM, SGLang) and below your clients, built out of the Kubernetes Gateway API, Envoy ext-proc, and engine-level telemetry. From README.md:

llm-d is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes. We help you achieve the fastest “time to state-of-the-art (SOTA) performance” for key OSS large language models across most hardware accelerators…

llm-d is a Cloud Native Computing Foundation (CNCF) sandbox project, founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA.

Positioning, in the project’s own words (README.md): “Model servers like vLLM and SGLang handle efficiently running large language models on accelerators. llm-d provides state-of-the-art orchestration and optimizations above model servers to serve high-scale real-world traffic efficiently and reliably.”

Two principles from PROJECT.md explain its shape better than any diagram:

  1. We respect our upstreams - vLLM and the Kubernetes Inference Gateway are where code changes start, no forks
  2. vLLM-first but not vLLM-only - build the modular architecture for most people and collaborate with other projects

So llm-d is not a monolith — it is a composition of the Gateway API Inference Extension (GAIE), vLLM, and a set of llm-d-org components, wired together by the recipes in this repo. Project maintainers (per PROJECT.md) are Carlos Costa, Clayton Coleman, and Robert Shaw — representing inference research, the llm-d Router, and vLLM respectively.

2. Why you care

This project is the intersection of what you already know and what you’re learning:

  • The data plane is Envoy. The “llm-d Router” is literally Envoy (or any conformant L7 proxy: Istio, agentgateway, GKE ALB) plus an ext-proc gRPC service called the Endpoint Picker (EPP). The repo ships a complete, readable static Envoy config (guides/no-kubernetes-deployment/router/envoy/envoy.yaml) — ext_proc filter in FULL_DUPLEX_STREAMED mode, an ORIGINAL_DST cluster keyed off a routing header, circuit breakers, gRPC health checks on the EPP. It will look like home.
  • The control plane is Gateway API. Gateway + HTTPRoute + a new backend kind, InferencePool, from the GAIE project you’re studying separately. llm-d is GAIE’s most prominent downstream consumer and contributor.
  • The novel part is the scoring signal. Instead of least-request/ring-hash on connection counts, endpoints are scored on KV-cache utilization, prefix-cache locality (approximate radix-style hashing or precise event-driven indexing), queue depth, and predicted latency — i.e., load balancing where the “load” is HBM contents. This is cache-aware routing and P/D disaggregation, productionized.
  • The benchmark argument is a traffic argument. guides/optimized-baseline/README.md compares the EPP against a stock Kubernetes Service round-robining the same 8 vLLM pods: output tokens/sec 5,722 → 13,163 (+130%), TTFT p90 107.43s → 0.206s at high rates. The headline gains of this whole space come from routing, not kernels.

3. What’s actually in this repo (and what isn’t)

This is the umbrella/docs/deployment repo of the llm-d GitHub org. There is no service source code here — no Go, no Rust, no Python services. What you deploy comes from sibling repos and OCI registries; what lives here is the documentation, the Helm values + Kustomize overlays that configure those components, and the Dockerfiles for llm-d’s custom vLLM container images.

PathWhat it is
README.md, PROJECT.md, SIGS.md, CONTRIBUTING.md, MAINTAINERS.mdProject identity, governance, SIG structure (Kubernetes-style OWNERS files throughout)
docs/architecture/The system design docs — the core of this guide. core/ (router, EPP, InferencePool, model servers) and advanced/ (KV management, disaggregation, autoscaling, batch, latency predictor)
docs/well-lit-paths/Concept pages for each supported pattern (the “why”), one per pattern
docs/api-reference/InferencePool, InferenceObjective, InferenceModelRewrite CRDs; EndpointPickerConfig schema; EPP HTTP headers/APIs; glossary
docs/getting-started/quickstart.md and artifacts.md (the authoritative map of charts, images, and source repos)
docs/infra-providers/GKE, AKS, OpenShift, DigitalOcean, minikube notes
docs/resources/observability/Metrics catalog, PromQL cookbook, tracing setup; plus docs/resources/rdma/ for the network prerequisites
docs/proposals/Design proposals (autoscaler, batch gateway, non-Kubernetes mode, distributed tracing, …)
guides/The deployable recipes (“well-lit paths” — the “how”): per-guide Helm values files for the router chart + Kustomize overlays for model servers, per accelerator (NVIDIA/AMD/Intel XPU/Gaudi/TPU/CPU)
guides/recipes/Shared building blocks: base router values, base model-server Deployments, gateway install kustomizations (Istio, agentgateway, kgateway, GKE)
docker/Dockerfile.cuda, .rocm, .cpu, .hpu — build the ghcr.io/llm-d/llm-d-cuda etc. images: vLLM plus the RDMA/P2P stack (UCX, NVSHMEM, NIXL, GDRCopy, DeepEP, LMCache, InfiniStore)
patches/NVSHMEM patches applied in those image builds
helpers/, scripts/, release/, .github/Benchmark harness docs, client setup, lint/CI plumbing

The actual code, by sibling repo (from the table in docs/getting-started/artifacts.md, names only):

  • llm-d-router (Go) — the EPP: routing engine, plugin framework, flow control. Older docs call it llm-d-inference-scheduler; it builds on the GAIE endpoint-picker framework. Most architecture docs here link directly into its pkg/epp/framework/plugins/... tree.
  • llm-d-routing-sidecar — the P/D routing proxy sidecar in decode pods (image ghcr.io/llm-d/llm-d-routing-sidecar).
  • llm-d-kv-cache (Go/Python/C++) — KV-block locality indexer and filesystem offloading connector.
  • llm-d-latency-predictor (Python) — XGBoost training + prediction sidecars for predicted-latency scheduling.
  • llm-d-workload-variant-autoscaler (Go) — SLO-aware autoscaler.
  • llm-d-batch-gateway, llm-d-async (incubation) — OpenAI Batch API and queue-based async processing.
  • llm-d-benchmark (Python) — the harness invoked by every guide’s benchmarking section.
  • llm-d-inference-sim (Go) — a GPU-free vLLM simulator (important for you; see section 7).

One subtlety worth knowing (from docs/getting-started/artifacts.md): the Helm charts themselves are currently published by GAIE (oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone in the quickstart; oci://ghcr.io/llm-d/charts/llm-d-router-standalone-dev in the v0.7 guides), with a note that publishing will move to llm-d. The boundary between GAIE and llm-d is deliberately thin.

4. The architecture as documented

4.1 Three core concepts

From docs/architecture/README.md: the llm-d Router (= Proxy + EPP), the InferencePool, and the Model Server.

  • Proxy: A high-performance L7 proxy (typically Envoy) that accepts user requests and consults the EPP via the ext-proc protocol to determine the optimal destination.
  • Endpoint Picker (EPP): The routing engine that scores and selects model server pods based on real-time metrics, KV-cache affinity, and configured policies.

InferencePool is described as an “LLM-optimized Service” — a label-selector grouping of pods serving one base model, with Variants (sub-groupings via pod labels, e.g. prefill vs decode roles, expressed as llm-d.ai/role: prefill|decode|prefill-decode).

Terminology is pinned down in docs/architecture/core/router/README.md and worth internalizing because older blog posts disagree: llm-d Router = Proxy + EPP (the whole entry point); Inference Gateway = the Router when operating in Gateway Mode; Request Scheduler = the decision engine inside the EPP. The EPP also carries dual responsibilities — routing and “fairness and prioritization”, i.e. which requests run at all when consolidating multi-tenant workloads onto shared model servers.

4.2 Request flow: gateway → EPP → vLLM pod

From docs/architecture/core/router/README.md:

When an inference request arrives at the Proxy, the Proxy “parks” the request and initiates a callback to the EPP via the ext-proc (External Processing) protocol. The EPP evaluates the request against the current state of the InferencePool—considering factors like KV-cache locality, current load, and priority—and returns the address of the optimal model server pod back to the Proxy.

Inside the EPP, the lifecycle is enumerated in docs/architecture/core/router/epp/README.md:

  1. Request arrival at the proxy (Gateway).
  2. External processing — proxy invokes the EPP via ext-proc, passing headers and body.
  3. Request handling — parses the request (OpenAI HTTP, vLLM gRPC; parser is a plugin, so custom protocols slot in) into the internal InferenceRequest.
  4. Flow control — if enabled, queues, prioritizes, and “holds requests when the pool is saturated”.
  5. Request scheduling — Filter → Score → Pick against the InferencePool.
  6. Request proxying — EPP returns the chosen endpoint address; proxy forwards.

Asynchronously, a Data Layer watches the Kube API for pool membership, scrapes model-server metrics, and maintains in-memory state such as the prefix-cache tree; “consultant” sidecars (latency predictor, KV indexer, tokenizer) plug in here. The same doc warns that the only supported ext-proc body mode is FULL_DUPLEX_STREAMED.

Flow control deserves a closer look from a traffic engineer (docs/architecture/core/router/epp/README.md, deep dive in flow-control.md): saturation-gated admission via pluggable SaturationDetectors (e.g. a concurrency detector on per-endpoint in-flight counts), priority bands separating latency-sensitive chat from background batch, and two pluggable fairness layers — FairnessPolicy distributing dispatch opportunities among flows within a band (e.g. round robin) and OrderingPolicy ordering requests within a flow (FIFO, SLO-based). It is an application-level admission/queuing tier of the kind you’d otherwise build with Envoy adaptive concurrency plus priority queues — but keyed on inference signals.

The scheduler (docs/architecture/core/router/epp/scheduling.md) is a weighted-scorer framework — recognizably the GAIE scheduling framework with llm-d’s plugin set. Scorers include kv-cache-utilization-scorer, queue-depth-scorer, prefix-scorer, lora-affinity-scorer, latency-scorer, session-affinity-scorer, and no-hit-lru-scorer (spreads cold prefills across the pool); pickers are max-score-picker, random-picker, weighted-random-picker. Where KV-cache-aware decisions happen: scoring, fed by the data layer.

The default “optimized baseline” policy is just YAML, shipped in this repo as Helm values (guides/optimized-baseline/router/optimized-baseline.values.yaml):

pluginsCustomConfig:
  optimized-baseline-plugins.yaml: |
    apiVersion: llm-d.ai/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: queue-scorer
    - type: kv-cache-utilization-scorer
    - type: prefix-cache-scorer
    - type: no-hit-lru-scorer
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: queue-scorer
        weight: 2
      - pluginRef: kv-cache-utilization-scorer
        weight: 2
      - pluginRef: prefix-cache-scorer
        weight: 3
      - pluginRef: no-hit-lru-scorer
        weight: 2

4.3 The Envoy wiring (read this file first)

guides/no-kubernetes-deployment/router/envoy/envoy.yaml is the entire data plane in one static config. The selected endpoint is conveyed via a header consumed by an ORIGINAL_DST cluster:

clusters:
  - name: original_destination_cluster
    type: ORIGINAL_DST
    connect_timeout: 1000s
    lb_policy: CLUSTER_PROVIDED
    original_dst_lb_config:
      use_http_header: true
      http_header_name: x-gateway-destination-endpoint

and the ext_proc filter that calls the EPP:

- name: envoy.filters.http.ext_proc
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor
    grpc_service:
      envoy_grpc:
        cluster_name: ext_proc
        authority: localhost:9002
      timeout: 10s
    processing_mode:
      request_body_mode: FULL_DUPLEX_STREAMED
      response_body_mode: FULL_DUPLEX_STREAMED

That is the GAIE Endpoint Picker Protocol made concrete: EPP picks a pod IP, writes it into x-gateway-destination-endpoint, Envoy’s original-dst LB forwards to it. Response bodies also stream back through the EPP for post-processing (metrics, prefix-index updates).

4.4 Two deployment modes for the proxy

docs/architecture/core/router/proxy.md defines:

  • Standalone Mode — Envoy runs as a sidecar in the EPP pod, ext-proc over localhost, no Gateway/HTTPRoute/controller needed. For testing, batch, RL pipelines, legacy-Ingress clusters.
  • Gateway Mode (= “Inference Gateway”) — the GAIE integration: a shared Gateway hosts HTTPRoutes whose backendRefs are InferencePools (group: inference.networking.k8s.io), alongside ordinary Service backends. Used for shared infra, multi-cluster LB, traffic splitting/canary. Supported providers documented: Istio, GKE Gateway, agentgateway (kgateway deprecated as of v0.7).

4.5 The well-lit paths

Indexed in docs/well-lit-paths/README.md and guides/README.md, grouped as: Intelligent Routing (optimized baseline; predicted-latency routing), Advanced KV-Cache Management (precise prefix-cache routing; tiered prefix cache offload to CPU/NVMe), Serving Large Models (P/D disaggregation; wide expert-parallelism), Operational Excellence (flow control; workload autoscaling; rollouts), Workloads (agentic inference; multimodal), and Experimental (async processing; batch gateway; no-Kubernetes deployment).

Prefix-cache-aware routing (docs/architecture/advanced/kv-management/prefix-cache-aware-routing.md) has two implementations:

FeatureApproximatePrecise
PrecisionHeuristic (character-based block hashing)100% (token-based)
State sourceLocal EPP assumptions after each routing decisionReal-time KVEvents from model servers over ZMQ
DependenciesNonevLLM /v1/completions/render tokenizer endpoint, ZMQ

The approximate path splits prompts into fixed-size blocks, keeps a rolling-hash LRU index of which prefixes were routed where, and “learns” from its own decisions. The precise path subscribes to vLLM KV-cache block add/evict events and maintains a global KV-block index (the llm-d-kv-cache component), with speculative indexing to cover the decision-to-event blind spot. guides/precise-prefix-cache-routing/router/precise-prefix-cache-routing.values.yaml shows the full production config including active-active HA EPP replicas.

P/D disaggregation (docs/architecture/advanced/disaggregation/README.md) — the EPP’s disagg-profile-handler runs a decode profile, asks a decider plugin whether the uncached suffix on the chosen decode pod is large enough to justify disaggregation, and only then runs the prefill profile. Decode endpoint becomes the proxy’s primary destination; the prefill endpoint rides along in the x-prefiller-host-port header. The sequence, from that doc:

sequenceDiagram
    Client->>Proxy: Request
    Proxy-->>EPP: Run EPP protocol
    EPP-->>Proxy: Selects P Worker and D Worker
    Proxy->>DSidecar: Request
    DSidecar->>PWorker: Request with max_tokens=1, do_remote_decode=True
    PWorker->>DSidecar: Response with KVTransferParams
    DSidecar->>DWorker: Request with KVTransferParams and do_remote_prefill=True
    DWorker-->>PWorker: Pull KV Cache (NIXL RDMA)

KV transfer uses NIXL (NVIDIA’s transfer library, from the ai-dynamo org — shared infrastructure with Dynamo) over UCX/UCCL/libfabric on IB/RoCE/EFA; TCP fallback exists “for testing and development” only. The vLLM protocol (nixlv2) is two-phase sequential; SGLang’s is concurrent with out-of-band bootstrap coordination. The corresponding EPP config is in guides/pd-disaggregation/router/pd-disaggregation.values.yaml — two schedulingProfiles named prefill and decode, each with its own prefill-filter/decode-filter plus prefix/queue/kv-utilization scorers, composed by disagg-profile-handler. The reference deployment is openai/gpt-oss-120b with 8 TP=1 prefill instances and 2 TP=4 decode instances (“heterogeneous parallelism”; tune your xPyD ratio to your ISL/OSL).

Wide expert-parallelism (guides/wide-ep-lws/README.md) — DeepSeek-R1 at DP=16 prefill + DP=16 decode over 32 H200/B200 GPUs, deployed with the LeaderWorkerSet (LWS) controller, requiring full-mesh all-to-all RDMA for DeepEP (“rail-only connectivity will fail”). This is the path the custom ghcr.io/llm-d/llm-d-cuda images (built from docker/Dockerfile.cuda with NVSHMEM/DeepEP patches from patches/) exist for.

4.6 Role of the Gateway API Inference Extension

Cross-reference for your gateway-api-inference-extension study: llm-d consumes GAIE at three layers, all visible in this repo.

  1. CRDs: every guide starts with kubectl apply -f .../gateway-api-inference-extension/releases/download/v1.5.0/v1-manifests.yaml (InferencePool is inference.networking.k8s.io/v1; InferenceObjective and InferenceModelRewrite are llm-d.ai/v1alpha2 — see docs/api-reference/README.md).
  2. Charts: the standalone/gateway router Helm charts are published from GAIE’s config/charts/ today (docs/getting-started/artifacts.md).
  3. Framework: the EPP in llm-d-router is built on GAIE’s endpoint-picker framework and Endpoint Picking Protocol; llm-d adds the plugin set (precise prefix cache, disagg handler, latency predictor, flow control policies) and the production recipes. PROJECT.md’s “no forks” principle is the governing relationship.

In short: GAIE is the routing framework and API; llm-d is the opinionated, benchmarked distribution of it, plus the engine-side pieces (sidecar, KV indexer, images) GAIE doesn’t own.

4.7 Deployment shape: what a minimal install looks like

From docs/getting-started/quickstart.md, an install is exactly three commands after CRDs:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${GAIE_VERSION}/v1-manifests.yaml

helm install ${GUIDE_NAME} \
    oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
    -f guides/recipes/router/base.values.yaml \
    -f guides/optimized-baseline/router/optimized-baseline.values.yaml \
    -n ${NAMESPACE} --version ${GAIE_VERSION}

kubectl apply -n ${NAMESPACE} -k guides/optimized-baseline/modelserver/gpu/vllm/

Pattern: Helm for the router (chart from GAIE/llm-d registries, configured by layered values files from this repo) + Kustomize for the model servers (base Deployment in guides/recipes/modelserver/base/single-host/default/decode-deployment.yaml — a plain Deployment labeled llm-d.ai/role: decode with /health + /v1/models probes — patched per guide/accelerator, e.g. guides/optimized-baseline/modelserver/gpu/vllm/base/patch-vllm.yaml setting vllm serve Qwen/Qwen3-32B --tensor-parallel-size=2 --gpu-memory-utilization=0.95, 8 replicas). The EPP discovers pods via the InferencePool’s matchLabels (llm-d.ai/guide: "optimized-baseline"). Worth reading in guides/recipes/router/base.values.yaml: the Envoy sidecar args block, with an inline comment explaining why --log-level warn and --concurrency 8 override the chart defaults (trace logging and hardware_concurrency() worker threads oversubscribing the cgroup CPU slice — your kind of footnote).

5. Suggested reading path

  1. README.md, then PROJECT.md — what it is, who runs it, the no-forks upstream policy.
  2. docs/architecture/README.mddocs/architecture/core/router/README.mdproxy.md — Router = Proxy + EPP, standalone vs gateway modes.
  3. guides/no-kubernetes-deployment/router/envoy/envoy.yaml + router/epp/config.yaml — the whole data plane and a complete EndpointPickerConfig in two files, no cluster abstractions in the way.
  4. docs/architecture/core/router/epp/README.mdscheduling.mdflow-control.mddatalayer.mdconfiguration.md — the EPP internals; then docs/api-reference/endpointpickerconfig.md and epp-http-headers.md as reference.
  5. docs/architecture/core/inferencepool.md + docs/api-reference/inferencepool.md — the CRD bridge to GAIE.
  6. guides/optimized-baseline/README.md (skim manifests, read the benchmark report at the bottom) — the canonical deployment and the RR-vs-EPP numbers.
  7. docs/architecture/advanced/kv-management/ (all four files) and guides/precise-prefix-cache-routing/ — approximate vs precise cache-aware routing.
  8. docs/architecture/advanced/disaggregation/README.md + guides/pd-disaggregation/README.md + guides/recipes/modelserver/base/single-host/pd/vllm/patch-sidecar.yaml — P/D end to end, including the sidecar.
  9. guides/wide-ep-lws/README.md + docker/Dockerfile.cuda — wide-EP and what actually goes into the engine image.
  10. docs/resources/observability/ — the operational surface: metrics.md for enabling scraping, promql.md for ready-made queries over the EPP scheduler metrics (inference_pool_per_pod_queue_size, inference_extension_prefix_indexer_hit_ratio, per-plugin latency distributions — cataloged in the tables at the bottom of docs/architecture/core/router/epp/scheduling.md) plus vLLM metrics, and tracing.md for OTel setup.
  11. Skim docs/well-lit-paths/ for the rest (flow control, autoscaling, batch, agentic), and docs/proposals/ if you want to see where the project is heading (non-kubernetes-mode.md, distributed-tracing.md, autoscaler.md).

6. Connections to your other study repos

  • gateway-api-inference-extension — the routing brain llm-d builds on. Everything in section 4.6; read GAIE’s Endpoint Picker Protocol proposal alongside docs/architecture/core/router/epp/README.md and the Envoy config above. llm-d is effectively GAIE’s flagship consumer plus an extended plugin catalog (the EPP code itself lives in sibling repo llm-d-router).
  • vllm — the engine being orchestrated. llm-d leans on vLLM features you can study in that repo: automatic prefix caching (what the prefix scorers exploit), KV-cache events over ZMQ (what precise routing consumes), the KV-connector/NIXL interface and kv_transfer_params (what P/D rides on), /v1/completions/render tokenization, data-parallel deployment (one pod, multiple endpoints), and the OffloadingConnector (tiered prefix cache).
  • dynamo — the closest competitor, from NVIDIA, and the sharpest contrast: Dynamo is a self-contained distributed runtime (its own Rust router/frontend, etcd/NATS control plane, planner) that runs on or off Kubernetes; llm-d is deliberately not a runtime — it reuses Kubernetes primitives (Gateway API, Deployments, LWS, HPA) and Envoy, adding only the EPP and sidecars. They share NIXL for KV transfer (note NIXL_REPO=github.com/ai-dynamo/nixl in docker/Dockerfile.cuda). Compare Dynamo’s KV-aware router with the EPP’s scorer pipeline, and Dynamo’s planner with llm-d’s Workload Variant Autoscaler.
  • sglang — the second engine: first-class in the optimized-baseline and P/D guides (guides/optimized-baseline/modelserver/gpu/sglang/), with its own concurrent bootstrap-room KV-transfer protocol documented in docs/architecture/advanced/disaggregation/README.md — contrast with vLLM’s sequential nixlv2.
  • nano-vllm — a minimal engine is the right mental model for what an llm-d “endpoint” is: the EPP only needs an OpenAI-compatible HTTP surface plus standard metrics; everything llm-d scores (queue depth, KV utilization, prefix reuse) maps to structures you can see in nano-vllm’s scheduler and block manager in a few hundred lines.
  • xgrammar / flashinfer — below llm-d’s abstraction line. They live inside the engine pods (structured-output constraints and attention/sampling kernels respectively); llm-d never sees them except as their effects on per-request latency and throughput — exactly the signals the latency predictor learns. flashinfer also illustrates why decode is memory-bandwidth-bound, which is the entire premise of P/D specialization.

7. Hands-on without a 16-GPU cluster

Honest assessment: the headline paths are out of home reach — the optimized baseline wants 16 GPUs (8×TP=2 for Qwen3-32B), P/D wants RDMA between nodes, wide-EP wants 32 H200s. With one RTX 5080 (16 GB) on Windows, treat this as a docs-and-configs repo plus three realistic exercises:

  1. No-Kubernetes deployment, scaled down (best option). guides/no-kubernetes-deployment/README.md runs the real stack — EPP container + Envoy container + vLLM worker(s) — with Docker only, endpoints declared in a YAML file (file-discovery plugin, hot-reloaded via atomic rename). Under WSL2 with CUDA Docker, substitute the 32B model with something that fits 16 GB (e.g. a 4–8B model at --tensor-parallel-size=1) and drop the TP/shm settings; the EPP and Envoy configs are model-agnostic. You can then watch scheduling happen: send repeated shared-prefix prompts, scrape EPP metrics on :9090 (inference_extension_prefix_indexer_hit_ratio, per-pod queue gauges), and poke Envoy admin on :19000 to see the ext_proc cluster and original-dst routing.
  2. GPU-free routing experiments with the simulator. docs/getting-started/artifacts.md lists sibling repo llm-d-inference-sim, a “GPU-free vLLM simulator.” Point the file-discovery endpoints.yaml (or a kind/minikube InferencePool) at several simulator instances and you can exercise the full EPP plugin pipeline — scorer weights, flow-control priority bands, even the disagg profile handler — with zero accelerators. For a traffic engineer, this is the highest-signal-per-watt exercise in the project.
  3. kind/minikube for the control plane only. The GAIE CRDs, router Helm chart, Gateway+HTTPRoute wiring (guides/recipes/gateway/), and EPP all install on a CPU-only cluster; only the model-server Kustomize step needs GPUs (swap in CPU vLLM via guides/optimized-baseline/modelserver/cpu/vllm/ — though it wants 64 cores/replica — or the simulator). Note docs/infra-providers/minikube/README.md is currently a stub (“TBD”), so expect to adapt the quickstart yourself.

Pure reading also pays here more than in most repos: the architecture docs are recent (v0.7), unusually candid about trade-offs (approximate-vs-precise tables, P/D “not a target for all workloads” guidance, NIXL TCP-is-dev-only warnings), and every claim is tied to a manifest you can open in the same checkout.