llm-d
A study guide to the llm-d/llm-d repository (checked out at materials/llm-d/llm-d, tracking main around the v0.7 release, 2026-05).
1. What it is
llm-d is a Kubernetes-native distributed inference serving stack: an orchestration and routing layer that sits above model servers (vLLM, SGLang) and below your clients, built out of the Kubernetes Gateway API, Envoy ext-proc, and engine-level telemetry. From README.md:
llm-d is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes. We help you achieve the fastest “time to state-of-the-art (SOTA) performance” for key OSS large language models across most hardware accelerators…
llm-d is a Cloud Native Computing Foundation (CNCF) sandbox project, founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA.
Positioning, in the project’s own words (README.md): “Model servers like vLLM and SGLang handle efficiently running large language models on accelerators. llm-d provides state-of-the-art orchestration and optimizations above model servers to serve high-scale real-world traffic efficiently and reliably.”
Two principles from PROJECT.md explain its shape better than any diagram:
- We respect our upstreams - vLLM and the Kubernetes Inference Gateway are where code changes start, no forks
- vLLM-first but not vLLM-only - build the modular architecture for most people and collaborate with other projects
So llm-d is not a monolith — it is a composition of the Gateway API Inference Extension (GAIE), vLLM, and a set of llm-d-org components, wired together by the recipes in this repo. Project maintainers (per PROJECT.md) are Carlos Costa, Clayton Coleman, and Robert Shaw — representing inference research, the llm-d Router, and vLLM respectively.
2. Why you care
This project is the intersection of what you already know and what you’re learning:
- The data plane is Envoy. The “llm-d Router” is literally Envoy (or any conformant L7 proxy: Istio, agentgateway, GKE ALB) plus an ext-proc gRPC service called the Endpoint Picker (EPP). The repo ships a complete, readable static Envoy config (
guides/no-kubernetes-deployment/router/envoy/envoy.yaml) — ext_proc filter inFULL_DUPLEX_STREAMEDmode, anORIGINAL_DSTcluster keyed off a routing header, circuit breakers, gRPC health checks on the EPP. It will look like home. - The control plane is Gateway API.
Gateway+HTTPRoute+ a new backend kind,InferencePool, from the GAIE project you’re studying separately. llm-d is GAIE’s most prominent downstream consumer and contributor. - The novel part is the scoring signal. Instead of least-request/ring-hash on connection counts, endpoints are scored on KV-cache utilization, prefix-cache locality (approximate radix-style hashing or precise event-driven indexing), queue depth, and predicted latency — i.e., load balancing where the “load” is HBM contents. This is cache-aware routing and P/D disaggregation, productionized.
- The benchmark argument is a traffic argument.
guides/optimized-baseline/README.mdcompares the EPP against a stock Kubernetes Service round-robining the same 8 vLLM pods: output tokens/sec 5,722 → 13,163 (+130%), TTFT p90 107.43s → 0.206s at high rates. The headline gains of this whole space come from routing, not kernels.
3. What’s actually in this repo (and what isn’t)
This is the umbrella/docs/deployment repo of the llm-d GitHub org. There is no service source code here — no Go, no Rust, no Python services. What you deploy comes from sibling repos and OCI registries; what lives here is the documentation, the Helm values + Kustomize overlays that configure those components, and the Dockerfiles for llm-d’s custom vLLM container images.
| Path | What it is |
|---|---|
README.md, PROJECT.md, SIGS.md, CONTRIBUTING.md, MAINTAINERS.md | Project identity, governance, SIG structure (Kubernetes-style OWNERS files throughout) |
docs/architecture/ | The system design docs — the core of this guide. core/ (router, EPP, InferencePool, model servers) and advanced/ (KV management, disaggregation, autoscaling, batch, latency predictor) |
docs/well-lit-paths/ | Concept pages for each supported pattern (the “why”), one per pattern |
docs/api-reference/ | InferencePool, InferenceObjective, InferenceModelRewrite CRDs; EndpointPickerConfig schema; EPP HTTP headers/APIs; glossary |
docs/getting-started/ | quickstart.md and artifacts.md (the authoritative map of charts, images, and source repos) |
docs/infra-providers/ | GKE, AKS, OpenShift, DigitalOcean, minikube notes |
docs/resources/observability/ | Metrics catalog, PromQL cookbook, tracing setup; plus docs/resources/rdma/ for the network prerequisites |
docs/proposals/ | Design proposals (autoscaler, batch gateway, non-Kubernetes mode, distributed tracing, …) |
guides/ | The deployable recipes (“well-lit paths” — the “how”): per-guide Helm values files for the router chart + Kustomize overlays for model servers, per accelerator (NVIDIA/AMD/Intel XPU/Gaudi/TPU/CPU) |
guides/recipes/ | Shared building blocks: base router values, base model-server Deployments, gateway install kustomizations (Istio, agentgateway, kgateway, GKE) |
docker/ | Dockerfile.cuda, .rocm, .cpu, .hpu — build the ghcr.io/llm-d/llm-d-cuda etc. images: vLLM plus the RDMA/P2P stack (UCX, NVSHMEM, NIXL, GDRCopy, DeepEP, LMCache, InfiniStore) |
patches/ | NVSHMEM patches applied in those image builds |
helpers/, scripts/, release/, .github/ | Benchmark harness docs, client setup, lint/CI plumbing |
The actual code, by sibling repo (from the table in docs/getting-started/artifacts.md, names only):
- llm-d-router (Go) — the EPP: routing engine, plugin framework, flow control. Older docs call it llm-d-inference-scheduler; it builds on the GAIE endpoint-picker framework. Most architecture docs here link directly into its
pkg/epp/framework/plugins/...tree. - llm-d-routing-sidecar — the P/D routing proxy sidecar in decode pods (image
ghcr.io/llm-d/llm-d-routing-sidecar). - llm-d-kv-cache (Go/Python/C++) — KV-block locality indexer and filesystem offloading connector.
- llm-d-latency-predictor (Python) — XGBoost training + prediction sidecars for predicted-latency scheduling.
- llm-d-workload-variant-autoscaler (Go) — SLO-aware autoscaler.
- llm-d-batch-gateway, llm-d-async (incubation) — OpenAI Batch API and queue-based async processing.
- llm-d-benchmark (Python) — the harness invoked by every guide’s benchmarking section.
- llm-d-inference-sim (Go) — a GPU-free vLLM simulator (important for you; see section 7).
One subtlety worth knowing (from docs/getting-started/artifacts.md): the Helm charts themselves are currently published by GAIE (oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone in the quickstart; oci://ghcr.io/llm-d/charts/llm-d-router-standalone-dev in the v0.7 guides), with a note that publishing will move to llm-d. The boundary between GAIE and llm-d is deliberately thin.
4. The architecture as documented
4.1 Three core concepts
From docs/architecture/README.md: the llm-d Router (= Proxy + EPP), the InferencePool, and the Model Server.
- Proxy: A high-performance L7 proxy (typically Envoy) that accepts user requests and consults the EPP via the
ext-procprotocol to determine the optimal destination.- Endpoint Picker (EPP): The routing engine that scores and selects model server pods based on real-time metrics, KV-cache affinity, and configured policies.
InferencePool is described as an “LLM-optimized Service” — a label-selector grouping of pods serving one base model, with Variants (sub-groupings via pod labels, e.g. prefill vs decode roles, expressed as llm-d.ai/role: prefill|decode|prefill-decode).
Terminology is pinned down in docs/architecture/core/router/README.md and worth internalizing because older blog posts disagree: llm-d Router = Proxy + EPP (the whole entry point); Inference Gateway = the Router when operating in Gateway Mode; Request Scheduler = the decision engine inside the EPP. The EPP also carries dual responsibilities — routing and “fairness and prioritization”, i.e. which requests run at all when consolidating multi-tenant workloads onto shared model servers.
4.2 Request flow: gateway → EPP → vLLM pod
From docs/architecture/core/router/README.md:
When an inference request arrives at the Proxy, the Proxy “parks” the request and initiates a callback to the EPP via the
ext-proc(External Processing) protocol. The EPP evaluates the request against the current state of the InferencePool—considering factors like KV-cache locality, current load, and priority—and returns the address of the optimal model server pod back to the Proxy.
Inside the EPP, the lifecycle is enumerated in docs/architecture/core/router/epp/README.md:
- Request arrival at the proxy (Gateway).
- External processing — proxy invokes the EPP via ext-proc, passing headers and body.
- Request handling — parses the request (OpenAI HTTP, vLLM gRPC; parser is a plugin, so custom protocols slot in) into the internal
InferenceRequest. - Flow control — if enabled, queues, prioritizes, and “holds requests when the pool is saturated”.
- Request scheduling — Filter → Score → Pick against the InferencePool.
- Request proxying — EPP returns the chosen endpoint address; proxy forwards.
Asynchronously, a Data Layer watches the Kube API for pool membership, scrapes model-server metrics, and maintains in-memory state such as the prefix-cache tree; “consultant” sidecars (latency predictor, KV indexer, tokenizer) plug in here. The same doc warns that the only supported ext-proc body mode is FULL_DUPLEX_STREAMED.
Flow control deserves a closer look from a traffic engineer (docs/architecture/core/router/epp/README.md, deep dive in flow-control.md): saturation-gated admission via pluggable SaturationDetectors (e.g. a concurrency detector on per-endpoint in-flight counts), priority bands separating latency-sensitive chat from background batch, and two pluggable fairness layers — FairnessPolicy distributing dispatch opportunities among flows within a band (e.g. round robin) and OrderingPolicy ordering requests within a flow (FIFO, SLO-based). It is an application-level admission/queuing tier of the kind you’d otherwise build with Envoy adaptive concurrency plus priority queues — but keyed on inference signals.
The scheduler (docs/architecture/core/router/epp/scheduling.md) is a weighted-scorer framework — recognizably the GAIE scheduling framework with llm-d’s plugin set. Scorers include kv-cache-utilization-scorer, queue-depth-scorer, prefix-scorer, lora-affinity-scorer, latency-scorer, session-affinity-scorer, and no-hit-lru-scorer (spreads cold prefills across the pool); pickers are max-score-picker, random-picker, weighted-random-picker. Where KV-cache-aware decisions happen: scoring, fed by the data layer.
The default “optimized baseline” policy is just YAML, shipped in this repo as Helm values (guides/optimized-baseline/router/optimized-baseline.values.yaml):
pluginsCustomConfig:
optimized-baseline-plugins.yaml: |
apiVersion: llm-d.ai/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: queue-scorer
- type: kv-cache-utilization-scorer
- type: prefix-cache-scorer
- type: no-hit-lru-scorer
schedulingProfiles:
- name: default
plugins:
- pluginRef: queue-scorer
weight: 2
- pluginRef: kv-cache-utilization-scorer
weight: 2
- pluginRef: prefix-cache-scorer
weight: 3
- pluginRef: no-hit-lru-scorer
weight: 2
4.3 The Envoy wiring (read this file first)
guides/no-kubernetes-deployment/router/envoy/envoy.yaml is the entire data plane in one static config. The selected endpoint is conveyed via a header consumed by an ORIGINAL_DST cluster:
clusters:
- name: original_destination_cluster
type: ORIGINAL_DST
connect_timeout: 1000s
lb_policy: CLUSTER_PROVIDED
original_dst_lb_config:
use_http_header: true
http_header_name: x-gateway-destination-endpoint
and the ext_proc filter that calls the EPP:
- name: envoy.filters.http.ext_proc
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor
grpc_service:
envoy_grpc:
cluster_name: ext_proc
authority: localhost:9002
timeout: 10s
processing_mode:
request_body_mode: FULL_DUPLEX_STREAMED
response_body_mode: FULL_DUPLEX_STREAMED
That is the GAIE Endpoint Picker Protocol made concrete: EPP picks a pod IP, writes it into x-gateway-destination-endpoint, Envoy’s original-dst LB forwards to it. Response bodies also stream back through the EPP for post-processing (metrics, prefix-index updates).
4.4 Two deployment modes for the proxy
docs/architecture/core/router/proxy.md defines:
- Standalone Mode — Envoy runs as a sidecar in the EPP pod, ext-proc over localhost, no Gateway/HTTPRoute/controller needed. For testing, batch, RL pipelines, legacy-Ingress clusters.
- Gateway Mode (= “Inference Gateway”) — the GAIE integration: a shared
GatewayhostsHTTPRoutes whosebackendRefsareInferencePools (group: inference.networking.k8s.io), alongside ordinaryServicebackends. Used for shared infra, multi-cluster LB, traffic splitting/canary. Supported providers documented: Istio, GKE Gateway, agentgateway (kgateway deprecated as of v0.7).
4.5 The well-lit paths
Indexed in docs/well-lit-paths/README.md and guides/README.md, grouped as: Intelligent Routing (optimized baseline; predicted-latency routing), Advanced KV-Cache Management (precise prefix-cache routing; tiered prefix cache offload to CPU/NVMe), Serving Large Models (P/D disaggregation; wide expert-parallelism), Operational Excellence (flow control; workload autoscaling; rollouts), Workloads (agentic inference; multimodal), and Experimental (async processing; batch gateway; no-Kubernetes deployment).
Prefix-cache-aware routing (docs/architecture/advanced/kv-management/prefix-cache-aware-routing.md) has two implementations:
| Feature | Approximate | Precise |
|---|---|---|
| Precision | Heuristic (character-based block hashing) | 100% (token-based) |
| State source | Local EPP assumptions after each routing decision | Real-time KVEvents from model servers over ZMQ |
| Dependencies | None | vLLM /v1/completions/render tokenizer endpoint, ZMQ |
The approximate path splits prompts into fixed-size blocks, keeps a rolling-hash LRU index of which prefixes were routed where, and “learns” from its own decisions. The precise path subscribes to vLLM KV-cache block add/evict events and maintains a global KV-block index (the llm-d-kv-cache component), with speculative indexing to cover the decision-to-event blind spot. guides/precise-prefix-cache-routing/router/precise-prefix-cache-routing.values.yaml shows the full production config including active-active HA EPP replicas.
P/D disaggregation (docs/architecture/advanced/disaggregation/README.md) — the EPP’s disagg-profile-handler runs a decode profile, asks a decider plugin whether the uncached suffix on the chosen decode pod is large enough to justify disaggregation, and only then runs the prefill profile. Decode endpoint becomes the proxy’s primary destination; the prefill endpoint rides along in the x-prefiller-host-port header. The sequence, from that doc:
sequenceDiagram
Client->>Proxy: Request
Proxy-->>EPP: Run EPP protocol
EPP-->>Proxy: Selects P Worker and D Worker
Proxy->>DSidecar: Request
DSidecar->>PWorker: Request with max_tokens=1, do_remote_decode=True
PWorker->>DSidecar: Response with KVTransferParams
DSidecar->>DWorker: Request with KVTransferParams and do_remote_prefill=True
DWorker-->>PWorker: Pull KV Cache (NIXL RDMA)
KV transfer uses NIXL (NVIDIA’s transfer library, from the ai-dynamo org — shared infrastructure with Dynamo) over UCX/UCCL/libfabric on IB/RoCE/EFA; TCP fallback exists “for testing and development” only. The vLLM protocol (nixlv2) is two-phase sequential; SGLang’s is concurrent with out-of-band bootstrap coordination. The corresponding EPP config is in guides/pd-disaggregation/router/pd-disaggregation.values.yaml — two schedulingProfiles named prefill and decode, each with its own prefill-filter/decode-filter plus prefix/queue/kv-utilization scorers, composed by disagg-profile-handler. The reference deployment is openai/gpt-oss-120b with 8 TP=1 prefill instances and 2 TP=4 decode instances (“heterogeneous parallelism”; tune your xPyD ratio to your ISL/OSL).
Wide expert-parallelism (guides/wide-ep-lws/README.md) — DeepSeek-R1 at DP=16 prefill + DP=16 decode over 32 H200/B200 GPUs, deployed with the LeaderWorkerSet (LWS) controller, requiring full-mesh all-to-all RDMA for DeepEP (“rail-only connectivity will fail”). This is the path the custom ghcr.io/llm-d/llm-d-cuda images (built from docker/Dockerfile.cuda with NVSHMEM/DeepEP patches from patches/) exist for.
4.6 Role of the Gateway API Inference Extension
Cross-reference for your gateway-api-inference-extension study: llm-d consumes GAIE at three layers, all visible in this repo.
- CRDs: every guide starts with
kubectl apply -f .../gateway-api-inference-extension/releases/download/v1.5.0/v1-manifests.yaml(InferencePoolisinference.networking.k8s.io/v1;InferenceObjectiveandInferenceModelRewritearellm-d.ai/v1alpha2— seedocs/api-reference/README.md). - Charts: the standalone/gateway router Helm charts are published from GAIE’s
config/charts/today (docs/getting-started/artifacts.md). - Framework: the EPP in llm-d-router is built on GAIE’s endpoint-picker framework and Endpoint Picking Protocol; llm-d adds the plugin set (precise prefix cache, disagg handler, latency predictor, flow control policies) and the production recipes.
PROJECT.md’s “no forks” principle is the governing relationship.
In short: GAIE is the routing framework and API; llm-d is the opinionated, benchmarked distribution of it, plus the engine-side pieces (sidecar, KV indexer, images) GAIE doesn’t own.
4.7 Deployment shape: what a minimal install looks like
From docs/getting-started/quickstart.md, an install is exactly three commands after CRDs:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${GAIE_VERSION}/v1-manifests.yaml
helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
-f guides/recipes/router/base.values.yaml \
-f guides/optimized-baseline/router/optimized-baseline.values.yaml \
-n ${NAMESPACE} --version ${GAIE_VERSION}
kubectl apply -n ${NAMESPACE} -k guides/optimized-baseline/modelserver/gpu/vllm/
Pattern: Helm for the router (chart from GAIE/llm-d registries, configured by layered values files from this repo) + Kustomize for the model servers (base Deployment in guides/recipes/modelserver/base/single-host/default/decode-deployment.yaml — a plain Deployment labeled llm-d.ai/role: decode with /health + /v1/models probes — patched per guide/accelerator, e.g. guides/optimized-baseline/modelserver/gpu/vllm/base/patch-vllm.yaml setting vllm serve Qwen/Qwen3-32B --tensor-parallel-size=2 --gpu-memory-utilization=0.95, 8 replicas). The EPP discovers pods via the InferencePool’s matchLabels (llm-d.ai/guide: "optimized-baseline"). Worth reading in guides/recipes/router/base.values.yaml: the Envoy sidecar args block, with an inline comment explaining why --log-level warn and --concurrency 8 override the chart defaults (trace logging and hardware_concurrency() worker threads oversubscribing the cgroup CPU slice — your kind of footnote).
5. Suggested reading path
README.md, thenPROJECT.md— what it is, who runs it, the no-forks upstream policy.docs/architecture/README.md→docs/architecture/core/router/README.md→proxy.md— Router = Proxy + EPP, standalone vs gateway modes.guides/no-kubernetes-deployment/router/envoy/envoy.yaml+router/epp/config.yaml— the whole data plane and a completeEndpointPickerConfigin two files, no cluster abstractions in the way.docs/architecture/core/router/epp/README.md→scheduling.md→flow-control.md→datalayer.md→configuration.md— the EPP internals; thendocs/api-reference/endpointpickerconfig.mdandepp-http-headers.mdas reference.docs/architecture/core/inferencepool.md+docs/api-reference/inferencepool.md— the CRD bridge to GAIE.guides/optimized-baseline/README.md(skim manifests, read the benchmark report at the bottom) — the canonical deployment and the RR-vs-EPP numbers.docs/architecture/advanced/kv-management/(all four files) andguides/precise-prefix-cache-routing/— approximate vs precise cache-aware routing.docs/architecture/advanced/disaggregation/README.md+guides/pd-disaggregation/README.md+guides/recipes/modelserver/base/single-host/pd/vllm/patch-sidecar.yaml— P/D end to end, including the sidecar.guides/wide-ep-lws/README.md+docker/Dockerfile.cuda— wide-EP and what actually goes into the engine image.docs/resources/observability/— the operational surface:metrics.mdfor enabling scraping,promql.mdfor ready-made queries over the EPP scheduler metrics (inference_pool_per_pod_queue_size,inference_extension_prefix_indexer_hit_ratio, per-plugin latency distributions — cataloged in the tables at the bottom ofdocs/architecture/core/router/epp/scheduling.md) plus vLLM metrics, andtracing.mdfor OTel setup.- Skim
docs/well-lit-paths/for the rest (flow control, autoscaling, batch, agentic), anddocs/proposals/if you want to see where the project is heading (non-kubernetes-mode.md,distributed-tracing.md,autoscaler.md).
6. Connections to your other study repos
- gateway-api-inference-extension — the routing brain llm-d builds on. Everything in section 4.6; read GAIE’s Endpoint Picker Protocol proposal alongside
docs/architecture/core/router/epp/README.mdand the Envoy config above. llm-d is effectively GAIE’s flagship consumer plus an extended plugin catalog (the EPP code itself lives in sibling repo llm-d-router). - vllm — the engine being orchestrated. llm-d leans on vLLM features you can study in that repo: automatic prefix caching (what the prefix scorers exploit), KV-cache events over ZMQ (what precise routing consumes), the KV-connector/NIXL interface and
kv_transfer_params(what P/D rides on),/v1/completions/rendertokenization, data-parallel deployment (one pod, multiple endpoints), and the OffloadingConnector (tiered prefix cache). - dynamo — the closest competitor, from NVIDIA, and the sharpest contrast: Dynamo is a self-contained distributed runtime (its own Rust router/frontend, etcd/NATS control plane, planner) that runs on or off Kubernetes; llm-d is deliberately not a runtime — it reuses Kubernetes primitives (Gateway API, Deployments, LWS, HPA) and Envoy, adding only the EPP and sidecars. They share NIXL for KV transfer (note
NIXL_REPO=github.com/ai-dynamo/nixlindocker/Dockerfile.cuda). Compare Dynamo’s KV-aware router with the EPP’s scorer pipeline, and Dynamo’s planner with llm-d’s Workload Variant Autoscaler. - sglang — the second engine: first-class in the optimized-baseline and P/D guides (
guides/optimized-baseline/modelserver/gpu/sglang/), with its own concurrent bootstrap-room KV-transfer protocol documented indocs/architecture/advanced/disaggregation/README.md— contrast with vLLM’s sequentialnixlv2. - nano-vllm — a minimal engine is the right mental model for what an llm-d “endpoint” is: the EPP only needs an OpenAI-compatible HTTP surface plus standard metrics; everything llm-d scores (queue depth, KV utilization, prefix reuse) maps to structures you can see in nano-vllm’s scheduler and block manager in a few hundred lines.
- xgrammar / flashinfer — below llm-d’s abstraction line. They live inside the engine pods (structured-output constraints and attention/sampling kernels respectively); llm-d never sees them except as their effects on per-request latency and throughput — exactly the signals the latency predictor learns. flashinfer also illustrates why decode is memory-bandwidth-bound, which is the entire premise of P/D specialization.
7. Hands-on without a 16-GPU cluster
Honest assessment: the headline paths are out of home reach — the optimized baseline wants 16 GPUs (8×TP=2 for Qwen3-32B), P/D wants RDMA between nodes, wide-EP wants 32 H200s. With one RTX 5080 (16 GB) on Windows, treat this as a docs-and-configs repo plus three realistic exercises:
- No-Kubernetes deployment, scaled down (best option).
guides/no-kubernetes-deployment/README.mdruns the real stack — EPP container + Envoy container + vLLM worker(s) — with Docker only, endpoints declared in a YAML file (file-discoveryplugin, hot-reloaded via atomic rename). Under WSL2 with CUDA Docker, substitute the 32B model with something that fits 16 GB (e.g. a 4–8B model at--tensor-parallel-size=1) and drop the TP/shm settings; the EPP and Envoy configs are model-agnostic. You can then watch scheduling happen: send repeated shared-prefix prompts, scrape EPP metrics on:9090(inference_extension_prefix_indexer_hit_ratio, per-pod queue gauges), and poke Envoy admin on:19000to see the ext_proc cluster and original-dst routing. - GPU-free routing experiments with the simulator.
docs/getting-started/artifacts.mdlists sibling repo llm-d-inference-sim, a “GPU-free vLLM simulator.” Point the file-discoveryendpoints.yaml(or a kind/minikube InferencePool) at several simulator instances and you can exercise the full EPP plugin pipeline — scorer weights, flow-control priority bands, even the disagg profile handler — with zero accelerators. For a traffic engineer, this is the highest-signal-per-watt exercise in the project. - kind/minikube for the control plane only. The GAIE CRDs, router Helm chart, Gateway+HTTPRoute wiring (
guides/recipes/gateway/), and EPP all install on a CPU-only cluster; only the model-server Kustomize step needs GPUs (swap in CPU vLLM viaguides/optimized-baseline/modelserver/cpu/vllm/— though it wants 64 cores/replica — or the simulator). Notedocs/infra-providers/minikube/README.mdis currently a stub (“TBD”), so expect to adapt the quickstart yourself.
Pure reading also pays here more than in most repos: the architecture docs are recent (v0.7), unusually candid about trade-offs (approximate-vs-precise tables, P/D “not a target for all workloads” guidance, NIXL TCP-is-dev-only warnings), and every claim is tied to a manifest you can open in the same checkout.