Gateway API Inference Extension
Kubernetes SIG project (kubernetes-sigs/gateway-api-inference-extension, “GIE” or “Inference Gateway”) that turns any ext-proc-capable, Gateway API-conformant proxy — Envoy Gateway, Istio, kgateway, GKE Gateway, agentgateway — into an inference-aware L7 load balancer. It ships two things: a set of CRDs (InferencePool and friends) that model a fleet of model-server pods as a routable backend, and the Endpoint Picker (EPP), a Go gRPC server implementing Envoy’s External Processing protocol that picks the destination pod per request using live vLLM/SGLang/TensorRT-LLM metrics.
Repo status note (important for contribution targeting). Per
README.md:16-23, the EPP,InferenceObjective/InferenceModelRewriteAPIs, and the Body-Based Router are migrating tollm-d/llm-d-inference-schedulerandllm-d/llm-d-inference-payload-processor; this repo remains the home of the InferencePool API, the lightweight EPP (lwepp), and the conformance suite. This checkout still contains the complete EPP (pkg/epp/), so it is the right place to study the architecture — but new scheduler/EPP PRs should go to the llm-d org. Everything below cites code in this repo.
1. What it is, and what classic LB cannot do
Classic load balancing assumes requests are roughly interchangeable and backends are roughly stateless: round-robin, least-request, EWMA, maglev/ring-hash for affinity. LLM serving violates every assumption:
- Per-request cost variance is enormous. A request is not a unit of work; its cost is
O(prompt_tokens)for prefill plusO(output_tokens × batch_interference)for decode. Two in-flight requests can differ by 1000x in GPU-seconds. Least-request counts requests, not tokens. - Backends are stateful. Each replica holds a KV cache (prefix cache). Routing a request to a pod that already has its prompt prefix cached skips most of prefill — but no L7 hash policy can know which pod has which prefix blocks resident, because residency changes with every admission and eviction.
- Latency is load-dependent and saturation is a cliff. Once a vLLM replica’s KV blocks are exhausted it starts queueing or preempting (recompute), and decode latency for everyone in the batch degrades. You want load-aware routing on queue depth and KV utilization, and admission control before the cliff, not after.
- LoRA multiplexing. A pod can only hold N adapters in GPU memory; routing a request for adapter X to a pod that must first swap X in costs hundreds of ms.
The project’s answer: leave the data plane in Envoy, and put a per-request scheduling decision into an external processor that watches model-server metrics at 50ms granularity. The EPP filters/scores/picks among the pods of an InferencePool and tells Envoy where to send the request via a header consumed by an ORIGINAL_DST cluster. Selection criteria (from README.md:71-81): KV-cache pressure, queue depth, prefix-cache affinity, LoRA adapter residency, request priority — with saturation-aware shedding of low-priority traffic.
2. Why you care
This is precisely the system a traffic-infra engineer sketches from first principles after learning how continuous batching works: “I need least-loaded routing where load = queue depth + KV pressure, plus consistent-hash-like affinity on prompt prefixes, plus load shedding before the saturation cliff — and I want it in my existing Envoy fleet, not a bespoke proxy.”
- The EPP is an Envoy ext-proc server (
envoy.service.ext_proc.v3.ExternalProcessor) speaking full-duplex streamed gRPC. The protocol, processing modes, header mutations, dynamic metadata,ClearRouteCache,ImmediateResponsefor 429/503 — all the machinery is the Envoy machinery you already know. - The scheduling layer is a clean filter → scorer → picker plugin pipeline (deliberately modeled on kube-scheduler), so the inference-specific knowledge is encapsulated in ~100-line plugins with explicit scoring math. You can read the entire decision logic in an afternoon.
- It is the vendor-neutral standard for this layer: Envoy Gateway, Istio, kgateway, GKE, and agentgateway all implement
InferencePoolconformance; llm-d (Google/Red Hat/IBM + vLLM collaboration) builds its inference scheduler directly on this framework. Contributions here (or inllm-d/llm-d-inference-scheduler, where the EPP now lives) land in every implementation. - It’s GA (v1
InferencePool), but the interesting frontiers are wide open: flow control / fairness (pkg/epp/flowcontrol/is explicitly experimental), P/D-disaggregation profiles, latency-SLO prediction-based scheduling, multi-clusterInferencePoolImport. Envoy + Go + queueing theory is exactly your toolkit.
3. Architecture map
client ──HTTP──> Gateway (Envoy) [data plane]
│ ext_proc filter (FULL_DUPLEX_STREAMED gRPC)
▼
EPP (this repo) [decision plane]
├─ handlers/ ext-proc protocol state machine
├─ requestcontrol Director: parse → admit → schedule → mutate
├─ flowcontrol/ priority queues, fairness, saturation
├─ scheduling/ filters → scorers → picker (plugins)
├─ datalayer/ per-pod collectors ──/metrics──> vLLM pods
└─ controller/ watches InferencePool/Pods/Objectives
│
└──returns x-gateway-destination-endpoint = ip:port
▼
Envoy ORIGINAL_DST cluster ──> chosen model-server pod
The APIs
InferencePool(v1,api/v1/inferencepool_types.go:32) — the backend type you put in anHTTPRoute.backendRefs. Spec is just three things: a labelselectorfor member pods,targetPorts(1-8 ports; eachip:portis a distinct endpoint, used for data-parallel ranks), andendpointPickerRefpointing at the EPP Service.endpointPickerRef.failureMode(api/v1/inferencepool_types.go:168-189) isFailCloseby default — exactly Envoy ext-proc’sfailure_mode_allowsemantics surfaced as API. Status carries per-parentAccepted/ResolvedRefsconditions written by each Gateway controller.InferenceObjective(v1alpha2,apix/v1alpha2/inferenceobjective_types.go:60-73) — attaches a workload identity and an integerpriority(default 0;priority < 0= sheddable, seepkg/epp/util/request/sheddable.go:20-22) to requests, selected via thex-gateway-inference-objectiveheader.InferenceModelRewrite(v1alpha2) — maps client-facing model names to backend models/adapters with weighted splits (the EPP rewrites"model"in the JSON body and un-rewrites it in responses,pkg/epp/handlers/server.go:446-460).InferencePoolImport(v1alpha1) — experimental multi-cluster pool export/import.EndpointPickerConfig(apix/config/v1alpha1/endpointpickerconfig_types.go:33) — not a CRD but a config-file schema (mounted ConfigMap) declaring which plugins to instantiate, scheduling profiles with per-scorer weights, flow control, saturation detector, and parser. This file is the EPP’s “scheduler policy” surface.
The EPP process
cmd/epp/main.go → cmd/epp/runner/runner.go wires everything: a controller-runtime manager watching the pool/pods, the datalayer runtime (one collector goroutine per endpoint), the plugin registry (cmd/epp/runner/runner.go:459-509 registers every in-tree plugin factory), the scheduler, optional flow control, and the gRPC ext-proc server (default port 9002, gRPC health on 9003, Prometheus on 9090 — pkg/epp/server/options.go:93-118). The default plugin config when you supply none is queue + KV-cache + prefix scorers (pkg/epp/config/loader/defaults.go:46-103).
cmd/lwepp/ is the lightweight EPP that stays in this repo: same ext-proc protocol and InferencePool discovery, but a trivial round-robin picker (pkg/lwepp/handlers/server.go:84-89) and no metrics scraping — a minimal reference data plane for conformance and for gateways that want endpoint subsetting without smart scheduling (it recently gained port-aware filtering for data-parallel ranks).
4. Core mechanisms
4.1 The ext-proc flow
Where it sits in Envoy: the canonical config used by e2e tests is test/testdata/envoy.yaml. The ext_proc HTTP filter runs before the router with all body modes set to full-duplex streaming:
# test/testdata/envoy.yaml:97-109
- name: envoy.filters.http.ext_proc
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor
grpc_service:
envoy_grpc:
cluster_name: ext_proc
authority: vllm-qwen3-32b-epp.$E2E_NS:9002
timeout: 10s
processing_mode:
request_header_mode: SEND
response_header_mode: SEND
request_body_mode: FULL_DUPLEX_STREAMED
response_body_mode: FULL_DUPLEX_STREAMED
and the route targets an ORIGINAL_DST cluster that takes its destination from the EPP-set header — this is the whole trick that lets an external process steer per-request endpoint selection without EDS churn:
# test/testdata/envoy.yaml:151-162
- name: original_destination_cluster
type: ORIGINAL_DST
connect_timeout: 1000s
lb_policy: CLUSTER_PROVIDED
circuit_breakers:
thresholds:
- max_connections: 40000
max_pending_requests: 40000
max_requests: 40000
original_dst_lb_config:
use_http_header: true
http_header_name: x-gateway-destination-endpoint
Per request, Envoy streams to the EPP: RequestHeaders → N× RequestBody chunks (EoS on last) → later ResponseHeaders → N× ResponseBody chunks. The EPP’s Process() loop (pkg/epp/handlers/server.go:162) runs a per-stream state machine (StreamRequestState, pkg/epp/handlers/server.go:142-154) with a single reader goroutine and a select that can also fire an eviction channel — flow control can yank an already-queued request mid-stream by sending ImmediateResponse(429) (pkg/epp/handlers/server.go:467-480).
The scheduling decision happens at request-body-EoS (it needs the parsed body for model name and prefix hashing): pkg/epp/handlers/server.go:319-348 parses the body, calls director.HandleRequest, then builds the response Envoy applies. The decision is communicated both as a header mutation and as dynamic metadata (namespace envoy.lb, key x-gateway-destination-endpoint — pkg/epp/metadata/consts.go:26-28), because gateway integrations differ in which they consume:
// pkg/epp/handlers/request.go:87-99
return &extProcPb.ProcessingResponse{
Response: &extProcPb.ProcessingResponse_RequestHeaders{
RequestHeaders: &extProcPb.HeadersResponse{
Response: &extProcPb.CommonResponse{
ClearRouteCache: true,
HeaderMutation: &extProcPb.HeaderMutation{
SetHeaders: s.generateHeaders(ctx, reqCtx),
},
},
},
},
DynamicMetadata: dynamicMetadata,
}
The full EPP↔proxy contract is specified in docs/proposals/004-endpoint-picker-protocol/README.md: the proxy may pass a candidate subset via filter metadata envoy.lb.subset_hint / x-gateway-destination-endpoint-subset; the EPP may return a comma-separated fallback list of endpoints (pkg/epp/requestcontrol/director.go:329-333 joins multiple picks) that Envoy walks on retry; 503 if no ready endpoint, 429 to shed. Headerless requests (e.g. GET /health) get a random endpoint at header time (pkg/epp/handlers/request.go:41-54). Response bodies are streamed back through the EPP so it can observe usage/latency per chunk and run post-response plugins — the EPP is on the data path for the whole response, which is why the protocol insists on full-duplex streaming.
The Director orchestration sequence (pkg/epp/requestcontrol/director.go:151-229): model rewrite → resolve InferenceObjective/priority → admission control → locate candidates (subset hint or full pool) → run PrepareData plugins (e.g. prefix matching) → run admission plugins → schedule → set TargetEndpoint + run PreRequest plugins.
4.2 The scheduling framework
pkg/epp/scheduling/scheduler.go:54 runs a loop of profiles chosen by a ProfileHandler (multi-profile exists for P/D disaggregation: a “prefill” profile and a decode profile can each pick an endpoint per request; see the hardcoded experimentalDefaultPrefillProfile = "prefill" in pkg/epp/framework/plugins/requestcontrol/dataproducer/approximateprefix/types.go:77-86). Each SchedulerProfile.Run is the kube-scheduler pattern:
// pkg/epp/scheduling/scheduler_profile.go:117-128
func (p *SchedulerProfile) Run(ctx context.Context, request *fwksched.InferenceRequest, cycleState *fwksched.CycleState, candidateEndpoints []fwksched.Endpoint) (*fwksched.ProfileRunResult, error) {
endpoints := p.runFilterPlugins(ctx, request, cycleState, candidateEndpoints)
if len(endpoints) == 0 {
return nil, errcommmon.Error{Code: errcommmon.Internal, Msg: "no endpoints available for the given request"}
}
// if we got here, there is at least one endpoint to score
weightedScorePerEndpoint := p.runScorerPlugins(ctx, request, cycleState, endpoints)
result := p.runPickerPlugin(ctx, cycleState, weightedScorePerEndpoint)
return result, nil
}
Scorers return map[endpoint]float64 clamped to [0,1]; the profile accumulates score × weight (pkg/epp/scheduling/scheduler_profile.go:165-168). The default profile (pkg/epp/config/loader/defaults.go:47-49 and helm config/charts/epplib/templates/_config.yaml:77-84) is:
| plugin | weight |
|---|---|
queue-scorer | 2 |
kv-cache-utilization-scorer | 2 |
prefix-cache-scorer | 3 |
max-score-picker (injected by default) | — |
The actual plugin inventory (registered in cmd/epp/runner/runner.go:459-509):
Scorers (pkg/epp/framework/plugins/scheduling/scorer/):
queue-scorer— min-max normalization of vLLM’s waiting-queue length across candidates:
// pkg/epp/framework/plugins/scheduling/scorer/queuedepth/queue.go:93-100
// endpointScoreFunc calculates the score based on the queue size of each endpoint. Longer queue gets a lower score.
endpointScoreFunc := func(endpoint framework.Endpoint) float64 {
if maxQueueSize == minQueueSize {
// If all pods have the same queue size, return a neutral score
return 1.0
}
return float64(maxQueueSize-endpoint.GetMetrics().WaitingQueueSize) / float64(maxQueueSize-minQueueSize)
}
kv-cache-utilization-scorer— simply1 - KVCacheUsagePercent(pkg/epp/framework/plugins/scheduling/scorer/kvcacheutilization/kvcache_utilization.go:76-82). Absolute, not normalized: an empty fleet scores all 1.0 and the queue scorer breaks the tie.prefix-cache-scorer—matchedBlocks / totalBlocksof the prompt found in the per-pod prefix index (pkg/epp/framework/plugins/scheduling/scorer/prefix/plugin.go:108-111); see below for how the index is built.lora-affinity-scorer— tiered constants from adapter residency metrics:
// pkg/epp/framework/plugins/scheduling/scorer/loraaffinity/lora_affinity.go:85-98
switch {
// Ideal: The adapter is already active on this model server.
case active:
scores[endpoint] = 1.0
// Good: The model server has capacity to load at least one more adapter.
case len(endpoint.GetMetrics().ActiveModels)+len(endpoint.GetMetrics().WaitingModels) < endpoint.GetMetrics().MaxActiveModels:
scores[endpoint] = 0.8
// Moderate: The adapter is already in the queue to be loaded on this model server.
case waiting:
scores[endpoint] = 0.6
// Unsuitable: The model server has reached its maximum capacity and cannot load the adapter.
default:
scores[endpoint] = 0.0
}
running-requests-scorer— min-max onRunningRequestsSize(.../runningrequests/runningrequest.go:77-107), i.e. least-request but on the server’s own gauge.token-load-scorer—1 - inFlightTokens/thresholdusing EPP-side in-flight token accounting (.../tokenload/token_load.go:82-105), a leading indicator that doesn’t wait for the 50ms scrape.latency-scorer(+slo-headroom-tier-filter,latency-slo-admitter,predicted-latency-producer) — experimental SLO-driven scheduling against a sidecar latency-prediction service (latencypredictor/, Python). Headroom-tiered filtering then scoring by predicted TTFT/TPOT headroom.
Filters: prefix-cache-affinity-filter (pkg/epp/framework/plugins/scheduling/filter/prefixcacheaffinity/plugin.go — keeps only pods whose prefix-match ratio ≥ threshold, used as a two-gate strict/loose pair around the SLO tier filter in the helm config config/charts/epplib/templates/_config.yaml:49-60), slo-headroom-tier-filter, the utilization-detector doubling as a filter (below), and a header-based test filter.
Pickers (pkg/epp/framework/plugins/scheduling/picker/): max-score-picker (shuffle for random tie-break, stable sort desc, take top-N — maxscore/picker.go:87-115), random-picker, weighted-random-picker (score-proportional sampling, used with the latency predictor to avoid thundering-herd on the best pod).
Prefix-cache affinity is the most interesting subsystem (proposal: docs/proposals/0602-prefix-cache-aware-routing-proposal/). The EPP cannot see vLLM’s actual block tables, so it maintains an approximation: an LRU index mapping xxhash block hashes → set of pods that recently served them. The chunking mirrors vLLM’s block hashing, chained like a Merkle list and salted by model name:
// pkg/epp/framework/plugins/requestcontrol/dataproducer/approximateprefix/hashing.go:70-86
h := xxhash.New()
// Different models should have different hashes even with the same body.
_, _ = h.Write([]byte(request.TargetModel))
if cacheSalt := request.Body.CacheSalt(); cacheSalt != "" {
_, _ = h.Write([]byte(cacheSalt))
}
prevBlockHash := blockHash(h.Sum64())
i := 0
for ; i+cacheBlockSizeChars <= len(userInput); i += cacheBlockSizeChars {
h.Reset()
_, _ = h.Write(userInput[i : i+cacheBlockSizeChars])
_, _ = h.Write(toBytes(prevBlockHash))
res = append(res, blockHash(h.Sum64()))
prevBlockHash = res[len(res)-1]
}
Defaults (approximateprefix/types.go:88-113): block size 16 tokens (vLLM default; auto-tuned from the server’s cache_config_info metric when AutoTune is on), ~4 chars/token heuristic (no tokenizer in the hot path), max 256 blocks matched, LRU capacity 31,250 entries/pod (sized from an H100 KV-budget calculation in the comment). The flow is split across two hooks of the same plugin: PrepareRequestData hashes the prompt and annotates every candidate with PrefixCacheMatchInfo before scheduling; PreRequest records the chosen pod (and the prefill pod, in P/D mode) against those hashes after scheduling, asynchronously (approximateprefix/plugin.go:140-205). The scorer itself just reads the annotation — production of state and consumption are decoupled through the datalayer attribute system (Consumes()/Produces() declarations let the config loader validate the plugin graph).
4.3 Per-pod state: the datalayer
Service discovery: a controller watches pods matching the pool selector (or a static --endpoint-selector in standalone mode) and registers one collector per endpoint. Each collector runs a ticker loop polling its data sources and feeding extractors:
// pkg/epp/datalayer/collector.go:139-158
case <-ticker.Channel():
for _, src := range sources {
tn := src.TypedName()
key := tn.String()
ctx, cancel := context.WithTimeout(c.ctx, defaultCollectionTimeout)
data, err := src.Poll(ctx, endpoint)
cancel()
logErrorTransition(logger, c.lastPollErrors, key, "poll", "source", err)
if err != nil {
continue
}
if srcExtractors, ok := exts[tn.Name]; ok && data != nil {
for _, ext := range srcExtractors {
extErr := ext.Extract(ctx, data, endpoint)
Defaults (pkg/epp/server/options.go:102-109): scrape every 50ms per pod with a 1s timeout, metrics-staleness-threshold 2s, and the vLLM metric names as defaults: vllm:num_requests_waiting, vllm:num_requests_running, vllm:kv_cache_usage_perc, vllm:lora_requests_info, vllm:cache_config_info. The metrics-data-source plugin GETs /metrics and parses Prometheus text; the core-metrics-extractor maps families into the shared Metrics snapshot (pkg/epp/framework/interface/datalayer/metrics.go:26-42: ActiveModels/WaitingModels/MaxActiveModels, RunningRequestsSize, WaitingQueueSize, KVCacheUsagePercent, CacheBlockSize, CacheNumBlocks, UpdateTime). Engine dialects are table-driven — the same extractor handles vLLM, SGLang, trtllm-serve, and Triton via per-engine metric specs selected by a pod label (inference.networking.k8s.io/engine-type):
// pkg/epp/framework/plugins/datalayer/extractor/metrics/factories.go:89-98
{
Name: "sglang",
QueuedRequestsSpec: "sglang:num_queue_reqs",
RunningRequestsSpec: "sglang:num_running_reqs",
KVUsageSpec: "sglang:token_usage",
LoRASpec: "",
CacheInfoSpec: "sglang:cache_config_info",
CacheBlockSizeLabelName: "page_size",
CacheNumBlocksLabelName: "num_pages",
},
The scrape contract a model server must satisfy is the Model Server Protocol (docs/proposals/003-model-server-protocol/README.md). Staleness handling is consumer-side: every snapshot carries UpdateTime, and consumers like the saturation detector treat stale pods as saturated (its own staleness bound defaults to 200ms — pkg/epp/framework/plugins/flowcontrol/saturationdetector/utilization/config.go:36), while readiness/debug listers partition pods into fresh/stale against the 2s threshold (pkg/epp/datalayer/logger/logger.go:78-93).
4.4 Flow control: admission, queueing, saturation
Two admission paths exist behind the AdmissionController interface (pkg/epp/requestcontrol/admission.go:39-56):
Legacy (default) — stateless shed-or-pass: sheddable (priority < 0) requests are rejected with 429 when pool saturation ≥ 1.0; non-sheddable always pass to the scheduler. Saturation comes from the utilization-detector, a roofline model on the two scraped pressure signals:
// pkg/epp/framework/plugins/flowcontrol/saturationdetector/utilization/detector.go:124-136
if metrics == nil || time.Since(metrics.UpdateTime) > d.config.MetricsStalenessThreshold {
totalScore += 1.0
continue
}
qRatio := float64(metrics.WaitingQueueSize) / float64(d.config.QueueDepthThreshold)
kvRatio := metrics.KVCacheUsagePercent / d.config.KVCacheUtilThreshold
// Roofline Analysis: The pod is saturated if either resource is exhausted.
totalScore += max(qRatio, kvRatio)
The same detector doubles as a scheduling filter that drops pods beyond threshold × (1 + headroom) — with a fail-open clause returning all endpoints if everything is saturated (detector.go:142-168).
Flow Control layer (experimental, pkg/epp/flowcontrol/) — a full queueing system: FlowControlAdmissionController.Admit wraps the request and blocks in FlowController.EnqueueAndWait (pkg/epp/flowcontrol/controller/controller.go:206). The controller is a supervisor over sharded ShardProcessor workers; requests land in per-flow queues keyed by x-gateway-inference-fairness-id within priority bands. Pluggable policies, all registered in the same plugin registry: fairness (round-robin, global-strict), ordering (fcfs, edf earliest-deadline-first, slo-deadline), usage limits (static token budgets), and queue data structures (ListQueue, MaxMinHeap — pkg/epp/flowcontrol/framework/plugins/queue/). Outcomes map to ext-proc responses: dispatch → continue to scheduler; reject/TTL-expiry → 429/503; and queued requests can be evicted post-enqueue, which is what that eviction channel in the handler’s select loop is for (pkg/epp/handlers/server.go:260-275). The design doc pkg/epp/flowcontrol/README.md is one of the best-written pieces in the repo — head-of-line blocking, displacement, and band-relative fairness are all spelled out.
4.5 Conformance: what a Gateway implementation must do
conformance/ vendors the upstream Gateway API conformance machinery and defines a Gateway profile (conformance/conformance.go:60-66). To claim InferencePool support an implementation must pass tests covering, among others (conformance/tests/):
inferencepool_accepted,inferencepool_resolvedrefs_condition— status conditions written per parent Gateway.gateway_following_epp_routing(+_dpfor multi-port data parallelism) — deploy a real EPP configured with a header-based test filter, assert the gateway actually routes to the exact pod the EPP names. This is the heart of conformance: the proxy must honorx-gateway-destination-endpoint.epp_unavailable_fail_open— kill the EPP,failureMode: FailOpenpools must still serve.gateway_destination_endpoint_served— the proxy must report which endpoint actually served via response-path metadata.inferencepool_invalid_epp_service,httproute_invalid_inferencepool_ref, port validation, multiple-gateways/pools weighting tests.
Backends in conformance are echo servers, not model servers (conformance/resources/base.yaml:60-83) — conformance tests routing mechanics, not scheduling quality. Reports are published per-implementation under conformance/reports/. If Discord ever fronts inference with its own Envoy control plane, this suite is the compliance bar.
5. Suggested reading path
README.md, thendocs/proposals/002-api-proposal/README.md(why pool+objective) anddocs/proposals/004-endpoint-picker-protocol/README.md(the EPP↔Envoy contract — short, read fully).api/v1/inferencepool_types.go,apix/v1alpha2/inferenceobjective_types.go— the API surface.test/testdata/envoy.yaml— the raw Envoy config; map every field to the protocol doc.pkg/epp/handlers/server.go(Processloop) andpkg/epp/handlers/request.go— the ext-proc state machine.pkg/epp/requestcontrol/director.go:151-229— the orchestration spine.pkg/epp/scheduling/scheduler.go+scheduler_profile.go— the framework; thendocs/proposals/0845-scheduler-architecture-proposal/.- Scorers in order:
queuedepth/queue.go,kvcacheutilization/kvcache_utilization.go,loraaffinity/lora_affinity.go, then the prefix pair:requestcontrol/dataproducer/approximateprefix/{hashing,indexer,plugin}.go+scorer/prefix/plugin.go. pkg/epp/datalayer/collector.go+framework/plugins/datalayer/extractor/metrics/— metrics pipeline;pkg/epp/server/options.gofor every default.pkg/epp/flowcontrol/README.md, thenrequestcontrol/admission.goandsaturationdetector/utilization/detector.go; go deeper intoflowcontrol/controller/only if fairness interests you.pkg/epp/config/loader/defaults.go+apix/config/v1alpha1/endpointpickerconfig_types.go— how config becomes a plugin graph.conformance/tests/gateway_following_epp_routing.goandtest/integration/epp/harness.go— how it’s all verified.
6. Connections to your other study repos
- llm-d — the most direct: llm-d’s inference scheduler is this EPP framework (the code is migrating to
llm-d/llm-d-inference-scheduler), with llm-d adding disaggregated prefill/decode plugins, KV-cache-event-based (precise, not approximate) prefix indexing, and the vLLM-side integration. The multi-profile scheduler and the hardcoded"prefill"profile name here are the seams llm-d plugs into. Study GIE first; llm-d then reads as “GIE plus opinionated vLLM deployment.” - vllm — the backend whose telemetry drives everything:
vllm:num_requests_waiting,vllm:kv_cache_usage_perc,vllm:lora_requests_infoare the EPP’s eyes (pkg/epp/server/options.go:105-109). The approximate prefix index mirrors vLLM’s block-hash chaining (prefix_cachingdesign) at 16-token granularity. KV pressure → preemption → latency cliff is the vLLM behavior the saturation detector encodes. - sgl-router & dynamo — the architectural counterpoint. Both put inference-aware routing in the data plane process (SGLang’s Rust router with cache-aware tree matching; Dynamo’s distributed runtime with its own KV-aware planner/router tier), getting tokenizer-exact prefix matching and event-driven cache state at the cost of owning the proxy: TLS, HTTP/2, retries, observability, deployment. GIE instead splits decision plane (EPP) from data plane (any conformant Envoy), paying one ext-proc RTT (~ms, plus full-duplex body streaming) and accepting approximate cache state, in exchange for vendor-neutrality and reuse of the Envoy ecosystem. Note GIE hedges on precision: char-based heuristic hashing here, with the event-driven precise indexer living in llm-d. Knowing both sides of this trade is exactly the judgment an infra interviewer probes.
- nano-vllm — the minimal lab for why these signals exist: see its block manager and scheduler to internalize what
kv_cache_usage_percandnum_requests_waitingphysically mean before tuning scorers that consume them. - xgrammar — orthogonal layer (constrained decoding inside the engine); only contact point is that structured-output requests skew decode cost, which token-load/latency scorers absorb statistically.
- flashinfer — the layer beneath vLLM that makes decode latency batch-size-dependent (paged KV attention kernels). That kernel-level fact is the root cause of “load-dependent latency” that makes least-request insufficient — the EPP is the system-level compensation.
7. Hands-on without a GPU fleet
Everything here runs on a laptop:
- Unit tests (
make test-unit) — pure Go, no cluster. Highest-value reads/runs:pkg/epp/scheduling/scheduler_test.go(full filter/score/pick cycles against fake metrics),approximateprefix/plugin_test.goandindexer_test.go(prefix matching end-to-end),saturationdetector/utilization/detector_test.go,pkg/epp/handlers/server_abort_test.go. - Hermetic integration tests (
make test-integration,test/integration/epp/hermetic_test.go) — boots the real EPP gRPC server via controller-runtimeenvtest(local kube-apiserver, no kubelets), injects pod metrics throughFakePodMetricsClient/ a mock datalayer source (test/integration/epp/harness.go:240-263), then drives the actual ext-proc stream with handcraftedProcessingRequests and asserts on the returned header mutations — i.e., you can watch a scheduling decision change as you flip a fake pod’sWaitingQueueSize. This is the best place to experiment with new scorer behavior. - Fake backends are first-class: the e2e suite and all getting-started guides default to the vLLM simulator
ghcr.io/llm-d/llm-d-inference-sim(config/manifests/vllm/sim-deployment.yaml) — a CPU-only container that speaks the OpenAI API and emits protocol-conformantvllm:*metrics, including LoRA. A kind cluster + any conformant gateway (kgateway, Istio, Envoy Gateway) + sim deployment gives you the full path:site-src/guides/index.mdis the walkthrough;make test-e2eautomates it (test/e2e/epp/README.md). - Standalone mode, no Gateway API at all (
site-src/guides/standalone.md): EPP + Envoy as a sidecar pair, pods discovered by--endpoint-selector app=...label selector — the minimal lab to watch ext-proc traffic with the simulator, and incidentally the deployment shape closest to “drop an inference scheduler into an existing Envoy fleet.” - Conformance against kind:
go test ./conformance --run TestConformancewith--gateway-classpointed at your gateway; echo-server backends only, no accelerators (site-src/guides/conformance-tests.md). - Flow control benchmarks:
pkg/epp/flowcontrol/benchmark/+make test-benchmarkfor queue/fairness micro-benchmarks.
Good first contribution surfaces (judged from TODOs and experimental markers in-tree, modulo the llm-d migration): the RequestContext protocol/lifecycle decoupling (pkg/epp/handlers/server.go:93-95 TODO), configurable prepareDataTimeout (pkg/epp/requestcontrol/director.go:52-55), the canonical P/D profile mechanism replacing the hardcoded "prefill" name (issue #2080), flow-control policy plugins, and conformance tests — which this repo explicitly retains and which are pure Gateway/Envoy engineering.