Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Gateway API Inference Extension

Kubernetes SIG project (kubernetes-sigs/gateway-api-inference-extension, “GIE” or “Inference Gateway”) that turns any ext-proc-capable, Gateway API-conformant proxy — Envoy Gateway, Istio, kgateway, GKE Gateway, agentgateway — into an inference-aware L7 load balancer. It ships two things: a set of CRDs (InferencePool and friends) that model a fleet of model-server pods as a routable backend, and the Endpoint Picker (EPP), a Go gRPC server implementing Envoy’s External Processing protocol that picks the destination pod per request using live vLLM/SGLang/TensorRT-LLM metrics.

Repo status note (important for contribution targeting). Per README.md:16-23, the EPP, InferenceObjective/InferenceModelRewrite APIs, and the Body-Based Router are migrating to llm-d/llm-d-inference-scheduler and llm-d/llm-d-inference-payload-processor; this repo remains the home of the InferencePool API, the lightweight EPP (lwepp), and the conformance suite. This checkout still contains the complete EPP (pkg/epp/), so it is the right place to study the architecture — but new scheduler/EPP PRs should go to the llm-d org. Everything below cites code in this repo.

1. What it is, and what classic LB cannot do

Classic load balancing assumes requests are roughly interchangeable and backends are roughly stateless: round-robin, least-request, EWMA, maglev/ring-hash for affinity. LLM serving violates every assumption:

  • Per-request cost variance is enormous. A request is not a unit of work; its cost is O(prompt_tokens) for prefill plus O(output_tokens × batch_interference) for decode. Two in-flight requests can differ by 1000x in GPU-seconds. Least-request counts requests, not tokens.
  • Backends are stateful. Each replica holds a KV cache (prefix cache). Routing a request to a pod that already has its prompt prefix cached skips most of prefill — but no L7 hash policy can know which pod has which prefix blocks resident, because residency changes with every admission and eviction.
  • Latency is load-dependent and saturation is a cliff. Once a vLLM replica’s KV blocks are exhausted it starts queueing or preempting (recompute), and decode latency for everyone in the batch degrades. You want load-aware routing on queue depth and KV utilization, and admission control before the cliff, not after.
  • LoRA multiplexing. A pod can only hold N adapters in GPU memory; routing a request for adapter X to a pod that must first swap X in costs hundreds of ms.

The project’s answer: leave the data plane in Envoy, and put a per-request scheduling decision into an external processor that watches model-server metrics at 50ms granularity. The EPP filters/scores/picks among the pods of an InferencePool and tells Envoy where to send the request via a header consumed by an ORIGINAL_DST cluster. Selection criteria (from README.md:71-81): KV-cache pressure, queue depth, prefix-cache affinity, LoRA adapter residency, request priority — with saturation-aware shedding of low-priority traffic.

2. Why you care

This is precisely the system a traffic-infra engineer sketches from first principles after learning how continuous batching works: “I need least-loaded routing where load = queue depth + KV pressure, plus consistent-hash-like affinity on prompt prefixes, plus load shedding before the saturation cliff — and I want it in my existing Envoy fleet, not a bespoke proxy.”

  • The EPP is an Envoy ext-proc server (envoy.service.ext_proc.v3.ExternalProcessor) speaking full-duplex streamed gRPC. The protocol, processing modes, header mutations, dynamic metadata, ClearRouteCache, ImmediateResponse for 429/503 — all the machinery is the Envoy machinery you already know.
  • The scheduling layer is a clean filter → scorer → picker plugin pipeline (deliberately modeled on kube-scheduler), so the inference-specific knowledge is encapsulated in ~100-line plugins with explicit scoring math. You can read the entire decision logic in an afternoon.
  • It is the vendor-neutral standard for this layer: Envoy Gateway, Istio, kgateway, GKE, and agentgateway all implement InferencePool conformance; llm-d (Google/Red Hat/IBM + vLLM collaboration) builds its inference scheduler directly on this framework. Contributions here (or in llm-d/llm-d-inference-scheduler, where the EPP now lives) land in every implementation.
  • It’s GA (v1 InferencePool), but the interesting frontiers are wide open: flow control / fairness (pkg/epp/flowcontrol/ is explicitly experimental), P/D-disaggregation profiles, latency-SLO prediction-based scheduling, multi-cluster InferencePoolImport. Envoy + Go + queueing theory is exactly your toolkit.

3. Architecture map

client ──HTTP──> Gateway (Envoy)                          [data plane]
                   │  ext_proc filter (FULL_DUPLEX_STREAMED gRPC)
                   ▼
                 EPP (this repo)                          [decision plane]
                   ├─ handlers/      ext-proc protocol state machine
                   ├─ requestcontrol Director: parse → admit → schedule → mutate
                   ├─ flowcontrol/   priority queues, fairness, saturation
                   ├─ scheduling/    filters → scorers → picker (plugins)
                   ├─ datalayer/     per-pod collectors  ──/metrics──> vLLM pods
                   └─ controller/    watches InferencePool/Pods/Objectives
                   │
                   └──returns x-gateway-destination-endpoint = ip:port
                   ▼
                 Envoy ORIGINAL_DST cluster ──> chosen model-server pod

The APIs

  • InferencePool (v1, api/v1/inferencepool_types.go:32) — the backend type you put in an HTTPRoute.backendRefs. Spec is just three things: a label selector for member pods, targetPorts (1-8 ports; each ip:port is a distinct endpoint, used for data-parallel ranks), and endpointPickerRef pointing at the EPP Service. endpointPickerRef.failureMode (api/v1/inferencepool_types.go:168-189) is FailClose by default — exactly Envoy ext-proc’s failure_mode_allow semantics surfaced as API. Status carries per-parent Accepted/ResolvedRefs conditions written by each Gateway controller.
  • InferenceObjective (v1alpha2, apix/v1alpha2/inferenceobjective_types.go:60-73) — attaches a workload identity and an integer priority (default 0; priority < 0 = sheddable, see pkg/epp/util/request/sheddable.go:20-22) to requests, selected via the x-gateway-inference-objective header.
  • InferenceModelRewrite (v1alpha2) — maps client-facing model names to backend models/adapters with weighted splits (the EPP rewrites "model" in the JSON body and un-rewrites it in responses, pkg/epp/handlers/server.go:446-460).
  • InferencePoolImport (v1alpha1) — experimental multi-cluster pool export/import.
  • EndpointPickerConfig (apix/config/v1alpha1/endpointpickerconfig_types.go:33) — not a CRD but a config-file schema (mounted ConfigMap) declaring which plugins to instantiate, scheduling profiles with per-scorer weights, flow control, saturation detector, and parser. This file is the EPP’s “scheduler policy” surface.

The EPP process

cmd/epp/main.gocmd/epp/runner/runner.go wires everything: a controller-runtime manager watching the pool/pods, the datalayer runtime (one collector goroutine per endpoint), the plugin registry (cmd/epp/runner/runner.go:459-509 registers every in-tree plugin factory), the scheduler, optional flow control, and the gRPC ext-proc server (default port 9002, gRPC health on 9003, Prometheus on 9090 — pkg/epp/server/options.go:93-118). The default plugin config when you supply none is queue + KV-cache + prefix scorers (pkg/epp/config/loader/defaults.go:46-103).

cmd/lwepp/ is the lightweight EPP that stays in this repo: same ext-proc protocol and InferencePool discovery, but a trivial round-robin picker (pkg/lwepp/handlers/server.go:84-89) and no metrics scraping — a minimal reference data plane for conformance and for gateways that want endpoint subsetting without smart scheduling (it recently gained port-aware filtering for data-parallel ranks).

4. Core mechanisms

4.1 The ext-proc flow

Where it sits in Envoy: the canonical config used by e2e tests is test/testdata/envoy.yaml. The ext_proc HTTP filter runs before the router with all body modes set to full-duplex streaming:

# test/testdata/envoy.yaml:97-109
- name: envoy.filters.http.ext_proc
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor
    grpc_service:
      envoy_grpc:
        cluster_name: ext_proc
        authority: vllm-qwen3-32b-epp.$E2E_NS:9002
      timeout: 10s
    processing_mode:
      request_header_mode: SEND
      response_header_mode: SEND
      request_body_mode: FULL_DUPLEX_STREAMED
      response_body_mode: FULL_DUPLEX_STREAMED

and the route targets an ORIGINAL_DST cluster that takes its destination from the EPP-set header — this is the whole trick that lets an external process steer per-request endpoint selection without EDS churn:

# test/testdata/envoy.yaml:151-162
- name: original_destination_cluster
  type: ORIGINAL_DST
  connect_timeout: 1000s
  lb_policy: CLUSTER_PROVIDED
  circuit_breakers:
    thresholds:
      - max_connections: 40000
        max_pending_requests: 40000
        max_requests: 40000
  original_dst_lb_config:
    use_http_header: true
    http_header_name: x-gateway-destination-endpoint

Per request, Envoy streams to the EPP: RequestHeaders → N× RequestBody chunks (EoS on last) → later ResponseHeaders → N× ResponseBody chunks. The EPP’s Process() loop (pkg/epp/handlers/server.go:162) runs a per-stream state machine (StreamRequestState, pkg/epp/handlers/server.go:142-154) with a single reader goroutine and a select that can also fire an eviction channel — flow control can yank an already-queued request mid-stream by sending ImmediateResponse(429) (pkg/epp/handlers/server.go:467-480).

The scheduling decision happens at request-body-EoS (it needs the parsed body for model name and prefix hashing): pkg/epp/handlers/server.go:319-348 parses the body, calls director.HandleRequest, then builds the response Envoy applies. The decision is communicated both as a header mutation and as dynamic metadata (namespace envoy.lb, key x-gateway-destination-endpointpkg/epp/metadata/consts.go:26-28), because gateway integrations differ in which they consume:

// pkg/epp/handlers/request.go:87-99
return &extProcPb.ProcessingResponse{
    Response: &extProcPb.ProcessingResponse_RequestHeaders{
        RequestHeaders: &extProcPb.HeadersResponse{
            Response: &extProcPb.CommonResponse{
                ClearRouteCache: true,
                HeaderMutation: &extProcPb.HeaderMutation{
                    SetHeaders: s.generateHeaders(ctx, reqCtx),
                },
            },
        },
    },
    DynamicMetadata: dynamicMetadata,
}

The full EPP↔proxy contract is specified in docs/proposals/004-endpoint-picker-protocol/README.md: the proxy may pass a candidate subset via filter metadata envoy.lb.subset_hint / x-gateway-destination-endpoint-subset; the EPP may return a comma-separated fallback list of endpoints (pkg/epp/requestcontrol/director.go:329-333 joins multiple picks) that Envoy walks on retry; 503 if no ready endpoint, 429 to shed. Headerless requests (e.g. GET /health) get a random endpoint at header time (pkg/epp/handlers/request.go:41-54). Response bodies are streamed back through the EPP so it can observe usage/latency per chunk and run post-response plugins — the EPP is on the data path for the whole response, which is why the protocol insists on full-duplex streaming.

The Director orchestration sequence (pkg/epp/requestcontrol/director.go:151-229): model rewrite → resolve InferenceObjective/priority → admission control → locate candidates (subset hint or full pool) → run PrepareData plugins (e.g. prefix matching) → run admission plugins → schedule → set TargetEndpoint + run PreRequest plugins.

4.2 The scheduling framework

pkg/epp/scheduling/scheduler.go:54 runs a loop of profiles chosen by a ProfileHandler (multi-profile exists for P/D disaggregation: a “prefill” profile and a decode profile can each pick an endpoint per request; see the hardcoded experimentalDefaultPrefillProfile = "prefill" in pkg/epp/framework/plugins/requestcontrol/dataproducer/approximateprefix/types.go:77-86). Each SchedulerProfile.Run is the kube-scheduler pattern:

// pkg/epp/scheduling/scheduler_profile.go:117-128
func (p *SchedulerProfile) Run(ctx context.Context, request *fwksched.InferenceRequest, cycleState *fwksched.CycleState, candidateEndpoints []fwksched.Endpoint) (*fwksched.ProfileRunResult, error) {
	endpoints := p.runFilterPlugins(ctx, request, cycleState, candidateEndpoints)
	if len(endpoints) == 0 {
		return nil, errcommmon.Error{Code: errcommmon.Internal, Msg: "no endpoints available for the given request"}
	}
	// if we got here, there is at least one endpoint to score
	weightedScorePerEndpoint := p.runScorerPlugins(ctx, request, cycleState, endpoints)

	result := p.runPickerPlugin(ctx, cycleState, weightedScorePerEndpoint)

	return result, nil
}

Scorers return map[endpoint]float64 clamped to [0,1]; the profile accumulates score × weight (pkg/epp/scheduling/scheduler_profile.go:165-168). The default profile (pkg/epp/config/loader/defaults.go:47-49 and helm config/charts/epplib/templates/_config.yaml:77-84) is:

pluginweight
queue-scorer2
kv-cache-utilization-scorer2
prefix-cache-scorer3
max-score-picker (injected by default)

The actual plugin inventory (registered in cmd/epp/runner/runner.go:459-509):

Scorers (pkg/epp/framework/plugins/scheduling/scorer/):

  • queue-scorer — min-max normalization of vLLM’s waiting-queue length across candidates:
// pkg/epp/framework/plugins/scheduling/scorer/queuedepth/queue.go:93-100
// endpointScoreFunc calculates the score based on the queue size of each endpoint. Longer queue gets a lower score.
endpointScoreFunc := func(endpoint framework.Endpoint) float64 {
    if maxQueueSize == minQueueSize {
        // If all pods have the same queue size, return a neutral score
        return 1.0
    }
    return float64(maxQueueSize-endpoint.GetMetrics().WaitingQueueSize) / float64(maxQueueSize-minQueueSize)
}
  • kv-cache-utilization-scorer — simply 1 - KVCacheUsagePercent (pkg/epp/framework/plugins/scheduling/scorer/kvcacheutilization/kvcache_utilization.go:76-82). Absolute, not normalized: an empty fleet scores all 1.0 and the queue scorer breaks the tie.
  • prefix-cache-scorermatchedBlocks / totalBlocks of the prompt found in the per-pod prefix index (pkg/epp/framework/plugins/scheduling/scorer/prefix/plugin.go:108-111); see below for how the index is built.
  • lora-affinity-scorer — tiered constants from adapter residency metrics:
// pkg/epp/framework/plugins/scheduling/scorer/loraaffinity/lora_affinity.go:85-98
switch {
// Ideal: The adapter is already active on this model server.
case active:
    scores[endpoint] = 1.0
// Good: The model server has capacity to load at least one more adapter.
case len(endpoint.GetMetrics().ActiveModels)+len(endpoint.GetMetrics().WaitingModels) < endpoint.GetMetrics().MaxActiveModels:
    scores[endpoint] = 0.8
// Moderate: The adapter is already in the queue to be loaded on this model server.
case waiting:
    scores[endpoint] = 0.6
// Unsuitable: The model server has reached its maximum capacity and cannot load the adapter.
default:
    scores[endpoint] = 0.0
}
  • running-requests-scorer — min-max on RunningRequestsSize (.../runningrequests/runningrequest.go:77-107), i.e. least-request but on the server’s own gauge.
  • token-load-scorer1 - inFlightTokens/threshold using EPP-side in-flight token accounting (.../tokenload/token_load.go:82-105), a leading indicator that doesn’t wait for the 50ms scrape.
  • latency-scorer (+ slo-headroom-tier-filter, latency-slo-admitter, predicted-latency-producer) — experimental SLO-driven scheduling against a sidecar latency-prediction service (latencypredictor/, Python). Headroom-tiered filtering then scoring by predicted TTFT/TPOT headroom.

Filters: prefix-cache-affinity-filter (pkg/epp/framework/plugins/scheduling/filter/prefixcacheaffinity/plugin.go — keeps only pods whose prefix-match ratio ≥ threshold, used as a two-gate strict/loose pair around the SLO tier filter in the helm config config/charts/epplib/templates/_config.yaml:49-60), slo-headroom-tier-filter, the utilization-detector doubling as a filter (below), and a header-based test filter.

Pickers (pkg/epp/framework/plugins/scheduling/picker/): max-score-picker (shuffle for random tie-break, stable sort desc, take top-N — maxscore/picker.go:87-115), random-picker, weighted-random-picker (score-proportional sampling, used with the latency predictor to avoid thundering-herd on the best pod).

Prefix-cache affinity is the most interesting subsystem (proposal: docs/proposals/0602-prefix-cache-aware-routing-proposal/). The EPP cannot see vLLM’s actual block tables, so it maintains an approximation: an LRU index mapping xxhash block hashes → set of pods that recently served them. The chunking mirrors vLLM’s block hashing, chained like a Merkle list and salted by model name:

// pkg/epp/framework/plugins/requestcontrol/dataproducer/approximateprefix/hashing.go:70-86
h := xxhash.New()
// Different models should have different hashes even with the same body.
_, _ = h.Write([]byte(request.TargetModel))
if cacheSalt := request.Body.CacheSalt(); cacheSalt != "" {
    _, _ = h.Write([]byte(cacheSalt))
}
prevBlockHash := blockHash(h.Sum64())
i := 0
for ; i+cacheBlockSizeChars <= len(userInput); i += cacheBlockSizeChars {
    h.Reset()
    _, _ = h.Write(userInput[i : i+cacheBlockSizeChars])
    _, _ = h.Write(toBytes(prevBlockHash))
    res = append(res, blockHash(h.Sum64()))
    prevBlockHash = res[len(res)-1]
}

Defaults (approximateprefix/types.go:88-113): block size 16 tokens (vLLM default; auto-tuned from the server’s cache_config_info metric when AutoTune is on), ~4 chars/token heuristic (no tokenizer in the hot path), max 256 blocks matched, LRU capacity 31,250 entries/pod (sized from an H100 KV-budget calculation in the comment). The flow is split across two hooks of the same plugin: PrepareRequestData hashes the prompt and annotates every candidate with PrefixCacheMatchInfo before scheduling; PreRequest records the chosen pod (and the prefill pod, in P/D mode) against those hashes after scheduling, asynchronously (approximateprefix/plugin.go:140-205). The scorer itself just reads the annotation — production of state and consumption are decoupled through the datalayer attribute system (Consumes()/Produces() declarations let the config loader validate the plugin graph).

4.3 Per-pod state: the datalayer

Service discovery: a controller watches pods matching the pool selector (or a static --endpoint-selector in standalone mode) and registers one collector per endpoint. Each collector runs a ticker loop polling its data sources and feeding extractors:

// pkg/epp/datalayer/collector.go:139-158
case <-ticker.Channel():
    for _, src := range sources {
        tn := src.TypedName()
        key := tn.String()

        ctx, cancel := context.WithTimeout(c.ctx, defaultCollectionTimeout)
        data, err := src.Poll(ctx, endpoint)
        cancel()

        logErrorTransition(logger, c.lastPollErrors, key, "poll", "source", err)
        if err != nil {
            continue
        }

        if srcExtractors, ok := exts[tn.Name]; ok && data != nil {
            for _, ext := range srcExtractors {
                extErr := ext.Extract(ctx, data, endpoint)

Defaults (pkg/epp/server/options.go:102-109): scrape every 50ms per pod with a 1s timeout, metrics-staleness-threshold 2s, and the vLLM metric names as defaults: vllm:num_requests_waiting, vllm:num_requests_running, vllm:kv_cache_usage_perc, vllm:lora_requests_info, vllm:cache_config_info. The metrics-data-source plugin GETs /metrics and parses Prometheus text; the core-metrics-extractor maps families into the shared Metrics snapshot (pkg/epp/framework/interface/datalayer/metrics.go:26-42: ActiveModels/WaitingModels/MaxActiveModels, RunningRequestsSize, WaitingQueueSize, KVCacheUsagePercent, CacheBlockSize, CacheNumBlocks, UpdateTime). Engine dialects are table-driven — the same extractor handles vLLM, SGLang, trtllm-serve, and Triton via per-engine metric specs selected by a pod label (inference.networking.k8s.io/engine-type):

// pkg/epp/framework/plugins/datalayer/extractor/metrics/factories.go:89-98
{
    Name:                    "sglang",
    QueuedRequestsSpec:      "sglang:num_queue_reqs",
    RunningRequestsSpec:     "sglang:num_running_reqs",
    KVUsageSpec:             "sglang:token_usage",
    LoRASpec:                "",
    CacheInfoSpec:           "sglang:cache_config_info",
    CacheBlockSizeLabelName: "page_size",
    CacheNumBlocksLabelName: "num_pages",
},

The scrape contract a model server must satisfy is the Model Server Protocol (docs/proposals/003-model-server-protocol/README.md). Staleness handling is consumer-side: every snapshot carries UpdateTime, and consumers like the saturation detector treat stale pods as saturated (its own staleness bound defaults to 200ms — pkg/epp/framework/plugins/flowcontrol/saturationdetector/utilization/config.go:36), while readiness/debug listers partition pods into fresh/stale against the 2s threshold (pkg/epp/datalayer/logger/logger.go:78-93).

4.4 Flow control: admission, queueing, saturation

Two admission paths exist behind the AdmissionController interface (pkg/epp/requestcontrol/admission.go:39-56):

Legacy (default) — stateless shed-or-pass: sheddable (priority < 0) requests are rejected with 429 when pool saturation ≥ 1.0; non-sheddable always pass to the scheduler. Saturation comes from the utilization-detector, a roofline model on the two scraped pressure signals:

// pkg/epp/framework/plugins/flowcontrol/saturationdetector/utilization/detector.go:124-136
if metrics == nil || time.Since(metrics.UpdateTime) > d.config.MetricsStalenessThreshold {
    totalScore += 1.0
    continue
}

qRatio := float64(metrics.WaitingQueueSize) / float64(d.config.QueueDepthThreshold)
kvRatio := metrics.KVCacheUsagePercent / d.config.KVCacheUtilThreshold

// Roofline Analysis: The pod is saturated if either resource is exhausted.
totalScore += max(qRatio, kvRatio)

The same detector doubles as a scheduling filter that drops pods beyond threshold × (1 + headroom) — with a fail-open clause returning all endpoints if everything is saturated (detector.go:142-168).

Flow Control layer (experimental, pkg/epp/flowcontrol/) — a full queueing system: FlowControlAdmissionController.Admit wraps the request and blocks in FlowController.EnqueueAndWait (pkg/epp/flowcontrol/controller/controller.go:206). The controller is a supervisor over sharded ShardProcessor workers; requests land in per-flow queues keyed by x-gateway-inference-fairness-id within priority bands. Pluggable policies, all registered in the same plugin registry: fairness (round-robin, global-strict), ordering (fcfs, edf earliest-deadline-first, slo-deadline), usage limits (static token budgets), and queue data structures (ListQueue, MaxMinHeappkg/epp/flowcontrol/framework/plugins/queue/). Outcomes map to ext-proc responses: dispatch → continue to scheduler; reject/TTL-expiry → 429/503; and queued requests can be evicted post-enqueue, which is what that eviction channel in the handler’s select loop is for (pkg/epp/handlers/server.go:260-275). The design doc pkg/epp/flowcontrol/README.md is one of the best-written pieces in the repo — head-of-line blocking, displacement, and band-relative fairness are all spelled out.

4.5 Conformance: what a Gateway implementation must do

conformance/ vendors the upstream Gateway API conformance machinery and defines a Gateway profile (conformance/conformance.go:60-66). To claim InferencePool support an implementation must pass tests covering, among others (conformance/tests/):

  • inferencepool_accepted, inferencepool_resolvedrefs_condition — status conditions written per parent Gateway.
  • gateway_following_epp_routing (+ _dp for multi-port data parallelism) — deploy a real EPP configured with a header-based test filter, assert the gateway actually routes to the exact pod the EPP names. This is the heart of conformance: the proxy must honor x-gateway-destination-endpoint.
  • epp_unavailable_fail_open — kill the EPP, failureMode: FailOpen pools must still serve.
  • gateway_destination_endpoint_served — the proxy must report which endpoint actually served via response-path metadata.
  • inferencepool_invalid_epp_service, httproute_invalid_inferencepool_ref, port validation, multiple-gateways/pools weighting tests.

Backends in conformance are echo servers, not model servers (conformance/resources/base.yaml:60-83) — conformance tests routing mechanics, not scheduling quality. Reports are published per-implementation under conformance/reports/. If Discord ever fronts inference with its own Envoy control plane, this suite is the compliance bar.

5. Suggested reading path

  1. README.md, then docs/proposals/002-api-proposal/README.md (why pool+objective) and docs/proposals/004-endpoint-picker-protocol/README.md (the EPP↔Envoy contract — short, read fully).
  2. api/v1/inferencepool_types.go, apix/v1alpha2/inferenceobjective_types.go — the API surface.
  3. test/testdata/envoy.yaml — the raw Envoy config; map every field to the protocol doc.
  4. pkg/epp/handlers/server.go (Process loop) and pkg/epp/handlers/request.go — the ext-proc state machine.
  5. pkg/epp/requestcontrol/director.go:151-229 — the orchestration spine.
  6. pkg/epp/scheduling/scheduler.go + scheduler_profile.go — the framework; then docs/proposals/0845-scheduler-architecture-proposal/.
  7. Scorers in order: queuedepth/queue.go, kvcacheutilization/kvcache_utilization.go, loraaffinity/lora_affinity.go, then the prefix pair: requestcontrol/dataproducer/approximateprefix/{hashing,indexer,plugin}.go + scorer/prefix/plugin.go.
  8. pkg/epp/datalayer/collector.go + framework/plugins/datalayer/extractor/metrics/ — metrics pipeline; pkg/epp/server/options.go for every default.
  9. pkg/epp/flowcontrol/README.md, then requestcontrol/admission.go and saturationdetector/utilization/detector.go; go deeper into flowcontrol/controller/ only if fairness interests you.
  10. pkg/epp/config/loader/defaults.go + apix/config/v1alpha1/endpointpickerconfig_types.go — how config becomes a plugin graph.
  11. conformance/tests/gateway_following_epp_routing.go and test/integration/epp/harness.go — how it’s all verified.

6. Connections to your other study repos

  • llm-d — the most direct: llm-d’s inference scheduler is this EPP framework (the code is migrating to llm-d/llm-d-inference-scheduler), with llm-d adding disaggregated prefill/decode plugins, KV-cache-event-based (precise, not approximate) prefix indexing, and the vLLM-side integration. The multi-profile scheduler and the hardcoded "prefill" profile name here are the seams llm-d plugs into. Study GIE first; llm-d then reads as “GIE plus opinionated vLLM deployment.”
  • vllm — the backend whose telemetry drives everything: vllm:num_requests_waiting, vllm:kv_cache_usage_perc, vllm:lora_requests_info are the EPP’s eyes (pkg/epp/server/options.go:105-109). The approximate prefix index mirrors vLLM’s block-hash chaining (prefix_caching design) at 16-token granularity. KV pressure → preemption → latency cliff is the vLLM behavior the saturation detector encodes.
  • sgl-router & dynamo — the architectural counterpoint. Both put inference-aware routing in the data plane process (SGLang’s Rust router with cache-aware tree matching; Dynamo’s distributed runtime with its own KV-aware planner/router tier), getting tokenizer-exact prefix matching and event-driven cache state at the cost of owning the proxy: TLS, HTTP/2, retries, observability, deployment. GIE instead splits decision plane (EPP) from data plane (any conformant Envoy), paying one ext-proc RTT (~ms, plus full-duplex body streaming) and accepting approximate cache state, in exchange for vendor-neutrality and reuse of the Envoy ecosystem. Note GIE hedges on precision: char-based heuristic hashing here, with the event-driven precise indexer living in llm-d. Knowing both sides of this trade is exactly the judgment an infra interviewer probes.
  • nano-vllm — the minimal lab for why these signals exist: see its block manager and scheduler to internalize what kv_cache_usage_perc and num_requests_waiting physically mean before tuning scorers that consume them.
  • xgrammar — orthogonal layer (constrained decoding inside the engine); only contact point is that structured-output requests skew decode cost, which token-load/latency scorers absorb statistically.
  • flashinfer — the layer beneath vLLM that makes decode latency batch-size-dependent (paged KV attention kernels). That kernel-level fact is the root cause of “load-dependent latency” that makes least-request insufficient — the EPP is the system-level compensation.

7. Hands-on without a GPU fleet

Everything here runs on a laptop:

  • Unit tests (make test-unit) — pure Go, no cluster. Highest-value reads/runs: pkg/epp/scheduling/scheduler_test.go (full filter/score/pick cycles against fake metrics), approximateprefix/plugin_test.go and indexer_test.go (prefix matching end-to-end), saturationdetector/utilization/detector_test.go, pkg/epp/handlers/server_abort_test.go.
  • Hermetic integration tests (make test-integration, test/integration/epp/hermetic_test.go) — boots the real EPP gRPC server via controller-runtime envtest (local kube-apiserver, no kubelets), injects pod metrics through FakePodMetricsClient / a mock datalayer source (test/integration/epp/harness.go:240-263), then drives the actual ext-proc stream with handcrafted ProcessingRequests and asserts on the returned header mutations — i.e., you can watch a scheduling decision change as you flip a fake pod’s WaitingQueueSize. This is the best place to experiment with new scorer behavior.
  • Fake backends are first-class: the e2e suite and all getting-started guides default to the vLLM simulator ghcr.io/llm-d/llm-d-inference-sim (config/manifests/vllm/sim-deployment.yaml) — a CPU-only container that speaks the OpenAI API and emits protocol-conformant vllm:* metrics, including LoRA. A kind cluster + any conformant gateway (kgateway, Istio, Envoy Gateway) + sim deployment gives you the full path: site-src/guides/index.md is the walkthrough; make test-e2e automates it (test/e2e/epp/README.md).
  • Standalone mode, no Gateway API at all (site-src/guides/standalone.md): EPP + Envoy as a sidecar pair, pods discovered by --endpoint-selector app=... label selector — the minimal lab to watch ext-proc traffic with the simulator, and incidentally the deployment shape closest to “drop an inference scheduler into an existing Envoy fleet.”
  • Conformance against kind: go test ./conformance --run TestConformance with --gateway-class pointed at your gateway; echo-server backends only, no accelerators (site-src/guides/conformance-tests.md).
  • Flow control benchmarks: pkg/epp/flowcontrol/benchmark/ + make test-benchmark for queue/fairness micro-benchmarks.

Good first contribution surfaces (judged from TODOs and experimental markers in-tree, modulo the llm-d migration): the RequestContext protocol/lifecycle decoupling (pkg/epp/handlers/server.go:93-95 TODO), configurable prepareDataTimeout (pkg/epp/requestcontrol/director.go:52-55), the canonical P/D profile mechanism replacing the hardcoded "prefill" name (issue #2080), flow-control policy plugins, and conformance tests — which this repo explicitly retains and which are pure Gateway/Envoy engineering.