NVIDIA Dynamo

1. What it is

Dynamo is NVIDIA’s open-source, datacenter-scale inference serving framework. It does not run models itself — it is the orchestration layer above inference engines (vLLM, SGLang, TensorRT-LLM), turning a fleet of single-node engines into one coordinated serving system:

OpenAI-compatible frontend (Rust, axum) with preprocessing (chat templates, tokenization)
KV-aware router that scores workers by cached-prefix overlap plus live load
Disaggregated prefill/decode with GPU-to-GPU KV-cache transfer via NIXL
KV Block Manager (KVBM) for multi-tier KV offload (GPU → pinned host → disk → S3)
Planner for SLA-driven autoscaling of prefill/decode pools
A distributed runtime (service discovery, leases, request plane, event plane) that all of the above is built on

The mental model: engines optimize a box; Dynamo optimizes the fleet. If you serve one model on one GPU, you don’t need it. If you serve across nodes and want to never recompute a prefix that already exists somewhere in the cluster, this is the layer that does it.

Version in this snapshot: workspace 1.3.0 (Cargo.toml:36). Core is Rust (edition 2024, tokio); engine adapters and the planner are Python; everything meets at PyO3 bindings.

2. Why you care

This repo is the closest thing in the ecosystem to “your current job, but for LLM inference”:

It is traffic infrastructure in Rust. The hot path — HTTP ingress, routing decision, request dispatch, token streaming — is all tokio/axum/tonic code. The same vocabulary you use at the edge (connection pools, pub/sub planes, leases, backpressure, softmax-with- temperature load balancing, sticky sessions, request migration) appears here with one new twist: the load balancer has to reason about where bytes of KV cache already live.
Routing is the product. A request router whose cost function is prefill_blocks_you'd_recompute + decode_blocks_in_flight is cache-aware load balancing — the same family of problem as consistent-hash + least-loaded at Discord, but the cache is GPU VRAM and a miss costs a multi-second prefill.
There is literally an Envoy ext_proc server in here. deploy/inference-gateway/ext-proc is a Rust gRPC ext_proc implementation of the Gateway API Inference Extension’s Endpoint Picker, backed by the same dynamo-kv-router crate. Your Envoy background maps directly.
Token egress is a designed transport, not an afterthought: a call-home TCP response plane, separate from the request plane, separate from the event plane.

3. Architecture map

The Rust workspace

Workspace members are declared in Cargo.toml:4-32. The crates that matter, roughly by layer:

Crate	Path	What it does
`dynamo-runtime`	`lib/runtime`	The distributed runtime: `DistributedRuntime`, Namespace/Component/Endpoint model, discovery backends (K8s, etcd, file, memory), etcd leases, NATS/TCP/ZMQ transports, the pipeline graph (`Source → Operator → Sink`), push router, metrics
`dynamo-llm`	`lib/llm`	Everything LLM-specific: OpenAI/Anthropic HTTP service, preprocessor (chat template + tokenize), `KvPushRouter`, `PrefillRouter` (disagg orchestration), migration, model discovery/cards, KVBM block manager v2, gRPC (KServe) frontend, engines glue
`dynamo-kv-router`	`lib/kv-router`	The routing core, runtime-free: radix-tree indexers (single-thread, concurrent, compressed), scheduling (cost function, queue, admission), active-sequence tracking, replica sync, ZMQ wire format
`dynamo-tokens`	`lib/tokens`	Token sequences, block hashing helpers
`dynamo-kv-hashing`	`lib/kv-hashing`	“Request → PositionalLineageHash” contract so router, KVBM, and engines agree on block identity
`dynamo-protocols`	`lib/protocols`	OpenAI-compatible API types with NVIDIA extensions (`nvext`)
`dynamo-tokenizers` / `dynamo-renderer`	`lib/tokenizers`, `lib/renderer`	HF tokenizers and minijinja chat-template rendering, runtime-free
`dynamo-parsers`	`lib/parsers`	Tool-calling and reasoning-trace parsers
`dynamo-memory` + `kvbm-*`	`lib/memory`, `lib/kvbm-*`	Memory arenas, NIXL agent bindings (`lib/memory/src/nixl/agent.rs`), and the KV Block Manager split into logical/physical/engine/kernels crates
`dynamo-mocker`	`lib/mocker`	GPU-free simulated engine: vLLM/SGLang-style schedulers, KV events, disagg — drives the real router/frontend without hardware
`dynamo-backend-common`	`lib/backend-common`	Shared glue for writing pure-Rust backends
`dynamo-ext-proc`	`deploy/inference-gateway/ext-proc`	Envoy ext_proc gRPC server (GAIE Endpoint Picker) reusing `dynamo-kv-router`

Python bindings and components

lib/bindings/python is a PyO3 crate exposing the runtime and LLM stack as dynamo._core (lib/bindings/python/rust/lib.rs). Python gets DistributedRuntime, Endpoint, KvEventPublisher, register_model, make_engine, the planner hooks, etc.
components/src/dynamo/ is the Python product surface: frontend (thin wrapper that launches the Rust HTTP server + router), vllm / sglang / trtllm (engine workers), planner, mocker, router (standalone router process), global_router, profiler.

How engines plug in as workers

An engine worker is just a Python process that (1) builds a DistributedRuntime, (2) creates an endpoint like dyn://namespace.component.generate, (3) registers a model card, and (4) serves an async generator. From the vLLM component:

# components/src/dynamo/vllm/worker_factory.py:196-230 (abridged)
generate_endpoint = runtime.endpoint(
    f"{config.namespace}.{config.component}.{config.endpoint}"
)
...
await register_model(
    ModelInput.Tokens, ModelType.Empty, generate_endpoint, config.model,
    model_name=config.served_model_name or config.model,
    worker_type=WorkerType.Encode,
    needs=[[WorkerType.Prefill, WorkerType.Decode], [WorkerType.Aggregated]],
)
...
await generate_endpoint.serve_endpoint(
    handler.generate, metrics_labels=[("model", config.model)]
)

register_model publishes a Model Deployment Card into discovery; the frontend’s ModelWatcher (lib/llm/src/discovery/watcher.rs) sees it and wires a new pipeline. The engine never speaks HTTP — it receives already-tokenized PreprocessedRequests and yields token deltas. The handler’s generate(request, context) is the entire engine contract.

4. Core mechanisms

4.1 Request flow end to end

Standalone-mode flow: client → Frontend (HTTP) → [preprocess → migration → detokenize → PrefillRouter → KvPushRouter] → worker, with tokens streaming back through the same chain. The whole thing is one typed pipeline built from operators with forward and backward edges:

#![allow(unused)]
fn main() {
// lib/llm/src/entrypoint/input/common.rs:352-362
let engine = frontend
    .link(preprocessor_op.forward_edge())?
    .link(migration.forward_edge())?
    .link(token_backend.forward_edge())?
    .link(prefill_op.forward_edge())?
    .link(backend)?
    .link(prefill_op.backward_edge())?
    .link(token_backend.backward_edge())?
    .link(migration.backward_edge())?
    .link(preprocessor_op.backward_edge())?
    .link(frontend)?;
}

Forward edges transform the request (OpenAI JSON → templated prompt → token IDs); backward edges transform the response stream (token IDs → detokenized deltas → OpenAI SSE chunks). Migration sits high enough to replay/redirect an in-flight request if a worker dies.

Transport choices — three independent planes, which is the part worth studying closely:

Client-facing egress is axum SSE. The engine’s response stream is mapped through metrics observers and a disconnect monitor, then wrapped:

#![allow(unused)]
fn main() {
// lib/llm/src/http/service/openai.rs:760-768
let stream = monitor_for_disconnects(stream, ctx, inflight_guard, stream_handle);

let mut sse_stream = Sse::new(stream);

if let Some(keep_alive) = state.sse_keep_alive() {
    sse_stream = sse_stream.keep_alive(KeepAlive::default().interval(keep_alive));
}

Ok(sse_stream.into_response())
}

Request plane (frontend → worker RPC) defaults to direct TCP, selectable to NATS via DYN_REQUEST_PLANE (docs/design-docs/request-plane.md). Routing mode aside, the actual dispatch is PushRouter::direct(request, instance_id) (lib/runtime/src/pipeline/network/egress/push_router.rs).
Response plane is the part that will feel familiar from designing token-egress paths: the requester runs a TcpStreamServer (lib/runtime/src/pipeline/network/tcp/server.rs:87-93), and the request carries connection info; the worker calls home over a fresh TCP stream (CallHomeHandshake, same file) and frames response chunks back with a two-part codec. So request fan-out and token fan-in are decoupled — tokens never transit NATS even when the request did. Frame inactivity timeouts, prologue handshakes, and per-stream contexts for cancellation are all explicit (lib/runtime/src/pipeline/network/egress/push_router.rs:52-62).

There is also a separate event plane (NATS Core or ZMQ, DYN_EVENT_PLANE) carrying KV events and router replica sync — covered next. A KServe gRPC frontend exists alongside HTTP (lib/llm/src/grpc), and lib/llm/src/http/service/realtime.rs covers websocket realtime.

4.2 KV-aware routing

The router answers: given these tokens, which worker would do the least new work? Three pieces: block hashing, a global prefix index fed by worker events, and a cost function that fuses cache overlap with live load.

Block hashing. Token sequences are chunked into engine-sized blocks (16/64 tokens) and hashed with XXH3; LoRA adapter identity is mixed into the seed so adapters don’t cross-hit:

#![allow(unused)]
fn main() {
// lib/kv-router/src/protocols.rs:85-97
pub fn compute_block_hash_for_seq(
    tokens: &[u32],
    kv_block_size: u32,
    options: BlockHashOptions<'_>,
) -> Vec<LocalBlockHash> {
    if kv_block_size == 0 {
        return Vec::new();
    }
    let seed = match options.lora_name.filter(|n| !n.is_empty()) {
        Some(name) => XXH3_SEED.wrapping_add(xxh3::xxh3_64(name.as_bytes())),
        None => XXH3_SEED,
    };
}

These are local hashes computed from token content, so the router never depends on engines agreeing on block IDs (docs/design-docs/router-design.md, “Deterministic Event IDs”).

KV events → indexer. Engines publish stored/removed/cleared events as blocks enter and leave their paged KV caches (vLLM via its ZMQ KvEventPublisher, the mocker natively):

#![allow(unused)]
fn main() {
// lib/kv-router/src/protocols.rs:640-646
pub enum KvCacheEventData {
    Stored(KvCacheStoreData),
    Removed(KvCacheRemoveData),
    Cleared,
}
}

The KvIndexer folds these into a radix tree over block hashes where every node records which workers hold that prefix. A lookup walks the query’s hash sequence and returns per-worker matched depth:

#![allow(unused)]
fn main() {
// lib/kv-router/src/indexer/radix_tree.rs:200-206
pub fn find_matches(&self, sequence: Vec<LocalBlockHash>, early_exit: bool) -> OverlapScores {
    self.find_match_details(sequence, early_exit).overlap_scores
}

pub fn apply_event(&mut self, event: RouterEvent) -> Result<(), KvCacheEventError> {
    self.apply_event_with_counters(event, None)
}
}

OverlapScores is just FxHashMap<WorkerWithDpRank, u32> (lib/kv-router/src/protocols.rs:942-947). Implementations range from a single-threaded compressed tree to a sharded ConcurrentRadixTree with sticky per-worker write threads (lib/kv-router/src/indexer/). Event-loss handling is sequence-number based: each worker numbers its events monotonically, the router detects gaps and re-queries the worker’s local indexer for the missing range, and dumps full worker state on discovery — or you can opt into NATS JetStream durable consumption with snapshots in object store (--durable-kv-events).

Scoring. Selection happens in DefaultWorkerSelector — the cost (“logit”, lower is better) per worker is computed in worker_logit:

#![allow(unused)]
fn main() {
// lib/kv-router/src/scheduling/selector.rs:194-203
let effective_overlap_score_credit = weights.overlap_score_credit * overlap_credit_decay;
let overlap_credit_blocks = effective_overlap_score_credit * device_overlap_blocks
    + self.kv_router_config.host_cache_hit_weight * host_overlap_blocks
    + self.kv_router_config.disk_cache_hit_weight * disk_overlap_blocks
    + shared_overlap_blocks;
let adjusted_prefill_blocks = (raw_prefill_blocks - overlap_credit_blocks).max(0.0);
let prefill_cost_blocks = weights.prefill_load_scale * adjusted_prefill_blocks;
let worker_load = worker_load.unwrap_or_default();
let decode_cost_blocks = worker_load.potential_decode_blocks() as f64;
let logit = prefill_cost_blocks + decode_cost_blocks;
}

Read it as: blocks I’d have to prefill from scratch (discounted by what the worker already holds on device, in pinned host memory, on disk, or in a shared external cache — each tier with its own credit weight) plus blocks already decoding there. An overlap_credit_decay term softly forfeits cache affinity on workers whose prefill backlog exceeds the fleet floor — the explicit TTFT-vs-ITL tradeoff knob. Selection is min-cost with reservoir-sampled tie breaking, or softmax sampling when router_temperature > 0 (lib/kv-router/src/scheduling/selector.rs:29-85). Taints implement topology/required-zone constraints, multiplying scores for preferred taints and filtering for required ones.

Load signals come from the router’s own bookkeeping, not engine polling: a slot manager (ActiveSequencesMultiWorker, lib/kv-router/src/sequences/) predicts active blocks at route time, marks prefill complete on first output token, and frees on stream end. With multiple router replicas, these predictions sync over NATS core (AddRequest / MarkPrefillCompleted / Free, lib/kv-router/src/sequences/replica_sync.rs). Admission is serialized through SchedulerQueue::admit_one so projected load and booking can’t race (lib/kv-router/src/scheduling/CLAUDE.md documents the invariants — worth reading as a design doc). The KV-routing wrapper that ties selection to dispatch:

#![allow(unused)]
fn main() {
// lib/llm/src/kv_router/push_router.rs:321-335 (abridged)
let mut response_stream = cancel_on_stop(
    request_context.as_ref(),
    &context_id,
    self.inner
        .direct(updated_request, instance_id)
        .instrument(tracing::info_span!(
            "kv_router.route_request",
            request_id = %context_id,
            worker_id = instance_id,
            overlap_blocks = overlap_amount,
        )),
)
.await??;
}

KvRouter itself (lib/llm/src/kv_router.rs:186-205) is deliberately “decide, don’t route”: it owns the Indexer + KvScheduler and emits FindBestMatchOutcome::{Routed, Backpressure} — backpressure carries queued-token depth so the HTTP layer can 429/503 instead of piling on.

4.3 Disaggregated prefill/decode

Disaggregation is orchestrated entirely at the routing layer by PrefillRouter (lib/llm/src/kv_router/prefill_router/mod.rs:36-43), an operator sandwiched between migration and the decode router (see the pipeline chain above). The split:

Clone the preprocessed request, force max_tokens = 1, route it to a prefill worker (KV-aware, using the same selector with prefill-specific config):

#![allow(unused)]
fn main() {
// lib/llm/src/kv_router/prefill_router/mod.rs:114-116
// Prepare prefill request with max_tokens = 1 (clone after tracker is set)
let mut prefill_req = req.clone();
prefill_req.stop_conditions.max_tokens = Some(1);
}

The prefill worker computes KV and returns transfer metadata, not KV bytes. In the vLLM handler the metadata is vLLM’s kv_transfer_params (block IDs + connection info):

# components/src/dynamo/vllm/handlers.py:3641-3651 (abridged)
output: Dict[str, Any] = {
    "token_ids": list(token_ids),
    "disaggregated_params": self._build_disaggregated_params(
        kv_protocol.decode_request_kv_transfer_params(res),
        embedding_params,
    ),
    "completion_usage": ...,
}

PrefillRouter injects that metadata into the decode request, restores the original max_tokens, and forwards to the decode router:

#![allow(unused)]
fn main() {
// lib/llm/src/kv_router/prefill_router/mod.rs:316-335 (abridged)
match outcome {
    PrefillOutcome::Bootstrap { bootstrap_info, worker_id } => {
        decode_req.bootstrap_info = Some(bootstrap_info);
        decode_req.routing_mut().prefill_worker_id = Some(worker_id);
    }
    PrefillOutcome::Completed { result, worker_id, worker_link } => {
        decode_req.prefill_result = Some(result);
        decode_req.migration_link = worker_link;
        ...
    }
};
}

The decode worker uses the metadata to pull KV directly from the prefill worker’s VRAM via NIXL (NVLink / InfiniBand-UCX / PCIe), non-blocking with respect to its forward passes. The transfer itself runs inside the engines’ KV-connector layer; Dynamo’s own Rust NIXL agent (lib/memory/src/nixl/agent.rs) is used by KVBM and multimodal RDMA.

Two execution shapes (docs/design-docs/disagg-serving.md): SGLang-style bootstrap — prefill worker publishes an RDMA rendezvous endpoint, so prefill is spawned as a background task and decode routing proceeds immediately, overlapping transfer with decode scheduling; vLLM/TRT-LLM-style synchronous — decode waits for the prefill response. Decode-side routing then runs with an override that zeroes overlap credit and prompt-load tracking (build_decode_router_override, asserted at prefill_router/mod.rs:453-463), because the decode pool’s cache state was just mutated by the transfer, not by prefix reuse.

Operationally notable details for someone who has run this class of system:

Prefill is deliberately not linked as a child of the request context: cancelling it mid-NIXL-transfer would permanently leak KV blocks, so it runs to completion and wasted compute is the accepted tradeoff (prefill_router/mod.rs:153-162 — comment cites the bug).
Backpressure from the prefill queue surfaces as ResourceExhausted rather than silently re-entering the saturated queue (prefill_router/mod.rs:181-203).
Fallback is explicit: no prefill workers → aggregated mode, unless --enforce-disagg fails fast; all-prefill-death flips a deactivated flag checked per request (prefill_router/mod.rs:96-104).
Topology constraints (“decode must be NVLink-reachable from this prefill worker”) are merged into the decode request as required/preferred taints (merge_decode_topology_constraints, prefill_router/mod.rs:424-442).

4.4 The distributed runtime

Everything above runs on dynamo-runtime. The object model is a three-level hierarchy — Namespace / Component / Endpoint (lib/runtime/src/component.rs:4-30) — addressed as dyn://namespace.component.endpoint. An Instance is one live process serving an endpoint, identified by a u64 instance_id and a transport address (lib/runtime/src/component.rs:106-115). The hello-world server is the whole pattern:

#![allow(unused)]
fn main() {
// lib/runtime/examples/hello_world/src/bin/server.rs:54-65
async fn backend(runtime: DistributedRuntime) -> anyhow::Result<()> {
    // attach an ingress to an engine
    let ingress = Ingress::for_engine(RequestHandler::new())?;

    let component = runtime.namespace(DEFAULT_NAMESPACE)?.component("backend")?;
    component
        .endpoint("generate")
        .endpoint_builder()
        .handler(ingress)
        .start()
        .await
}
}

Discovery is pluggable (lib/runtime/src/distributed.rs:148-184): a Discovery trait with two families — Kubernetes-native (CRD DynamoWorkerMetadata + EndpointSlices, no etcd) and KV-store backed (etcd, a plain filesystem directory for laptop dev via --discovery-backend file, or in-memory for tests). Workers write instance records and model cards; frontends list_and_watch a DiscoveryQuery and react to Added/Removed.

Liveness in the etcd backend is lease-based — every registration is attached to a TTL lease kept alive by a background task; if keep-alive fails the process’s cancellation token fires, taking the whole worker down rather than serving as a zombie:

#![allow(unused)]
fn main() {
// lib/runtime/src/transports/etcd/lease.rs:15-27
pub async fn create_lease(
    connector: Arc<Connector>,
    ttl: u64,
    token: CancellationToken,
) -> anyhow::Result<u64> {
    let mut lease_client = connector.get_client().lease_client();
    let lease = lease_client.grant(ttl as i64, None).await?;

    let id = lease.id() as u64;
    let ttl = lease.ttl() as u64;
    let child = token.child_token();
}

Lease IDs double as instance IDs, which is why the TCP layer keeps short-lived tombstones keyed on them during re-registration races (lib/runtime/src/pipeline/network/tcp/server.rs:15-19).

NATS is optional everywhere: requests default to TCP, discovery defaults to K8s/file, and KV events can ride ZMQ. NATS earns its place for JetStream-durable KV events, router replica sync, and brokered request plane (README.md:246-262).

4.5 Planner / autoscaling

components/src/dynamo/planner is the SLA autoscaler (Python). Signals are forward-pass metrics (FPM) scraped from workers — queued prefill tokens, KV-cache utilization, observed TTFT/ITL — fed into a state machine with two modes: simple threshold scaling and SLA mode backed by per-GPU performance models (interpolated from offline profiling or AIConfigurator predictions). The non-SLA thresholds are legible enough to read directly:

# components/src/dynamo/planner/core/load_scaling.py:24-34
# Prefill: ratio of queued_prefill_tokens / context_length
_PREFILL_THROUGHPUT_SCALE_UP = 1.0  # queued >= context_length
_PREFILL_THROUGHPUT_SCALE_DOWN = 0.1  # queued < context_length / 10
_PREFILL_LATENCY_SCALE_UP = 0.1  # queued >= context_length / 10
_PREFILL_LATENCY_SCALE_DOWN = 0.0  # queued == 0

# Decode/Agg: KV cache utilization (scheduled + queued) / max_kv_tokens
_DECODE_THROUGHPUT_SCALE_UP = 1.0  # util > 100%
_DECODE_THROUGHPUT_SCALE_DOWN = 0.6  # util < 60%
_DECODE_LATENCY_SCALE_UP = 0.4  # util > 40%
_DECODE_LATENCY_SCALE_DOWN = 0.1  # util < 10%

Prefill and decode pools scale independently (the economic argument for disaggregation), with budget caps and connectors that execute decisions against Kubernetes (DGD resources via the operator) or virtual targets for simulation. The DynamoGraphDeploymentRequest CRD (README.md:162-175) chains AIConfigurator profiling → planner topology → deployment from a single model + SLA spec. A global_planner component coordinates across deployments.

5. Suggested reading path

Orientation (30 min). README.md, then docs/design-docs/architecture.md and docs/design-docs/dynamo-flow.md (the numbered S1-S9 disagg walkthrough).
Runtime primitives. lib/runtime/examples/hello_world/src/bin/server.rs and client.rs (74 lines total — the whole component/endpoint/push-router model), then lib/runtime/src/component.rs doc comments and lib/runtime/src/distributed.rs:117-184 for discovery backends. Skim lib/runtime/src/pipeline/network/tcp/server.rs for the call-home response stream.
The frontend pipeline. lib/llm/src/entrypoint/input/common.rs:323-365 (operator chain), lib/llm/src/http/service/openai.rs (SSE handler), and lib/llm/src/preprocessor.rs (what “preprocess” actually does).
The router, bottom-up. lib/kv-router/src/protocols.rs (hashes, events, OverlapScores), lib/kv-router/src/indexer/radix_tree.rs (single-threaded tree first — the concurrent versions are optimizations of the same shape), lib/kv-router/src/scheduling/selector.rs (cost function; tests at the bottom are executable documentation), lib/kv-router/src/scheduling/CLAUDE.md (invariants), then up to lib/llm/src/kv_router.rs and lib/llm/src/kv_router/push_router.rs for the integration. docs/design-docs/router-design.md alongside.
Disaggregation. docs/design-docs/disagg-serving.md, then lib/llm/src/kv_router/prefill_router/mod.rs top to bottom (the generate method is the whole story), with components/src/dynamo/vllm/handlers.py prefill/decode handlers as the engine-side counterpart.
Worker lifecycle in Python. components/src/dynamo/vllm/main.py + worker_factory.py, and lib/bindings/python/rust/lib.rs to see how thin the binding is.
Optional depth. KVBM (lib/llm/src/block_manager.md, lib/kvbm-logical, lib/memory/src/nixl/), planner (components/src/dynamo/planner/core/), the Envoy EPP (deploy/inference-gateway/ext-proc/src/picker.rs, epp.rs).

6. Connections to your other study repos

vllm / sglang — the engines Dynamo orchestrates. What you studied as their internals (paged KV, prefix caching, continuous batching) becomes Dynamo’s signal surface: vLLM’s ZMQ KV-event publisher and kv_transfer_params, SGLang’s RDMA bootstrap, both wrapped by components/src/dynamo/{vllm,sglang}. Dynamo’s mocker reimplements their schedulers in Rust (lib/mocker) — a compact second take on what you read in those codebases.
sgl-router (in the sglang repo) — the same cache-aware-routing idea, one tier, one binary: an approximate radix tree built from the requests it routed, no event plane. Dynamo’s router is the heavyweight sibling: event-fed exact index, multi-replica state sync, tiered cache credits, prefill/decode phase awareness, admission queue. Comparing sgl-router’s tree to lib/kv-router/src/indexer/ is a great compare-and-contrast.
llm-d — the K8s-native competitor (Google/Red Hat/IBM orbit). Same thesis (disagg + cache-aware scheduling above vLLM) but architected as Kubernetes: Go, Inference Gateway EPP as the router, vLLM-only. Dynamo is engine-agnostic, Rust, and owns its own runtime, with K8s as one discovery backend among several — contrast llm-d’s gateway-resident scorer with Dynamo’s in-frontend KvPushRouter plus optional EPP.
gateway-api-inference-extension — the standard llm-d builds on, and Dynamo meets it too: deploy/inference-gateway/ext-proc is a Rust ext_proc Endpoint Picker that mirrors the Go LW-EPP interface (picker.rs:5-8 cites GAIE #2834) but scores with dynamo-kv-router. In gateway mode the frontend runs --router-mode direct and honors the EPP’s x-prefill-instance-id/worker headers (README.md:111-122) — directly relevant to your Envoy work.
nano-vllm — the minimal engine; useful here as the mental model for what a Dynamo worker does between receiving PreprocessedRequest and yielding token deltas.
xgrammar — structured-output constraints; surfaces in Dynamo only at the edges (lib/parsers parses tool calls; guided decoding is delegated to engines via nvext).
flashinfer — two layers below: engine kernels. Dynamo never touches it, but KVBM’s layout/transfer code (lib/kvbm-kernels, lib/memory) is where Dynamo gets closest to that altitude.

7. Tinkering on your machine

Windows host: plan on WSL2. Published wheels are manylinux-only — pip on Windows will not resolve ai-dynamo (docs/reference/release-artifacts.md), and the from-source path assumes Ubuntu packages (libhwloc-dev, protobuf-compiler; README.md:191-204). Your RTX 5080 (16 GB, Blackwell) works under WSL2 Ubuntu 24.04 with a CUDA 13 driver; small models (Qwen3-0.6B, Llama-3.2-1B/3B) leave room to spare. NIXL GPU-to-GPU disagg across two real workers is not happening on one consumer GPU — use the mocker for that topology.

Tier 0 — no GPU, no Linux strictly required (read/test the Rust core). The routing and protocol crates are pure Rust with no CUDA or engine dependency:

cargo test -p dynamo-kv-router          # radix tree, selector, queue, sequences
cargo test -p dynamo-tokens -p dynamo-kv-hashing -p dynamo-protocols
cargo test -p dynamo-mocker             # simulated vLLM/SGLang schedulers
cargo bench -p dynamo-llm --bench kv_router_bench --features kv-router-stress

The selector tests (lib/kv-router/src/scheduling/selector.rs:465-1383) are the fastest way to internalize the cost function — modify a weight, predict the winner, run. Inside WSL2 the full workspace including dynamo-runtime and dynamo-llm builds and tests CPU-only.

Tier 1 — full serving stack, zero GPUs (mocker). This exercises the real frontend, preprocessor, KV router, and even disaggregated flow against a simulated engine (docs/dynosim/mocker.md):

python -m dynamo.frontend --http-port 8000 --router-mode kv &
python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --num-workers 4 --speedup-ratio 10
# disagg without GPUs:
python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --disaggregation-mode prefill --bootstrap-ports 50100 &
python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --disaggregation-mode decode

Watch router decisions with RUST_LOG=dynamo_llm::kv_router=debug — the selector logs the full cost decomposition per worker (selector.rs:219-229). This is the highest learning-per-watt setup in the repo for someone who cares about routing.

Tier 2 — one real GPU (WSL2). The README quick start, no etcd/NATS needed:

uv pip install --prerelease=allow "ai-dynamo[vllm]"
python -m dynamo.frontend --http-port 8000 --discovery-backend file &
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
    --kv-events-config '{"enable_kv_cache_events": false}'
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d \
  '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"hi"}],"max_tokens":50,"stream":true}'

With ~16 GB you can also run two vLLM workers on tiny models (cap --gpu-memory-utilization around 0.4 each) and turn on --router-mode kv to watch cache-aware routing pick the warm worker on repeated prefixes — the single most instructive experiment in the repo.

Tier 3 — runtime-only Rust, near-zero deps. lib/runtime/examples/hello_world runs two processes against file/memory discovery; lib/llm has in-process engine examples (echo_full) reachable via dynamo-run-style entrypoints (lib/llm/src/entrypoint/), and cargo run -p dynamo-llm --bin generate-frontend-openapi exercises the HTTP surface without any worker at all.

Things that genuinely need a cluster (read, don’t run): NIXL GPU-to-GPU transfer, KVBM multi-tier offload under pressure, the K8s operator/DGDR path, planner scaling against real FPM streams, JetStream-durable KV events with multi-replica routers.

Keyboard shortcuts

Inference Study Guides