NVIDIA Dynamo
1. What it is
Dynamo is NVIDIA’s open-source, datacenter-scale inference serving framework. It does not run models itself — it is the orchestration layer above inference engines (vLLM, SGLang, TensorRT-LLM), turning a fleet of single-node engines into one coordinated serving system:
- OpenAI-compatible frontend (Rust, axum) with preprocessing (chat templates, tokenization)
- KV-aware router that scores workers by cached-prefix overlap plus live load
- Disaggregated prefill/decode with GPU-to-GPU KV-cache transfer via NIXL
- KV Block Manager (KVBM) for multi-tier KV offload (GPU → pinned host → disk → S3)
- Planner for SLA-driven autoscaling of prefill/decode pools
- A distributed runtime (service discovery, leases, request plane, event plane) that all of the above is built on
The mental model: engines optimize a box; Dynamo optimizes the fleet. If you serve one model on one GPU, you don’t need it. If you serve across nodes and want to never recompute a prefix that already exists somewhere in the cluster, this is the layer that does it.
Version in this snapshot: workspace 1.3.0 (Cargo.toml:36). Core is Rust (edition 2024,
tokio); engine adapters and the planner are Python; everything meets at PyO3 bindings.
2. Why you care
This repo is the closest thing in the ecosystem to “your current job, but for LLM inference”:
- It is traffic infrastructure in Rust. The hot path — HTTP ingress, routing decision, request dispatch, token streaming — is all tokio/axum/tonic code. The same vocabulary you use at the edge (connection pools, pub/sub planes, leases, backpressure, softmax-with- temperature load balancing, sticky sessions, request migration) appears here with one new twist: the load balancer has to reason about where bytes of KV cache already live.
- Routing is the product. A request router whose cost function is
prefill_blocks_you'd_recompute + decode_blocks_in_flightis cache-aware load balancing — the same family of problem as consistent-hash + least-loaded at Discord, but the cache is GPU VRAM and a miss costs a multi-second prefill. - There is literally an Envoy ext_proc server in here.
deploy/inference-gateway/ext-procis a Rust gRPC ext_proc implementation of the Gateway API Inference Extension’s Endpoint Picker, backed by the samedynamo-kv-routercrate. Your Envoy background maps directly. - Token egress is a designed transport, not an afterthought: a call-home TCP response plane, separate from the request plane, separate from the event plane.
3. Architecture map
The Rust workspace
Workspace members are declared in Cargo.toml:4-32. The crates that matter, roughly by layer:
| Crate | Path | What it does |
|---|---|---|
dynamo-runtime | lib/runtime | The distributed runtime: DistributedRuntime, Namespace/Component/Endpoint model, discovery backends (K8s, etcd, file, memory), etcd leases, NATS/TCP/ZMQ transports, the pipeline graph (Source → Operator → Sink), push router, metrics |
dynamo-llm | lib/llm | Everything LLM-specific: OpenAI/Anthropic HTTP service, preprocessor (chat template + tokenize), KvPushRouter, PrefillRouter (disagg orchestration), migration, model discovery/cards, KVBM block manager v2, gRPC (KServe) frontend, engines glue |
dynamo-kv-router | lib/kv-router | The routing core, runtime-free: radix-tree indexers (single-thread, concurrent, compressed), scheduling (cost function, queue, admission), active-sequence tracking, replica sync, ZMQ wire format |
dynamo-tokens | lib/tokens | Token sequences, block hashing helpers |
dynamo-kv-hashing | lib/kv-hashing | “Request → PositionalLineageHash” contract so router, KVBM, and engines agree on block identity |
dynamo-protocols | lib/protocols | OpenAI-compatible API types with NVIDIA extensions (nvext) |
dynamo-tokenizers / dynamo-renderer | lib/tokenizers, lib/renderer | HF tokenizers and minijinja chat-template rendering, runtime-free |
dynamo-parsers | lib/parsers | Tool-calling and reasoning-trace parsers |
dynamo-memory + kvbm-* | lib/memory, lib/kvbm-* | Memory arenas, NIXL agent bindings (lib/memory/src/nixl/agent.rs), and the KV Block Manager split into logical/physical/engine/kernels crates |
dynamo-mocker | lib/mocker | GPU-free simulated engine: vLLM/SGLang-style schedulers, KV events, disagg — drives the real router/frontend without hardware |
dynamo-backend-common | lib/backend-common | Shared glue for writing pure-Rust backends |
dynamo-ext-proc | deploy/inference-gateway/ext-proc | Envoy ext_proc gRPC server (GAIE Endpoint Picker) reusing dynamo-kv-router |
Python bindings and components
lib/bindings/pythonis a PyO3 crate exposing the runtime and LLM stack asdynamo._core(lib/bindings/python/rust/lib.rs). Python getsDistributedRuntime,Endpoint,KvEventPublisher,register_model,make_engine, the planner hooks, etc.components/src/dynamo/is the Python product surface:frontend(thin wrapper that launches the Rust HTTP server + router),vllm/sglang/trtllm(engine workers),planner,mocker,router(standalone router process),global_router,profiler.
How engines plug in as workers
An engine worker is just a Python process that (1) builds a DistributedRuntime, (2) creates
an endpoint like dyn://namespace.component.generate, (3) registers a model card, and
(4) serves an async generator. From the vLLM component:
# components/src/dynamo/vllm/worker_factory.py:196-230 (abridged)
generate_endpoint = runtime.endpoint(
f"{config.namespace}.{config.component}.{config.endpoint}"
)
...
await register_model(
ModelInput.Tokens, ModelType.Empty, generate_endpoint, config.model,
model_name=config.served_model_name or config.model,
worker_type=WorkerType.Encode,
needs=[[WorkerType.Prefill, WorkerType.Decode], [WorkerType.Aggregated]],
)
...
await generate_endpoint.serve_endpoint(
handler.generate, metrics_labels=[("model", config.model)]
)
register_model publishes a Model Deployment Card into discovery; the frontend’s
ModelWatcher (lib/llm/src/discovery/watcher.rs) sees it and wires a new pipeline. The
engine never speaks HTTP — it receives already-tokenized PreprocessedRequests and yields
token deltas. The handler’s generate(request, context) is the entire engine contract.
4. Core mechanisms
4.1 Request flow end to end
Standalone-mode flow: client → Frontend (HTTP) → [preprocess → migration → detokenize → PrefillRouter → KvPushRouter] → worker, with tokens streaming back through the same chain.
The whole thing is one typed pipeline built from operators with forward and backward edges:
#![allow(unused)]
fn main() {
// lib/llm/src/entrypoint/input/common.rs:352-362
let engine = frontend
.link(preprocessor_op.forward_edge())?
.link(migration.forward_edge())?
.link(token_backend.forward_edge())?
.link(prefill_op.forward_edge())?
.link(backend)?
.link(prefill_op.backward_edge())?
.link(token_backend.backward_edge())?
.link(migration.backward_edge())?
.link(preprocessor_op.backward_edge())?
.link(frontend)?;
}
Forward edges transform the request (OpenAI JSON → templated prompt → token IDs); backward
edges transform the response stream (token IDs → detokenized deltas → OpenAI SSE chunks).
Migration sits high enough to replay/redirect an in-flight request if a worker dies.
Transport choices — three independent planes, which is the part worth studying closely:
- Client-facing egress is axum SSE. The engine’s response stream is mapped through metrics observers and a disconnect monitor, then wrapped:
#![allow(unused)]
fn main() {
// lib/llm/src/http/service/openai.rs:760-768
let stream = monitor_for_disconnects(stream, ctx, inflight_guard, stream_handle);
let mut sse_stream = Sse::new(stream);
if let Some(keep_alive) = state.sse_keep_alive() {
sse_stream = sse_stream.keep_alive(KeepAlive::default().interval(keep_alive));
}
Ok(sse_stream.into_response())
}
- Request plane (frontend → worker RPC) defaults to direct TCP, selectable to NATS via
DYN_REQUEST_PLANE(docs/design-docs/request-plane.md). Routing mode aside, the actual dispatch isPushRouter::direct(request, instance_id)(lib/runtime/src/pipeline/network/egress/push_router.rs). - Response plane is the part that will feel familiar from designing token-egress paths:
the requester runs a
TcpStreamServer(lib/runtime/src/pipeline/network/tcp/server.rs:87-93), and the request carries connection info; the worker calls home over a fresh TCP stream (CallHomeHandshake, same file) and frames response chunks back with a two-part codec. So request fan-out and token fan-in are decoupled — tokens never transit NATS even when the request did. Frame inactivity timeouts, prologue handshakes, and per-stream contexts for cancellation are all explicit (lib/runtime/src/pipeline/network/egress/push_router.rs:52-62).
There is also a separate event plane (NATS Core or ZMQ, DYN_EVENT_PLANE) carrying KV
events and router replica sync — covered next. A KServe gRPC frontend exists alongside HTTP
(lib/llm/src/grpc), and lib/llm/src/http/service/realtime.rs covers websocket realtime.
4.2 KV-aware routing
The router answers: given these tokens, which worker would do the least new work? Three pieces: block hashing, a global prefix index fed by worker events, and a cost function that fuses cache overlap with live load.
Block hashing. Token sequences are chunked into engine-sized blocks (16/64 tokens) and hashed with XXH3; LoRA adapter identity is mixed into the seed so adapters don’t cross-hit:
#![allow(unused)]
fn main() {
// lib/kv-router/src/protocols.rs:85-97
pub fn compute_block_hash_for_seq(
tokens: &[u32],
kv_block_size: u32,
options: BlockHashOptions<'_>,
) -> Vec<LocalBlockHash> {
if kv_block_size == 0 {
return Vec::new();
}
let seed = match options.lora_name.filter(|n| !n.is_empty()) {
Some(name) => XXH3_SEED.wrapping_add(xxh3::xxh3_64(name.as_bytes())),
None => XXH3_SEED,
};
}
These are local hashes computed from token content, so the router never depends on engines
agreeing on block IDs (docs/design-docs/router-design.md, “Deterministic Event IDs”).
KV events → indexer. Engines publish stored/removed/cleared events as blocks enter and
leave their paged KV caches (vLLM via its ZMQ KvEventPublisher, the mocker natively):
#![allow(unused)]
fn main() {
// lib/kv-router/src/protocols.rs:640-646
pub enum KvCacheEventData {
Stored(KvCacheStoreData),
Removed(KvCacheRemoveData),
Cleared,
}
}
The KvIndexer folds these into a radix tree over block hashes where every node records
which workers hold that prefix. A lookup walks the query’s hash sequence and returns
per-worker matched depth:
#![allow(unused)]
fn main() {
// lib/kv-router/src/indexer/radix_tree.rs:200-206
pub fn find_matches(&self, sequence: Vec<LocalBlockHash>, early_exit: bool) -> OverlapScores {
self.find_match_details(sequence, early_exit).overlap_scores
}
pub fn apply_event(&mut self, event: RouterEvent) -> Result<(), KvCacheEventError> {
self.apply_event_with_counters(event, None)
}
}
OverlapScores is just FxHashMap<WorkerWithDpRank, u32> (lib/kv-router/src/protocols.rs:942-947).
Implementations range from a single-threaded compressed tree to a sharded
ConcurrentRadixTree with sticky per-worker write threads (lib/kv-router/src/indexer/).
Event-loss handling is sequence-number based: each worker numbers its events monotonically,
the router detects gaps and re-queries the worker’s local indexer for the missing range,
and dumps full worker state on discovery — or you can opt into NATS JetStream durable
consumption with snapshots in object store (--durable-kv-events).
Scoring. Selection happens in DefaultWorkerSelector — the cost (“logit”, lower is
better) per worker is computed in worker_logit:
#![allow(unused)]
fn main() {
// lib/kv-router/src/scheduling/selector.rs:194-203
let effective_overlap_score_credit = weights.overlap_score_credit * overlap_credit_decay;
let overlap_credit_blocks = effective_overlap_score_credit * device_overlap_blocks
+ self.kv_router_config.host_cache_hit_weight * host_overlap_blocks
+ self.kv_router_config.disk_cache_hit_weight * disk_overlap_blocks
+ shared_overlap_blocks;
let adjusted_prefill_blocks = (raw_prefill_blocks - overlap_credit_blocks).max(0.0);
let prefill_cost_blocks = weights.prefill_load_scale * adjusted_prefill_blocks;
let worker_load = worker_load.unwrap_or_default();
let decode_cost_blocks = worker_load.potential_decode_blocks() as f64;
let logit = prefill_cost_blocks + decode_cost_blocks;
}
Read it as: blocks I’d have to prefill from scratch (discounted by what the worker already
holds on device, in pinned host memory, on disk, or in a shared external cache — each tier
with its own credit weight) plus blocks already decoding there. An overlap_credit_decay
term softly forfeits cache affinity on workers whose prefill backlog exceeds the fleet floor
— the explicit TTFT-vs-ITL tradeoff knob. Selection is min-cost with reservoir-sampled tie
breaking, or softmax sampling when router_temperature > 0
(lib/kv-router/src/scheduling/selector.rs:29-85). Taints implement topology/required-zone
constraints, multiplying scores for preferred taints and filtering for required ones.
Load signals come from the router’s own bookkeeping, not engine polling: a slot manager
(ActiveSequencesMultiWorker, lib/kv-router/src/sequences/) predicts active blocks at
route time, marks prefill complete on first output token, and frees on stream end. With
multiple router replicas, these predictions sync over NATS core
(AddRequest / MarkPrefillCompleted / Free, lib/kv-router/src/sequences/replica_sync.rs).
Admission is serialized through SchedulerQueue::admit_one so projected load and booking
can’t race (lib/kv-router/src/scheduling/CLAUDE.md documents the invariants — worth reading
as a design doc). The KV-routing wrapper that ties selection to dispatch:
#![allow(unused)]
fn main() {
// lib/llm/src/kv_router/push_router.rs:321-335 (abridged)
let mut response_stream = cancel_on_stop(
request_context.as_ref(),
&context_id,
self.inner
.direct(updated_request, instance_id)
.instrument(tracing::info_span!(
"kv_router.route_request",
request_id = %context_id,
worker_id = instance_id,
overlap_blocks = overlap_amount,
)),
)
.await??;
}
KvRouter itself (lib/llm/src/kv_router.rs:186-205) is deliberately “decide, don’t route”:
it owns the Indexer + KvScheduler and emits FindBestMatchOutcome::{Routed, Backpressure}
— backpressure carries queued-token depth so the HTTP layer can 429/503 instead of piling on.
4.3 Disaggregated prefill/decode
Disaggregation is orchestrated entirely at the routing layer by PrefillRouter
(lib/llm/src/kv_router/prefill_router/mod.rs:36-43), an operator sandwiched between
migration and the decode router (see the pipeline chain above). The split:
- Clone the preprocessed request, force
max_tokens = 1, route it to a prefill worker (KV-aware, using the same selector with prefill-specific config):
#![allow(unused)]
fn main() {
// lib/llm/src/kv_router/prefill_router/mod.rs:114-116
// Prepare prefill request with max_tokens = 1 (clone after tracker is set)
let mut prefill_req = req.clone();
prefill_req.stop_conditions.max_tokens = Some(1);
}
- The prefill worker computes KV and returns transfer metadata, not KV bytes. In the
vLLM handler the metadata is vLLM’s
kv_transfer_params(block IDs + connection info):
# components/src/dynamo/vllm/handlers.py:3641-3651 (abridged)
output: Dict[str, Any] = {
"token_ids": list(token_ids),
"disaggregated_params": self._build_disaggregated_params(
kv_protocol.decode_request_kv_transfer_params(res),
embedding_params,
),
"completion_usage": ...,
}
PrefillRouterinjects that metadata into the decode request, restores the originalmax_tokens, and forwards to the decode router:
#![allow(unused)]
fn main() {
// lib/llm/src/kv_router/prefill_router/mod.rs:316-335 (abridged)
match outcome {
PrefillOutcome::Bootstrap { bootstrap_info, worker_id } => {
decode_req.bootstrap_info = Some(bootstrap_info);
decode_req.routing_mut().prefill_worker_id = Some(worker_id);
}
PrefillOutcome::Completed { result, worker_id, worker_link } => {
decode_req.prefill_result = Some(result);
decode_req.migration_link = worker_link;
...
}
};
}
- The decode worker uses the metadata to pull KV directly from the prefill worker’s VRAM
via NIXL (NVLink / InfiniBand-UCX / PCIe), non-blocking with respect to its forward
passes. The transfer itself runs inside the engines’ KV-connector layer; Dynamo’s own
Rust NIXL agent (
lib/memory/src/nixl/agent.rs) is used by KVBM and multimodal RDMA.
Two execution shapes (docs/design-docs/disagg-serving.md): SGLang-style bootstrap —
prefill worker publishes an RDMA rendezvous endpoint, so prefill is spawned as a background
task and decode routing proceeds immediately, overlapping transfer with decode scheduling;
vLLM/TRT-LLM-style synchronous — decode waits for the prefill response. Decode-side
routing then runs with an override that zeroes overlap credit and prompt-load tracking
(build_decode_router_override, asserted at prefill_router/mod.rs:453-463), because the
decode pool’s cache state was just mutated by the transfer, not by prefix reuse.
Operationally notable details for someone who has run this class of system:
- Prefill is deliberately not linked as a child of the request context: cancelling it
mid-NIXL-transfer would permanently leak KV blocks, so it runs to completion and wasted
compute is the accepted tradeoff (
prefill_router/mod.rs:153-162— comment cites the bug). - Backpressure from the prefill queue surfaces as
ResourceExhaustedrather than silently re-entering the saturated queue (prefill_router/mod.rs:181-203). - Fallback is explicit: no prefill workers → aggregated mode, unless
--enforce-disaggfails fast; all-prefill-death flips adeactivatedflag checked per request (prefill_router/mod.rs:96-104). - Topology constraints (“decode must be NVLink-reachable from this prefill worker”) are
merged into the decode request as required/preferred taints
(
merge_decode_topology_constraints,prefill_router/mod.rs:424-442).
4.4 The distributed runtime
Everything above runs on dynamo-runtime. The object model is a three-level hierarchy —
Namespace / Component / Endpoint (lib/runtime/src/component.rs:4-30) — addressed as
dyn://namespace.component.endpoint. An Instance is one live process serving an endpoint,
identified by a u64 instance_id and a transport address
(lib/runtime/src/component.rs:106-115). The hello-world server is the whole pattern:
#![allow(unused)]
fn main() {
// lib/runtime/examples/hello_world/src/bin/server.rs:54-65
async fn backend(runtime: DistributedRuntime) -> anyhow::Result<()> {
// attach an ingress to an engine
let ingress = Ingress::for_engine(RequestHandler::new())?;
let component = runtime.namespace(DEFAULT_NAMESPACE)?.component("backend")?;
component
.endpoint("generate")
.endpoint_builder()
.handler(ingress)
.start()
.await
}
}
Discovery is pluggable (lib/runtime/src/distributed.rs:148-184): a Discovery trait
with two families — Kubernetes-native (CRD DynamoWorkerMetadata + EndpointSlices, no etcd)
and KV-store backed (etcd, a plain filesystem directory for laptop dev via
--discovery-backend file, or in-memory for tests). Workers write instance records and
model cards; frontends list_and_watch a DiscoveryQuery and react to Added/Removed.
Liveness in the etcd backend is lease-based — every registration is attached to a TTL lease kept alive by a background task; if keep-alive fails the process’s cancellation token fires, taking the whole worker down rather than serving as a zombie:
#![allow(unused)]
fn main() {
// lib/runtime/src/transports/etcd/lease.rs:15-27
pub async fn create_lease(
connector: Arc<Connector>,
ttl: u64,
token: CancellationToken,
) -> anyhow::Result<u64> {
let mut lease_client = connector.get_client().lease_client();
let lease = lease_client.grant(ttl as i64, None).await?;
let id = lease.id() as u64;
let ttl = lease.ttl() as u64;
let child = token.child_token();
}
Lease IDs double as instance IDs, which is why the TCP layer keeps short-lived tombstones
keyed on them during re-registration races (lib/runtime/src/pipeline/network/tcp/server.rs:15-19).
NATS is optional everywhere: requests default to TCP, discovery defaults to K8s/file,
and KV events can ride ZMQ. NATS earns its place for JetStream-durable KV events, router
replica sync, and brokered request plane (README.md:246-262).
4.5 Planner / autoscaling
components/src/dynamo/planner is the SLA autoscaler (Python). Signals are forward-pass
metrics (FPM) scraped from workers — queued prefill tokens, KV-cache utilization, observed
TTFT/ITL — fed into a state machine with two modes: simple threshold scaling and SLA mode
backed by per-GPU performance models (interpolated from offline profiling or AIConfigurator
predictions). The non-SLA thresholds are legible enough to read directly:
# components/src/dynamo/planner/core/load_scaling.py:24-34
# Prefill: ratio of queued_prefill_tokens / context_length
_PREFILL_THROUGHPUT_SCALE_UP = 1.0 # queued >= context_length
_PREFILL_THROUGHPUT_SCALE_DOWN = 0.1 # queued < context_length / 10
_PREFILL_LATENCY_SCALE_UP = 0.1 # queued >= context_length / 10
_PREFILL_LATENCY_SCALE_DOWN = 0.0 # queued == 0
# Decode/Agg: KV cache utilization (scheduled + queued) / max_kv_tokens
_DECODE_THROUGHPUT_SCALE_UP = 1.0 # util > 100%
_DECODE_THROUGHPUT_SCALE_DOWN = 0.6 # util < 60%
_DECODE_LATENCY_SCALE_UP = 0.4 # util > 40%
_DECODE_LATENCY_SCALE_DOWN = 0.1 # util < 10%
Prefill and decode pools scale independently (the economic argument for disaggregation),
with budget caps and connectors that execute decisions against Kubernetes (DGD resources via
the operator) or virtual targets for simulation. The DynamoGraphDeploymentRequest CRD
(README.md:162-175) chains AIConfigurator profiling → planner topology → deployment from a
single model + SLA spec. A global_planner component coordinates across deployments.
5. Suggested reading path
- Orientation (30 min).
README.md, thendocs/design-docs/architecture.mdanddocs/design-docs/dynamo-flow.md(the numbered S1-S9 disagg walkthrough). - Runtime primitives.
lib/runtime/examples/hello_world/src/bin/server.rsandclient.rs(74 lines total — the whole component/endpoint/push-router model), thenlib/runtime/src/component.rsdoc comments andlib/runtime/src/distributed.rs:117-184for discovery backends. Skimlib/runtime/src/pipeline/network/tcp/server.rsfor the call-home response stream. - The frontend pipeline.
lib/llm/src/entrypoint/input/common.rs:323-365(operator chain),lib/llm/src/http/service/openai.rs(SSE handler), andlib/llm/src/preprocessor.rs(what “preprocess” actually does). - The router, bottom-up.
lib/kv-router/src/protocols.rs(hashes, events,OverlapScores),lib/kv-router/src/indexer/radix_tree.rs(single-threaded tree first — the concurrent versions are optimizations of the same shape),lib/kv-router/src/scheduling/selector.rs(cost function; tests at the bottom are executable documentation),lib/kv-router/src/scheduling/CLAUDE.md(invariants), then up tolib/llm/src/kv_router.rsandlib/llm/src/kv_router/push_router.rsfor the integration.docs/design-docs/router-design.mdalongside. - Disaggregation.
docs/design-docs/disagg-serving.md, thenlib/llm/src/kv_router/prefill_router/mod.rstop to bottom (thegeneratemethod is the whole story), withcomponents/src/dynamo/vllm/handlers.pyprefill/decode handlers as the engine-side counterpart. - Worker lifecycle in Python.
components/src/dynamo/vllm/main.py+worker_factory.py, andlib/bindings/python/rust/lib.rsto see how thin the binding is. - Optional depth. KVBM (
lib/llm/src/block_manager.md,lib/kvbm-logical,lib/memory/src/nixl/), planner (components/src/dynamo/planner/core/), the Envoy EPP (deploy/inference-gateway/ext-proc/src/picker.rs,epp.rs).
6. Connections to your other study repos
- vllm / sglang — the engines Dynamo orchestrates. What you studied as their internals
(paged KV, prefix caching, continuous batching) becomes Dynamo’s signal surface: vLLM’s
ZMQ KV-event publisher and
kv_transfer_params, SGLang’s RDMA bootstrap, both wrapped bycomponents/src/dynamo/{vllm,sglang}. Dynamo’s mocker reimplements their schedulers in Rust (lib/mocker) — a compact second take on what you read in those codebases. - sgl-router (in the sglang repo) — the same cache-aware-routing idea, one tier, one
binary: an approximate radix tree built from the requests it routed, no event plane.
Dynamo’s router is the heavyweight sibling: event-fed exact index, multi-replica state
sync, tiered cache credits, prefill/decode phase awareness, admission queue. Comparing
sgl-router’s tree tolib/kv-router/src/indexer/is a great compare-and-contrast. - llm-d — the K8s-native competitor (Google/Red Hat/IBM orbit). Same thesis (disagg +
cache-aware scheduling above vLLM) but architected as Kubernetes: Go, Inference Gateway
EPP as the router, vLLM-only. Dynamo is engine-agnostic, Rust, and owns its own runtime,
with K8s as one discovery backend among several — contrast llm-d’s gateway-resident scorer
with Dynamo’s in-frontend
KvPushRouterplus optional EPP. - gateway-api-inference-extension — the standard llm-d builds on, and Dynamo meets it
too:
deploy/inference-gateway/ext-procis a Rust ext_proc Endpoint Picker that mirrors the Go LW-EPP interface (picker.rs:5-8cites GAIE #2834) but scores withdynamo-kv-router. In gateway mode the frontend runs--router-mode directand honors the EPP’sx-prefill-instance-id/worker headers (README.md:111-122) — directly relevant to your Envoy work. - nano-vllm — the minimal engine; useful here as the mental model for what a Dynamo
worker does between receiving
PreprocessedRequestand yielding token deltas. - xgrammar — structured-output constraints; surfaces in Dynamo only at the edges
(
lib/parsersparses tool calls; guided decoding is delegated to engines vianvext). - flashinfer — two layers below: engine kernels. Dynamo never touches it, but KVBM’s
layout/transfer code (
lib/kvbm-kernels,lib/memory) is where Dynamo gets closest to that altitude.
7. Tinkering on your machine
Windows host: plan on WSL2. Published wheels are manylinux-only — pip on Windows will
not resolve ai-dynamo (docs/reference/release-artifacts.md), and the from-source path
assumes Ubuntu packages (libhwloc-dev, protobuf-compiler; README.md:191-204). Your
RTX 5080 (16 GB, Blackwell) works under WSL2 Ubuntu 24.04 with a CUDA 13 driver; small
models (Qwen3-0.6B, Llama-3.2-1B/3B) leave room to spare. NIXL GPU-to-GPU disagg across two
real workers is not happening on one consumer GPU — use the mocker for that topology.
Tier 0 — no GPU, no Linux strictly required (read/test the Rust core). The routing and protocol crates are pure Rust with no CUDA or engine dependency:
cargo test -p dynamo-kv-router # radix tree, selector, queue, sequences
cargo test -p dynamo-tokens -p dynamo-kv-hashing -p dynamo-protocols
cargo test -p dynamo-mocker # simulated vLLM/SGLang schedulers
cargo bench -p dynamo-llm --bench kv_router_bench --features kv-router-stress
The selector tests (lib/kv-router/src/scheduling/selector.rs:465-1383) are the fastest way
to internalize the cost function — modify a weight, predict the winner, run. Inside WSL2 the
full workspace including dynamo-runtime and dynamo-llm builds and tests CPU-only.
Tier 1 — full serving stack, zero GPUs (mocker). This exercises the real frontend,
preprocessor, KV router, and even disaggregated flow against a simulated engine
(docs/dynosim/mocker.md):
python -m dynamo.frontend --http-port 8000 --router-mode kv &
python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --num-workers 4 --speedup-ratio 10
# disagg without GPUs:
python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --disaggregation-mode prefill --bootstrap-ports 50100 &
python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --disaggregation-mode decode
Watch router decisions with RUST_LOG=dynamo_llm::kv_router=debug — the selector logs the
full cost decomposition per worker (selector.rs:219-229). This is the highest
learning-per-watt setup in the repo for someone who cares about routing.
Tier 2 — one real GPU (WSL2). The README quick start, no etcd/NATS needed:
uv pip install --prerelease=allow "ai-dynamo[vllm]"
python -m dynamo.frontend --http-port 8000 --discovery-backend file &
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
--kv-events-config '{"enable_kv_cache_events": false}'
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d \
'{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"hi"}],"max_tokens":50,"stream":true}'
With ~16 GB you can also run two vLLM workers on tiny models (cap --gpu-memory-utilization
around 0.4 each) and turn on --router-mode kv to watch cache-aware routing pick the warm
worker on repeated prefixes — the single most instructive experiment in the repo.
Tier 3 — runtime-only Rust, near-zero deps. lib/runtime/examples/hello_world runs two
processes against file/memory discovery; lib/llm has in-process engine examples
(echo_full) reachable via dynamo-run-style entrypoints (lib/llm/src/entrypoint/), and
cargo run -p dynamo-llm --bin generate-frontend-openapi exercises the HTTP surface without
any worker at all.
Things that genuinely need a cluster (read, don’t run): NIXL GPU-to-GPU transfer, KVBM multi-tier offload under pressure, the K8s operator/DGDR path, planner scaling against real FPM streams, JetStream-durable KV events with multi-replica routers.