Welcome
Open Source Systems

kvfleet Internals: Building a KV-Cache-Aware Routing Control Plane for LLM Fleets

#kvfleet#llm-routing#kv-cache#python#distributed-systems#ai-infrastructure

kvfleet Routing Flow

From request intake to cache-aware model execution

`kvfleet` treats routing as a control-plane decision: normalize the request, apply enterprise constraints, score for cache locality and performance, then execute through fallback-aware adapters.

Input

Request Intake

  • SDK client or OpenAI-compatible app sends chat request
  • Messages normalize into a single ChatRequest envelope
  • Enabled chat-capable models are loaded from ModelRegistry

Control Plane

Decision Plane

  • PolicyEngine and TenantManager remove invalid candidates first
  • PromptFingerprinter computes full, system, prefix, and conversation hashes
  • KVAffinityScorer combines session memory, prefix memory, and consistent hashing
  • RoutingStrategy ranks candidates with cost, latency, quality, cache, health, and compliance signals

Data Plane

Execution + Feedback

  • FallbackChain executes via vLLM, TGI, Triton, Ollama, OpenAI-compatible, or Custom HTTP adapters
  • Selected endpoint handles inference while preserving warm-route locality where possible
  • RouteExplanation, metrics, and affinity records feed the next routing decision
Policy first

PII, data class, tenant, and capability checks happen before ranking.

Cache locality

Prefix and session affinity reduce expensive prompt re-prefill on GPUs.

Explainable outcomes

Every decision emits a structured trace for tuning, audits, and debugging.

Self-hosting LLMs is no longer the hard part. Keeping them fast, cheap, compliant, and predictable under real traffic is the hard part.

That is the problem space behind kvfleet, the Python library I open-sourced to act as a routing control plane for self-hosted and hybrid LLM fleets. At a surface level, kvfleet looks like a smart gateway. Under the hood, it is doing something more specific: it tries to preserve KV-cache locality, enforce policy, rank models across competing objectives, explain every decision, and still remain drop-in compatible with OpenAI-style clients.

This article is about the internals, not the announcement. The goal is to walk through the core design decisions in kvfleet v0.11.2, show how the routing path actually works, and explain where it differs from adjacent tools like LiteLLM, RouteLLM, and semantic routers.

The Challenge

The standard way teams scale inference is to place multiple replicas behind a stateless load balancer and spread requests across them using round-robin, least-connections, or weighted routing. That works for HTTP services whose hot state is either externalized or cheap to rebuild. LLM inference is different because the expensive state often lives inside GPU memory as a KV cache.

If a user sends a long system prompt, a large document, or a multi-turn conversation, the prefill phase builds a substantial cache of attention keys and values. The next request is only fast if it lands on the same replica, or at least on a replica with an equivalent warm prefix. If it lands elsewhere, the system recomputes the prompt prefix from scratch. Prefill cost grows roughly linearly with prompt length, so a routing miss on a 32K-token conversation is not a small miss. It is often the dominant latency event in the transaction.

Why common approaches fail:

  • Round-robin is oblivious to prompt reuse, so cache locality is accidental.
  • Least-connections optimizes queue depth, not GPU-resident state.
  • Sticky sessions help with conversation locality, but not with shared system prompts, near-duplicate prefixes, or cost and policy constraints.
  • Pure semantic routers choose models based on content or intent, but usually do not choose replicas based on cache locality.
  • API gateways normalize providers and credentials well, but that still leaves the replica-selection problem unsolved for self-hosted backends.

There is also a second problem that only shows up in enterprise deployments: the best model from a latency perspective may be the wrong model from a governance perspective. A request may carry internal or confidential data, may belong to a specific tenant, may require tool calling or JSON-mode, or may have a strict budget ceiling. Routing therefore becomes a constrained optimization problem, not just a load-balancing problem.

That is the design center of kvfleet: routing is treated as a control-plane decision that combines prompt identity, model priors, live health, compliance policy, and failure handling.

The Blueprint

The routing path in the diagram above maps almost one-to-one to the actual Router.route() implementation.

At startup, the router builds a ModelRegistry from FleetConfig.models, initializes a ScoringEngine, creates the selected routing strategy, constructs a PromptFingerprinter, and prepares a KVAffinityScorer. The affinity scorer itself combines three data structures:

  • A session store with TTL for conversation-level stickiness.
  • A prefix cache that remembers which endpoint previously served a prefix fingerprint.
  • A consistent hash ring used as a deterministic tie-breaker when there is no explicit warm-route memory.

The prompt fingerprint is intentionally multi-level. PromptFingerprinter derives:

  • full_hash for exact request identity.
  • system_hash for stable instructions shared across many requests.
  • prefix_hash for the first prefix_hash_tokens worth of normalized prompt text.
  • conversation_hash for the user-turn sequence.

Those values are cheap to compute and cheap to compare. The tradeoff is precision: the current implementation uses whitespace splitting and simple normalization rather than a tokenizer-aware prefix boundary. That is deliberate. The library is optimizing for low routing overhead in Python, not for perfect linguistic equivalence.

Once the request is normalized into a ChatRequest, the router executes this sequence:

  1. Enumerate enabled chat-capable models from the registry.
  2. Apply policy filters such as PII handling and allowed data classes.
  3. Apply tenant filters and capability filters such as tool-use and JSON-mode support.
  4. Fingerprint the prompt and compute per-model cache-affinity scores.
  5. Build a ScoringContext carrying runtime signals such as data class, tags, health, and affinity.
  6. Ask the configured routing strategy to rank candidates.
  7. Execute the selected model through a FallbackChain.
  8. Record the endpoint in the affinity store for future requests.
  9. Emit metrics, route explanation, and optional shadow traffic.

The result is a layered decision model. Policy is a hard gate. Capability is a hard gate. Affinity is a soft but high-value signal. Scoring is a weighted ranking. Fallback is the recovery path. This separation matters because it keeps the system explainable. RouteExplanation stores candidate scores, policy decisions, fallback history, cache-hit state, total latency, and metadata for downstream debugging.

From an algorithmic perspective, the hot path is lightweight:

  • Fingerprinting is O(L) in prompt length because it walks message text and hashes normalized strings.
  • Policy filtering is roughly O(M + R) for M candidates and R rules.
  • Scoring is O(M).
  • The current hash-ring lookup is effectively O(K) over sorted ring keys because it linearly scans instead of using binary search, where K is the number of virtual nodes. That is a conscious simplicity tradeoff today, and one of the easiest future optimizations.

Implementation Guide

The implementation pattern I recommend in production is:

  1. Treat configuration as code.
  2. Keep routing policy and model priors explicit in YAML.
  3. Simulate before sending live traffic.
  4. Run the router close to the inference backends so the control-plane hop stays cheap.

Start with a fleet config that encodes not just endpoints, but the operating assumptions behind them.

fleet.yaml
yaml
fleet_name: prod-fleet
strategy: hybrid_score
 
models:
  - name: llama-3-8b-fast
    endpoint: http://gpu-a.internal:8000
    replicas:
      - http://gpu-b.internal:8000
    provider: vllm
    model_id: meta-llama/Llama-3-8B-Instruct
    latency_p50_ms: 180
    quality_score: 0.72
    cost_per_1k_input_tokens: 0.0
    allowed_data_classes: [public, internal, confidential]
    tags:
      tier: fast
      domain: general
    capabilities:
      supports_tools: true
      supports_json_mode: true
      model_type: chat
 
  - name: llama-3-70b-quality
    endpoint: http://gpu-c.internal:8000
    provider: vllm
    model_id: meta-llama/Llama-3-70B-Instruct
    latency_p50_ms: 850
    quality_score: 0.93
    cost_per_1k_input_tokens: 0.0
    allowed_data_classes: [public, internal, confidential]
    tags:
      tier: quality
      domain: general
 
  - name: gpt-4o-fallback
    endpoint: https://api.openai.com
    provider: openai_compat
    model_id: gpt-4o
    latency_p50_ms: 420
    quality_score: 0.96
    cost_per_1k_input_tokens: 0.005
    allowed_data_classes: [public]
    tags:
      tier: premium
 
scoring_weights:
  # Why: keep cost and latency important, but leave room for cache reuse
  # because a warm cache can dominate the end-to-end user experience.
  cost: 0.20
  latency: 0.25
  quality: 0.30
  cache_affinity: 0.15
  hardware_load: 0.05
  compliance: 0.05
 
cache_affinity:
  enabled: true
  session_ttl_seconds: 3600
  prefix_hash_tokens: 128
  min_affinity_score: 0.30
  consistent_hash_replicas: 150
 
policy:
  enabled: true
  pii_detection: true
  default_data_class: internal
  rules:
    - name: confidential-stays-private
      condition: "data_class == confidential"
      action: require_private
      priority: 1
 
fallback:
  enabled: true
  max_attempts: 3
  promote_on_timeout: true
  timeout_ms: 10000
  fallback_order: [llama-3-8b-fast, llama-3-70b-quality, gpt-4o-fallback]

Then wire it from Python in a way that makes routing observable before it becomes mandatory infrastructure.

router_demo.py
python
import asyncio
 
from kvfleet import Router, load_config
from kvfleet.adapters.base import ChatMessage
 
 
async def main() -> None:
    config = load_config("fleet.yaml")
    router = Router(config)
 
    # Why: warm health state before the first live request so the scoring
    # engine can consider endpoint health instead of assuming a neutral score.
    await router.health_check_all()
 
    messages = [
        ChatMessage(
            role="system",
            content=(
                "You are a production support assistant. "
                "Return concise JSON with root_cause and next_actions."
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Investigate a latency spike in our inference gateway. "
                "Requests include customer identifiers."
            ),
        ),
    ]
 
    # Why: simulate first. In rollout phases, this gives you an explanation
    # trace without actually spending tokens or touching backends.
    dry_run = await router.simulate(
        messages=messages,
        data_class="confidential",
        tenant_id="platform-eng",
        tags={"domain": "operations"},
    )
    print(dry_run.summary())
 
    response, explanation = await router.route(
        messages=messages,
        data_class="confidential",
        tenant_id="platform-eng",
        tags={"domain": "operations"},
        temperature=0.2,
        max_tokens=300,
        response_format={"type": "json_object"},
    )
 
    print(f"Selected model: {explanation.selected_model}")
    print(f"Endpoint: {explanation.selected_endpoint}")
    print(f"Latency: {explanation.total_latency_ms:.1f}ms")
    print(response.content)
 
    await router.close()
 
 
if __name__ == "__main__":
    asyncio.run(main())

There are a few details in that snippet that matter operationally:

  • health_check_all() matters because hardware load is part of the scoring context when available.
  • simulate() is more than a convenience; it is how you validate policies and routing priors before production cutover.
  • response_format={"type": "json_object"} is a useful capability filter, because models that cannot honor JSON-mode should be excluded before execution rather than after a malformed response.
  • tenant_id and data_class should be first-class request attributes, not afterthoughts bolted onto logs.

If you want a transparent gateway instead of a Python embedding, kvfleet also exposes an OpenAI-compatible server mode. That matters when you want to migrate existing SDK consumers by swapping a base URL rather than rewriting client logic.

Performance & Trade-offs

kvfleet is opinionated about where to spend latency budget. It is willing to add a small CPU-side decision cost in order to avoid much larger GPU-side recomputation cost.

That tradeoff is correct in the common case. A few hundred microseconds or a couple of milliseconds spent fingerprinting, filtering, and scoring is negligible compared to re-prefilling a long prompt on a remote GPU. The system is essentially buying down prefill latency variance with a deterministic control-plane hop.

But the design does make explicit sacrifices.

First, the current route memory is process-local. SessionAffinityStore uses a threading.Lock and an in-memory dictionary. The prefix cache is also in-memory. That means a single router instance can reason well about warm routes, but multiple router pods do not automatically share route memory. If you deploy kvfleet horizontally, you either need a shared affinity store or a network design that preserves ingress locality.

Second, prompt fingerprinting is approximate by design. The library normalizes text and splits on whitespace instead of using model-specific tokenizers. That keeps routing cheap and backend-agnostic, but it means the prefix boundary is not a perfect proxy for actual token cache reuse. In practice that is often a good engineering compromise, but it is still a compromise.

Third, the consistent hash ring favors implementation simplicity over theoretical optimality right now. get_node() iterates through sorted ring keys rather than using binary search, so lookup is not as asymptotically efficient as it could be. For small to moderate ring sizes, this is acceptable. At very large fleet sizes, a bisect-based lookup is the obvious next step.

Fourth, enterprise policy is treated as a hard filter before ranking. That is the correct safety posture, but it can reduce the candidate set enough that cost or latency optimization becomes secondary. In other words, compliance is not free. The library makes that trade explicit rather than pretending every route is equally available.

Existing Libraries and Where kvfleet Stands Out

This space already has useful tools, but they optimize for different centers of gravity.

LiteLLM is excellent if your primary problem is provider abstraction, unified OpenAI-style APIs, gateway deployment, and managed-model interoperability. It is not primarily a KV-locality router for self-hosted replicas.

RouteLLM is strong when the core problem is choosing between stronger and weaker models to optimize quality versus spend. That is a model-selection problem. kvfleet overlaps there, but extends the decision surface to include endpoint locality, replica affinity, tenant and policy constraints, and backend heterogeneity.

Semantic routers are useful when prompt meaning is the routing signal, especially for intent-specific models, tools, or agent pathways. kvfleet includes semantic and domain-aware strategies too, but those are only one part of the control plane. The distinguishing feature is that semantic intent can coexist with cache-affinity, policy gates, and fallback execution in one pipeline.

In short, kvfleet stands out when all of these are true at once:

  • You are running self-hosted or hybrid fleets, not just third-party APIs.
  • Replica-level cache locality matters to latency and cost.
  • Governance and tenant isolation are part of the routing decision.
  • You need an explanation trace for every decision, not just a selected model name.
  • You want one control plane that can operate as both a Python library and an OpenAI-compatible gateway.

Lessons Learned

  • Multi-turn affinity and prefix affinity are related but not identical. Conflating them reduces accuracy, which is why kvfleet fingerprints both conversation context and prompt prefix separately.
  • A routing layer without an explanation model becomes impossible to tune. Once cost, latency, compliance, and cache-affinity interact, "why did this route fire?" must be a first-class API.
  • Process-local state is enough to prove the architecture and deliver value, but high-availability deployments eventually need shared route memory if they want consistent affinity across pods.
  • Capability filters should happen before inference, not after. It is cheaper to reject a non-tool-capable or non-JSON-capable model at selection time than to recover from a malformed downstream response.

kvfleet exists because LLM routing is no longer just a model-choice problem. It is a systems problem. Once you operate real fleets, the winning route is the one that respects locality, constraints, economics, and failure recovery at the same time. That is the engineering niche kvfleet is designed to fill.

Article Feedback

If this write-up was useful, leave a quick signal.

0 readers liked this

Comments

Join the discussion

Comments are moderated before they appear publicly. Thoughtful technical feedback wins.

Moderated
Name
Email
Comment

Public after approval. Email stays private.

Approved Comments

No approved comments yet. You can be the first one in the queue.