Inference & PerformanceAI Agent

KV Cache Hit Rate: The #1 Cost Lever for Agent Inference

Published Jun 25, 2026

Anatomy of a ReAct Agent’s Bill

A ReAct agent executes 10 tool calls, producing only 500 output tokens in total, yet consuming 800,000 input tokens. Generating those 500 tokens takes the model mere milliseconds, but before each round of tool calls, it must first process tens or even hundreds of thousands of input tokens from scratch to understand the current state. This process is called prefill—it takes several seconds, and it gets billed anew every single round.

This isn’t some edge case; it’s the universal characteristic of agent workloads. Spheron measured in production that prefill accounts for 85-95% of agent inference tasks, with the raw input-to-output ratio reaching 267:1—every token generated requires re-reading 267. Under this load profile, the primary variable governing inference cost and latency isn’t model selection; it’s the KV cache hit rate. Take the same agent workload, raise the cache hit rate from 0% to 90%, and monthly GPU bills can drop from $20,000 to $2,000. arXiv 2605.26297’s inference cost analysis independently validates this: once effective caching kicks in, the append ratio collapses from an original range of 53.9× to 559.8× down to 1.5× to 7.3×, and the decode phase reclaims 91% to 98.6% of the time share.

Prefill Is the Dominant Component of the Agent Bill

To understand the judgment above, you first need to touch the internal mechanics of LLM inference. When an LLM receives input tokens, it must compute a set of key-value matrices (the K matrix and the V matrix) for each token and store them in GPU memory for lookup during subsequent generation—this computation process is prefill. The size of the KV matrices scales proportionally with the number of input tokens: longer input means heavier prefill and more memory consumed. Generating new tokens is fast; re-reading everything that came before is slow. Cockroach Labs’s cost breakdown of agent inference at scale confirms this—the re-read overhead occupies a far larger share of the total bill than intuition would suggest.

Prefix caching is precisely the optimization that targets this re-read overhead. If the next round’s input shares a long leading segment identical to the previous round’s, that segment’s KV matrices don’t need to be recomputed—they can be reused straight from GPU memory. This shared prefix grows automatically each round; the more cache hits, the fewer tokens actually need fresh computation. A hit is a cache hit, a miss is a cache miss, and hit rate is the direct metric measuring how well this mechanism performs.

Agent scenarios push the re-read problem to its extreme. Each round’s tool-call prompt consists of three parts: the system prompt, the tool schema, and the conversation history. The system prompt and tool schema stay completely unchanged across all rounds, and the longer the conversation history grows, the larger the share of the invariant prefix becomes. Stanford’s statistics show that re-sent context accounts for 62% of the agent inference bill—the vast majority of the cost goes toward making the model repeatedly chew through information it already knows. The Manus team calls KV cache hit rate the #1 metric for production agents.

Pricing structures across API providers offer corroborating evidence: cached input is universally priced at 0.1 to 0.5 times the rate of non-cached input. Providers incur almost no extra GPU overhead when reusing prefixes, so naturally they can pass the discount through. That itself is a structural signal that prefill is the dominant component of the bill.

Of course, this judgment holds only when the agent workload has stable prefix reuse. If every request’s input is completely non-overlapping with no reusable prefix structure, the benefit of prefix caching approaches zero. Likewise, when concurrent requests’ total KV cache footprint exceeds GPU memory capacity, caches get constantly evicted and reloaded, prefill reasserts dominance, and the system enters a cache thrashing state.

A Three-Layer Engineering Stack: Compression, Routing, and API Caching

Once you know cache hit rate is the cost lever, the question shifts to how to push it higher. As of June 2026, three layers of engineering practice answer this question from different directions, each addressing a different controller. The compression layer modifies the internal algorithms and memory management of inference engines like vLLM and SGLang—only engine maintainers or self-hosting infra teams can touch it. The routing layer modifies load balancing across multi-replica deployments—only teams operating their own GPU clusters have control here. The API-layer prompt caching is the one part that agent developers using Claude or OpenAI APIs can directly influence. The three layers stack on top of each other and evolve independently, together forming a complete hit-rate improvement stack. You can find your seat according to your role.

The compression layer tackles memory efficiency within a single inference pass. The KV matrices themselves are sizable—a 100K-token context can easily produce a KV cache occupying tens of GB of GPU memory. If you can shrink the KV matrices without losing too much information, the cache holds more context, and the hit rate naturally rises.

CompressKV, published on arXiv in June 2026, opened up a new path. The paper’s authors discovered that different attention heads in a transformer don’t divide labor equally: a small subset of heads specialize in semantic retrieval—dubbed Semantic Retrieval Heads—and their attention scores pinpoint exactly which tokens in the context are valuable to the current generation task. Using these heads’ scores to decide which tokens to retain, keeping only 3% of the KV cache suffices to preserve 97% of performance on LongBench QA. In the more extreme Needle-in-a-Haystack test, a mere 0.7% of capacity still achieved 90% accuracy. Around the same time, STAR-KV and R-KV explored the upper bounds of KV cache compression from different angles, while CacheBlend supplemented the picture on semantic fidelity of compressed caches.

The compression algorithms themselves are already good enough. The bottleneck has now shifted to getting them to run inside a production serving stack. NVIDIA’s blog post identifies two engineering walls: FlashAttention kernels don’t expose attention scores, so methods relying on scores for eviction decisions can’t get the data; and paged attention manages memory at block granularity, releasing memory only when an entire block is empty—after eviction, surviving tokens scatter across different blocks, and fragmentation prevents memory from actually being freed. Tangram’s solution is to fully staticize the dynamic overhead required by non-uniform compression through offline calibration, boosting throughput 2.6×. UltraQuant takes the hardware route, adapting 4-bit quantization directly to the matrix core’s native instructions. Three paths advancing in parallel from different directions.

The routing layer, meanwhile, solves the cache affinity problem in multi-replica deployments. Say you have 8 vLLM replicas behind a standard Kubernetes service, and the load balancer defaults to round-robin distribution. The first request arrives carrying certain prefixes, and the KV cache gets written to replica 1. The second request arrives with the exact same system prompt, and round-robin sends it to replica 2. Replica 2 has no KV matrices for this prefix in its memory, so it computes everything from scratch—a complete miss. The bigger the cluster, the closer the probability that the same prefix lands on the same replica a second time approaches 1/N, and prefix caching’s gains are entirely negated by the routing layer. TrueFoundry’s tests provide quantitative evidence: when requests sharing the same prefix get repeatedly routed to different GPU replicas, KV cache hit rate drops to zero, and latency and cost immediately revert to the all-prefill baseline.

The routing problem transitioned from an implicit assumption to an explicit engineering goal in the first half of 2026, and three major ecosystems have already crystallized. llm-d, from the Red Hat community, takes a Kubernetes-native approach to precise prefix-cache scheduling, demonstrating a decisive gap on an 8-Pod, 16-H100 setup: TTFT p90 dropped from 92.5 seconds under random scheduling to 0.54 seconds. SGLang SMG’s cache-aware routing achieved a 1.9× throughput improvement and a 3.8× increase in hit rate. GKE Inference Gateway compressed TTFT by 92.8%. Snap’s production environment steadily runs at a 75-80% hit rate. Together CPD’s cache-aware disaggregated inference and DigitalOcean’s production practices are converging on the same direction. vLLM Router’s consistent hashing approach offers yet another option. The divergence of these solutions means prefix-boundary tuning still requires manual intervention, but routing is no longer invisible from this point forward.

The API-layer prompt caching is the fastest path most agent developers can use directly. Anthropic’s Claude prompt caching bills cached input at 0.1×, equivalent to a 90% discount. It allows setting 4 cache breakpoints in the prompt, naturally corresponding to the four static zones of an agent prompt: one for the system prompt, one for the tool schema, one for few-shot examples, and one for long documents. Each inference round pays full price only on the parts of the prompt that change; everything invariant enjoys the discount.

This scheme suffered an incident on March 6, 2026, exposing the single-point fragility of the entire mechanism. On that day, Anthropic silently reduced the default cache TTL from 1 hour to 5 minutes, and agent users’bills instantly surged 100×. The fallout spread rapidly through the community, and it was eventually traced to the need to explicitly declare a "ttl": 3600 field to restore the previous cache duration.

API-layer prompt caching has graduated from optional optimization to default infrastructure, but there are still plenty of pitfalls. Requesty compiled in April 2026 the cache hit rates of the same model across different platforms: Claude connected directly hit 77.5%, whereas the same Claude routed through Google Vertex AI managed only 23.5%—same model, different gateway, a 3× difference in hit rate. Morph’s five-leverage stacking model shows that systematically combining cache breakpoint placement, TTL, prompt structure (static portions first, variable portions last), vendor selection, and hit-rate monitoring can achieve over 90% cost savings. On the storage side, KVFlow and Lablup & VAST Data’s KV cache offload benchmark widen the cache capacity frontier, while LMCache’s new architecture boosts MoE model inference performance 10×.

Context Engineering Is Crystallizing Into a New Engineering Discipline

These three layers aren’t three independent directions—they’re projections of the same problem onto different foundational strata: how to drive the cost of having the model re-read known information toward zero without sacrificing inference quality. Achieving this goal requires simultaneously managing cache hit rate (cost), context rot (quality degradation from stale content), and error propagation (cache errors propagating forward)—no single-point optimization covers all three dimensions.

Prompt engineering cares about how to ask and optimizes output quality, with typical cost savings of 5% to 8%. RAG cares about what to retrieve and optimizes the relevance and coverage of retrieved content. Context engineering cares about the order in which information enters the context, the caching strategy, and the compression method—optimizing three metrics at once: input cost, latency, and quality, with typical savings reaching 55% to 60%. It fills the layer that the previous two practices leave untouched.

arXiv 2605.27744’s proposed agent runtime layer architecture cross-validates this from another angle. The paper lists 9 cross-module policies—including caching, retry, aggregation, identity management, and more—each requiring agent identity as a shared coordinate to coordinate, something that can’t be handled purely at the framework layer or the engine layer alone. A survey on Preprints reviews this line from the theoretical angle of sufficient state approximation, giving it a systematic academic skeleton. Cognition AI calls context engineering the #1 job in building agents. Gartner marks 2026 as the Year of Context.

The term context engineering carries the risk of being diluted by marketing—some of Gartner’s descriptions mix in realms like data engineering and corporate anthropology that are far from the engineering substance. But the engineering foundation is solid: quantifiable ROI, a near-complete toolchain, theoretical grounding spanning from attention heads to routing strategies to API breakpoints—these aren’t things buzzwords can prop up. New work such as Nexus Sampling and IntentKV continues to deliver new upper bounds on compression efficiency and semantic fidelity, indicating that this discipline is moving from empirical intuition toward systematic design.

From Cost Accounting to Action Items

Teams running multi-round agents: turn on prompt caching before discussing model selection. This is the highest-ROI, lowest-integration-cost step—no code architecture changes, no impact on model behavior, acting directly on the bill. Claude’s 4 breakpoints plus 1-hour TTL is currently the most controllable scheme: put cache breakpoints on the system prompt, tool schema, few-shot examples, and long documents respectively, paying full price only on the changing parts each round. At the same time, make cache hit rate the first metric on your dashboard—its correlation with cost is more direct than any model version number.

Teams self-hosting multiple replicas: routing strategy matters more than switching models. The minimum bar is swapping round-robin for prefix-hash routing, so requests sharing the same prefix land on the same replica, immediately restoring prefix caching gains. Larger fleets can adopt llm-d’s precise scheduling, compressing TTFT p90 from minutes to seconds, or go the vLLM Router consistent-hashing route. NVIDIA Dynamo’s low-latency distributed framework offers another integration approach.

Context engineering as a direction addresses precisely the layer that prompt engineering and RAG leave untouched: how to sequence information entering the context so cache breakpoints fall on repetitive segments, how compression algorithms can trim KV cache without degrading retrieval quality, and how routing strategies can coalesce identical prefixes across replicas. It isn’t a marketing term—it’s a structural new dimension that naturally emerges as agent inference costs come into focus.