AI CodingInference & PerformanceAI Agent

Prompt Caching as a First-Class Constraint in Harness Engineering

Published Apr 4, 2026

A Counterintuitive PR

In early 2026, Anthropic discontinued login support for third-party harnesses under Pro subscriptions, requiring all third-party tools to use paid API access. Against this backdrop, one of the core authors of Claude Code submitted a seemingly counterintuitive PR (OpenClaw #58036) to OpenClaw: when conversation history requires compaction, delete the most recent tool results first, rather than the oldest.

By naive intuition, the most recent context should be the most valuable. The file contents a user just read, the command output just executed — these have the highest relevance to the next decision. Why discard them first?

The value of this kind of PR lies in the observation window it provides, revealing the priority hierarchy that mature harnesses develop in practice. When prefix stability and context recency conflict, this PR chose the former. Understanding the reasoning behind this choice requires answering two more fundamental questions first.

Why Prompt Caching Is a Viability Condition

Token economics in agent scenarios have an easily overlooked characteristic: input far exceeds output. The Manus team reported in their context engineering blog post that their average input/output token ratio is approximately 100:1. Nearly all computational cost goes toward repeatedly processing long contexts; the overhead of generating responses is negligible by comparison. Manus was among the first teams to systematically articulate agent caching strategies in a public setting. Their three principles — keep the prefix stable, make context append-only, mask tools rather than removing them — closely align with the patterns observed across multiple sources in this article.

At a 100:1 input/output ratio, whether the cache hits directly determines the system’s cost baseline. Anthropic has a 10x cost difference between cache hits and misses (cached read pricing is 10% of base input pricing). OpenAI’s GPT-5 series offers a 90% discount. DeepSeek’s cache hit pricing is similarly one-tenth of the miss price. The implications are straightforward: if a harness sustains a cache miss rate above 50%, its actual cost is three to five times that of a cache-aware system at equivalent scale. For products that are scaling, this gap is sufficient to determine whether the business model is viable.

Cost is only half the picture. The more decisive factor is latency. DeepSeek has reported that for a 128K-token prompt under high cache hit conditions, time to first token drops from 13 seconds to 500 milliseconds. What does this gap mean in practice? A 13-second time to first token renders speculation (speculative execution), background agents, and sub-agent parallelism unusable as interaction patterns — no user will wait 13 seconds for a sub-agent to cold-start before returning a result. At 500 milliseconds, these patterns become viable. In other words, prompt caching hit rate determines which system architectures can exist, not how fast existing architectures run.

Hence the first core judgment: Prompt caching is a viability constraint in mature harnesses. It simultaneously determines the system’s cost baseline and interaction latency, which together define the feasibility boundary for sub-agent architectures, speculation patterns, background agents, and similar design approaches. Once a system begins to scale, these factors override local functional intuitions and reshape design decisions in turn.

Why Cache Discipline Reshapes Harness Design in Reverse

Once cache reuse becomes a viability condition, a chain reaction follows: the messages array, tool definitions, and system prompt sent to the API are no longer data that can be modified freely. They become semi-immutable sequences — stabilized at the front, allowed to grow at the tail.

This constraint stems from the core mechanism of prompt caching: cache matching is based on strict prefix comparison, exact to the token level. Even a single changed space invalidates everything from the point of modification onward. This mechanism is consistent across providers: OpenAI performs automatic prefix matching at 128-token granularity, DeepSeek enables fully automatic caching by default, and Google Gemini requires explicit creation of CachedContent objects. Implementation details vary, but the underlying constraint is identical: any modification to the prefix invalidates the cache. This constraint originates from how KV-caches work — as long as the Transformer architecture remains unchanged, it will persist.

This means prefix stability permeates the design of multiple harness subsystems. It has cascading effects on compaction order (delete from the tail, not the head), tool definition arrangement (deterministic sorting is required, as nondeterminism from dynamic loading causes cache breaks), the timing of image and large content pruning (deferred pruning is preferable to aggressive pruning), and how parameters are passed to sub-agents (they must share cache keys consistent with the parent process). These subsystems appear unrelated, yet they become coupled through the same underlying constraint.

Hence the second core judgment: Cache discipline reshapes harness design in reverse. Those seemingly counterintuitive implementations are the natural outcome of global cache economics overriding local functional intuitions.

Which Design Decisions Cache Discipline Rewrites

Returning to the PR from the beginning. The following cases come from OpenClaw’s public PR history. They are patches to an existing system, revealing cache discipline that formed incrementally during development — not a blueprint designed from the outset. This is precisely their value: they show how the priority hierarchy of a mature harness was forged under practical pressure.

Compaction order. #58036 changed the compaction strategy from deleting the oldest entries to deleting from the tail. The early content of a conversation — system prompt, initial tool definitions, the first few turns of dialogue — forms the core of the cache prefix. Deleting it is equivalent to destroying the cache foundation. Tool results at the tail, while information-dense, sit at the end of the cache computation; removing them has minimal impact on prefix hit rate. A more critical point: deleted tool results can be re-fetched when needed later (e.g., by re-reading a file), whereas a destroyed cache prefix can only be rebuilt at full price. This asymmetry is the fulcrum of the entire logic.

Deterministic ordering of tool lists. Tool definitions are typically sent as part of the system message, positioned at the very front of the messages array. If the tool list order varies between requests (for instance, due to timing differences in dynamic MCP server responses), the entire cache is invalidated. #58037 addressed exactly this class of problem: ensuring tool definitions maintain a consistent serialization order across requests. The Manus team proposed a more aggressive strategy: even when certain tools are unavailable in the current state, their positions in the tool list are preserved. Tool availability is controlled via logit masking rather than list insertion and removal, specifically to avoid cache breaks caused by tool list mutations.

Timing of image and large content pruning. Image tokens occupy substantial space in the messages array. Removing an early image mid-conversation to free up context window space has the same effect as deleting from the head: the prefix is destroyed. #58038 chose to defer history image pruning, pushing modifications to the early prefix as late as possible. A better strategy is to decide at the time an image first appears whether to retain it, or to prune only images located at the tail of the messages. Manus takes a more thorough approach: large content (PDFs, web pages, etc.) is written to the filesystem, with only file paths retained in the context as pointers — eliminating from the source the problem of large content inflating the context and then requiring pruning.

Placement of cache control breakpoints. The position of cache breakpoints delineates the boundary between the stable zone and the active zone. Mature harnesses set cache_control markers at the end of the system prompt and on the most recent user message. The system prompt section is stably cached; the assistant response and tool results following the user message belong to the active zone, permitted to grow and change naturally. For harnesses using the OpenAI API, where caching is automatic and requires no explicit breakpoints, the same partitioning mindset still applies: place stable content at the front, variable content at the back.

The coupling between these subsystems is a faithful mapping of the underlying API characteristics. Harnesses that ignore this constraint silently incur costs that are several times higher than necessary.

When a harness introduces sub-agent architecture (where the main agent dispatches tasks to child agents for execution), prompt caching constraints propagate in a way that is difficult to detect.

Each sub-agent establishes its own API session upon startup, with an independent cache prefix. The cache the main agent carefully maintains is entirely useless to the sub-agent, which builds its cache from scratch. If a sub-agent’s task is short-lived (say, making a single tool call and returning the result), its cache expires before it can ever be reused. This is a hidden cost amplifier: every short-lived sub-agent means a cache cold start.

A concrete example involves the reasoning_effort parameter. Some harnesses set reasoning effort to low when dispatching simple subtasks, expecting to reduce output tokens and thereby lower costs. In practice, however, changes to reasoning_effort may alter the API request’s parameter signature, preventing cache sharing with normal requests. What appears to be cost savings actually incurs higher costs due to cache misses.

A subtler issue lies in the design of sub-agent system prompts. If the main agent and sub-agents share a portion of the system prompt (such as common safety rules and behavioral guidelines), this shared content should be placed at the very beginning of the sub-agent’s messages and kept exactly identical to the main agent’s version. Any deviation — even removing a few lines for the sake of simplification — means that the sub-agent cannot reuse the main agent’s cache, nor share a cache with other sub-agents.

When designing sub-agent strategies, one must weigh the granularity of task decomposition against the potential for cache reuse. Excessively granular subtask decomposition may result in every sub-agent paying full price for initial cache creation. But this trade-off requires data to inform it: what is the actual cache hit rate for sub-agents? What proportion of total cost comes from cold starts? Without these measurements, any adjustment to sub-agent strategy is flying blind.

Measure First, Then Optimize

The third core judgment: What cannot be measured cannot be improved.

The challenge with prompt caching is that its impact is difficult to observe directly. Cache misses do not trigger errors; the excess cost accumulates silently on the bill. API responses include cache_creation_input_tokens and cache_read_input_tokens fields (Anthropic uses these; DeepSeek similarly returns prompt_cache_hit_tokens and prompt_cache_miss_tokens), but unless the harness actively collects and surfaces these metrics, developers have no awareness of their caching performance.

The promptCacheBreakDetection.ts file from the Claude Code leaked source code demonstrates an approach worth studying. This module systematically tracks the sources of cache breaks: did the system prompt change? Did the tool list order change? Was a history message modified or deleted? It attributes each category of cache break to a specific type of change, producing observable metrics. This file is worth reading carefully as a learning resource — it shows how an engineering team transformed a vague cost problem into an attributable engineering problem.

For developers building their own harnesses, observability should cover at least three dimensions. First, recording and aggregating cache hit / miss / creation tokens for every API call — this is the most fundamental metric. Second, counters categorized by cache break source: system prompt changes, tool list changes, history message modifications, compaction triggers, and so on. Third, tracking cache performance separately for the main agent and sub-agents, as the two typically exhibit significantly different patterns.

With this data, optimization gains direction. Otherwise, you may expend considerable effort optimizing a source that contributes only 5% of cache misses while overlooking the one responsible for 80%. The applicability of this methodology extends beyond prompt caching itself: token usage, latency distribution, tool call success rates — all critical harness metrics follow the same logic. Establish a measurement baseline first, then optimize with precision.

Conclusion

Returning to #58036. Its diff is small — the changed lines of code likely number no more than a few dozen. But it represents a cognitive shift: prompt caching has moved from a retroactive optimization measure to a first-class constraint that shapes system behavior.

The three core observations in this article: prompt caching is a viability condition, cache discipline reshapes harness design in reverse, and what cannot be measured cannot be improved. There is a progression among the three: the first judgment explains why prompt caching deserves serious attention, the second explains what happens to a system once that attention is applied, and the third explains how to advance this transformation correctly.

These observations hold across providers. Whether using Anthropic, OpenAI, DeepSeek, or Gemini, the underlying constraint is the same: prefix modifications invalidate caches, and the cost gap between cache hits and misses ranges from several-fold to tenfold. They collectively point to the conclusion that prompt caching should be incorporated during the architectural design phase of a harness, not retrofitted as a post-launch cost optimization.

For engineers building their own harnesses, promptCacheBreakDetection.ts from the Claude Code source is a starting point worth reading carefully. Begin there, build your own measurement framework, and let the data tell you what to optimize.