AI CodingAI AgentAI Products & Platforms

Evaluation-First: What Cursor's Agent Harness Post Is Really Worth Reading For

Cursor published an article today on the continuous improvement of its agent harness: Continually improving our agent harness. Looking only at the title and section headings — evolution of the context window, a taxonomy of tool reliability, multi-model customized harnesses, mid-chat switching — it reads like a typical engineering blog. Many of the practices will feel familiar to readers who follow harness engineering, or even like things they are already doing themselves.

Read it differently, and the focus shifts from the practices themselves to Cursor’s approach to decision-making: which practices to keep, which to remove, and which to rewrite per model. For example, those static guardrails added weeks ago to compensate for weaker models — now that models are stronger, can they be removed? When onboarding a new model, should it reuse the existing tool format, or get a harness tailored to it? Does a more expensive summarization model bring quality improvements, or just added cost? These questions cannot be answered by experience alone. The clue running through Cursor’s article is that it uses the same evaluation process to inform every one of these decisions. That thread is worth more than any individual technique. And the dilemma it addresses is not unique to Cursor.

This is the evaluation-first mindset: first define what “good” means, then verify each hypothesis through experiments, and only then decide whether to fully deploy. When reading this article, it is more valuable to watch how the evaluation system participates in every change than to fixate on any single harness technique. CursorBench handles offline regression; online A/B experiments validate real-world usage. The two tracks together make every improvement observable, regressable, and verifiable. Without this system, harness improvements would be driven by intuition alone.

In mid-March, when Cursor first released CursorBench, we wrote a survey: CursorBench Analysis: When Benchmark Meets Reality, discussing the role of benchmarks in agent products. At that time, CursorBench came across primarily as a tool for evaluating how different models perform within the Cursor environment. This harness article pushes the question further: evaluation is no longer just for model selection — it has entered the core of product operations, driving context strategy, tool reliability, new model adaptation, and log analysis. Around the same time, we also published an engineering survey: Harness Engineering: When Humans Move from Writing Code to Designing the Agent’s Workspace, discussing how when humans shift from writing code to designing the agent’s working environment, documentation, constraints, feedback loops, and verification infrastructure matter more than prompt tricks. Cursor’s article is a concrete instantiation of that thesis.

The Three Components of Evaluation: Metric, Dataset, Protocol

If we break down what Cursor’s evaluation system consists of, it can be decomposed into three layers: metric (how to define “good”), dataset (what scenarios to test on), and protocol (how metrics enter the decision loop).

The Division of Metrics: North-Star and Diagnostic

Let’s start with metrics. Cursor categorizes metrics into two types — this categorization itself is more valuable than any individual metric.

The first type is north-star metrics, which are close to user-perceived quality. The most interesting among them is Keep Rate — the proportion of agent-generated code that remains in the user’s codebase after a fixed period of time. If users repeatedly ask the agent to modify the same piece of code, or manually revert changes and rewrite them, that means the agent’s output is not truly being adopted. Keep Rate does not directly ask whether a model can write code. It measures user behavior: did this code ultimately stay? It is a behavioral metric, not a capability metric. That is precisely what makes it closer to real value than checks that ask whether a model passed a particular test.

Another north-star metric is semantic-level analysis of user responses. Cursor uses a language model to read user replies after agent output: if the user moves on to the next feature development, that is a positive signal; if they paste a stack trace, that is a negative signal. This does introduce the risk of judge model misclassification, but it still provides a directional signal that can be tracked consistently over time.

The second type is diagnostic metrics: latency, token efficiency, tool call count, cache hit rate, tool error rate. These metrics can point a direction — reduce latency, decrease token consumption, improve cache hit rate — but they cannot by themselves prove that the agent is doing a good job. An agent can generate incorrect code extremely quickly, or make repeated useless tool calls with low token consumption. Diagnostic metrics localize the problem; north-star metrics define quality. Both are indispensable: without diagnostic metrics, a team seeing quality changes has no idea where to look; without north-star metrics, a team might optimize the system to peak efficiency while leaving users unsatisfied.

Where CursorBench Sits

With metrics in place, a standardized set of scenarios is needed to produce those metrics reliably. That is the role CursorBench plays: providing offline, reproducible, cross-time comparable evaluation readings, allowing the team to quickly determine, across multiple daily iterations, whether a change is better or worse than last month.

One key attribute of CursorBench is easy to overlook: it measures the model’s performance within Cursor’s harness, so scores are jointly influenced by model capability and harness design. For Cursor’s internal product iteration, that is precisely where its value lies. What the team really needs to know is which model performs better in their own system. A model’s strength on standard benchmarks is just a reference point; product decisions must ultimately optimize system-level performance.

This attribute also defines CursorBench’s boundaries. It cannot simply be used as the basis for a general-purpose model leaderboard. External parties cannot independently reproduce its task set, scoring criteria, or runtime environment. When interpreting CursorBench results, this context must be kept in mind.

Protocol: How Metrics Enter Decision-Making

Metrics and datasets are just infrastructure. To embed them into engineering decisions, you need a protocol — a fixed execution flow: form a hypothesis, run offline evaluation, run online A/B testing, compare results, then ship or roll back.

Cursor’s protocol has three layers. The first is offline eval: using CursorBench for fast regression. The second is online experiments: deploying multiple variants in production and comparing them against real user behavior. The third is weekly automation — a scheduled agent equipped with a skill set that searches logs, discovers new issues or anomaly spikes, and creates or updates tickets in the backlog. This automation turns production logs into actionable harness repair tasks, forming a closed loop from discovery to fix to verification.

Without a protocol, metrics are just numbers. With a protocol, metrics can decide whether to ship, whether to roll back, and whether to open a ticket.

The Fourth Dimension: Stopping and Refusal

After the three components, one question remains: if an evaluation system only measures how many tasks an agent completes, it cannot see when the agent should actively stop.

In April 2026, PocketOS founder Jer Crane was using Claude Opus 4.6 inside Cursor when the agent encountered a credential mismatch in the staging environment. It found an unscoped Railway API token, then called the volumeDelete endpoint and deleted the production database and volume-level backups in roughly nine seconds. Afterward, it generated an explanation acknowledging that it had guessed without confirming the boundary and executed a destructive operation. Railway restored the data from internal disaster backups and added safeguards. Crane reported the incident on X, and The Register and NeuralTrust followed up.

The value of this case is not the broad claim that AI can make mistakes. It exposes an undefined premise in current agent evaluation design: CursorBench and most coding-agent benchmarks assume the agent should try to complete the given task, then measure whether the code is correct, whether tests pass, and what the keep rate looks like. These metrics measure how much the agent acts when it should act. They do not measure whether it stops when it should not act.

The danger is also not simply that stronger models are more dangerous. A weaker model may never find the production API key, so the task fails harmlessly. A mature enough model should stop and ask the user once it notices that credentials and task authority do not match. The dangerous zone is the intermediate model: capable enough to route around obstacles, find a token, and assemble a working API call, but not yet mature enough to judge whether it should use that path. The riskiest driver on the road is not the total beginner or the driver with a decade of experience; it is often the driver who has been driving for a couple of years and feels fluent before judgment has caught up. The PocketOS agent was in that state: technically capable enough to find the token and construct the API call, but still operating at the level of “if there is a way, use it.”

This problem cannot be solved simply by switching to a stronger model — the PocketOS incident already involved Claude Opus 4.6. Stopping and refusal are capabilities separate from code generation. They need to be defined and measured separately. In an evaluation framework, this means adding a fourth dimension beyond metric, dataset, and protocol: scenarios and scoring methods that specifically measure whether the agent stops under unclear authority, ambiguous goals, or irreversible operations. The avoidance rate should be calculated independently rather than merged into task completion. An agent with average code generation but no unauthorized operations is more deployable than an agent with strong code generation that occasionally deletes a database.

Engineering-wise, this dimension can start from protocol evolution. Whenever an online incident involves unauthorized or destructive operations, the offline eval set should add a corresponding scenario within one sprint. The practice is not complicated, but it requires evaluation system designers to accept one thing: in agent products, some tasks are successful precisely because the agent does nothing.

Re-reading Cursor’s Harness Decisions Through the Lens of Evaluation

With this framework mapped back onto Cursor’s engineering scenarios, the evaluation underpinning of each decision comes into sharper focus.

From static context to dynamic context discovery. Early models were weaker, so Cursor packed the harness with static context and guardrails: showing lint and type errors after every edit, rewriting overly long file content, limiting the number of tool calls per turn. As models grew stronger, these static injections were gradually removed, shifting toward letting the agent decide for itself what information it needs during its workflow. This migration required answering a series of concrete questions: after removing guardrails, did quality drop? How much did token consumption decrease? Did error rates go up? Did keep rate improve or worsen? Without evaluation, these decisions have no basis. With evaluation, every guardrail adjustment lands on a verifiable reading.

Onboarding a new model is not about hooking into a generic interface. Cursor’s example centers on tool format: OpenAI models are more familiar with patch-based file edit formats because those appear more frequently in their training data; Anthropic models are more accustomed to string replacement. Cursor customizes tool formats and prompts for different models, and even fine-tunes between versions of the same model. The evaluation target is not the model in isolation, but the combined system of model plus harness. When onboarding a new model, Cursor starts with the harness of the closest existing model, runs offline eval, lets the internal team try it out, adjusts prompts, and only pushes to production once confident. Every step asks the same question: is this combination better than what we have now? Benchmark scores from published papers only provide a starting point. The real answer comes from the readings of one’s own evaluation suite.

Cost optimization: more expensive does not always mean better. Cursor tried using a stronger model for context summarization and found the quality improvement was marginal and not worth the extra cost. This is a textbook evaluation-first decision scenario: there is cost temptation (a more expensive component sounds more advanced, so it should be better) and intuitive pressure (summarization quality directly affects whether the agent can stay coherent in long conversations). But if the north-star metrics do not show meaningful change, the decision is clear — do not switch. Teams without evaluation tend to over-optimize intermediate metrics in these scenarios, such as summary faithfulness, while losing sight of changes in end-user-perceived quality.

A taxonomy for tool reliability. Cursor splits tool call errors into two categories: unknown errors and expected errors, with the latter further broken down by cause into InvalidArguments, UnexpectedEnvironment, ProviderError, UserAborted, Timeout, and so on. More importantly, it establishes per-tool, per-model baselines and uses anomaly detection to identify real spikes. The value of this taxonomy is that a rise in tool error rate can have many causes. A model update may change calling patterns, codebase size may affect tool behavior, or user distribution may shift usage frequency. Without taxonomy and baselines, a rise in error rate only triggers a vague alert, and the team does not know where to start investigating. The taxonomy narrows the problem space first; anomaly detection then tells the team whether this particular instance is truly anomalous. Tool error rate here is like a thermometer — it signals the system is unwell, but cannot diagnose the cause on its own.

Mid-chat switching and subagents. Users may switch models mid-conversation — for example, writing core logic with Claude and then switching to GPT to continue developing another feature. When switching, the new model must take over conversation history generated by the old model, which is out-of-distribution for the new model. Moreover, different models may use different tool sets, and tools appearing in the call history that the current model does not support can cause problems. Cursor’s approach is to add custom instructions informing the model that it is taking over mid-conversation, and telling it to avoid calling tools outside its own tool set. Switching also invalidates cache, so Cursor summarizes the conversation at the switch point to reduce the cache penalty, though summaries may lose detail in complex tasks. For this reason, Cursor recommends keeping complex conversations within a single model, or working around the issue by using a subagent’s fresh context. New architectural capabilities bring new failure modes, and the evaluation question shifts accordingly: beyond whether switching is technically possible, we must determine when switching is better than not switching. Only evaluation can answer this.

Conclusion

Let’s return to the everyday questions from the beginning: should we onboard a new model, should we remove old guardrails, is the more expensive summarizer worth it. These questions will not disappear; they will only become more numerous and more complex as agent platforms grow more capable. There is no one-size-fits-all answer to them, but a well-designed evaluation system can consistently produce better judgment signals.

These signals do not need to be perfect from day one. The framework behind Cursor’s approach — define north-star metrics, build a standardized task set, use a fixed protocol to turn measurement into a decision cadence — is transferable. Start by thinking clearly about what “good” means, then find scenarios close to real usage to measure against, and finally bake those measurements into release decisions. Small teams can begin with manual logging and periodic review. The core is establishing the loop, not replicating Cursor’s infrastructure in one shot.

That said, the article’s own limitations should be acknowledged. It is written from the vendor’s perspective, not a third-party audit — specific Keep Rate numbers, A/B experiment sample sizes, the full CursorBench task set and scoring mechanism are not disclosed. What we discuss here is the soundness of the methodology, not the verifiability of the results. Every metric also has its failure modes: Keep Rate in large projects may be limited by user attention rather than quality; semantic follow-up analysis produces weak signals from short responses. But these limitations do not change evaluation’s trajectory — it is rising, not falling.

The reason goes beyond agent platforms becoming more complex. More fundamentally, agents are being tasked with harder problems, often in domains we ourselves may not fully master. In traditional engineering, the quality of a code change is clear: does it compile, do the tests pass, did latency go up? These signals are explicit and observable. Code generated by agents is different: bugs may hide in auto-generated logic and only surface in the end user’s runtime environment. Observability is getting worse while task difficulty is increasing. Together, these two trends make evaluation not a nice-to-have, but a necessary condition for maintaining product trustworthiness.

Model capabilities refreshing every two weeks, ever-growing context windows, an increasingly rich tool ecosystem — the components of an agent platform keep changing. But the faster they change, the more important it is to have a stable method for answering three questions: What does “good” mean? Which change actually made things better? Where is it broken? The value of Cursor’s article is that it makes this concrete: evaluation is transitioning from a supporting function within testing teams to the main thread of agent product engineering.