AI CodingAI Agent

Harness Engineering: When Humans Shift from Writing Code to Designing Agent Work Environments

Published Mar 12, 2026

Date: 2026-03-12 Core Sources: OpenAI “Harness engineering” (2026-02-11), Cursor “Towards self-driving codebases” (2026-02-05), Cursor “Scaling long-running autonomous coding” (2026-01-14) Supplementary Sources: Vercel AGENTS.md empirical study, Anthropic agent autonomy research, OpenAI Codex Skills/Review practices Related Axioms: A02, A03, A04, A05, A08, A12, T05, M09

1. Overview

In early 2026, OpenAI and Cursor almost simultaneously released reports on their practices in agent-first software development. Ryan Lopopolo of OpenAI described how a three-person team used Codex to generate an internal product with about one million lines of code in five months, with zero lines of manually written code. Wilson Lin of Cursor described building a web browser engine from scratch by running thousands of parallel agents autonomously for a week.

While these articles come from different companies, products, and tech stacks, they converge on the same conclusion: when AI agents become the primary producers of code, the core work of human engineers undergoes a fundamental shift. This is not a gradual upgrade of “AI-assisted programming” but a paradigm leap. OpenAI has named this new paradigm “harness engineering.”

In an engineering context, “harness” usually refers to a wiring harness or a test fixture. Here, it refers to the entire work environment built around the agent: documentation systems, architectural constraints, feedback loops, verification tools, and observability infrastructure. The output of human engineers is no longer code, but this harness. Code is produced by agents, and humans are responsible for ensuring that agents can produce code reliably.

This survey aims to clarify the core concepts of harness engineering, compare the similarities and differences between the paths taken by OpenAI and Cursor, and map them to our own axioms and practices.

2. OpenAI: An Agent-First Product from Scratch

2.1 Experimental Design

OpenAI’s experiment had an extreme constraint: zero lines of manually written code. From the first commit, all code—including CI configurations, internal tools, design documents, evaluation frameworks, and deployment scripts—was generated by Codex. The team of three engineers eventually expanded to seven, merging about 1,500 PRs in five months, averaging 3.5 per person per day.

This constraint was not to prove that AI can write code, which is already established, but to force a question: when you are forced to influence code quality only by designing the environment, what truly matters?

2.2 Six Core Findings

Finding 1: AGENTS.md should be a directory, not an encyclopedia. Their initial attempt to cram all rules into one large file failed. Context windows are a scarce resource, and too much information is equivalent to no information. The final solution was a roughly 100-line AGENTS.md serving as navigation, pointing to a structured knowledge base under docs/. This knowledge base includes design documents, execution plans, architectural decisions, and quality scores. A later empirical study by Vercel confirmed this: a compressed 8KB AGENTS.md achieved a 100% pass rate in evaluations, while the skills mechanism only reached 79%. Passive context is superior to active retrieval.

Finding 2: What Codex cannot see does not exist. This is the source of the quote that triggered this survey. Discussions in Google Docs, alignment on Slack, and implicit knowledge in team members’ heads do not exist for the agent. The solution is to push all knowledge into the repository: architectural consensus reached on Slack must become markdown, design decisions must become execution plans, and technical debt must become traceable documentation. This follows the logic of training new employees: someone joining in three months won’t see the Slack discussions from today and can only rely on what is left in the repository.

Finding 3: Agent reviews can replace most human reviews. Humans can review PRs, but it is no longer necessary. They had Codex perform self-reviews first, then requested reviews from other agents, iterating in a loop until all agent reviewers were satisfied. The human role shifted from line-by-line code review to defining review standards and encoding taste preferences in AGENTS.md. A later OpenAI article on scaling code verification mentioned that their system processes over 100,000 external PRs daily, with over 80% of comments receiving positive reactions.

Finding 4: Forced constraints are more effective than micromanaging implementation. They established a strict layered architecture for the agent (Types -> Config -> Repo -> Service -> Runtime -> UI), enforcing dependency directions through custom linters generated by Codex. A key difference was that they required Codex to parse data types at boundaries (“parse, don’t validate”) but did not specify which libraries to use (Codex chose Zod). This “define boundaries, let go of implementation” model allowed the agent to maintain full freedom within constraints.

Finding 5: Throughput changes the merging philosophy. When the PR output of agents far exceeds human attention, traditional blocking merge gates become counterproductive. Waiting is more expensive than correcting errors. They adopted a minimally blocking merge strategy: PR lifecycles are short, and test flakes are handled in subsequent runs rather than blocking progress. The article admits this would be irresponsible in a low-throughput environment but is the right trade-off when agent throughput far exceeds human attention.

Finding 6: Entropy accumulates and requires “garbage collection.” Codex replicates existing patterns in the repository, including suboptimal ones. Initially, the team spent 20% of their time every Friday manually cleaning up “AI slop,” but they soon realized this was unsustainable. They shifted to an automated approach: encoding “golden principles” into the repository and running background Codex tasks periodically to scan for deviations, update quality scores, and open fix PRs. Most of these PRs can be reviewed and auto-merged within a minute. Technical debt is like a high-interest loan; continuous small repayments are more economical than dealing with it all at once after it accumulates.

2.3 Observability as a Lever

A significant investment by OpenAI was making the application itself observable to Codex. They connected to the Chrome DevTools Protocol, allowing Codex to take screenshots, manipulate the DOM, and drive UI flows. They established an independent observability stack (logs, metrics, traces) for each git worktree, which Codex could query using LogQL and PromQL. This meant prompts like “ensure no span in the critical user path exceeds two seconds” became executable.

A single Codex run often lasted more than six hours, typically executing while humans were asleep.

3. Cursor: The Evolution of Multi-Agent Coordination

3.1 Different Sides of the Problem

If OpenAI’s article answers “what humans should do,” Cursor’s two articles answer “how to coordinate a large number of agents to work together.” Cursor’s goal was to verify a basic question: can investing 10x the computation yield 10x the meaningful throughput?

They chose building a web browser engine from scratch (using Rust) as the benchmark task. Hundreds of agents ran in parallel for a week, generating over a million lines of code and a thousand files.

3.2 Four Architectural Iterations

The true value of Cursor’s article lies in its honest recording of the failures and lessons from four architectural iterations, rather than just showing the final solution.

First: Equal Self-Coordination (Failure). All agents were equal and coordinated through a shared state file. Each agent checked what others were doing, claimed tasks, and updated status. This is a classic solution in distributed systems but failed quickly in an agent scenario. Agents held locks for too long, forgot to release them, or performed illegal lock operations. Throughput for 20 agents degraded to that of 1-3 agents. A deeper issue was that without a hierarchy, agents became risk-averse, avoiding difficult tasks and making only safe, small changes. No one was responsible for hard problems.

Second: Planner-Executor-Worker-Judge Pipeline (Partial Success). Separating roles brought significant improvement: the Planner created plans, the Executor supervised execution, the Worker focused on specific tasks, and the Judge decided whether to continue. However, this structure was too rigid and was bottlenecked by the slowest Worker. Pre-planning also made it difficult for the system to adjust dynamically when new problems were discovered.

Third: Continuous Executor (Partial Success, then Regression). Removing the independent Planner and having the Executor handle both planning and execution made the system more flexible. However, the Executor began to show pathological behaviors: random sleeping, stopping task generation, writing code itself, and claiming premature completion. The reason was that it was overwhelmed by too many roles (planning, exploration, research, task generation, checking Workers, reviewing code, merging output, and judging completion).

Fourth: Recursive Planner + Independent Worker (Final Solution). The root Planner owns the entire project scope. When it decides its scope can be subdivided, it generates sub-Planners recursively. Workers receive tasks from Planners and work on their own copies of the repository. Upon completion, they write a handoff (including what was done, what was discovered, and any concerns) and submit it to the requesting Planner. Workers are unaware of each other and do not communicate with other Planners.

3.3 Key Insights

Allow a certain error rate in exchange for overall throughput. When they required every commit to be 100% correct, the system fell into severe serialization and slowdown. A small error (API change, typo) would stall the entire system as Workers ran outside their scope to fix unrelated things. They found that accepting a small, stable error rate was a better strategy: errors were quickly fixed by other agents, and the system remained in an “imperfect but stable” state. Ultimately, an independent “green branch” might be needed for a final fix traversal before release.

Instructions are more important than the harness. A counter-intuitive discovery was that the most important factor in system behavior is not the architectural design but the prompts given to the agents. Vague wording in instructions is amplified infinitely. “Spec implementation” led agents to implement obscure features instead of prioritizing core functionality. “Generate many tasks” produced only a few tasks, while “generate 20-100 tasks” conveyed the true intent. Constraints are more effective than instructions: “no TODOs, no partial implementations” works better than “remember to finish implementations.”

Project architecture affects token throughput. Hundreds of agents compiling simultaneously caused massive disk I/O, becoming a real bottleneck. A later run refactored the repository into multiple independent crates (migrating from a monolith), which significantly reduced compilation wait times and multiplied throughput. This means project structure choices affect not only the human developer experience but also agent efficiency.

Simple systems beat complex systems. They initially tried various solutions from distributed systems and organizational design, but the final system was surprisingly simple. They tried adding an Integrator role for central quality control, but it became a bottleneck as hundreds of Workers had to pass through a single gate to merge code. They eventually removed the Integrator and let the system converge naturally.

4. Cross-Analysis: Convergence and Divergence of the Two Paths

4.1 Common Conclusions

OpenAI and Cursor started from different directions but converged on several common conclusions.

First, the core work of humans is environment design, not code writing. OpenAI expressed this as “designing the environment, specifying intent, and building feedback loops,” while Cursor stated that “architecture and instructions are more important than the harness.” Both found that the human leverage point is not in direct code output but in creating the conditions for agents to work reliably.

Second, knowledge must be versioned, discoverable, and structured. OpenAI emphasized that “what Codex cannot see does not exist,” and Cursor found that “vague wording in instructions is amplified infinitely.” Both solutions involved pushing knowledge into the repository, replacing verbal communication and external tools with markdown and structured documentation.

Third, perfectionism is the enemy of throughput. OpenAI adopted a minimally blocking merge and subsequent fix strategy. Cursor found that requiring 100% correctness stalled the system and that accepting a small, stable error rate was more efficient. Both accepted the trade-off that “correcting errors is cheaper than waiting.”

Fourth, role separation and architectural constraints are prerequisites for scaling. OpenAI enforced constraints through layered architecture and custom linters. Cursor achieved linear scaling through recursive Planner-Worker separation. Both found that without structure, agent groups degrade into a risk-averse, inefficient state.

4.2 Path Differences

The core difference lies in the mode of human participation.

OpenAI’s model is collaboration with continuous human involvement. Three to seven engineers interact with Codex daily, describing tasks through prompts, reviewing (or not reviewing) PRs, and continuously encoding taste and judgment into the repository. Human attention is a scarce resource, but humans remain in the loop. This more closely resembles real-world team usage.

Cursor’s model is autonomous operation after humans set goals. An initial instruction is given (“build a web browser engine”), and the system runs autonomously for a week without human intervention. Human involvement is concentrated before the experiment starts (writing instructions, designing architecture) and after it ends (evaluating results). This more closely resembles a research experiment testing the upper limits of agent autonomy.

Cursor’s experiment also revealed an issue mentioned but not explored in the OpenAI article: the impact of model selection on roles. Cursor found that GPT-5.2 outperformed Opus 4.5 in long-term autonomous operation (the latter tended to stop early or take shortcuts). Different models are suited for different roles: GPT-5.2 is a better Planner, even though GPT-5.1-Codex was specifically trained for coding. This means part of harness engineering is matching different models to different roles.

5. Connection to Our Axiom System

The core concepts of harness engineering have a systematic correspondence with our existing axiom system. This is not a coincidence; our axioms come from independent observations and summaries of the same set of problems in practice.

5.1 A03 (IC -> Manager): The Most Direct Mapping

The core of A03 is that as your scope of responsibility expands, your work shifts from doing things yourself to ensuring others (humans or AI) do them well. Harness engineering is an extreme form of this shift. The engineers on the OpenAI team do not write code at all; their entire work is an AI mapping of the five pillars of management: model selection (hiring), task decomposition and context provision (delegation), documentation and knowledge base (training), methodology guidance (mentoring), and observability and acceptance criteria (acceptance).

The “urge to grab the keyboard” in A03 becomes the “urge to write code manually” in harness engineering. OpenAI treats “no manually-written code” as a core philosophy not because humans can’t write it, but because every time a human writes code directly, they are bypassing the harness and losing an opportunity to improve the environment. This is perfectly consistent with A03’s statement: “When you take over the keyboard, you deprive the AI of the opportunity to learn and improve.”

5.2 A05 (Documentation as Long-Term Memory): Knowledge Versioning in Practice

The core of A05 is that documentation is not just a deliverable but a shared long-term memory system for AI and humans. OpenAI’s practice is an engineering implementation of this concept. They use AGENTS.md as a directory, the docs/ directory as a system of record, execution plans to track complex work, and CI and linters to verify the timeliness and consistency of the knowledge base, even using periodic “doc-gardening” agents to scan for outdated documents.

The view in A05 of “from prompt engineering to context architecture” is specifically reflected in OpenAI’s practice: instead of meticulously crafting every prompt, they build an environment where the agent can “become smart.” OpenAI uses a great analogy: just as you wouldn’t give a new employee a 1,000-page manual, you would give them a short onboarding guide and a map of “where to find what.”

5.3 A02 (Amplifier) + T05 (Cognition is the Asset): Value Transfer

A02 states that AI is a capability amplifier, with the amplification effect proportional to the user’s expertise. T05 states that as the cost of code approaches zero, stable value shifts to cognition. Harness engineering pushes these two axioms to their logical extremes: code is not only low-cost but entirely produced by agents. All human value comes from the cognitive level: defining what is a good architecture, what is correct taste, and what are reasonable trade-offs.

OpenAI’s “golden principles” and Cursor’s “instructions matter more than harness” both say the same thing: the final quality of the system’s output is determined by the human judgment and taste encoded into it. Code is a consumable; cognition is a compounding asset.

5.4 A04 (Reliability is a Management Problem): From Perfection to Fault Tolerance

A04 states that reliability comes from managing uncertainty rather than requiring system perfection. Both OpenAI and Cursor verified this. OpenAI replaced traditional strict gating with minimally blocking merges and subsequent fixes. Cursor found that requiring 100% correctness reduced system efficiency. Both concluded that in an environment where agent throughput far exceeds human attention, correcting errors is more economical than preventing them.

The principle in A04 that “certainty of results is better than certainty of process” has a specific manifestation in harness engineering: OpenAI does not specify which library Codex uses for data validation (process) but requires it to parse data types at boundaries (result). They require invariants, not implementation methods.

5.5 A08 (Prompt Quality is the Main Lever) + A12 (AI-Native Development Paradigm)

A08 states that in AI-assisted programming, code quality depends on documentation quality. A12 states that AI-native software treats AI as the primary builder and delivers AI-consumable interfaces. Harness engineering is the fusion of A08 and A12 in engineering practice: the entire repository structure is optimized for agent legibility rather than human readability. Comments, documentation, architectural diagrams, and linter error messages are all “prompts” written for the agent.

A detail in the OpenAI article is worth noting: their custom linter error messages are designed to inject remediation instructions into the agent context. This means the linter is not only a constraint mechanism but also a teaching tool. This is an extension of “comment-oriented programming” in A08.

5.6 M09 (Management Paradigm in the AI Era): The Overall Framework

M09 states that the management paradigm in the AI era shifts from process certainty to result certainty, with AI viewed as a team member rather than a tool. Harness engineering is the full engineering implementation of M09. OpenAI’s three management mechanisms (Evaluation First, Cross-check, Documents as Deliverable) all have counterparts in harness engineering: execution plans have acceptance criteria (Evaluation First), agent-to-agent review is Cross-check, and the docs/ directory itself is Documents as Deliverable.

6. Broader Industry Response

Harness engineering is not an isolated concept. In early 2026, several companies discussed similar themes from different perspectives.

Vercel confirmed OpenAI’s theory through an empirical study of AGENTS.md (2026-01-27). They found that passive context (automatically injected AGENTS.md) systematically outperformed active retrieval (skills mechanism) because the former eliminated the agent’s decision-making burden. Vercel also proposed the concept of “self-driving infrastructure,” extending harness engineering from the code layer to the infrastructure layer: agents not only write code but also monitor production environments and automatically generate improvement PRs.

Anthropic provided quantitative data on agent autonomy through a large-scale empirical study (2026-02-18). They found that the 99.9th percentile turn duration for Claude Code grew from 25 minutes to 45 minutes in six months, indicating that the complexity of tasks undertaken by agents is continuously rising. A counter-intuitive finding was that the interruption rate for experienced users actually increased (from 5% to 9%) because they adopted active monitoring rather than passive approval. This is perfectly consistent with the trust spectrum in A04.

A survey of 132 internal engineers by Anthropic in December 2025 also revealed the human side of harness engineering: engineers reported a 50% increase in productivity but also felt uncertain about their roles in a few years. 27% of Claude-assisted work was something they wouldn’t have done otherwise (fixing “papercuts,” exploratory work), showing that AI not only amplifies existing capabilities but also expands the boundaries of work.

The OpenAI Agents SDK team’s practice (2026-03-09) provided a more grounded case. They use Codex to maintain Python and TypeScript SDK repositories, encoding repetitive work (verification, release preparation, integration testing, PR review) into repeatable workflows through repo-local skills, AGENTS.md, and GitHub Actions. The number of merged PRs increased from 316 to 457 in three months. Their experience is that for routine program errors, regressions, and missing tests, Codex is “safe enough in practice” as a required review path.

7. Practical Implications

7.1 What We Are Already Doing

Looking back at our own practices, many elements of harness engineering already exist in our workflow. The axioms and skills in the rules/ directory are a versioned knowledge system. AGENTS.md serves as a navigation file pointing to SOUL.md, USER.md, WORKSPACE.md, and the skills index, consistent with OpenAI’s “directory, not encyclopedia” principle. The Planner-Executor separation and Scratchpad document communication described in our multi-agent blog posts have direct counterparts in Cursor’s final solution. Our practices of axiom A05 (documentation-driven development) and A03 (IC to Manager) are essentially components of harness engineering.

7.2 What We Can Do Further

Several directions from OpenAI and Cursor’s practices are worth adopting.

First is a structured quality scoring system. OpenAI maintains a quality document to score each product domain and architectural layer, tracking quality changes over time. While our survey sessions and axioms cover the thinking level, we lack continuous quantitative tracking of codebase health.

Second is automated knowledge base maintenance. OpenAI has periodic doc-gardening agents to scan for outdated documents and use CI to verify the timeliness and cross-referencing of documentation. Our document updates are currently mostly triggered manually; we could consider introducing similar automated mechanisms.

Third is observability as an agent feedback loop. OpenAI provided Codex with access to Chrome DevTools and a full observability stack (logs, metrics, traces). This makes high-level goals like “ensure startup time is below 800ms” executable for the agent. We can more systematically build observability interfaces for agents in adhoc jobs.

Fourth is the mechanical enforcement of constraints. OpenAI uses custom linters to enforce architectural invariants, with linter error messages serving as fix instructions for the agent. We can establish similar automated checks for common engineering constraints (naming conventions, file organization, dependency directions).

7.3 What to Watch Out For

The narrative of harness engineering might give the impression that everything can be automated, but both articles include important caveats. OpenAI admits they don’t yet know how architectural consistency evolves over years in a system entirely generated by agents. Cursor admits that multi-agent coordination remains a hard problem and that their system is far from optimal.

More fundamentally, OpenAI’s experiment was conducted on a greenfield project from scratch, where the repository was optimized for agent legibility from day one. Most real-world codebases are legacy systems full of implicit knowledge, informal conventions, and technical debt. Neither article discusses the cost and path of transforming these codebases to be agent-legible.

Another noteworthy signal from Anthropic’s engineer survey is that as AI usage rises, collaboration among colleagues and mentorship are decreasing. Claude has become the first point of contact. If deep skills atrophy due to lack of practice, and “supervising agents requires exactly those atrophying skills,” a paradox is formed. Harness engineering makes human work higher-level but also makes it harder for humans to maintain a feel for the lower levels. This tension remains unresolved.

References

Ryan Lopopolo, “Harness engineering: leveraging Codex in an agent-first world”, OpenAI, 2026-02-11. https://openai.com/index/harness-engineering/
Wilson Lin, “Towards self-driving codebases”, Cursor, 2026-02-05. https://cursor.com/blog/self-driving-codebases
Wilson Lin, “Scaling long-running autonomous coding”, Cursor, 2026-01-14. https://cursor.com/blog/scaling-agents
Vercel, “AGENTS.md outperforms skills in our agent evals”, 2026-01-27.
Anthropic, “Measuring AI agent autonomy in practice”, 2026-02-18.
Anthropic, “How AI is transforming work at Anthropic”, 2025-12-02.
Kazuhiro Sera, “Using skills to accelerate OSS maintenance”, OpenAI Agents SDK, 2026-03-09.
OpenAI, “A Practical Approach to Verifying Code at Scale”, 2025-10.