AI AgentAI Coding

Everybody Talks About It, Nobody Knows What It Is — What Exactly Is Harness Engineering

Published Apr 20, 2026

Date: 2026-04-20

A Word Nobody Can Define, Yet It Stays Hot for Three Months

Dan Ariely said big data is like teenage sex: everybody talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they’re doing it. Harness engineering is in exactly that state right now.

Every few weeks the AI field pushes a new concept to the forefront, buzzes about it for a while, then replaces it with the next one. RAG went through this cycle, LangChain went through it, Context Engineering went through it.

Harness Engineering is different. From Mitchell Hashimoto’s first articulation in February 2026, to OpenAI’s official adoption, to Garry Tan’s Thin Harness Fat Skills pulling 1.4 million views, this concept has persisted for nearly three months with no obvious decline in interest. And throughout these three months, almost nobody has produced a definition that satisfies everyone.

OpenAI wrote a good article, Garry Tan has influence, three major companies publishing simultaneously created resonance. These factors explain the first week of heat, but they do not explain the third month. Three months of sustained interest requires real resonance on the demand side: a group of people hitting the same set of problems in their own practice, then discovering that the word harness happens to describe it.

What is this set of problems? Between late 2025 and early 2026, GPT-5.4, Claude Opus 4.6, and Gemini 3 Pro launched in succession, and agents moved from proof-of-concept to production deployment. A large number of teams hit five walls simultaneously during this transition.

What Walls Did Agents Hit

The first wall: combinatorial explosion of errors. A single agent’s failure modes are manageable; you can add guards to it. But when two agents are chained together, the failure modes are not the sum of two, but the combination of two. A case from healthcare illustrates this combinatorial explosion: three agents each at 95% accuracy, Agent A generated a non-existent drug name, Agent B detected fictitious drug interactions based on it, Agent C sent an emergency alert to the physician based on that. Each agent’s guard passed (each one’s output was correctly formatted, internally consistent), but combined they produced a completely fabricated high-confidence medical alert. The root cause: traditional per-component error prevention is built on the assumption that failure modes are enumerable; the probabilistic output of agents invalidates that assumption.

The second wall: natural language output cannot be measured. Traditional observability measures structured data: HTTP status codes, latency, error rates. You can set thresholds on numbers and configure alerts. But an agent’s core output is natural language. Whether a collections email is precise enough, whether the details are sufficient, whether the tone is appropriate — there are no ready-made measurement tools. A case from finance: an agent initially generated very precise collections emails, but as it handled more edge cases, output gradually degraded into generic templates. The model’s interpretation of the prompt drifted, and the team had absolutely no means to detect the drift as it happened. Mezmo’s AURA project lead put it precisely: “agents fail silently. They hallucinate, loop, and make confident but incorrect decisions based on incomplete context. Traditional observability stacks offer no visibility into these failures.” The root cause: traditional observability is built on the assumption that the observed object is structured data; the natural language output of agents invalidates that assumption.

The third wall: the managed entity perceives its environment and changes behavior. Traditional software’s state is defined by programmers, with clear boundaries. An agent’s context is entirely different: dynamic, unstructured, capacity-limited, and the agent perceives this limit and changes its behavior accordingly. Cognition (the Devin team) discovered context anxiety while rebuilding their product: the agent anticipates the context window limit and starts taking shortcuts, wrapping up tasks prematurely. Their solution was entirely at the environment level: enable a 1M token context, actually limit usage to 200K, making the model believe it still has ample headroom. The model did not change, the environment changed, the problem disappeared. The root cause: traditional state management is built on the assumption that the managed entity does not perceive resource constraints; the agent’s active perception of context invalidates that assumption.

The fourth wall: output is non-reproducible, traditional testing fails. Traditional testing methods (unit tests, integration tests, regression tests) are built on one premise: the same input produces the same output. Pass once, and it should pass again in the future. An agent’s output is different every time; a case that passes today might fail tomorrow. The traditional pass/fail determination is insufficient; what is needed is continuous evaluation based on judgment criteria: define what constitutes acceptable output, then use statistical methods to determine whether overall quality is within acceptable bounds. The root cause: traditional verification is built on the assumption that behavior is reproducible; the probabilistic output of agents invalidates that assumption.

The fifth wall: governance frameworks cannot control probabilistic behavior. McKinsey’s data shows over 70% of enterprises are piloting AI, but less than 20% have successfully scaled it. IDC’s research is sharper: only 2.9% of organizations are scaling agent applications across departments. Specific symptoms include agents processing data redundantly across systems, API changes causing agents to fail silently, and poorly managed agents generating more support tickets than the productivity gains they deliver. Enterprises’ existing IT governance systems (permission management, change control, audit trails) assume that when a system is authorized to do something, it only does that thing, in the same way every time. The probabilistic behavior of agents invalidates this assumption: same permissions, same input, but the agent may take actions outside its authorized scope. The root cause: traditional governance frameworks are built on the assumption that system behavior is deterministic.

Looking back at these five walls, they share a unified pattern. Each wall corresponds to a link in the traditional software reliability assurance chain: error prevention, observability, state management, verification, governance. Together, these five links cover the complete lifecycle of software from runtime to delivery. And the reason each link fails is the same: its underlying assumption is built on deterministic systems, and the probabilistic behavior of agents breaks that assumption.

In other words, agents did not break at a single point; they caused the entire reliability assurance chain to fail simultaneously. These problems do not belong to the domain of model capability (you cannot wait for the next generation of models to fix them), nor do they belong to the domain of prompt or context engineering (which focus on the quality of model inputs). They are infrastructure problems: agent capabilities have outrun the infrastructure.

But are these infrastructure gaps truly new?

Nothing New Under the Sun

Place the five walls above in the context of management science, and each one has a ready-made counterpart. We argued this extensively with case studies in the Managing AI series: most obstacles encountered in AI deployment have already appeared in managing human teams, and already have mature solutions.

Combinatorial explosion of errors corresponds to combinatorial risk in cross-departmental collaboration. Each department’s own quality management may pass muster, but the handoff points between departments generate problems no one foresaw. The organizational response is end-to-end process auditing, cross-departmental independent review, and checkpoints at critical handoff points. Checking each department individually is never sufficient; you must check the combined effect.

Natural language output being unmeasurable corresponds to broken performance metrics. When a team’s output shifts from quantifiable (lines of code, tickets closed, compliance rates) to hard-to-quantify (solution quality, communication effectiveness, judgment), the original KPI system becomes inadequate. The management response is to introduce qualitative evaluation: peer review, case retrospectives, multi-dimensional scoring based on rubrics. The core challenge is the same: you need to build a new measurement system for unstructured output.

Context anxiety corresponds to employees lowering standards upon perceiving resource pressure. Human employees also wrap up early and take shortcuts when they feel time is insufficient or the task exceeds their capacity. The manager’s response is reasonable workload allocation, clear priorities, and giving employees accurate expectations about resource constraints — rather than investigating only after output quality has declined.

Testing failure corresponds to team members with high output variance. Some direct reports perform inconsistently; you cannot judge by a single delivery, you need to track multiple outputs over time to assess overall capability. The management approach is to establish continuous performance tracking, replacing single-point evaluation with multi-sample assessment.

Governance failure corresponds to loss of control during organizational scaling. When an organization grows from one team to ten, previously effective management methods start breaking down, requiring hierarchical structures, standardized processes, and cross-team coordination mechanisms. And when team members’ behavior is not fully predictable (new hires, outsourced teams), you need stronger auditing and permission management.

Every wall has a corresponding problem and a mature solution in management science.

If It Is Management Science, Why Did It Not Take Off Before

Using AI is doing management — this observation is not new.

In 2023, Karpathy proposed Software 3.0, saying the developer’s role shifted from writing code to directing and verifying AI output using natural language. In 2024, Ethan Mollick wrote explicitly in Co-Intelligence that “great AI management, not great models, creates competitive advantage,” and provided a complete delegation-supervision-evaluation framework. That same year Simon Willison defined an agent as “an LLM that runs tools in a loop,” with the developer’s role being to provide constraints and verification. After Karpathy’s vibe coding went viral in 2025, he himself admitted that real work requires “more oversight and scrutiny.” We ourselves began writing the Managing AI series in early 2025, with the same core argument: obstacles encountered in AI deployment have already appeared in managing human teams.

These viewpoints are all correct, and their reach was not bad, but none of them generated sustained momentum. Why?

Because “managing AI” as a framing has two problems.

First, it evokes the wrong associations. The moment you say management science, the tech community’s first reaction is: supply chain management, financial management, human resource management — what does that have to do with my code? More specifically, a large portion of human management is spent understanding your reports, knowing how to motivate them, handling interpersonal relationships and politics. None of this applies to AI. AI is trained from the ground up to be a helpful assistant; you do not need to spend effort motivating it or managing its emotions. So while AI management and human management share the same principles, their emphasis is completely different: the hard part of human management is willingness, while the hard part of AI management is reliability. Call it management science, and people think of the willingness half, when what agents actually need to solve is the reliability half.

Second, it sounds like a soft skill. Yuzheng identified the root of this problem in “Nouns vs. Verbs”: the market prices nouns, not verbs. Debugging a cascade failure in a multi-agent system is a verb, designing a verification loop is a verb, managing an agent’s context lifecycle is a verb. These verbs have no GitHub repo, cannot be pip-installed, cannot go in a resume’s skills section. Call it management science, and it stays on the soft-cognition shelf.

Harness engineering solves both problems.

A Well-Chosen Name

The original meaning of harness is horse tack. A horse is a being with autonomous agency; it can judge road conditions on its own, navigate around obstacles, avoid crashing into walls when the rider is drunk. But the rider needs to control direction, using 5% of effort to determine 95% of the travel path. This metaphor maps precisely to agents’ characteristics: agency, judgment, but requiring high-leverage guidance.

This is entirely different from the car metaphor. A car has no agency; turn the steering wheel left and it goes left, hit the brakes and it stops. The mental model for driving a car is process determinism: you control every step. The mental model for riding a horse is outcome determinism: you provide direction and intent, the horse decides the specifics. We used similar analogies in From Process Determinism to Outcome Determinism and Managing AI: The Most Important Promotion of Your Career: calling AI via API used to be like driving a car, where you control every step; using agents now is like riding a horse, where you manage direction and it manages execution. The shift from micromanagement to management, from process determinism to outcome determinism — they describe the same thing.

The word harness naturally locks onto the specific scenario of “guiding an autonomous intelligent agent,” much narrower than management science’s scope, with a far more accurate mental anchor. Of course, there is hype in this too. Harness engineering sounds newer than AI management science, cooler than AI management science, and that novelty itself is part of its propagation power. AI management science immediately sounds like decades-old stuff; even if the content is identical, it is hard to get the tech community excited. But hype only explains short-term propagation; three months of sustained interest indicates there is substance beneath the name. And that substance is: harness engineering adds one critical signal beyond agent management: engineering. It declares this is an engineering discipline, with codifiable, teachable, measurable engineering practices. When you package a set of verbs into the noun harness engineering, it moves from the soft-cognition shelf to the hard-tech shelf, where it can be discussed, learned, and priced.

But a good name only solves the cognitive categorization problem, not the content problem. Beneath the name, what is harness engineering actually doing?

What it is doing is: re-implementing old management principles in the new environment of agent runtime. The principles are the same set; the engineering practices are two different sets.

A more precise analogy is DevOps. The principles of operations have not changed in decades: monitoring, alerting, capacity planning, failure recovery. In the traditional data center era, these principles were implemented with tools like Nagios for monitoring, manually SSHing into servers to check logs, and ops teams on-call rotation handling incidents. In the cloud-native era, the same principles have entirely different implementations: Prometheus + Grafana for monitoring, Kubernetes for automatic container orchestration, Terraform writing infrastructure as code, GitHub Actions for CI/CD automated deployments. The principles did not change, but the cloud-native environment (instances created and destroyed at will, state distributed across multiple nodes, infrastructure itself needing version management) demanded an entirely new set of tools and practices. DevOps did not invent new operations principles; it re-implemented old principles in a new execution environment.

The relationship between harness engineering and management science is the same. To manage people, you use 1-on-1s, performance reviews, culture building. To manage agents, you use checkpoint design, verification loops, context lifecycle strategies, observability stacks. Agent runtime has its own specific constraints (probabilistic output, context window limits, tool-call uncertainty, multi-agent state synchronization), and these constraints demand a new set of engineering practices to implement those old principles from management science.

The three foundational articles from Q1 2026 become clear from this angle. Each one addresses the engineering implementation of management principles along a different dimension of agent runtime. OpenAI’s harness engineering addresses the interaction dimension: how humans steer large amounts of agent work with minimal intervention, corresponding to span of control and delegation in management science. Cursor’s self-driving codebases addresses the spatial dimension: how hundreds of agents run in parallel without stepping on each other, corresponding to cross-team coordination. Anthropic’s harness design for long-running apps addresses the temporal dimension: how an agent running continuously for hours stays on track, corresponding to milestone management in long-cycle projects.

The three articles share highly overlapping readership, use highly consistent terminology, but answer engineering questions that are entirely different. This is also why consensus on the definition of harness has never converged: everyone is loading their own specific dimensional management problem into the word.

Why Harness Engineering Took Off

Back to the original question. Three major companies publishing simultaneously created resonance — that explains the first week. But three months of persistence requires demand-side support.

What happened on the demand side: after agents entered production, the entire software reliability assurance chain failed simultaneously. Error prevention, observability, state management, verification, governance — the underlying assumptions of all five links were built on deterministic systems, and the probabilistic behavior of agents broke these assumptions one by one. Practitioners hit these problems in their daily work, then discovered the solutions were not unfamiliar: management science has long had corresponding principles and tools.

The principles are old, but previous naming attempts failed to make them popular. “Managing AI” made people think of willingness and politics, while the core challenge of agents is reliability. Harness engineering solved this naming problem: harness locks onto the specific scenario of “guiding an autonomous intelligent agent,” engineering declares this is a codifiable, teachable engineering discipline, and the two words together package a set of scattered verbs into a noun that can be discussed and priced.

So the sustained interest in harness engineering comes from the convergence of two things: a real infrastructure gap, and a name that finally fits that gap.

For practitioners, this means one piece of good news and one piece of bad news.

The good news is that you don’t need to learn an entirely new discipline from scratch. If you’ve managed people, projects, or teams, you already have the foundational principles of harness engineering.

The bad news is that shared principles don’t mean you can just copy-paste. Agent runtime has its own unique constraints and opportunities that demand new practices you need to specifically learn. A few examples. When managing people, your team has tribal knowledge; many things don’t need to be written down for everyone to know them. Agents have no tribal knowledge. All context must be explicitly documented. Document first isn’t a nice-to-have; it’s a basic prerequisite. When managing people, you don’t worry about a report suddenly cutting corners at 80% completion. Agents will: when the context window approaches capacity, they sense the pressure and start taking shortcuts (the context anxiety discussed earlier). You need to proactively manage context lifecycle and divide and conquer before it hits the threshold. When managing people, having ten people do the same task in parallel and picking the winner is expensive and ethically questionable. With agents, you can run ten instances on the same task in parallel, take the best result — this is called bootstrapping in statistics, costs nearly nothing, and works remarkably well.

These practices are either impossible or uneconomical in the context of managing humans, but they’re basic operations in agent runtime. The principles of harness engineering come from management science, but its engineering practices must be redesigned for the specific characteristics of agents. This is also why it deserves its own name.