Over the past few years, the barrier to using AI has kept dropping.
Early ChatGPT users would carefully study prompt phrasing, try
take a deep breath, try role-playing, try few-shot
templates. By GPT-4 onward, as long as you made your intent clear, the
model could usually understand. More recently, tools like Claude Code,
Codex, Cursor, and OpenCode wired file reading, file writing, command
execution, error reading, and iterative editing into a ready-made
agentic runtime. Users no longer need to write their own agent loop to
get the model working continuously.
This trend is easy to summarize in one line: the stronger the AI, the less hand-holding it needs. That judgment points in the right direction, but it is too coarse. A more precise account would be: some scaffolding is absorbed by model capability, some is commoditized by product runtimes, and the remaining scaffolding increasingly becomes the team’s own judgment assets.
So the real question has already shifted: which scaffolding can you hand off to a commoditized runtime, and which must you still design yourself. That is the boundary judgment facing technical practitioners today.
The first to recede was prompt technique. In the GPT-3.5 era, a lot of prompt engineering really did feel like folk remedies: write specific phrasings, give a role assignment, ask the model to think step by step, use examples to force the output format. These techniques worked at the time because the models were weak at instruction following, format control, and long-context stability.
OpenAI’s own documentation evolution tells this story well. The
GPT-4.1 prompting
guide still emphasizes writing three types of reminders in agent
prompts (persistence, tool-calling, planning) and claims they improved
internal SWE-bench Verified scores by nearly 20%. By the o-series reasoning
best practices, the official guidance instead warns users against
writing think step by step for reasoning models, because
those models already have reasoning built in. Old techniques may not
help and could even hurt performance.
The guidance for GPT-5 and GPT-5.5 goes further. The GPT-5 guide documents a case from Cursor: older models needed prompts that encouraged thorough context analysis, but the same prompt applied to GPT-5 caused excessive searching, because GPT-5 already collects context more aggressively. OpenAI’s GPT-5.5 prompt guidance directly recommends shorter, more outcome-oriented prompts: define the goal, constraints, evidence, and final artifact, and leave the implementation path to the model.
Anthropic’s path is similar. Early Claude guidance focused on XML tags, few-shot examples, and role prompting. By the Claude Opus 4.7 prompting best practices, the official guidance started warning users to remove over-prompting. Absolute instructions that were written to trigger tool use on older models now cause over-triggering on newer ones. Anthropic’s Effective context engineering for AI agents summarizes this shift: building language model applications is moving from finding the right wording to answering “what context configuration is most likely to produce the desired behavior from the model.”
Prompts still matter. But they have changed from rhetorical tricks into work contracts: what is the goal, what are the constraints, what evidence is available, and when should the model stop. What is receding is the low-level technique that compensates for model weaknesses. The act of expressing intent itself remains.
The second layer of change is more important: the agent harness that teams used to build themselves is becoming a standard runtime.
If you call the API directly, there is a lot of dirty work to handle yourself: how to save state, how to compress long text, how to recover from tool call failures, how to retry on malformed output, where to run commands, how to constrain permissions, how to feed test failures back to the model. Many teams think they are building an AI product, but they actually spend most of their time cleaning up long-tail errors for the model.
Tools like Claude Code, Codex, Cursor, and OpenCode package these general capabilities. Their product differences exist, of course, but from the perspective of the user’s working boundary, the core capabilities are converging: filesystem access, shell execution, tool calling, context compression, diff, lint, test feedback, retry loops, permission confirmation, session state. They increasingly look like a standard class of agentic runtime.
This layer can be called a commodity runtime. “Commodity” here means general enough, cheap enough, and interchangeable enough that most teams have no need to implement it themselves. If the only result of building your own harness is that the model can read files, edit files, run tests, check errors, and keep fixing, then that part is no longer worth building in-house.
OpenAI’s Harness engineering article states this trend directly. What Codex cannot see effectively does not exist, so documentation, architectural constraints, lint, eval harnesses, and review loops all need to be placed into the agent-accessible environment. OpenAI then made skills, hosted shell, and server-side compaction into agentic primitives. In other words, capabilities that once belonged to a team’s custom harness are being turned into platform capabilities by the model provider.
Anthropic’s harness article shows the same direction. Harness design for long-running application development documents an important case: early long-running application development needed sprint constructs and context resets, but by Opus 4.6, the authors found the model could work continuously for longer periods and removed the sprint construct. Anthropic’s own words: every harness component encodes an assumption about “what the model cannot do on its own,” and those assumptions need to be stress-tested because they expire as the model improves.
The conclusion here is that the middle layer of general-purpose harness is hollowing out. Lightweight custom while-loops, hand-written tool parsers, format retries, and simple context compression will continue to be absorbed by standard runtimes. The barrier to building your own has actually risen: you either use a commoditized runtime directly, or you write heavier, more domain-specific scaffolding for the few non-standard tasks.
When to delegate to a commoditized runtime and when to build your own scaffolding can start with a rough rule: general execution goes to the runtime, domain-specific judgment stays as scaffolding.
If the task has clear feedback, such as test, lint, build, preview, or diff output that directly tells you whether it worked, prioritize handing it to the commodity runtime. Fixing a bug, refactoring a local module, adding tests, editing documentation, cleaning up repetitive patterns in a repo. There is no need to reinvent the execution layer for these tasks. Having Claude Code, Codex, Cursor, or OpenCode read files, run commands, and check errors costs less than writing your own agent loop.
If the task depends on domain judgment, the picture changes. What counts as good design, what level of risk is acceptable, which metric matters more than another, which internal APIs must never be touched, which historical decisions must not be overturned, which user scenarios have the highest priority. These are things the runtime does not know. They must go into project documentation, skills, AGENTS.md, evals, review rubrics, domain context, and eventually the team’s long-term judgment principles.
To be more concrete, what is suitable for delegation: file reading and writing, command execution, routine tool orchestration, basic error recovery, common code modifications, iteration under existing test coverage. What is suitable for custom scaffolding: domain-specific evals, internal system integration, compliance and permission boundaries, state machines for multi-step tasks, high-risk external actions, long-running asynchronous collaboration, and team-specific quality standards.
External benchmarks also support this boundary. SWE-bench shows that real-world repository issues often require coordinated changes across functions, classes, and files. SWE-bench Pro directly reproduces experimental results using the SWE-Agent scaffold. Terminal-Bench places tasks inside a real command-line environment with an accompanying execution harness and tests. The SWE-agent Agent Computer Interface documentation also explicitly states that good ACI design significantly improves agent performance.
All this evidence points to the same fact: when a task becomes a long-chain execution, reliability does not come from “the model’s one-shot answer being smarter.” It comes from the runtime, tool interfaces, tests, evals, checkpoints, guardrails, and observability mechanisms. The difference is that the general parts can be bought, while the domain-specific parts still have to be designed.
Delegating execution to a commodity runtime still has a cost. The real price is the control you trade away.
First, you accept the runtime’s context management strategy. How it compresses history, how it decides which file matters, how it determines when to search, how it handles old information in a long session. These are usually opaque. For standard software engineering tasks, these default strategies are efficient. For non-standard tasks, they may squeeze out critical context, making it hard to tell whether a failure comes from the model itself, the prompt, the tools, or the runtime’s context selection.
Second, you accept the runtime’s default mode of agency. Claude Code, Codex, Cursor, and OpenCode all shape a default agency: when to ask the user, when to keep trying, when to run tests, when to consider a task complete. Default agency is valuable because it lets most ordinary tasks run without manual orchestration. But whenever your task has special stopping conditions, special risk boundaries, or special acceptance criteria, the default agency may not be enough.
Third, custom scaffolding has an uphill-cost problem. Model and tool providers will keep optimizing for the usage distribution of mainstream runtimes. Models will keep getting better at standard tool schemas, standard shell environments, and standard code edit workflows. If you build a non-standard interface, you are not just maintaining code; you are swimming against the model’s default training distribution. Unless your custom scaffolding delivers clear advantages, the next round of model and runtime upgrades will erase it.
So the more practical strategy is to place control where it matters most. Filesystem, shell, context compression, and basic retries go to the runtime. Success criteria, domain knowledge, permission boundaries, evaluation methods, and long-term memory stay in your own hands.
Harness Engineering is getting a lot of attention right now. But it may not persist in its current form for long.
LangChain’s history already provides a reference. In the early days, multi-model orchestration, chains, memory, and agent executors all seemed like problems that a framework had to manage. Back then, the models were weak, tool-calling was weak, and product interfaces were immature. Wrapping these things in a framework made real sense. But a few years later, many of the tasks that early frameworks handled have been absorbed by models, APIs, and coding agent runtimes. You do not necessarily need to write a chain or set up a multi-model orchestration framework. Often, stating the goal, context, and acceptance criteria clearly, then letting a ready-made agentic runtime search, execute, and fix on its own, works well enough.
Harness Engineering may be going through the same process. Today it addresses the boundaries that models and runtimes have not yet fully covered: long tasks, parallel agents, asynchronous execution, observability, permissions, and evaluation. OpenAI talks about environment design. Cursor talks about large-scale agent orchestration. Anthropic talks about time scaling for an agent that runs continuously for hours. All three call it harness engineering, but they are solving different boundaries.
This means the real skill in learning Harness Engineering is training your boundary judgment: is this scaffolding compensating for a model weakness, a runtime gap, missing domain knowledge, or missing evaluation criteria? If it is only compensating for a model weakness, prioritize deleting it when the model improves. If it is only filling a gap in the general-purpose runtime, the platform will likely absorb it soon. If it encodes your domain knowledge, risk boundaries, and quality standards, then it is worth maintaining long-term.
A good harness also leaves byproducts: failure samples, evals, review records, error classifications, and context assets. These can in turn drive improvements to the model and product runtime. But that is a supporting mechanism. The main thread is still boundary migration: scaffolding worth building today may become a commoditized runtime’s default capability tomorrow. And as models get stronger, humans will hand them longer, more ambiguous, and higher-risk tasks. New boundaries will keep appearing.
Human work is indeed moving upward. The direction is from execution to judgment, and from coding to prompting is just one facet at the prompt level.
The bottom layer will keep converging on simple, deterministic interfaces: command line, JSON, exit codes, filesystem, AGENTS.md, test scripts, lint rules. The middle runtime layer will keep productizing; Claude Code, Codex, Cursor, and OpenCode will increasingly look like standard working environments. What really creates differentiation is the layer above: context, skills, evals, contracts, domain memory, cognitive frameworks.
Garry Tan’s Thin Harness, Fat Skills points in the same direction: the harness only handles the loop of calling the model, reading and writing files, managing context, and enforcing safety policies. Real capability lives in skill files. Our own practice is closer to this judgment too: the runtime can be outsourced to Claude Code or OpenCode, but skills, axioms, workspace routing, research workflows, and acceptance criteria have to be maintained by the team.
This is also why cognitive frameworks are growing in importance. The better the model gets at execution, the more humans need to be responsible for defining the problem, choosing boundaries, establishing acceptance criteria, and making trade-offs. In the past, you had to write down the process: when A happens, do B; when C happens, do D. Now a more effective approach is often to define the outcome: what conditions the final artifact must satisfy, how to verify it, and how to proceed if it fails. Let the agent explore the path. Set the acceptance criteria yourself.
The same model, the same runtime, the same tools. Put different cognitive context behind them and the output changes from an action list to real judgment. The controlled experiment in Context infrastructure illustrates this exactly: without a personal judgment framework, the AI outputs correct but mediocre recommendations. After connecting it to long-accumulated judgment principles, it starts producing non-consensus analysis.
The progress of AI has not made the framework world disappear. It has absorbed low-level patches into the model, absorbed general execution into commoditized runtimes, and pushed the truly hard part back to humans: which capabilities to buy, which to build, what standard defines “done,” and on what basis you trust that it got it right.