AI AgentAI CodingDeveloper Tools

Loop Engineering Explained: From Managing Execution to Designing Self-Convergent Loops

Recently in the AI-assisted development space, Loop Engineering has become a hot industry term. Business Insider captured this wave of discussion with Forget Prompts: “Loop Engineering” Is All the Rage Now, and Addy Osmani followed up with a more systematic definitional breakdown.

I was initially hesitant and delayed writing about this topic. From a practical standpoint, the core ideas are common knowledge within the frontier developer community. In our AI Builders course, this approach centered on feedback loops and system design has been taught for over a year.

But the reason I’m writing about it now is that the term is well-defined and serves as a useful thread to help everyone grasp the underlying essence. It’s also a perfect opportunity to draw on some of the course material to analyze how to truly make the most of this paradigm.

The leap from Prompt to Loop revolves around two things: an elevation of mindset, and the art of evaluation.


1. AI Manager: The First Step from Executor to Delegator

The first step in using AI is to adopt a managerial mindset. You need to transform yourself from a hands-on coder executing tasks into an AI Manager who defines problems and validates results. Anthropic’s study of roughly 400,000 Claude Code sessions also observed that those who behave like managers are more likely to succeed, noting in a research footnote that “perhaps acting like a manager confers greater success”.

At this stage, we can think of AI as an intern with amnesia. To get this intern to produce high-quality work, the manager needs to provide support at three levels:

The AI Manager must simultaneously provide context, problem boundaries, and objectively verifiable endpoints

First, provide self-contained context. Due to the stateless nature of each individual LLM invocation, it has no awareness of the historical decisions and implicit rules behind a project. You must organize this background knowledge into clean documentation.

Second, define clear problem boundaries. Instead of giving vague verbal instructions, use natural language to clearly describe the task’s inputs, outputs, and constraints.

Third, define objectively verifiable endpoints first. This requires making it clear to the AI when a task is considered done, and what specific behaviors constitute failure.

While the AI Manager approach solves the cold-start and intention-guessing problems, at this stage the human remains deeply involved in the management loop. You still need to continuously break down tasks, supply context, inspect intermediate results, and decide on the next step, guiding the AI according to your own judgment. This human-powered management loop limits the overall throughput of the system.


2. Senior Manager: Freeing Humans from the Execution Loop

Manual management has bottlenecks, but these bottlenecks are no longer simply “pasting error messages back to the AI.” Modern coding agents can already write code, run verification, read error logs, and continue debugging autonomously. Back in 2025, Elastic’s team ran a similar self-healing loop in Buildkite CI, fixing 24 broken PRs in the first month. This proves that basic write-run-fix cycles are already viable in engineering.

The real bottleneck occurs one level up. Once tasks become longer, more numerous, cross-session, and cross-project, humans still have to constantly decide what to do next, judge whether a particular change is genuinely better, diagnose where failures are concentrated, and repeat a successful lesson to the next iteration of the Agent. The human is no longer a log courier, but remains the system’s implicit manager.

The Senior Manager aims to solve precisely this problem. It’s not about making a human fix a bug faster, but about turning managerial actions into system capabilities: clearly delegating tasks, building evaluation that can be run repeatedly, using observability to see the causes of failure, and then codifying effective coaching into SOPs so the system can invoke them on its own next time.

Neurotype auto-correction is a more fitting example. This project aims to let users type freely even with missing characters, mistaken touches, and case confusion and have the AI clean up the text to near ground truth. The hard part isn’t writing a correction function — it’s enabling the system to know which correction is better.

So we first had the AI build an evaluation harness: a sample set covering adjacent-key mistouches, omissions, repetitions, and case issues, with metrics using normalized edit distance and error-type logging, and a baseline from a dummy corrector or an old prompt. With this in place, the system no longer depends on a human’s momentary intuition but has a fixed yardstick.

Next, we had the AI build a tracking UI. The overview view shows trends in overall scores, while the per-case view records the rationale for each case. Evaluation scores can only tell us whether things are good, not why they aren’t. Without this micro-level visibility, even if we statistically know which type of case has the most errors, we can’t determine where the model is actually stuck. We only see the distribution of results, not the model’s internal error paths. The value of rationale and trace lies in exposing the causes of failure for individual cases, truly establishing observability beyond evaluation.

Finally, the manager no longer directly edits the model’s prompts, but instead lets the AI analyze failed samples, identify error patterns, and give it a targeted nudge. Once successful, the “problem, diagnosis, correction” process is written into an SOP. Subsequent auto-tuners reference this SOP to propose changes, run evaluations, and record results on their own. What the Senior Manager delivers is not a fixed bug, but a system that can continuously fix its own bugs.


3. Loop Engineering: The Mechanization of Senior Manager Ideas

Loop Engineering is essentially the technical realization of the Senior Manager philosophy. In a CNBC interview, Boris Cherny described this shift as: Claude writes its own prompts, and the human converses with the coordinating Claude; Business Insider paraphrased this.

So when trying to understand Loop Engineering, you shouldn’t focus first on implementation details like cron or worktrees. They’re useful, but they’re engineering carriers. What truly matters is this: every second-order management action in the Senior Manager roadmap has a corresponding system component in Loop Engineering.

Senior Manager Roadmap Loop Engineering Component Why It Matters
Codify successful experience into SOPs, e.g. scale.md or playbooks skills and knowledge assets So the next iteration of the Agent doesn’t have to grope in the dark again — turns a one-time coaching into a long-term invocable capability.
Build evaluation harness: sample sets, metrics, baselines verifier and automated evaluation Gives the system a fixed yardstick, avoiding the need for human ad-hoc judgment with every change.
Build tracking UI, record rationale and trace state, trace, and observability Scores only tell you whether something changed; trace tells you where to fix it.
Transition from fixer to coach, let the system self-diagnose failure patterns maker/checker feedback loop The executor tries, the checker calibrates — so the system doesn’t grade itself too leniently.
Let real-world usage continuously generate new samples memory and data flywheel Static test sets go stale; real-data backflow is what makes the system stronger over the long run.

In engineering implementation, a Loop Engineering closed-loop pipeline typically employs these carriers:


4. The Two Hard Levers That Determine Loop Quality

Running a pipeline is not hard; making it produce highly reliable results is. In practice, what determines whether a loop truly converges rather than degenerates comes down to the following two hard levers.

4.1 The Art of Evaluation: Why TDD Is Not the Answer in the AI Era

In real engineering, the evaluation of work results is often highly subjective and complex — far from something captured by “does it compile?”

Many people intuitively think that since AI’s behavior is hard to predict, we just need to apply the TDD pattern, write more unit tests, and force the AI to make them pass to ensure quality.

In practice, however, this is a directional error. In our previously published article Why TDD Is Not the Answer in the AI Era, we analyzed the causes of this paradox in detail:

First, traditional TDD succeeds because it relies on an implicit assumption unique to human society: human developers come with a built-in motivational structure oriented toward correctness (fear of being woken up at 2 a.m., concern about embarrassment during Code Review). For humans, tests serve as a guiding signpost system.

Second, AI has no implicit maintenance burden. In the TDD loop, the only feedback the AI receives is whether tests pass or fail. At this point, Goodhart’s Law (when a measure becomes a target, it ceases to be a good measure) kicks in rapidly: if the fastest way to pass a permissions test is to directly rewrite the code so it hardcodes return True, the AI will do so without hesitation. The test becomes its destination, not a signpost.

Third, if we try to plug all holes by writing a large volume of highly specialized Mocks and Stubs, this locks down the AI’s implementation path, effectively shackling it with procedural determinism. The AI loses the freedom to explore better architectures.

Therefore, designing evaluation criteria is an art. The solution in the AI era is not to write more granular unit tests, but to pull deterministic constraints back from specific code paths to the system boundary:

Evaluation in the AI era should shift from path constraints to boundary constraints

We need to use contract tests, property-based tests, or introduce another fully independent AI instance that uses natural-language acceptance specifications to perform semantic-level consensus checks on the generated results. This sacrifices mechanical determinism, but gains immense semantic flexibility.

4.2 The Limitations of Autonomous Task Discovery

In the current wave of Loop advocacy, one concept is being marketed as a selling point: let the AI automatically scan CI logs and bug reports, discover work on its own, and decide what to change.

In real-world production, this vision of fully AI-determined prioritization and requirement discovery is mostly a gimmick. There are two reasons:

First, prioritization requires a wealth of business context, user empathy, and mid-to-long-term strategic goals. This background knowledge is highly abstract and dynamically shifting. Decisions made by AI in the absence of this tacit context are extremely prone to going off course.

Second, task discovery is a divergent system. If the implementation code is wrong, at worst it won’t compile or tests will fail — a convergent, local problem. But if directional decisions and architectural designs go wrong, the Agent may continuously accumulate code on a flawed foundation, piling up a near-irrecoverable mess of a system, with exorbitant costs to backtrack later.

At the current stage, task discovery still requires a firm human hand on the tiller. AI can assist with filling gaps and catching oversights, but deciding what to do and in what order remains the human’s core line of defense.


Conclusion: The System Designer, One Step Ahead of the Industry

The surge of Loop Engineering marks the industry’s elevation from being a prompt-tinkering tool user to a system designer who architects self-running systems.

If you want to stop being the human manager who constantly decomposes tasks, supplies context, and decides the next step, and instead become a designer capable of building self-evolving development engines:

👉 Check out our course: Superlinear Academy - AI Builders. Here, we systematically deliver this AI collaboration, system evaluation, and closed-loop methodology — staying ahead of the industry’s curve — to you.