AI CodingCommunity & Cognition

Refactoring a Subsystem with AI: Clearing Tech Debt or Tearing Out Load-Bearing Walls

An engineer at an e-commerce company had been on the backend team for about a year when he took over a warehouse routing submodule in the order fulfillment system. The module wasn’t large, a few thousand lines of Python, but it had been running for years with patches stacked on patches. Which warehouse had stock, which one was closest to the user, which courier was cheaper, how to avoid warehouses that were over capacity, none of these constraints were explicitly declared. They were scattered across a dozen variable assignments and a chain of if-else blocks, held together by implicit conventions between functions. To find out why an order had been routed to a distant warehouse, you had to read from the entry point down to the bottom layer, changing direction at least three times along the way. His description: changing one parameter was like pulling a weed, you never knew which pipe it was connected to underground.

He used AI to rewrite the submodule. Warehouse selection rules were declared explicitly as classes, routing logic compressed from ninety lines to forty, the code was self-documenting, and every historical regression case passed. He didn’t touch anything outside the module boundary. The change was scoped, testable, and the ugliest part of the old code, the implicit reverse-inference block, was cleaned up. He submitted the PR. It was rejected.

The person who rejected it was a colleague who had been on the system for over two years. The reason was one sentence: the change is too big, you didn’t align with me on the design, and I can’t tell how to debug the new version when something goes wrong. The reject note read: you treated yourself as a consumer and this submodule as your product, but it runs inside a system I’m responsible for. When something breaks at 2 AM, the person who gets called isn’t you.

Neither of them was a villain or a fool, but both felt the other was being unreasonable. The younger engineer thought: I changed a well-scoped submodule, cleaned up the messiest part, all regression tests are green, the boundary is clear, why is this not acceptable. To be safe, he had run every historical regression case, and not one failed. The senior colleague thought: you took a tool that doesn’t know what happened behind these patches, replaced the internal logic of this submodule entirely, I wasn’t part of any discussion about what you changed, and then you dropped a several-thousand-line diff on me, why would I sign off on that. His logic held up too: this code wasn’t reviewed by me, wasn’t designed by me, but it runs inside a system I’m responsible for, and when something breaks at 2 AM, the person who gets called is me, not you. My refusal is justified. He had another thought he didn’t say out loud: the code you can write in a day, the reviewer also needs a day to read. Before, a change like this took a week or two, and the team’s budget included both writing and reviewing. Now you finish in a day, and the team receives one day’s worth of code with one or two weeks’ worth of cognitive debt.

The real lesion in this conflict isn’t code quality. What the senior colleague was actually pointing at: a design choice hadn’t reached consensus in the team yet, but it had already become several thousand lines of code. What he was trying to say was, I haven’t even had time to articulate what I care about, and you’ve already made the decision for me.

The Hard Part of Engineering Isn’t at the Keyboard

Anyone who has done engineering for a few years knows that the hard part of a system isn’t writing code. It’s deciding, among a set of mutually contradictory goals, which one to protect, which to sacrifice, and to what degree. In a scheduling system: throughput first or latency first, conserve memory or cache more to compute less, when a fault happens do you protect the live service first or protect data integrity first. The same scheduling backbone: an e-commerce team wants maximum throughput and can tolerate occasionally losing a machine or two; a finance team wants every link traceable and would rather be slow than lose anything. One codebase can’t satisfy both preferences. These choices have no standard answer, only context.

AI helps you produce code, but it doesn’t know that your customer SLA was signed for a certain number of minutes, doesn’t know which metric your boss is most afraid of seeing turn red in the monthly review, doesn’t know that last year’s Double 11 full-chain avalanche was caused by a hidden assumption that couldn’t handle burst traffic. This knowledge isn’t in the code repository, it’s in people’s heads. AI can only read what you’ve already written down, and what you’ve written down usually doesn’t include this information. It’s not that you refuse to write it, it’s that these judgments are too granular, too rooted in specific scenarios. No one writes in a design document: “This module cannot lose tasks because we lost one last year and the VP called us out in the all-hands meeting.”

Old code, ugly as it is, may have an incident behind every seemingly redundant branch. It’s a storage format for the team’s collective memory, just not a very readable one. Before refactoring, someone has to translate that memory out and let the team confirm which parts are garbage and which parts are shields.

When Writing Code Was Slow, Slowness Had a Built-In Safeguard

Before, manual coding was slow. That slowness had a built-in benefit: it forced the team to stop and discuss in the middle. A senior engineer doing a major refactor wouldn’t write everything and then notify everyone. He would first write a design document, explaining why the change was needed, what to change, what not to change, and where the risks were. Everyone would argue it out, and only then would the keyboard work begin. The pace of code production was like walking on ice: before each step, you confirmed the surface could bear weight. This pace was annoying, but it guaranteed that every piece of ice had been stepped on, examined, argued over.

Now AI has pushed the cost of implementation to near zero. You don’t need to confirm the ice is thick enough because you outrun the cracking. Code expands at ten times the old speed, and disputes turn into diffs at the same rate. But the work of design trade-offs hasn’t decreased at all, there’s just no natural occasion to discuss it anymore. The team’s alignment speed on design assumptions hasn’t accelerated in step. A decision that used to take two weeks of team discussion can now be turned into several thousand lines of runnable code by one person plus AI over a weekend, pushed up for everyone to discover in the PR.

The senior colleague’s anxiety has a clear source. A strong senior who joins and does a major refactor also produces unfamiliar code, but he has the credibility from having lived through those incidents, he communicated his intent in advance, he can explain the reasoning behind every trade-off, and he bears the consequences when things go wrong. The junior-plus-AI combination lacks this entire chain of credibility. The code quality might be fine, but the judgment of the person behind the code hasn’t been stress-tested by the team yet.

Half of Old Code Is Incident Memory

Behind this are two deeper problems.

The first is the absence of engineering design. The correctness of a core scheduling algorithm doesn’t depend solely on whether it passes historical cases. Those ugly branches in the old code, some of them are genuinely technical debt, hardcoded shortcuts from a rushed release, experimental logic that never got cleaned up. But some of them are incident memory: that seemingly pointless timeout-retry logic exists because the year before last, a network jitter caused tasks to be dispatched twice, and every machine was instantly saturated that day. When AI reads this code, it sees an if statement. It doesn’t see the day behind that statement. More dangerously, during a refactor, AI might treat this seemingly redundant logic as unnecessary and optimize it away, and no regression test would catch it afterward. The trigger condition is a network anomaly that occurs once every three years, and you can’t write a test case for every anomaly. Tests verify what you know. They can’t verify what you don’t know.

The code looks the way it does because that’s how it had to be written to block a known production issue. Explicitly modeling resources is the right direction, and building it out does have value. But the judgment of which old behaviors to preserve, which to intentionally change, and which to leave untouched for now because you can’t yet determine, that judgment can’t be outsourced to any person or tool that doesn’t know the context. This has nothing to do with AI capability and nothing to do with who wrote the code. Even the most senior person on the team doing this refactor would first need to sit down with the people who handled those incidents and talk through what happened before touching anything. The only difference is that the senior person knows they need to have those conversations, while the junior with AI assumes they don’t.

The second is the transfer of maintenance risk. A junior with AI can produce in a day what used to take a week of code, but the team’s capacity to review, understand, and take over debugging when things break hasn’t increased proportionally. You made an architectural change and captured the efficiency gain, but the downstream maintenance risk may fall on someone else. The person making the key design trade-offs is you, and the person bearing the long-term maintenance consequences may not include you. This isn’t a problem unique to AI. In any organization, the mismatch between who makes decisions and who bears the consequences is a source of friction. AI just amplifies the frequency and magnitude of this mismatch. Before, a major refactor took two weeks, and you’d do maybe two or three a year, each with a large but infrequent risk window. Now you can push three in a week, each with a wider risk window than before, while your colleagues are still reviewing at the old pace calibrated for two-week changes.

The Design Review That Doesn’t Look at Code

How to resolve this. Don’t try to prove that AI-written code is better than human-written code. What the other side is objecting to was never code quality in the first place. Proving code quality sidesteps the real problem.

Move the argument up one level. Organize a design review where nobody looks at code. Answer these questions first: why is this system hard to change, what real damage has the old implicit logic caused, what does the new approach explicitly model, what does it still not model, why stop there, which old behaviors does the new version preserve, which does it change, why those choices, when something goes wrong how do you trace it, how do you roll back, who is the first responder. These questions don’t care about indentation, naming, or function length. They care about trade-offs. Trade-offs are the answer the senior colleague has been waiting for.

Write this consensus into a document, then feed that document to AI and let it rewrite according to the design spec. The cost of AI writing code is low to begin with. What you need to protect is the team’s shared understanding of the system. Code can always be rewritten. Since AI makes writing code nearly free, even if the design gets overturned and you redo it from scratch, the extra loss is minimal. Treat this refactor as a prototype, as a learning experiment. The code you spent a day writing with AI got rejected, but it wasn’t wasted. It became the most persuasive prop for the next design review.

This also raises two questions that go further.

Can a junior lead design with AI. Yes, but acknowledge that this is an exoskeleton-assisted capability. AI expands the design space you can explore, and your job is to compress it into a clear judgment. You can explain why you chose option three, why not the other nine, and who bears what cost from this choice. If you can do that, you’re qualified to lead the design. If you can’t do that but you’ve already turned the choice into code, that’s why you got rejected. A designer with an exoskeleton is still a designer, but the exoskeleton itself is not a credential. There’s a trap that’s easy to miss: the process of discussing design with AI gives you a feeling of participation, making you think the conclusions are ones you arrived at yourself. In reality, AI is constantly filling in arguments for you, connecting causes and effects, and pruning branches in the conversation. What you’re doing is more like multiple choice than proof-writing. The mark of leading design isn’t that you picked a plan, it’s that you can, without AI present, independently explain why this plan holds, under what conditions it fails, and what to do when it fails. If you can’t articulate the trade-off reasoning without the conversation log, you’re not the one leading yet.

Do fundamentals still matter. Line-by-line handwriting matters less, but layer-by-layer understanding matters more. Algorithms and data structures have shifted from a construction capability to an evaluation capability. You don’t need to write complex data structures faster than AI, but you need to be able to tell whether its abstractions are excessive, whether complexity is out of control, whether lifecycles are dangerous, whether the scheduling strategy violates the problem’s own constraints. Fundamentals used to be whether you could write. Now they’re whether you can tell when you’re being snowed.

Back to the original scenario. The younger engineer rewrote the code with AI and got rejected, probably feeling wronged: I clearly made the system better. The senior colleague, when he rejected it, probably wasn’t comfortable either: he wasn’t guarding the old ways, he was guarding an incident pathway he knew how to walk. Two people looking at the same code, one seeing what could be better, the other seeing what couldn’t be allowed to get riskier. AI didn’t make either side wrong. It just compressed a disagreement that used to be naturally rate-limited by coding speed into a single weekend where it all burst at once. The solution isn’t in the code. It’s in the design consensus that the two of them haven’t sat down to discuss yet.