Key Decisions for Agentic Workflows: A Simple Case Study

Today I used AI to complete a small task. This case feels particularly suitable for introducing AI's practical principles, so I wrote this article to share it.

The task itself was to add a summary line to every article in this blog, which helps search engines understand the website's content and improve its ranking (SEO). This task looks simple, but it has many pitfalls—one careless move and you fall into the trap of AI getting stuck in loops, being unreliable, or being cumbersome to use. Below I'll mainly share the five important decisions I made during this process to make the entire workflow stable and reliable.

Decision 1: Use a Local Coding Agent, Not ChatGPT

The first decision I made was to use Cursor/OpenCode as the platform for discussion, not ChatGPT. This isn't obvious, because the project started with me wanting to do SEO for this website. Intuitively, this seems like a chat-type task better suited for ChatGPT. But I still insisted on using OpenCode. The fundamental reason is friction.

Specifically, friction exists in two aspects. First is the friction of context transfer. With ChatGPT, I need to copy and paste my blog's content or even code to it, or have it write code to fetch these articles. But in OpenCode, I just use @ to specify the folder where my blog is located—much less friction.

Another aspect is the friction of implementation. For example, if we reach a conclusion through chatting in ChatGPT that this website needs to add Summary metadata, to implement this idea, I need to copy and paste several rounds of chat history between me and ChatGPT into Cursor/OpenCode, and then call another AI to modify the article content. In contrast, if the discussion happens in OpenCode from the beginning, it can be implemented immediately after the discussion.

So I made this first decision: for almost all tasks, abandon chat-based AI environments and choose executable Agentic environments. Why put this decision first? Because this is the difference between existence and non-existence. If friction is high, we won't bother continuing, and the entire project takes time but delivers 0—pure waste of time. Only when friction is low and the project can continue is it meaningful to discuss specific methods and techniques.

Decision 2: Before Starting, Define Success and Provide Tests

The second decision I made was: before letting AI generate any summaries, have it write a test first. This test does something very simple—check all .md files to see if they have a summary field. If not 100% of files have this field, it fails, and prints which files have problems.

Why write the test first? Because without this test, when AI says it's done, I don't actually know if it's really done. I can spot-check a few articles, but with over 300 articles, spot-checking can't cover everything. The final situation is that neither I nor AI knows—we're both in wishful thinking.

But with the test, it's different. After AI completes one round, if the test fails, it knows there are 20 articles not covered, and will re-examine these articles. When the test passes, it's 100% complete. No manual spot-checking needed, no guessing—everything is 100% certain.

This is the feedback loop we've been emphasizing. Many people using AI fall into a cycle of nudging it once to move, finding it's wrong after it moves, then nudging again, feeling that AI is hard to use. The root cause is not establishing a feedback mechanism. AI doesn't know what "done" means, and you don't know to what extent AI has completed things. This is the core problem to solve first. Deterministic testing is a very effective solution. In fact, once this kind of test is in place, the next three decisions are just icing on the cake.

So before starting any task, I ask myself: Do I/AI have a deterministic way to judge whether the task is complete? If not, build this mechanism first.

Decision 3: Let the Agent Do It, Instead of Writing Programs to Call APIs

The third decision was: I didn't write a program to call LLM API to generate summaries, but let the coding agent do it itself.

More detailed reasons are explained in this article. Although having AI do summarization sounds like just calling an API, if you think carefully, there are many corner cases here: some articles already have summaries and shouldn't be duplicated, some metadata formats are inconsistent, some positions need adjustment. If you write a program to handle these situations, the code becomes very complex, debugging costs are high, and progress is slow. Eventually, AI might spend a lot of effort adjusting how to handle these details.

Another approach is to use natural language to directly assign tasks to Cursor/OpenCode: "Go look at XX.md and make sure it has an SEO-oriented summary metadata field." At this point, the entity completing the task is not a mechanical program, but an Agent with real intelligence and adaptability. It handles situations on its own—skipping if summary exists, adjusting if format is wrong, judging by itself when encountering special cases.

This is the difference between using AI as an agent and using AI as a tool. The API-calling pattern is: you write programs, AI is one component. This pattern has high certainty but low flexibility, and is actually slower when encountering complex situations. With Agentic AI, certainty moves from process to outcome—you only need to clearly state what result you want. The rest, AI figures out using its own initiative and judgment.

So in my workflow, calling APIs is the last resort. Whatever can be handed to agents, I hand to agents.

Decision 4: Use Divide and Conquer to Handle Cognitive Saturation

The fourth decision was: I didn't assign one agent the task of handling 300 articles all at once, but had it open 8 sub-agents, distribute tasks, and process in parallel.

The reason relates to context window saturation. If one agent processes 300 articles at once, it might be okay at first, but after reading a dozen articles, the context window gets filled up, and later it starts slacking off, skipping articles, or forgetting pitfalls encountered earlier. This is similar to humans—when cognitive load is high, we become forgetful or start cutting corners.

Another reason is that sub-agents are a natively supported feature of coding agents. I don't need to write concurrency logic, task distribution, or result aggregation myself. This plumbing work is all outsourced. I just need to describe the workflow in a sentence or two.

Many people using AI don't realize this problem. They don't think about AI's defects or anticipate the pitfalls, and just assign tasks using the most intuitive method. But just like when managing subordinates we need to know their strengths and weaknesses, we need to realize that AI's cognitive resources are particularly limited—context window is a scarce resource that needs management. When task volume is large, quality inevitably drops. So when there's a lot of work, I actively consider splitting it up rather than having one agent carry everything.

The relationship between this decision and the previous ones: Decision 2 ensures results are correct (tests pass), Decision 3 ensures the process is flexible (agent handles corner cases itself), Decision 4 goes further by avoiding a guaranteed pitfall, ensuring processing is both fast and good.

Decision 5: Ensure Prompt Is Self-Contained and Outcome-Oriented

The fifth decision was: when giving AI instructions, clearly state all information (don't expect it to read minds), and emphasize what the acceptance criteria are, not how to do each step.

My prompt was roughly this:

For each .md file under blog/content, write a summary field from an SEO perspective and put it in metadata. You can use sub-agents to do this. First look at a few articles to get a feel, then think of a prompt, let different sub-agents process different articles. Open 8 agents to process in parallel, each agent responsible for writing summaries and directly editing .md files. Also, I want a test to check summary coverage—if coverage is below 100%, the test fails. Your goal is to get this test to 100% so it passes.

Notice I didn't tell it specifically how to write this test program or how to handle various corner cases.

This is where many people get it backwards. When writing instructions for AI, they specify every step in detail. This is actually treating AI as a program, wasting Agentic AI's subjective initiative. AI is not a yes-man who only follows instructions—it has strong judgment and execution capabilities. We should leverage its initiative while giving it a clear enough boundary.

I summarize two principles for writing prompts. First, give enough context—don't expect AI to read minds. It doesn't know what the metadata structure looks like. This information should either be given directly or ensured that AI can figure it out itself (for example, here we gave a specific path, and it can figure it out by reading files). Second, start from outcomes, not processes. Tell AI what you want, let it figure out how to do it. Unless you predict that not giving specific guidance on some aspect will cause problems—like the context window issue earlier—there's no need to explain in such detail.

This decision and Decision 3 are two sides of the same coin: Decision 3 says hand execution to agents, Decision 5 says write instructions in a form suitable for agents.

Summary: AI Is Leverage

Finally, some thoughts on my experience.

This task took me about two minutes to dictate instructions to AI using voice recognition. Then AI worked on it for 45 minutes: opening 8 sub-agents in parallel, handling various edge cases, writing tests, reworking, getting tests to pass, committing. I didn't manage it at all during this process. This is leverage. Using two minutes of time to leverage 45 minutes of AI work. More precisely, using 5% of time to control 100% of engineering output.

And current Agentic AI capabilities are strong enough to work autonomously for long periods. We don't need to watch it work. As long as we clearly state what the deliverable is and what the acceptance criteria are, we can go do other things. This brings a new possibility: scalable agentic workflow. For example, we use two minutes to leverage Agent A, keeping it busy for 45 minutes. Then during this time we go command Agent B, C, D... simultaneously launching multiple AIs to proceed in parallel. The cognitive load is indeed high, but this is a practical path to achieve 10x productivity on top of single-agent workflow.

Having talked about the 10x productivity side, the flip side of this project is: having the awareness to use AI, but using the wrong methods—discussing in ChatGPT, no testing mechanism, letting one AI handle everything. If these decisions are wrong, we might struggle for hours to finish, or even get stuck in endless loops unable to complete. The same task, even the same LLM—the difference between knowing how to use it and not knowing, the quality of decisions made, is the difference between being composed and at ease versus struggling without reward, even slower than doing it manually.

Computing Life