<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Computing Life - Computing</title><link href="https://yage.ai/" rel="alternate"/><link href="https://yage.ai/feeds/computing.atom.xml" rel="self"/><id>https://yage.ai/</id><updated>2026-03-30T22:00:00-07:00</updated><entry><title>一行代码的事，Web 为什么做了三十年还没做到</title><link href="https://yage.ai/web-layout-tradeoff.html" rel="alternate"/><published>2026-03-30T22:00:00-07:00</published><updated>2026-03-30T22:00:00-07:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-03-30:/web-layout-tradeoff.html</id><summary type="html">&lt;p&gt;在iOS上查询排版结果只需一行代码，Web上需要触发整个页面的重新布局。这不是因为浏览器工程师蠢，而是CSS在1994年做了一个声明式的架构选择。这个选择的天花板更高，但代价是中间状态不可查询。Facebook在2012年因为不理解这个trade-off付出了数亿美元的代价。SwiftUI和Jetpack Compose证明了声明式和可观测可以共存，关键在于分层。这个教训适用于所有系统设计：好的抽象让你选择在哪一层工作，坏的抽象把所有层粘在一起让你没得选。&lt;/p&gt;</summary><content type="html">&lt;p&gt;你大概率曾经注意到，在手机上用原生 app 刷信息流的时候，滚动通常很顺滑。但在浏览器里打开一个内容密集的网页，快速滑动时经常会看到内容跳动、布局闪烁、空白区域先出现再被填上内容。这不是网速的问题。很大一部分原因是，浏览器在计算"每段内容排完版占多大空间"这件事上，比原生平台慢了一个数量级。&lt;/p&gt;
&lt;p&gt;&lt;a href="https://yage.ai/share/pretext-short-term-overrated-long-term-underrated-20260330.html"&gt;Pretext&lt;/a&gt; 最近在技术社区很火，做的就是这件事：预测一段文字放进容器后会占多少空间。一位叫马工的读者看完后问了一个很好的问题：这不应该是最基本的功能吗？为什么到 2026 年了还需要一个第三方库来解决？&lt;/p&gt;
&lt;p&gt;我顺着这个问题往下挖，发现它触及的东西远比前端技术选型深。Facebook 在 2012 年因为没理解它背后的 trade-off，付出了重写整个移动端的代价。这不只关乎前端工程师。它是一个关乎系统设计哲学的判断：你在做抽象的时候，该把什么藏起来，该把什么留在外面。&lt;/p&gt;
&lt;h2&gt;在其他平台上，这真的就是一行代码&lt;/h2&gt;
&lt;p&gt;先验证马工的直觉。在 iOS 上，你想知道一段文字在给定宽度下排完版占多高，调用 &lt;code&gt;sizeThatFits&lt;/code&gt; 就行。一行代码，立刻返回，甚至不需要把这段文字放进任何界面。Android 上用 &lt;code&gt;StaticLayout&lt;/code&gt;，Qt 用 &lt;code&gt;QFontMetrics&lt;/code&gt;，Flutter 用 &lt;code&gt;TextPainter&lt;/code&gt;，都是同一个模式：排版引擎是一个独立的计算模块，你给它输入，它给你输出，不触发任何全局操作。&lt;/p&gt;
&lt;p&gt;这就是为什么原生 app 滚动长列表的时候通常很流畅：系统可以提前算好每条内容的高度，精确地知道哪些内容即将进入屏幕、该为它们预留多少空间。&lt;/p&gt;
&lt;p&gt;Web 上做不到这件事。在浏览器里想知道一段内容排完版多高，你得把它真的放进页面，然后触发一次 Reflow，也就是浏览器重新计算页面中所有受影响元素的位置和大小。这个计算是同步阻塞的，而且范围不是只算你问的那个元素，而是可能波及整棵布局树。一个内容密集的信息流页面，窗口大小一变，每条内容都要重算一遍，每次重算都牵动整个页面。这就是为什么 Web 版的信息流、聊天界面、电商列表在滚动和窗口缩放时，体验通常比原生 app 更差。&lt;/p&gt;
&lt;p&gt;马工说得对：这确实应该是最基本的功能。而且在 iOS、Android、Qt、Flutter 上，它真的就是最基本的功能。Web 是主流 UI 平台中唯一一个做不到这件事的。&lt;/p&gt;
&lt;p&gt;但 Web 工程师不可能三十年都没想到。这里一定有一个 trade-off。而且理解这个 trade-off 很重要。&lt;/p&gt;
&lt;h2&gt;CSS 的选择不是因为蠢&lt;/h2&gt;
&lt;p&gt;构建界面有两种方式。&lt;/p&gt;
&lt;p&gt;一种是你告诉系统每个东西放哪里。早期 iOS 开发就是这样，你手动计算每个元素的坐标和尺寸。系统严格执行，你对每一步都有精确理解。在这种方式下，"排完版了多高"是一个自然的中间数据，因为你本来就要用它来决定下一个元素放在哪。&lt;/p&gt;
&lt;p&gt;另一种是你描述你想要什么效果，让系统自己算。CSS 就是这样。你写 &lt;code&gt;display: flex; flex-wrap: wrap; gap: 16px&lt;/code&gt;，不管屏幕是 320px 还是 1920px，浏览器自己决定每行放几个元素、怎么分配空间。你不控制过程，你描述意图。&lt;/p&gt;
&lt;p&gt;后者的天花板更高。LaTeX 是更极端的例子。它的断行算法把整个段落当作一个优化问题来求解，考虑所有可能的断行方案，选一个让全段行间松紧度最均匀的。它可能会让前面几行排得稍松，来避免后面某一行出现难看的大空隙。你用逐行手动排版写不出这种效果，因为排第三行的时候你不知道第七行会怎样。只有一个能看到全局的系统才能做这种优化。&lt;/p&gt;
&lt;p&gt;CSS 的响应式布局也是同一个逻辑。有一位做过 WPF 和 XAML 的开发者在 Hacker News 上说，用 CSS Flexbox/Grid 做自适应布局比很多原生桌面框架的开发效率更高，因为你在描述意图而不是在编写实现。&lt;/p&gt;
&lt;p&gt;但代价是什么？代价是你没法问系统"你算出来的结果是什么"。在 CSS 的世界里，一个元素的最终大小取决于它周围的所有元素。浮动、定位、行内格式化上下文、margin collapse，每一条规则都在强化同一个事实：局部结果是全局求解出来的，不存在脱离上下文的独立答案。1994 年 CSS 的第一份提案就确立了这个架构：信息单向流动，开发者声明规则，浏览器执行排版，但浏览器不反馈中间结果。&lt;/p&gt;
&lt;p&gt;在文档排版的场景下，这完全合理。你不需要知道段落精确多高，你只要声明样式，浏览器负责呈现。LaTeX 用户也一样。你不控制图片出现在哪一页，你告诉 TeX 你的偏好，它自己决定最优位置。&lt;/p&gt;
&lt;p&gt;不理解这个 trade-off 的代价可能高达上亿美元。2012 年，Facebook 用 HTML5 构建了整个移动端。iOS 和 Android 的 app 本质上是对 WebView 的封装。选择 HTML5 确实直观：write once, run everywhere，服务端推送更新不需要用户下载新版本。但最终大家发现，有个致命问题是性能扛不住。Facebook 工程师&lt;a href="https://www.infoq.com/news/2012/09/Facebook-HTML5-Native/"&gt;事后复盘&lt;/a&gt;的时候列举了一系列问题：滚动帧率不稳定、UI 线程卡顿、设备资源耗尽导致崩溃。DOM 的全局 Reflow 是核心瓶颈之一：每次内容更新都可能触发整棵布局树的重新计算，News Feed 那种长列表加大量图片的场景直接把它推到了极限。&lt;/p&gt;
&lt;p&gt;最终 Facebook 花了9个月从头重写原生 iOS 应用。启动时间从约10秒降到约4秒，News Feed 加载速度提升一倍。Zuckerberg 在 TechCrunch Disrupt 上说：&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The biggest mistake that we made as a company was betting too much on HTML5 as opposed to native.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;后续的发展也很值得讨论。HTML5 失败直接催生了 React（2013），React 的声明式思路被验证后催生了 React Native（2015），React Native 的布局引擎 Yoga 在设计上明确优化了排版查询。&lt;a href="https://engineering.fb.com/2016/12/07/android/yoga-a-cross-platform-layout-engine/"&gt;Facebook 工程博客&lt;/a&gt;写道，Yoga 确保文本视图"只被测量尽可能少的次数，理想情况下只测量一次"，并且把布局计算放到了独立线程上，彻底绕开了 DOM Reflow 中 JavaScript 和布局互相阻塞的问题。&lt;/p&gt;
&lt;p&gt;这每一步都在逃离 CSS 布局架构的限制。因此，理解 CSS 为什么做了这个 trade-off，以及这个 trade-off 在什么场景下会崩溃，是一个价值数亿美元的判断。&lt;/p&gt;
&lt;h2&gt;但这个 trade-off 不是必须的&lt;/h2&gt;
&lt;p&gt;到这里，故事听起来像一个非此即彼的选择：要么选"系统替你决定"拿到更好的排版效果，代价是看不到中间结果；要么选"你自己控制一切"拿到精确可查询性，代价是失去全局优化。&lt;/p&gt;
&lt;p&gt;但事实上，这个 trade-off 是可以打破的。&lt;/p&gt;
&lt;p&gt;SwiftUI（Apple 2019 年推出）和 Jetpack Compose（Google 2021 年推出）都是声明式 UI 框架。你用 SwiftUI 写响应式布局的方式和用 CSS 写 Flexbox 在概念上非常接近：描述意图，系统自行决定布局。但它们都没有 CSS 的问题。&lt;/p&gt;
&lt;p&gt;内里的原因在于架构分层。原生平台的设计是：底层有一个完全独立的排版引擎（iOS 的 Core Text，Android 的 StaticLayout），上层是声明式框架。声明式框架调用底层引擎来完成布局，但应用代码也可以随时穿透声明式抽象，直接调用底层引擎查询排版结果。Jetpack Compose 甚至提供了官方的 &lt;code&gt;TextMeasurer&lt;/code&gt; API，可以在不触发任何实际渲染的情况下拿到完整的排版信息，包括宽高、行数、每个字符的位置。&lt;/p&gt;
&lt;p&gt;这证明了一个重要的事情：声明式和可观测并不矛盾。你可以让系统替你做全局优化，同时保留一条独立的通道让你查询中间结果。关键是排版引擎要被设计成一个独立模块，而不是焊死在全局布局管线里。&lt;/p&gt;
&lt;p&gt;CSS 的问题不是它选择了声明式。问题是它在做抽象的时候，把排版引擎也封进了全局布局流程，没有留一个独立的查询接口。（叠甲：&lt;code&gt;canvas.measureText()&lt;/code&gt; 只能处理单行文本，不支持换行。处理换行的那套逻辑被锁在了布局引擎内部，从未作为独立接口暴露。）&lt;/p&gt;
&lt;p&gt;1994 年设计 CSS 的时候，Web 是一个学术文档交换系统，没人预见到它会成为应用平台。在文档场景下，你确实不需要独立查询排版结果，所以没人觉得需要把排版引擎设计成可以单独调用的模块。这个决策在当时完全合理。但三十年后，它成了一笔巨大的技术债。&lt;/p&gt;
&lt;p&gt;W3C 也对这个问题心知肚明。CSS Houdini 的 Font Metrics API 就是为了解决它，但截至 2026 年仍停留在提案阶段，没有浏览器实现。Pretext 用4000行用户态代码补了这个缺口。它在 Web 上重建了一个 iOS 和 Android 早就作为基础设施提供的能力。它的存在本身就是 CSS 缺失了一层抽象的证据。&lt;/p&gt;
&lt;h2&gt;这个教训比前端大得多&lt;/h2&gt;
&lt;p&gt;回头看，CSS 和原生平台都选择了声明式布局，但原生平台保留了排版引擎作为独立的可查询层，CSS 没有。一个在 1990 年代做出的、当时看起来无关紧要的分层决策，在三十年后导致了完全不同的开发体验，完全不同的性能特征，和一家公司数亿美元的战略代价。&lt;/p&gt;
&lt;p&gt;这个模式在前端之外同样存在。&lt;/p&gt;
&lt;p&gt;传统 API 设计遵循的哲学和 CSS 一样：隐藏复杂性，保护用户不需要看到中间状态。捕获底层错误后抛出一个抽象的高层异常，把实现细节封装在干净的接口后面。这在用户是人类的时候没问题。人类的认知带宽有限，好的抽象帮他们聚焦。&lt;/p&gt;
&lt;p&gt;但当系统的用户变成需要做决策的代码或 AI 的时候，这种保护性抽象就成了障碍。AI 的有效性依赖于&lt;a href="https://yage.ai/result-certainty.html"&gt;尝试-反馈-修正的循环&lt;/a&gt;。一个模糊的"操作失败，请稍后再试"会直接中断这个循环，和 CSS 不告诉你排版结果是同一个问题。AI 需要的是&lt;a href="https://yage.ai/ai-software-engineering.html"&gt;精细原始的反馈、细粒度的控制接口、足够详尽的中间状态&lt;/a&gt;。前端排版三十年的挣扎和 AI 工程今天遇到的问题，根源是同一个设计哲学在不同时代的碰壁。&lt;/p&gt;
&lt;p&gt;解法也是类似的。Pretext 在 CSS 的声明式渲染旁边建了一条独立的可查询通道，不替代浏览器的排版，只是让你能观测到排版结果。Agentic loop 对 AI 工作流做了同样的事：不规定 AI 每一步怎么做，但让它能观测到自己行动的结果，然后自主决定下一步。两者都没有拆掉抽象，而是在抽象上开了观测窗口。&lt;/p&gt;
&lt;p&gt;马工说"前端很多问题都是因为缺乏合适的架构造成的"。这个方向是对的，但可以更具体精确：问题不在于 CSS 选择了声明式，声明式的天花板确实更高。问题在于 CSS 在做抽象的时候，把不该藏的东西也藏了。好的抽象让你选择在哪一层工作，坏的抽象把所有层粘在一起让你没得选。&lt;/p&gt;
&lt;p&gt;下次你设计一个系统、做一层抽象的时候，值得问自己一个问题：你有没有把使用者未来需要观测的中间状态封进了黑盒？这个决策今天可能看起来干净优雅。但 CSS 用三十年证明了，这可能是一笔到期时间很长、利息很高的技术债。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"/><category term="Chinese"/><category term="Frontend"/><category term="System Design"/></entry><entry><title>One Line of Code on Every Other Platform. Why Can't the Web Do It After 30 Years?</title><link href="https://yage.ai/web-layout-tradeoff-en.html" rel="alternate"/><published>2026-03-30T21:00:00-07:00</published><updated>2026-03-30T21:00:00-07:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-03-30:/web-layout-tradeoff-en.html</id><summary type="html">&lt;p&gt;Querying layout results takes one line of code on iOS, Android, Qt, and Flutter. On the web, it requires triggering a full-page reflow. This isn't because browser engineers are incompetent. CSS made a deliberate architectural choice in 1994 toward declarative layout, which has a higher ceiling but hides intermediate state. Facebook paid hundreds of millions of dollars in 2012 for not understanding this trade-off. SwiftUI and Jetpack Compose prove that declarative and observable can coexist through proper layering. The lesson applies to all system design: good abstractions let you choose which layer to work at; bad abstractions glue all layers together and leave you no choice.&lt;/p&gt;</summary><content type="html">&lt;p&gt;You've almost certainly noticed this: scrolling through a feed in a native mobile app is usually smooth. But open a content-heavy webpage in a browser, and you'll often see content jumping, layout flickering, blank areas appearing before being filled in. This isn't a network speed issue. A large part of the reason is that browsers are an order of magnitude slower than native platforms at computing how much space each piece of content occupies after layout.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://yage.ai/share/pretext-short-term-overrated-long-term-underrated-20260330.html"&gt;Pretext&lt;/a&gt; recently went viral in the tech community, and it does exactly this: predicting how much space text will occupy inside a container. A reader named Ma Gong asked a very good question after reading about it: shouldn't this be the most basic functionality? Why, in 2026, do we still need a third-party library to solve this?&lt;/p&gt;
&lt;p&gt;I followed this question down the rabbit hole and found that it touches something far deeper than frontend technology choices. In 2012, Facebook paid the price of rewriting their entire mobile stack because they didn't understand the trade-off behind it. This isn't just a concern for frontend engineers. It's a judgment call about system design philosophy: when you build an abstraction, what should you hide, and what should you leave exposed?&lt;/p&gt;
&lt;h2&gt;On Every Other Platform, It Really Is One Line of Code&lt;/h2&gt;
&lt;p&gt;Let's first verify Ma Gong's intuition. On iOS, if you want to know how tall a piece of text will be when laid out at a given width, you call &lt;code&gt;sizeThatFits&lt;/code&gt;. One line of code, returns instantly, doesn't even require putting the text into any view hierarchy. Android uses &lt;code&gt;StaticLayout&lt;/code&gt;, Qt uses &lt;code&gt;QFontMetrics&lt;/code&gt;, Flutter uses &lt;code&gt;TextPainter&lt;/code&gt;. Same pattern everywhere: the layout engine is an independent computation module. You give it input, it gives you output, no global operations triggered.&lt;/p&gt;
&lt;p&gt;This is why native apps scroll through long lists so smoothly: the system can pre-calculate the height of each content item, knowing precisely which items are about to enter the screen and how much space to reserve for them.&lt;/p&gt;
&lt;p&gt;The web can't do this. In a browser, to find out how tall a piece of content is after layout, you have to actually insert it into the page, then trigger a Reflow, which means the browser recalculates the position and size of all affected elements. This calculation is synchronous and blocking, and its scope isn't limited to just the element you asked about. It can cascade through the entire layout tree. In a content-heavy feed page, when the window size changes, every item needs to be recalculated, and each recalculation ripples across the whole page. This is why web-based feeds, chat interfaces, and e-commerce listings typically feel rougher than their native counterparts during scrolling and window resizing.&lt;/p&gt;
&lt;p&gt;Ma Gong is right: this really should be the most basic functionality. And on iOS, Android, Qt, and Flutter, it truly is the most basic functionality. The web is the only major UI platform that can't do this.&lt;/p&gt;
&lt;p&gt;But web engineers couldn't possibly have overlooked this for thirty years. There must be a trade-off here. And understanding this trade-off matters a lot.&lt;/p&gt;
&lt;h2&gt;CSS's Choice Wasn't Stupid&lt;/h2&gt;
&lt;p&gt;There are two ways to build a user interface.&lt;/p&gt;
&lt;p&gt;One is to tell the system where to put everything. Early iOS development worked this way: you manually calculated coordinates and dimensions for each element. The system executed exactly what you specified, and you had precise understanding of every step. In this model, "how tall is this after layout" is naturally available as intermediate data, because you need that number to decide where the next element goes.&lt;/p&gt;
&lt;p&gt;The other is to describe what you want and let the system figure out the rest. CSS works this way. You write &lt;code&gt;display: flex; flex-wrap: wrap; gap: 16px&lt;/code&gt;, and whether the screen is 320px or 1920px wide, the browser decides how many items per row and how to distribute space. You don't control the process. You describe intent.&lt;/p&gt;
&lt;p&gt;The latter has a higher ceiling. LaTeX is an even more extreme example. Its line-breaking algorithm treats an entire paragraph as an optimization problem, considering all possible break points and selecting the one that produces the most uniform spacing across the whole paragraph. It might make earlier lines slightly looser to avoid an ugly gap in a later line. You can't achieve this effect with line-by-line manual typesetting, because when you're setting line three, you don't know what will happen at line seven. Only a system with a global view can make that kind of optimization.&lt;/p&gt;
&lt;p&gt;CSS's responsive layout follows the same logic. A developer experienced with WPF and XAML commented on Hacker News that building adaptive layouts with CSS Flexbox/Grid is more efficient in terms of development time than many native desktop frameworks, because you're describing intent rather than writing implementation.&lt;/p&gt;
&lt;p&gt;But what's the cost? The cost is that you can't ask the system "what did you calculate?" In CSS's world, an element's final size depends on all the elements around it. Floats, positioning, inline formatting contexts, margin collapse: every rule reinforces the same fact. Local results are solved globally. There is no context-free independent answer. CSS's very first proposal in 1994 established this architecture: information flows one way. Developers declare rules, the browser executes layout, but the browser doesn't report intermediate results back.&lt;/p&gt;
&lt;p&gt;For document typesetting, this makes perfect sense. You don't need to know a paragraph's exact height. You just declare styles, and the browser handles presentation. LaTeX users experience the same thing. You don't control which page a figure appears on. You tell TeX your preference, and it decides the optimal placement.&lt;/p&gt;
&lt;p&gt;The cost of not understanding this trade-off can run into hundreds of millions of dollars. In 2012, Facebook built its entire mobile experience on HTML5. Both the iOS and Android apps were essentially wrappers around WebViews. The rationale for choosing HTML5 was intuitive: write once, run everywhere; push updates from the server without requiring users to download new versions. But ultimately, performance couldn't keep up. Facebook engineers &lt;a href="https://www.infoq.com/news/2012/09/Facebook-HTML5-Native/"&gt;documented the issues in a postmortem&lt;/a&gt;: inconsistent scroll frame rates, UI thread stutter, device resource exhaustion causing crashes. Global DOM Reflow was one of the core bottlenecks: every content update could trigger recalculation of the entire layout tree, and News Feed's long lists with heavy images pushed it to the breaking point.&lt;/p&gt;
&lt;p&gt;Facebook ultimately spent nine months rewriting the native iOS app from scratch. Launch time dropped from roughly 10 seconds to roughly 4 seconds. News Feed loading speed doubled. Zuckerberg said at TechCrunch Disrupt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The biggest mistake that we made as a company was betting too much on HTML5 as opposed to native.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The subsequent chain of events is equally telling. The HTML5 failure directly led to React (2013). Once React's declarative approach was validated, it led to React Native (2015). React Native's layout engine Yoga was explicitly designed to optimize layout queries. &lt;a href="https://engineering.fb.com/2016/12/07/android/yoga-a-cross-platform-layout-engine/"&gt;Facebook's engineering blog&lt;/a&gt; noted that Yoga ensures text views are "measured as few times as possible, ideally just once," and moves layout computation to a separate thread, completely sidestepping the mutual blocking between JavaScript and layout in DOM Reflow.&lt;/p&gt;
&lt;p&gt;Every step in this chain was an escape from CSS's layout architecture constraints. Understanding why CSS made this trade-off, and under what conditions it breaks down, is a judgment call worth hundreds of millions of dollars.&lt;/p&gt;
&lt;h2&gt;But This Trade-off Isn't Inevitable&lt;/h2&gt;
&lt;p&gt;At this point, the story sounds like a binary choice: either pick "the system decides for you" and get better layout quality at the cost of invisible intermediate results, or pick "you control everything" and get precise queryability at the cost of losing global optimization.&lt;/p&gt;
&lt;p&gt;But in fact, this trade-off can be broken.&lt;/p&gt;
&lt;p&gt;SwiftUI (Apple, 2019) and Jetpack Compose (Google, 2021) are both declarative UI frameworks. Writing responsive layouts in SwiftUI is conceptually very similar to writing Flexbox in CSS: describe intent, let the system decide layout. But neither of them has CSS's problem.&lt;/p&gt;
&lt;p&gt;The underlying reason is architectural layering. Native platforms are designed with a completely independent layout engine at the bottom layer (Core Text on iOS, StaticLayout on Android), and declarative frameworks on top. The declarative framework calls the underlying engine to perform layout, but application code can also pierce through the declarative abstraction at any time and query the underlying engine directly. Jetpack Compose even provides an official &lt;code&gt;TextMeasurer&lt;/code&gt; API that returns complete layout information, including dimensions, line count, and character positions, without triggering any actual rendering.&lt;/p&gt;
&lt;p&gt;This proves something important: declarative and observable are not mutually exclusive. You can let the system perform global optimization while maintaining an independent channel for querying intermediate results. The key is that the layout engine must be designed as an independent module, not welded into the global layout pipeline.&lt;/p&gt;
&lt;p&gt;CSS's problem isn't that it chose to be declarative. The problem is that when building its abstraction, it also sealed the layout engine inside the global layout process without leaving an independent query interface. (Caveat: &lt;code&gt;canvas.measureText()&lt;/code&gt; only handles single-line text and doesn't support line wrapping. The line-wrapping logic is locked inside the layout engine and has never been exposed as a standalone interface.)&lt;/p&gt;
&lt;p&gt;When CSS was designed in 1994, the web was an academic document exchange system. Nobody foresaw it becoming an application platform. In a document context, you genuinely don't need to independently query layout results, so nobody thought to design the layout engine as a separately callable module. That decision was perfectly reasonable at the time. But thirty years later, it became a massive piece of technical debt.&lt;/p&gt;
&lt;p&gt;The W3C is well aware of this problem. CSS Houdini's Font Metrics API was designed to address it, but as of 2026, it remains at the proposal stage with no browser implementation. Pretext filled this gap with 4,000 lines of userland code. It rebuilt on the web a capability that iOS and Android have long provided as infrastructure. Its very existence is evidence that CSS is missing an abstraction layer.&lt;/p&gt;
&lt;h2&gt;This Lesson Extends Far Beyond Frontend&lt;/h2&gt;
&lt;p&gt;Looking back, both CSS and native platforms chose declarative layout, but native platforms preserved the layout engine as an independent, queryable layer. CSS did not. A layering decision made in the 1990s, seemingly inconsequential at the time, led thirty years later to fundamentally different developer experiences, fundamentally different performance characteristics, and a company's strategic mistake costing hundreds of millions of dollars.&lt;/p&gt;
&lt;p&gt;This pattern exists well beyond frontend.&lt;/p&gt;
&lt;p&gt;Traditional API design follows the same philosophy as CSS: hide complexity, shield users from needing to see intermediate state. Catch underlying errors and throw an abstract high-level exception. Wrap implementation details behind clean interfaces. This works fine when the user is human. Humans have limited cognitive bandwidth, and good abstractions help them focus.&lt;/p&gt;
&lt;p&gt;But when the system's user becomes code or AI that needs to make decisions, this protective abstraction becomes an obstacle. AI's effectiveness depends on a &lt;a href="https://yage.ai/result-certainty-en.html"&gt;try-observe-correct loop&lt;/a&gt;. A vague "operation failed, please try again later" directly breaks this loop, which is the same problem as CSS not telling you layout results. What AI needs is &lt;a href="https://yage.ai/ai-software-engineering-en.html"&gt;fine-grained raw feedback, granular control interfaces, and sufficiently detailed intermediate state&lt;/a&gt;. The thirty-year struggle of frontend layout and the problems AI engineering faces today share the same root: the same design philosophy hitting a wall in different eras.&lt;/p&gt;
&lt;p&gt;The solutions are also structurally similar. Pretext built an independent queryable channel alongside CSS's declarative rendering, not replacing the browser's layout but letting you observe layout results. Agentic loops do the same thing for AI workflows: they don't prescribe what AI should do at each step, but they let it observe the results of its own actions and autonomously decide what to do next. Neither tears down the abstraction. Both open observation windows on top of it.&lt;/p&gt;
&lt;p&gt;Ma Gong said "many frontend problems exist because of lacking proper architecture." The direction is right, but it can be made more precise: the problem isn't that CSS chose to be declarative. Declarative genuinely has a higher ceiling. The problem is that CSS, when building its abstraction, also hid things that shouldn't have been hidden. Good abstractions let you choose which layer to work at. Bad abstractions glue all layers together and leave you no choice.&lt;/p&gt;
&lt;p&gt;Next time you design a system or build an abstraction layer, it's worth asking yourself: are you sealing intermediate state that your users will someday need to observe into a black box? That decision might look clean and elegant today. But CSS has spent thirty years proving that it can be a very long-dated, very high-interest piece of technical debt.&lt;/p&gt;</content><category term="Computing"/><category term="English"/><category term="Frontend"/><category term="System Design"/></entry><entry><title>为什么AI只会说正确的废话，以及怎么把它逼出舒适区</title><link href="https://yage.ai/context-infrastructure.html" rel="alternate"/><published>2026-03-15T22:00:00-07:00</published><updated>2026-03-15T22:00:00-07:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-03-15:/context-infrastructure.html</id><summary type="html">&lt;p&gt;LLM的默认输出是consensus：正确但平庸。Deep Research其实是Wide Research。我们找到了一种系统性方法，用个人认知上下文把LLM从consensus里强行扯出来。一年实验，有控制变量证据。&lt;/p&gt;</summary><content type="html">&lt;p&gt;三周前，OpenAI发了一篇关于Harness Engineering的博文。我和社区里一个很厉害的朋友老王各自让自己的Agent做了一份深度调研。我们特意用了同档次的LLM（Claude Opus 4.6 vs GPT-5.4），同一个调研skill，同一个搜索工具Tavily，同一个agentic后端OpenCode，甚至同样的prompt，唯一不同的是两个Agent背后的context：我的Agent接入了我过去一年积累的判断框架和认知体系，老王的Agent没有。结果两边的AI给出了风格完全不同的分析。拿OpenAI和Cursor在harness架构上的收敛来举例：&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;第一种报告（行动建议部分）：&lt;/strong&gt;
先做知识底座，再做更强agent。给仓库建立清晰的AGENTS.md目录索引；把产品规则、架构规则、执行计划、质量标准写进repo；用CI检查文档freshness和cross-link completeness。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第二种报告（同一话题的分析）：&lt;/strong&gt;
完美主义是吞吐量的敌人。OpenAI采用最小阻塞合并、后续修复的策略。Cursor发现要求100%正确性会导致系统停滞，接受小而稳定的错误率反而更高效。两者都接受了「纠错比等待便宜」的权衡。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;差异在哪？第一种给了一份checklist：正确、安全、换任何人来问AI都能得到差不多的东西。第二种给了一个insight：跨两个不同来源提炼出一个有立场的判断（完美主义是敌人），并归纳出底层权衡（纠错比等待便宜）。一个是搬运工，一个是分析师。两篇完整报告可以在这里对比：&lt;a href="https://challenwang.com/essays/harness-engineering-survey-20260313.html"&gt;第一种报告&lt;/a&gt;和&lt;a href="https://yage.ai/share/harness-engineering-survey-20260312.html"&gt;第二种报告&lt;/a&gt;。&lt;/p&gt;
&lt;p&gt;如果你回忆一下自己用AI做分析性工作的经验，大多数AI的产出其实都像第一种报告：找不出明显的错误，但读完之后没有任何启发。属于正确的废话。第二种极其少见。这种普遍的平庸来自一个LLM训练的底层原因。&lt;/p&gt;
&lt;h2&gt;LLM的Consensus天花板&lt;/h2&gt;
&lt;p&gt;这个原因是：LLM被训练的方式就决定了它的默认输出是consensus（共识）。&lt;/p&gt;
&lt;p&gt;LLM训练的本质是next token prediction，它的意思是：每一步输出概率最高的token。概率最高意味着最多人会认同，也就是consensus。RLHF在这个基础上更进了一步：安全对齐专门惩罚有争议的、带有强烈立场的输出，鼓励平衡、全面、没有明显偏向的回答。两层机制叠加，LLM的默认行为就是回归均值。&lt;/p&gt;
&lt;p&gt;这个默认行为导致了一个相当严重的缺陷。比如过去两年认知方向最火的产品是Deep Research，但仔细看它做的事情：自动化的高频搜索，多文档综合，扩大信息覆盖面。这其实和都Deep没有关系，充其量是Wide Research。换言之，Deep Research是一个非常有误导性的名字。它解决的问题是&lt;strong&gt;信息&lt;/strong&gt;不对称：你以前不知道的，现在知道了。但是真正的深度来自另一个维度，来自&lt;strong&gt;认知&lt;/strong&gt;不对称。面对同样一份行业报告，一个从业二十年的老兵和一个刚入行的新手看到的东西完全不一样。老兵的优势在于他有一套经过多年试错沉淀下来的认知系统，知道哪些数据是噪音，哪些异常值预示着趋势。小白没有这个滤镜，就算拿到10倍厚的报告也没办法做出同样质量的决策。&lt;/p&gt;
&lt;p&gt;这就是为什么你很少听到有人说「用了AI之后我有了以前从来没有过的深刻判断」。AI可以把一个小白提升到大众平均水平，因为它的训练数据就是大众平均水平的压缩。但对于已经在平均水平之上的人，AI的consensus输出对他的判断几乎没有增量。深刻的定义本来就是非共识，而非共识恰好是LLM被训练去规避的方向。&lt;/p&gt;
&lt;p&gt;但是，这个gap意味着一种浪费，一个机会。AI只能输出consensus的话，你就没法把真正的thinking委托给它。不说AGI之类的长远前景，就看日常应用，它能当秘书帮你整理信息，但是当不了顾问/教练，帮你做判断。&lt;a href="/ai-management.html"&gt;之前的AI管理系列文章&lt;/a&gt;里讨论过这个区别，但那时还没有找到系统性的突破口。&lt;/p&gt;
&lt;p&gt;那突破口在哪？&lt;/p&gt;
&lt;h2&gt;AI已经从CPU Bound走向Memory Bound&lt;/h2&gt;
&lt;p&gt;面对AI说正确的废话，大家的直觉是去优化模型：换更好更贵的模型、改更复杂的prompt、加更完备的工具，Multi-Agent、Harness全给它整上。这些做法都在优化同一个维度：模型的智能。&lt;/p&gt;
&lt;p&gt;但是开头的实验已经告诉我们答案了。两边的模型智能几乎一样，工具一样，prompt一样。唯一不同的是context：第二种报告背后有一年积累的判断框架，第一种没有。结果一个输出checklist，一个输出insight。&lt;/p&gt;
&lt;p&gt;变量只有一个，结论很显然：（在模型智能跨过一道坎以后）决定产出性质的是context，而不是模型的智能。其实这种转变在计算机历史上发生过：CPU快到一定程度之后，继续升级CPU就没有意义了，主要的提升都来自内存架构。而LLM现在到了同样的拐点。&lt;/p&gt;
&lt;p&gt;这个判断是反直觉的。这是因为一说起AI，大家第一反应就是模型。我们经常看到模型升级了，却从来没看到过context升级了这种说法。但这个不对称本身就揭示了一个更深刻的趋势。每次模型升级，智能就更便宜一点，你用的模型别人也能用。但你的context是只属于你的，模型升级不会让它贬值。所以持续投入在一个不断贬值的维度（模型智能）上，收益递减；投入在一个不贬值的维度（个人context）上，收益累积。&lt;/p&gt;
&lt;p&gt;既然瓶颈在context，那要突破consensus天花板，就需要用足够密度的个人认知上下文压过训练时的consensus prior。几句话的system prompt做不到这件事。你的品味、你对优先级的直觉、你在某个领域反复验证过的判断框架，这些东西是高维的，散落在过去无数次决策和反馈里，几句话根本说不清楚，而需要一套系统来采集和精炼。&lt;/p&gt;
&lt;h2&gt;怎么把LLM从Consensus的舒适区域里面逼出来&lt;/h2&gt;
&lt;p&gt;为了实现这个目标，我花了一年时间，逐渐构建了一整个系统，发展为三个互相支撑的要素。每一个要素都在回应一个具体的问题。&lt;/p&gt;
&lt;h3&gt;大量积累&lt;/h3&gt;
&lt;p&gt;第一个问题是：你的认知框架到底是什么？&lt;/p&gt;
&lt;p&gt;这个问题看起来简单，实际上非常困难。厉害的人通常说不清楚自己哪里厉害，就算能说上来，往往也是错的。很多他觉得「没什么大不了」的肌肉记忆一样判断，恰恰是他最独特的地方。这部分一定要一个第三方才能捕捉到。&lt;/p&gt;
&lt;p&gt;所以捕捉Context的起点是采集客观的行为数据，而不是单单靠自己写prompt。我持续了一年相关实验，包括&lt;a href="/life-api.html"&gt;录音转写&lt;/a&gt;、会议记录、&lt;a href="https://github.com/grapeot/wechat_db_parser/"&gt;微信对话导出&lt;/a&gt;、和AI的每次对话、每次纠正甚至发飙，都&lt;a href="stop-using-chatgpt.html"&gt;积累成了本地文件&lt;/a&gt;。这些是我们在真实决策场景下展现出来的判断逻辑。&lt;/p&gt;
&lt;p&gt;注意，我们自己很难从里面提取模式，因为我们太接近它了。这往往需要一个旁观者来看，AI在这件事上是个合适的旁观者。因此，我把所有数据放在同一个文件夹里，AI打开就能看到所有内容，对任何项目做cross-reference。这是context density的基础。&lt;/p&gt;
&lt;h3&gt;分层提炼&lt;/h3&gt;
&lt;p&gt;第二个问题是：原始数据里那么多噪声，怎么把信号找出来？&lt;/p&gt;
&lt;p&gt;你今天做的某个决策可能是因为没睡好，可能当时信息不全，也可能就是随机选的。如果把原始数据直接给AI（比如&lt;a href="https://mem0.ai/"&gt;Mem0&lt;/a&gt;的做法），AI面对的解读空间太大了。一个具体事件可能体现了很多不同的原则，有些决策甚至是arbitrary的。因此，我们需要一个精炼过程。&lt;/p&gt;
&lt;p&gt;这里我用了一个非常简单的筛选标准：稳定性。一个判断如果是跨场景、跨时间反复出现并保持一致的，它大概率是我们认知结构的一部分。不稳定的是情境反应，稳定的才是我自己。&lt;/p&gt;
&lt;p&gt;&lt;a href="/openclaw.html"&gt;受OpenClaw启发&lt;/a&gt;，这个精炼分为三层。L1 Observer每天扫描文件变动，提取有意义的观察，写个流水账。L2 Reflector每周合并重复、清理过期信息、识别跨项目模式，负责把信号和噪声分离。L3 Axiom从稳定模式中蒸馏决策原则，只保留真正代表你的东西。经过了一年的积累和几周的精炼，目前我的系统里积累了44条axiom，覆盖我的技术选择、沟通风格、商业判断等等主观偏好。&lt;/p&gt;
&lt;p&gt;这里我们和Mem0等等流行记忆系统的核心区别在于蒸馏的深度。Mem0蒸馏到事实层就停了：「你偏好TypeScript」「你住在上海」。但是我们的系统继续往上走，蒸馏到判断原则层：「评估技术方案时，你怎么权衡可维护性和性能，优先级排序是什么」。事实告诉AI你是谁，判断原则告诉AI你怎么想。让AI产出从consensus变成non-consensus，需要的是后者。&lt;/p&gt;
&lt;h3&gt;按需加载&lt;/h3&gt;
&lt;p&gt;第三个问题是：这么多context，怎么给到AI？&lt;/p&gt;
&lt;p&gt;全部塞进去是不行的。Context window有限，而且无关信息会稀释有效信号。一个写代码的任务带入所有的商业判断原则没有意义，一个做调研的任务也不需要加载代码架构偏好。&lt;/p&gt;
&lt;p&gt;解决方案是现成的skill系统：每个skill是一个针对特定任务类型的context子集，包含这类任务最相关的axiom，判断标准，和常用工具。做调研时加载调研的分析框架，写代码时加载架构原则和审阅偏好。这和CPU的内存层级也是类似的：L1 cache是AGENTs.md，L2对应skill库的索引，告诉AI如果需要什么信息的话往哪找，L3则对应具体的skill文件。按需加载，渐进披露，每层只在需要的时候被调用。&lt;/p&gt;
&lt;h3&gt;循环&lt;/h3&gt;
&lt;p&gt;三个要素运行起来之后，一件有意思的事情发生了：知识产品开始涌现，而每个产品在消费context的同时也在产生新的context。&lt;/p&gt;
&lt;p&gt;&lt;a href="https://daily.yage.ai"&gt;鸭哥AI手记&lt;/a&gt; 是基于这个上下文系统写的每日AI行业简报，每一期都在消费axiom和skill，同时产出新的观察进入observation库。具体的&lt;a href="https://www.superlinear.academy/c/news/"&gt;领域调研报告&lt;/a&gt; 是带判断标准的深度分析，每篇报告在生产过程中同时也在更新相关领域的认知框架。这两个系列的报告质量都很高，被大家广泛订阅和转发。这证明了循环可以在足够高的context density下自然形成，在持续运行中不断保持活力。&lt;/p&gt;
&lt;p&gt;这个系统的本质是把你的bias注入AI。有品味的bias是深度的来源，但bias也可能是质量不高的偏见。不过把bias显性化这个过程本身就很有意义。在没有这个系统之前，你的bias散落在决策里，你意识不到它的形态，甚至存在。经过采集、精炼、蒸馏之后，你能看到自己面对某类问题时倾向于优先考虑什么、倾向于忽略什么。这种自我认知的进步，单独就有价值。&lt;/p&gt;
&lt;p&gt;回到开头的实验。老王的AI输出consensus，因为它能看到的context几乎是空的，训练时的prior没有被任何个人认知覆盖。我的AI输出有判断力的分析，因为它有一年积累的判断框架在背后。同一个模型，context density不同，产出的性质就不同。&lt;/p&gt;
&lt;h3&gt;开源的参考实现&lt;/h3&gt;
&lt;p&gt;但是注意，这个系统需要时间积累，需要一定的技术能力，需要持续维护的意愿。但换一个角度理解这个成本：「改一下system prompt就让AI瞬间懂你」或者「换一个更好的模型就够了」，这些捷径从原理上就走不通。Consensus prior太强，几句话压不过去，换模型只是换了一个consensus的来源。对于一个这么重要的问题，it deserves a system。从源头采集、分层精炼、按需加载、循环更新，每一步都有它的理由。并不简单，但特别有帮助。&lt;/p&gt;
&lt;p&gt;我们把这个系统的完整结构开源了：&lt;a href="https://github.com/grapeot/context-infrastructure"&gt;github.com/grapeot/context-infrastructure&lt;/a&gt;。这个repo是一个参考实现，包含了我们实际在用的44条axiom、核心skill文件、三层记忆系统的代码、以及文章里提到的所有组件。&lt;/p&gt;
&lt;p&gt;需要说清楚的是：这个repo的主要价值是让你看到一个运行了一年的系统长什么样，而非让你克隆下来就能直接用。你可以打开它跟AI对话，问「这个观点鸭哥会怎么看」，立刻体验到有context和没有context的差异。但要想让AI真正变成你自己的，没有捷径。你需要从头开始采集你的行为数据，设置你自己的计划任务，让系统从你的决策历史中蒸馏出属于你的判断原则。别人的skills是别人的视角，参考可以，替代不行。&lt;/p&gt;
&lt;h2&gt;偏见与硅基大脑&lt;/h2&gt;
&lt;p&gt;总之，AI变得更聪明，并不自动让它变得更深刻。更聪明的consensus依然是consensus。突破天花板的路径只有一条：注入非共识的视角。&lt;/p&gt;
&lt;p&gt;每个人都有自己的非共识视角。你的判断标准，你的审美偏好，你从失败中提炼出的教训，你对什么重要什么次要的直觉。这些东西在AI的训练数据里不存在，永远不会被任何版本的模型自动学到，因为它们只属于你。&lt;/p&gt;
&lt;p&gt;硅基大脑的绝对客观最终只能抵达聪明的平庸，能将其重塑的，唯有你积累数十年、充满偏见与品味的人类灵魂。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"/><category term="Chinese"/><category term="Agentic AI"/><category term="Methodology"/></entry><entry><title>Why AI Only Gives You Correct Nonsense, and How to Push It Out of Its Comfort Zone</title><link href="https://yage.ai/context-infrastructure-en.html" rel="alternate"/><published>2026-03-15T21:00:00-07:00</published><updated>2026-03-15T21:00:00-07:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-03-15:/context-infrastructure-en.html</id><summary type="html">&lt;p&gt;An LLM's default output is consensus: correct but mediocre. Deep Research is really Wide Research. We found a systematic way to pull LLMs out of consensus using personal cognitive context. One year of experimentation, with controlled evidence.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Three weeks ago, OpenAI published a blog post on Harness Engineering. My friend Lao Wang and I each had our own agents run a deep research on it. We were deliberate about keeping conditions matched: comparable LLMs (Claude Opus 4.6 vs GPT-5.4), the same research skill, the same search tool (Tavily), the same agentic backend (OpenCode), even the same prompt. The only difference was the context behind each agent: mine had a year's worth of accumulated judgment frameworks and personal mental models; Lao Wang's had none. The two AIs produced analyses that felt fundamentally different in kind. Here's the comparison on one specific topic—how OpenAI and Cursor have converged on harness architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Report One (recommendations section):&lt;/strong&gt; (Original Chinese)
先做知识底座，再做更强agent。给仓库建立清晰的AGENTS.md目录索引；把产品规则、架构规则、执行计划、质量标准写进repo；用CI检查文档freshness和cross-link completeness。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Report Two (same topic):&lt;/strong&gt; (Original Chinese)
完美主义是吞吐量的敌人。OpenAI采用最小阻塞合并、后续修复的策略。Cursor发现要求100%正确性会导致系统停滞，接受小而稳定的错误率反而更高效。两者都接受了「纠错比等待便宜」的权衡。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;What's the difference? Report One gives you a checklist: correct, safe, the kind of output you'd get from anyone asking the same question. Report Two gives you an insight: it synthesizes a pointed judgment across two independent sources (perfectionism is the enemy) and names the underlying tradeoff (correcting errors is cheaper than waiting for them). One is a courier; the other is an analyst. Both full reports are available for comparison: &lt;a href="https://challenwang.com/essays/harness-engineering-survey-20260313.html"&gt;Report One&lt;/a&gt; and &lt;a href="https://yage.ai/share/harness-engineering-survey-20260312.html"&gt;Report Two&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you think back on your own experience using AI for analytical work, most outputs look like Report One: nothing obviously wrong, but nothing you'd actually learn from. Correct nonsense. Report Two is rare. This pervasive mediocrity has a specific root cause in how LLMs are trained.&lt;/p&gt;
&lt;h2&gt;The LLM Consensus Ceiling&lt;/h2&gt;
&lt;p&gt;The reason is this: the way LLMs are trained means their default output is consensus.&lt;/p&gt;
&lt;p&gt;LLM training is fundamentally next token prediction—outputting the highest-probability token at each step. Highest probability means most people would agree, which is consensus. RLHF layers another mechanism on top: safety alignment specifically penalizes controversial, strongly-opinionated outputs and rewards balanced, comprehensive, non-committal answers. Two mechanisms stacked together, and the LLM's default behavior is regression to the mean.&lt;/p&gt;
&lt;p&gt;This default has a serious consequence. Take Deep Research, arguably the hottest product category in AI over the past two years. But look at what it actually does: automated high-frequency search, multi-document synthesis, expanding information coverage. That has nothing to do with depth. It's Wide Research at best. "Deep Research" is a genuinely misleading name.&lt;/p&gt;
&lt;p&gt;The problem it solves is &lt;strong&gt;information&lt;/strong&gt; asymmetry: things you didn't know before, now you do. But real depth comes from a different axis—&lt;strong&gt;cognitive&lt;/strong&gt; asymmetry. Facing the same industry report, a twenty-year veteran and a fresh hire see completely different things. The veteran has a mental system built from years of trial and error: they know which data is noise, which anomalies signal a trend. The newcomer doesn't have that filter. Even with a report ten times as long, they can't make the same quality decisions.&lt;/p&gt;
&lt;p&gt;This is why you rarely hear someone say "AI gave me a judgment I never could have formed on my own." AI can raise a novice to average, because its training data is a compression of average. But for people already above average, consensus output adds almost nothing. Depth is, by definition, non-consensus—and non-consensus is exactly what LLMs are trained to avoid.&lt;/p&gt;
&lt;p&gt;But this gap is also a waste, and an opportunity. If AI can only output consensus, you can't actually delegate real thinking to it. Setting aside AGI and long-term futures, in everyday use it can be your secretary—organizing information—but not your advisor or coach, helping you form judgments. &lt;a href="/ai-management-en.html"&gt;The earlier AI management series&lt;/a&gt; touched on this distinction, but at the time I hadn't found a systematic way through.&lt;/p&gt;
&lt;p&gt;So where's the opening?&lt;/p&gt;
&lt;h2&gt;AI Has Shifted from CPU-Bound to Memory-Bound&lt;/h2&gt;
&lt;p&gt;The intuitive response to AI producing correct nonsense is to optimize the model: use a smarter, more expensive model, write more elaborate prompts, add more tools—throw Multi-Agent and Harness architectures at it. All of these optimize the same dimension: model intelligence.&lt;/p&gt;
&lt;p&gt;But the opening experiment already gave us the answer. The model intelligence on both sides was roughly equal. Same tools, same prompt. The only variable was context: Report Two had a year of accumulated judgment frameworks behind it; Report One had nothing. One produced a checklist, the other produced an insight.&lt;/p&gt;
&lt;p&gt;One variable, one conclusion: once model intelligence crosses a threshold, what determines the &lt;em&gt;nature&lt;/em&gt; of the output is context, not model intelligence. This kind of transition has happened before in computing: once CPUs became fast enough, upgrading the CPU stopped mattering—the meaningful gains all came from memory architecture. LLMs are at the same inflection point.&lt;/p&gt;
&lt;p&gt;This is counterintuitive. When people talk about AI, the first thing they think about is the model. We constantly hear about model upgrades; nobody talks about context upgrades. But that asymmetry itself reveals something. Every model upgrade makes intelligence a little cheaper, available to everyone. Your context, on the other hand, is yours alone—model upgrades don't depreciate it. Continuously investing in a depreciating dimension (model intelligence) yields diminishing returns; investing in a non-depreciating dimension (personal context) compounds.&lt;/p&gt;
&lt;p&gt;Since the bottleneck is context, breaking through the consensus ceiling requires personal cognitive context dense enough to override the consensus prior baked in during training. A few lines of system prompt can't do this. Your taste, your intuitions about priorities, your judgment frameworks refined through repeated validation in a specific domain—these are high-dimensional, scattered across countless past decisions and feedback loops, and a few sentences can't capture them. You need a system.&lt;/p&gt;
&lt;h2&gt;How to Push LLMs Out of Their Consensus Comfort Zone&lt;/h2&gt;
&lt;p&gt;To solve this, I spent a year building a system that evolved into three mutually reinforcing components. Each addresses a specific question.&lt;/p&gt;
&lt;h3&gt;Accumulate at Scale&lt;/h3&gt;
&lt;p&gt;The first question: what exactly is your cognitive framework?&lt;/p&gt;
&lt;p&gt;This sounds simple and turns out to be quite hard. Skilled people usually can't articulate what makes them skilled—and when they try, they're often wrong. Many of their most distinctive capabilities are muscle memory, things they'd describe as "no big deal," which is precisely what makes them unique. You need a third party to capture that.&lt;/p&gt;
&lt;p&gt;So the starting point for capturing context is collecting objective behavioral data, not just writing prompts from introspection. Over the past year I've run a continuous experiment: &lt;a href="/life-api-en.html"&gt;voice transcripts&lt;/a&gt;, meeting notes, &lt;a href="https://github.com/grapeot/wechat_db_parser/"&gt;WeChat conversation exports&lt;/a&gt;, every conversation with AI, every correction, even every frustrated outburst—all &lt;a href="stop-using-chatgpt-en.html"&gt;accumulated as local files&lt;/a&gt;. These are the reasoning patterns I actually exhibit under real decision-making conditions.&lt;/p&gt;
&lt;p&gt;Worth noting: it's hard to extract patterns from your own data because you're too close to it. You need an outside observer, and AI is a good fit for that role. So I keep all this data in one folder, where an AI can see everything at once and cross-reference across any project. This is the foundation of context density.&lt;/p&gt;
&lt;h3&gt;Layered Distillation&lt;/h3&gt;
&lt;p&gt;The second question: given all that noise in the raw data, how do you find the signal?&lt;/p&gt;
&lt;p&gt;A decision you made today might reflect poor sleep, incomplete information, or just a random choice. If you feed raw data directly to AI (as systems like &lt;a href="https://mem0.ai/"&gt;Mem0&lt;/a&gt; do), the interpretive space is too wide. A single event can express many different principles, and some decisions are simply arbitrary. Distillation is necessary.&lt;/p&gt;
&lt;p&gt;The filtering criterion I use is simple: stability. A judgment that appears consistently across different situations and over time is likely part of your cognitive structure. What's unstable is situational reaction; what's stable is actually you.&lt;/p&gt;
&lt;p&gt;&lt;a href="/openclaw-en.html"&gt;Inspired by OpenClaw&lt;/a&gt;, the distillation runs in three layers. L1 Observer scans file changes daily, extracts meaningful observations, and writes a running log. L2 Reflector runs weekly, merging duplicates, pruning stale information, and identifying cross-project patterns—separating signal from noise. L3 Axiom distills stable patterns into decision principles, keeping only what genuinely represents you. After a year of accumulation and a few weeks of distillation, my system now holds 44 axioms covering my technical preferences, communication style, business judgment, and other subjective leanings.&lt;/p&gt;
&lt;p&gt;The key difference from Mem0 and similar memory systems is the depth of distillation. Mem0 stops at the fact layer: "you prefer TypeScript," "you live in Shanghai." My system goes further, distilling to the level of judgment principles: "when you evaluate technical solutions, how do you weigh maintainability against performance, and what's your priority ordering?" Facts tell an AI who you are. Judgment principles tell an AI how you think. Getting AI output to shift from consensus to non-consensus requires the latter.&lt;/p&gt;
&lt;h3&gt;On-Demand Loading&lt;/h3&gt;
&lt;p&gt;The third question: with all this context, how do you actually give it to the AI?&lt;/p&gt;
&lt;p&gt;Dumping everything in isn't viable. Context windows have limits, and irrelevant information dilutes the useful signal. A coding task doesn't need all your business judgment principles loaded. A research task doesn't need your code architecture preferences.&lt;/p&gt;
&lt;p&gt;The solution is the existing skill system: each skill is a context subset tailored to a specific task type, containing the most relevant axioms, evaluation criteria, and common tools for that task. Load the research framework when doing research; load architecture principles and review preferences when writing code. It's analogous to a CPU's memory hierarchy: L1 cache is AGENTS.md, L2 corresponds to the skill index (telling the AI where to look if it needs something), L3 corresponds to the actual skill files. On-demand loading, progressive disclosure, each layer invoked only when needed.&lt;/p&gt;
&lt;h3&gt;The Feedback Loop&lt;/h3&gt;
&lt;p&gt;Once the three components are running, something interesting happens: knowledge products start emerging, and each product consumes context while also generating new context.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://daily.yage.ai"&gt;Duck's AI Daily&lt;/a&gt; is a daily AI industry briefing written using this context system. Each issue consumes axioms and skills, while also producing new observations that feed back into the observation library. The &lt;a href="https://www.superlinear.academy/c/news/"&gt;domain research reports&lt;/a&gt; are depth analyses with explicit judgment criteria, and each report updates the relevant domain's knowledge framework as it's produced. Both series have attracted wide readership and sharing—which suggests the feedback loop can sustain itself at sufficient context density and stay alive through continued operation.&lt;/p&gt;
&lt;p&gt;The essence of this system is injecting your bias into AI. Cultivated bias is the source of depth; but bias can also be low-quality prejudice. Still, the act of making bias explicit has its own value. Before this system, your biases were scattered across decisions, invisible to you in form or even existence. After collection, distillation, and refinement, you can actually see what you tend to prioritize—and what you tend to overlook—when facing certain types of problems. That self-knowledge is independently valuable.&lt;/p&gt;
&lt;p&gt;Back to the opening experiment. Lao Wang's AI outputs consensus because the context it can see is essentially empty—the training-time consensus prior hasn't been overridden by any personal cognitive layer. My AI produces opinionated analysis because there's a year of judgment frameworks behind it. Same model, different context density, fundamentally different output.&lt;/p&gt;
&lt;h3&gt;An Open-Source Reference Implementation&lt;/h3&gt;
&lt;p&gt;This system takes time to accumulate, some technical ability to set up, and ongoing willingness to maintain. But consider the cost from another angle: "tweak the system prompt and the AI instantly understands you" or "just use a better model"—these shortcuts don't work in principle. The consensus prior is too strong to override with a few sentences, and switching models just gives you a different source of consensus. For a problem this important, it deserves a system. Collect from the source, distill in layers, load on demand, update in cycles—each step has its reason. Not simple, but genuinely useful.&lt;/p&gt;
&lt;p&gt;We've open-sourced the complete structure: &lt;a href="https://github.com/grapeot/context-infrastructure"&gt;github.com/grapeot/context-infrastructure&lt;/a&gt;. The repo is a reference implementation containing the 44 axioms we actually use, core skill files, code for the three-layer memory system, and all the components described in this post.&lt;/p&gt;
&lt;p&gt;To be clear: the primary value of this repo is letting you see what a system that's been running for a year looks like, not giving you something you can clone and immediately use. You can open it and talk to an AI—ask "how would the author think about this?"—and immediately experience the difference between having context and not. But there's no shortcut to making an AI truly yours. You need to start collecting your own behavioral data, set up your own scheduled tasks, and let the system distill judgment principles from your own decision history. Someone else's skills represent someone else's perspective. A useful reference, not a replacement.&lt;/p&gt;
&lt;h2&gt;Bias and the Silicon Brain&lt;/h2&gt;
&lt;p&gt;AI getting smarter doesn't automatically make it deeper. Smarter consensus is still consensus. There's only one way to break through the ceiling: inject non-consensus perspective.&lt;/p&gt;
&lt;p&gt;Everyone has their own non-consensus perspective. Your criteria for judgment, your aesthetic sensibilities, the lessons distilled from your failures, your intuition about what matters and what doesn't. These things don't exist in AI's training data and will never be automatically learned by any version of any model, because they belong only to you.&lt;/p&gt;
&lt;p&gt;The silicon brain's absolute objectivity can only ever reach an intelligent mediocrity. What can reshape it is something else entirely: the human soul you've spent decades accumulating—full of bias, full of taste.&lt;/p&gt;</content><category term="Computing"/><category term="English"/><category term="Agentic AI"/><category term="Methodology"/></entry><entry><title>用好AI的第一步：停止和AI聊天</title><link href="https://yage.ai/stop-using-chatgpt.html" rel="alternate"/><published>2026-03-03T12:00:00-08:00</published><updated>2026-03-03T12:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-03-03:/stop-using-chatgpt.html</id><summary type="html">&lt;p&gt;会用AI和用好AI之间差的是10倍。这个差距的根源在于工作方式，而非模型。本文通过一个完整的工作流例子和上中下三策的框架，解释为什么应该从ChatGPT切换到Cursor这类Agentic工具。&lt;/p&gt;</summary><content type="html">&lt;p&gt;2026年，AI的渗透率已经很高了。很多公司All in AI，Meta甚至专门安排了一整周的脱产AI培训。但我有一个观察是：大多数人，甚至很多重度用户使用AI的方式，和两年前是一样的：大家还是打开聊天窗口，输入问题，等一个回答。区别只是从GPT-4o换成了GPT-5.2或者豆包，从免费版换成了Pro。&lt;/p&gt;
&lt;p&gt;这当然比完全不用AI更好，但也远远不是最优的方法。我很相信（下面也有例子解释）一件事：能用AI和用好AI之间，生产力差的不是30%，而是10倍的量级。不是说我用，甚至重度使用ChatGPT，就天然进入了AI阵营，可以高枕无忧了。事实上，大多数人用AI的方法，就像汽车发明之后还在把它当马车用：同样的路线，同样的速度，只是换了个引擎。而这个差距的根源，在于你的工作方式是否匹配了AI的能力结构。&lt;/p&gt;
&lt;p&gt;举一个我最近的真实例子。我要改进一个算法，从开会讨论方向、分析失败case、到实现改进方案并验证结果，AI（Cursor）自主执行了大约45分钟，自己走完了设计、实现、测试、发现问题、定位原因、修复、再验证的完整循环，最终所有失败case全部修复。整个过程中我的角色就是定方向和审结果。如果用ChatGPT做同一件事，保守估计时间会多五到十倍。这个10倍差距到底是怎么来的？下面我先解释原因，再用这个例子的完整过程来演示具体做法。&lt;/p&gt;
&lt;h2&gt;为什么聊天窗口是天花板&lt;/h2&gt;
&lt;p&gt;从2024年底开始，AI领域出现了一类新工具，以Cursor、Claude Code、Codex为代表。它们表面上是编程工具，但代表的是一种跟ChatGPT完全不同的AI用法。很多人以为这只是面向程序员的ChatGPT，但我的实际体验是，它们&lt;a href="/cursor-ai-entry.html"&gt;对几乎所有知识工作都有用&lt;/a&gt;。具体地说，它有三层好处：&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第一层：反馈闭环。&lt;/strong&gt; 你让ChatGPT写一段python，它写了，你复制到IDE里一跑，报错了。你把报错信息贴回去，它改了一版，你再跑，又不对，你又贴回去。这个过程里，我们就是反馈闭环中的人型工具人：AI产出，我们验证，我们搬运，AI再改。我们从一个应该指挥AI的人，变成了一个来回跑腿的工具人。&lt;/p&gt;
&lt;p&gt;Cursor这类工具的核心区别在于它接入了我们的执行环境。它写完代码可以直接跑，看到报错自己改，改完再跑，再改。这个循环是AI自己驱动的。因此，AI从一个只会出主意的顾问，变成了能独立干活的员工。顾问说完就走，对不对它既不知道，也不负责；员工则会自己验收，发现问题就返工。&lt;/p&gt;
&lt;p&gt;这也是为什么很多人觉得AI不靠谱：他们一直在用一个开环的AI，犯了错浑然不觉。给它一个闭环，可靠性会有质的提升。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第二层：上下文供给。&lt;/strong&gt; AI输出质量的瓶颈，很多时候在于它能看到多少相关上下文，而非模型本身有多聪明。同一个模型，给足上下文就能给出对的结果；让它盲猜，就容易脑补出不一样的目标。&lt;/p&gt;
&lt;p&gt;最近有&lt;a href="/ai-key-decisions.html#comment-6844340971"&gt;读者评论&lt;/a&gt;：各家的Deep Research和在本地工具里接搜索API相比，哪个更好？我的回答是，我已经好几个月没开过Deep Research了。搜索质量本身没问题，但它能解决的问题太有限。举个例子，我想在工作中比较两种算法的优劣。这个"我的场景"其实需要仔细描述，因为它直接决定了比较的维度：我的数据长什么样、我看重延迟还是准确率、部署环境有什么约束。用Deep Research，我要花很长时间把这些背景交代清楚。但在Cursor里，我直接 @ 几个内部文档和会议记录，AI立刻就有了所有上下文。哪怕搜索能力弱一点，给出的结果也更贴合，速度还更快。&lt;/p&gt;
&lt;p&gt;所以ChatGPT的瓶颈很多时候在于上下文的供给：你很难把足够的信息喂给它。Cursor这类工具解决的就是这个问题。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第三层：资产积累。&lt;/strong&gt; ChatGPT的使用模式是消耗型的。你投入时间，得到一个答案，答案用完就没了。每次对话都是从零开始。Cursor是投资型的。你用到了某个内部文档？存到项目文件夹里。AI反复犯某个错？花两分钟写一条规则。团队有一套约定俗成的惯例？写下来让AI也知道。这些都是一次性投入，但收益是持久的。&lt;/p&gt;
&lt;p&gt;时间一长就会形成飞轮效应：你用得越多、积累越多，AI就越懂你的项目、你的偏好、你的工作方式。ChatGPT永远是一个需要完整briefing的陌生人，Cursor可以变成一个越来越默契的搭档。一个每次归零，一个持续复利。&lt;/p&gt;
&lt;p&gt;反馈闭环、上下文、资产积累，这三层加在一起，就是前面那个45分钟的例子能成立的原因。但光知道原因还不够，关键是怎么在日常工作中把这些落地。下面我就用那个例子的完整过程来演示。&lt;/p&gt;
&lt;h2&gt;上中下三策：一个完整的例子&lt;/h2&gt;
&lt;p&gt;在展开之前，先介绍一个我在实践中总结的框架，叫做上中下三策。工作中的每一步都会产生信息，这些信息怎么处理，决定了AI能帮你多少。下策是让信息消失（人看不到，AI也看不到）；中策是记录成人能看的形式（人友好，AI不友好）；上策是先让AI能消费，再加工给人看（AI-first）。下面每一步我都会用这个框架来分析。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第一步，开会。&lt;/strong&gt; 组里的周会，讨论了某个算法在一些数据上失败的情况，大家提出了各种假设和改进思路。&lt;/p&gt;
&lt;p&gt;下策是开完就忘，什么都没留下。中策是写一份Google Doc的会议纪要，这已经是一个很好的做法了：它增加了你的visibility，同事知道你做了什么，未来也方便引用。但AI很难直接拿到这些内容，因为Google Doc需要登录，格式也混杂，每次想让AI参考都要手动复制粘贴。中策对人友好，对AI不友好。&lt;/p&gt;
&lt;p&gt;上策是用Zoom AI Companion或类似工具自动转录会议内容，存成.md文件，放到工作文件夹的meeting_notes目录下。时间成本几乎为零，但AI从此可以直接引用这次会议里的每一个细节。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第二步，分析数据。&lt;/strong&gt; 我需要看那个算法在不同数据上的表现，记录失败的具体场景和原因。同样的三策逻辑：下策是在便签上记几个URL，给人看的时候切过去点一下完事；中策是写进Confluence；上策是在工作文件夹里建一个analysis_notes.md，把每个失败case的链接、失败原因、观察都记进去。&lt;/p&gt;
&lt;p&gt;值得说明的是，在这两步里上策实际花的时间和中策差不多，有时候甚至更短，因为.md文件的排版比Confluence简单得多，而且你完全可以让AI帮你整理。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第三步，写代码改进算法。&lt;/strong&gt; 这是上策真正发挥威力的地方。因为前两步的所有信息都在同一个文件夹里，我在Cursor里 @ 一下会议记录，再 @ 一下分析笔记，告诉AI：根据这些信息，设计一个改进方案并实现，然后验证这些失败的case有没有被修复。&lt;/p&gt;
&lt;p&gt;注意AI这时候拿到的上下文有多完整：它知道这个算法为什么要改，有什么改进思路（会议记录里有讨论），知道具体有哪些失败模式和原因（分析笔记里有记录），知道成功的标准是什么（哪几个case要被修复）。这里面最关键的是最后一点：success criteria。很多人用AI的时候，只告诉它做什么，却省略了什么样算做好了。这就像一场缺少终点线的赛跑，AI凭感觉跑，你凭感觉判断。但如果你给了AI一个明确的终点线（这几个失败的case要全部修复），AI就可以自己跑完从设计到实现到验证的完整循环：写代码、跑测试、发现问题、定位原因、修复、再验证。这就是前面说的那45分钟里发生的事情。
（事实上这背后比听起来更复杂：AI在后台自动拆分了子任务，调度了多个agent并行工作，主agent做设计和验收，子agent负责编码和测试，整个过程高度自动化。但这是更进阶的话题了。）&lt;/p&gt;
&lt;p&gt;如果用ChatGPT做同一件事呢？你要手动把每段上下文贴过来。你可能会贴会议纪要作为背景，再开贴代码让它帮你改，但这样一方面要贴大量的文件，一方面要在python环境和ChatGPT里面两边拷来拷去，非常低效。其次，这种ChatGPT的用法缺少自我修正能力，你得自己看中间结果、自己判断哪里出了问题、自己把反馈喂回去。麻烦都是其次，主要是弯路会多很多。AI可以一目十行，看1000行log就知道问题在哪，人类得需要特殊的工具做可视化才能看出来。这就是10倍差距的来源：一边是信息打通、自动闭环的AI，另一边是信息割裂、人肉驱动的AI。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第四步，写文档和准备presentation。&lt;/strong&gt; 因为所有的分析、代码、结果都在同一个文件夹里，我直接让AI根据这些内容生成一份技术文档，再贴到Confluence上。&lt;/p&gt;
&lt;p&gt;注意这里的顺序：先在Cursor里让AI生成，再复制到Confluence。先AI，后人。这个顺序的倒转其实是整个工作流里最深的一个思维转变。传统做法是human-first：我自己写文档，写完可能让AI帮我润色一下。上策是AI-first：信息先以AI能消费的格式存在（.md文件），AI完成主要工作（生成文档），最后才转成人类可读的版本（Confluence页面）。结果是你花的时间更少，产出的质量更高，而且AI消费的那份原材料还留在你的文件夹里，未来随时可以再用。&lt;/p&gt;
&lt;p&gt;从开会到出文档，半天时间搞定了全部工作。&lt;/p&gt;
&lt;p&gt;当你把每一步都用上策来处理，所有信息最终都汇聚在同一个文件夹下，形成了我在&lt;a href="/openclaw.html"&gt;之前文章&lt;/a&gt;里提到的Mono Repo模式。AI天然可以跨主题访问所有上下文。这时候AI的能力会有一个显著的跃升，因为它第一次拥有了你的完整信息版图。你可以回想一下你上周的工作：多少环节在用下策？多少在用中策？如果大部分答案是下策和中策，那就是你和10倍效率之间的差距所在。&lt;/p&gt;
&lt;p&gt;回过头看这个流程，有一个根本性的转变：传统工作流里，人是主要执行者，AI是辅助。这个工作流里反过来了，AI是主要执行者，人的角色是定方向、定标准、做判断。换一种说法：我们对AI的定位，应该从&lt;em&gt;让AI帮我写代码&lt;/em&gt;升级到&lt;em&gt;让AI帮我解决问题&lt;/em&gt;。写代码只是解决问题的其中一环。如果你给了AI足够的上下文和明确的成功标准，它可以独立走完整个循环，你的角色就变成了出题人。你的价值在于你知道这个算法应该往哪个方向改，你知道什么样的结果才算成功。这种判断力是你作为专业人士最核心的能力，也恰恰是AI最依赖你提供的东西。&lt;/p&gt;
&lt;p&gt;这个思路适用于所有职业。你可以是工程师、数据分析师、产品经理、研究员。只要你的工作涉及信息的整理、分析、决策和产出，上中下三策就适用，feedback loop的价值就存在。区别只在于AI帮你执行的那个环节是写代码、做分析、写文档还是别的任务。&lt;/p&gt;
&lt;h2&gt;开始行动&lt;/h2&gt;
&lt;p&gt;工具会变，今天的载体是Cursor和Claude Code，明天可能是别的。但三样东西是持久的：反馈闭环让AI能自我修正，上下文供给让AI能理解你的世界，资产积累让你和AI的协作越来越高效。这是底层的范式，跟具体工具无关。&lt;/p&gt;
&lt;p&gt;如果你今天只做一件事，我的建议是这样：找一个你正在进行的项目，建一个文件夹，花半小时把相关的文档、笔记、会议记录全部复制粘贴放进去。然后即使是对你觉得应该用ChatGPT的工作，抑制住这种冲动，强令自己打开Cursor，从这里开始你跟AI的下一次对话。你会立刻感受到差异。改变从这一刻开始。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"/><category term="Chinese"/><category term="Agentic AI"/><category term="Methodology"/></entry><entry><title>Step One to Using AI Well: Stop Chatting with AI</title><link href="https://yage.ai/stop-using-chatgpt-en.html" rel="alternate"/><published>2026-03-03T12:00:00-08:00</published><updated>2026-03-03T12:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-03-03:/stop-using-chatgpt-en.html</id><summary type="html">&lt;p&gt;The gap between using AI and using AI well is 10x. That gap comes from how you work, not which model you use. This post walks through a complete workflow example and a Three Tiers framework to explain why you should switch from ChatGPT to agentic tools like Cursor.&lt;/p&gt;</summary><content type="html">&lt;p&gt;By 2026, AI has become widespread. Companies are all-in on it. Meta even blocked out an entire week for mandatory AI training. But here's what I keep noticing: most people, including heavy users, are interacting with AI the same way they did two years ago. They open a chat window, type a question, wait for an answer. The only difference is they've swapped GPT-4o for GPT-5.2 or Doubao, or upgraded from free to Pro.&lt;/p&gt;
&lt;p&gt;That's better than not using AI at all, but it's nowhere close to optimal. I'm convinced, and I'll show you evidence below, that the productivity gap between &lt;em&gt;using AI&lt;/em&gt; and &lt;em&gt;using AI well&lt;/em&gt; isn't 30%. It's an order of magnitude. Just because you use ChatGPT, even heavily, doesn't mean you've joined some AI-native vanguard where you can sit back and relax. Most people are using AI like someone who got a car but still drives horse-carriage routes: same roads, same speed, just a different engine. The real gap comes down to whether your way of working actually matches how AI is capable of operating.&lt;/p&gt;
&lt;p&gt;Here's a recent real example from my own work. I needed to improve an algorithm. From the initial meeting to map out the direction, through analyzing failure cases, to implementing the fix and verifying results, AI (specifically Cursor) ran autonomously for about 45 minutes. It completed the full loop on its own: design, implement, test, find issues, diagnose, fix, verify again. Every failing case was resolved. My role throughout was to set the direction and review the outcome. Doing the same thing in ChatGPT would conservatively take five to ten times longer. Where does that 10x gap actually come from? I'll explain the why first, then walk through the complete example to show the how.&lt;/p&gt;
&lt;h2&gt;Why the Chat Window Is a Ceiling&lt;/h2&gt;
&lt;p&gt;Starting around late 2024, a new category of AI tools emerged: Cursor, Claude Code, Codex. On the surface they look like coding tools, but they represent a fundamentally different way of using AI compared to ChatGPT. A lot of people assume they're just ChatGPT for programmers, but my experience is that &lt;a href="/cursor-ai-entry-en.html"&gt;they're useful for almost all knowledge work&lt;/a&gt;. The difference plays out on three levels.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 1: The feedback loop.&lt;/strong&gt; You ask ChatGPT to write some Python. It writes it. You copy it to your IDE, run it, it errors. You paste the error back, it gives you a revision, you run it again, still wrong, you paste again. In this cycle, you become the human errand runner in the feedback loop: AI produces, you test, you shuttle information back and forth, AI revises. You've gone from the person directing AI to the person doing the legwork.&lt;/p&gt;
&lt;p&gt;The core difference with Cursor is that it's connected to your execution environment. It writes code and runs it directly. Sees an error, fixes it, runs it again. The loop is AI-driven. This turns AI from a consultant who gives advice and walks away into an employee who can work independently. The consultant says their piece and leaves, with no idea if it was right and no accountability. The employee validates their own work and fixes problems when they find them.&lt;/p&gt;
&lt;p&gt;This is also why a lot of people think AI is unreliable: they've been using open-loop AI that fails and doesn't know it. Give it a closed loop, and reliability improves dramatically.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 2: Context supply.&lt;/strong&gt; The bottleneck on AI output quality, much of the time, isn't how smart the model is. It's how much relevant context the model can see. Same model, enough context: correct result. Same model, guessing blind: it fills in the gaps with something that might be completely wrong.&lt;/p&gt;
&lt;p&gt;A &lt;a href="/ai-key-decisions-en.html#comment-6844340971"&gt;reader recently commented&lt;/a&gt;: between Deep Research from the major AI providers versus plugging a search API into a local tool, which is better? My answer: I haven't opened Deep Research in months. The search quality isn't the issue. It's just too limited in what it can actually solve. Say I want to compare two algorithms for my specific use case at work. "My use case" requires careful description, because it directly determines what dimensions matter for the comparison: what my data looks like, whether I care about latency or accuracy, what the deployment constraints are. With Deep Research, I have to spend a lot of time explaining all that background. In Cursor, I just @ a few internal docs and meeting notes, and AI immediately has all the context. Even if the search capability is slightly weaker, the results are more relevant and the whole thing is faster.&lt;/p&gt;
&lt;p&gt;ChatGPT's bottleneck is often context supply: it's hard to feed it enough information. Cursor-style tools solve exactly that problem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 3: Asset accumulation.&lt;/strong&gt; ChatGPT's usage pattern is consumptive. You put in time, you get an answer, the answer gets used, then it's gone. Every conversation starts from zero. Cursor is investment-style. You needed an internal doc? Save it to the project folder. AI keeps making the same mistake? Spend two minutes writing a rule. Your team has conventions everyone follows? Write them down so AI knows too. Each of these is a one-time investment with compounding returns.&lt;/p&gt;
&lt;p&gt;Over time, this creates a flywheel: the more you use it, the more you've accumulated, the better AI understands your project, your preferences, your working style. ChatGPT is always a stranger who needs a full briefing every time. Cursor becomes a collaborator who gets more in sync with you over time. One resets to zero; the other compounds.&lt;/p&gt;
&lt;p&gt;These three levels, feedback loop, context, and asset accumulation, are why that 45-minute example was possible. But knowing the reason isn't enough. What matters is how to actually make this work day to day. The full example below shows that.&lt;/p&gt;
&lt;h2&gt;Three Tiers: A Complete Example&lt;/h2&gt;
&lt;p&gt;Before walking through it, let me introduce a framework I've developed through practice, which I call the Three Tiers. Every step in your work produces information. How you handle that information determines how much AI can help you. The Bad tier: information disappears (neither you nor AI can see it later). The Better tier: information gets recorded in a human-readable format (human-friendly, AI-unfriendly). The Best tier: information gets stored AI-first, then made human-readable. I'll apply this framework to every step below.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: The meeting.&lt;/strong&gt; The team's weekly sync, where we discussed cases where an algorithm was failing on certain data and brainstormed hypotheses and improvement ideas.&lt;/p&gt;
&lt;p&gt;Bad tier: meeting ends, nothing is captured. Better tier: write up a Google Doc with meeting notes. This is already a solid practice. It increases your visibility, your colleagues know what happened, and it's easy to reference later. But AI can't easily access this content: Google Docs require login, the format is messy, and every time you want AI to reference it you have to manually copy and paste. Better tier is human-friendly and AI-unfriendly.&lt;/p&gt;
&lt;p&gt;Best tier: use Zoom AI Companion or a similar tool to auto-transcribe the meeting, save it as a .md file, put it in a meeting_notes directory inside your work folder. Time cost is nearly zero, but AI can now directly reference every detail from that meeting going forward.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Analyzing the data.&lt;/strong&gt; I needed to look at how the algorithm performed across different inputs, and document the specific failure scenarios and their causes. Same Three Tiers logic: Bad tier is jotting a few URLs in a sticky note and clicking through them when you need to show someone. Better tier is writing it up in Confluence. Best tier is creating an analysis_notes.md in your work folder with each failure case's link, failure reason, and observations.&lt;/p&gt;
&lt;p&gt;Worth noting: the Best tier in these two steps takes about as much time as the Better tier, sometimes less, because .md formatting is far simpler than Confluence, and you can have AI help you organize it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Writing code to improve the algorithm.&lt;/strong&gt; This is where the Best tier really shows its value. Because all the information from the first two steps lives in the same folder, I open Cursor, @ the meeting notes, @ the analysis notes, and tell AI: based on this, design an improvement and implement it, then verify that the failing cases are fixed.&lt;/p&gt;
&lt;p&gt;Look at how complete the context is that AI has at this point. It knows why the algorithm needs to change. It has improvement ideas (the meeting notes have that discussion). It knows the specific failure patterns and their causes (the analysis notes have that). It knows the success criteria (which cases need to be fixed). That last piece is the most critical. A lot of people tell AI what to do but skip what "done" looks like. It's like a race with no finish line: AI runs by feel, you judge by feel. But give AI a clear finish line (all these failing cases must pass), and it can run the entire loop from design to implementation to verification on its own: write code, run tests, find problems, diagnose, fix, verify again. That's what happened in those 45 minutes.&lt;/p&gt;
&lt;p&gt;(What's going on behind the scenes is actually more complex than it sounds: AI automatically broke the task into subtasks, scheduled multiple agents to work in parallel, with the main agent handling design and review while sub-agents handled coding and testing. But that's a more advanced topic.)&lt;/p&gt;
&lt;p&gt;What if you did this same thing in ChatGPT? You'd have to manually paste in every piece of context. Maybe you paste the meeting notes as background, then open another chat for the code changes, copying back and forth between your Python environment and the chat window constantly. Beyond the inefficiency, this approach lacks any self-correction ability. You have to review every intermediate result yourself, decide where things went wrong, and manually feed that feedback back in. The hassle is secondary; the bigger cost is all the detours. AI can skim a thousand lines of logs and identify the problem in seconds. A human needs specialized visualization tools just to see what's happening. That's where the 10x gap comes from: on one side, information fully connected, loop automated; on the other, information siloed, loop driven by hand.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4: Writing documentation and preparing the presentation.&lt;/strong&gt; Because all the analysis, code, and results are in the same folder, I have AI generate a technical document directly from that content, then paste it to Confluence.&lt;/p&gt;
&lt;p&gt;Notice the order: generate in Cursor first, then copy to Confluence. AI first, then humans. This reversal is actually the deepest mindset shift in the entire workflow. The traditional approach is human-first: I write the document, then maybe have AI polish it. The Best tier is AI-first: information lives in a format AI can consume (.md files), AI does the main work (generates the document), and only then does it get converted to a human-readable form (Confluence page). The result is less time spent, higher quality output, and the AI-consumable source material stays in your folder for future use.&lt;/p&gt;
&lt;p&gt;From the meeting to finished documentation, the whole thing took half a day.&lt;/p&gt;
&lt;p&gt;When you handle every step with the Best tier, all information converges in the same folder, forming what I called the Mono Repo pattern in &lt;a href="/openclaw-en.html"&gt;a previous post&lt;/a&gt;. AI can naturally access all the context across every topic. At that point, AI's capability takes a noticeable leap, because it finally has access to your complete information map. Think back over your work last week. How many steps were Bad tier? How many were Better tier? If most of your answers are Bad and Better, that's the gap between where you are and 10x productivity.&lt;/p&gt;
&lt;p&gt;Stepping back and looking at this workflow, there's a fundamental shift: in the traditional model, the human is the primary executor and AI is the assistant. In this workflow, it's reversed. AI is the primary executor; the human's role is to set direction, define success criteria, and make judgment calls. Put it another way: our conception of AI should upgrade from &lt;em&gt;have AI help me write code&lt;/em&gt; to &lt;em&gt;have AI help me solve problems&lt;/em&gt;. Writing code is just one piece of solving problems. If you give AI enough context and a clear definition of success, it can complete the entire loop independently, and your role becomes the one who sets the problem. Your value lies in knowing which direction the algorithm should go, and knowing what a successful result looks like. That kind of judgment is your core capability as a professional, and it's exactly what AI depends on you to provide.&lt;/p&gt;
&lt;p&gt;This applies to every profession. Engineer, data analyst, product manager, researcher. If your work involves gathering, analyzing, deciding, and producing information, the Three Tiers apply, and the value of a feedback loop is there. The only difference is whether the loop AI runs for you involves writing code, doing analysis, writing documents, or something else.&lt;/p&gt;
&lt;h2&gt;Getting Started&lt;/h2&gt;
&lt;p&gt;The tools will change. Today it's Cursor and Claude Code; tomorrow it'll be something else. But three things are durable: a feedback loop that lets AI correct itself, context supply that lets AI understand your world, and asset accumulation that makes your collaboration with AI more efficient over time. These are the underlying principles, independent of any specific tool.&lt;/p&gt;
&lt;p&gt;If you do one thing today, here's my suggestion: find a project you're currently working on, create a folder, and spend 30 minutes copying all the relevant documents, notes, and meeting records into it. Then, even for work you'd normally turn to ChatGPT for, resist that impulse, open Cursor instead, and start your next conversation with AI from there. You'll feel the difference immediately. Start now.&lt;/p&gt;</content><category term="Computing"/><category term="English"/><category term="Agentic AI"/><category term="Methodology"/></entry><entry><title>以一个简单任务为例看AI落地的关键决策</title><link href="https://yage.ai/ai-key-decisions.html" rel="alternate"/><published>2026-02-20T18:00:00-08:00</published><updated>2026-02-20T18:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-20:/ai-key-decisions.html</id><summary type="html">&lt;p&gt;用两分钟指挥AI给300篇文章添加SEO summary的实战案例，拆解五个关键决策：选对执行环境、先建测试再干活、让agent自己处理corner case、divide and conquer、结果导向的prompt写法。&lt;/p&gt;</summary><content type="html">&lt;p&gt;今天我用AI完成了一个小任务。感觉这个案例特别适合用来介绍AI的实战原则，所以写了这篇文章来分享一下。&lt;/p&gt;
&lt;p&gt;任务本身是给这个blog里的每一篇文章都加一行summary，这样可以帮助搜索引擎理解这个网站的内容，从而提升这个网站的排名（SEO）。这个任务看起来简单，其实有很多坑，一不小心就会陷入AI鬼打墙、不可靠、使用繁琐的陷阱。下面主要分享在这个过程中我做了哪五个重要的决策，来让整个流程变得稳定可靠。&lt;/p&gt;
&lt;h2&gt;决策一：用本地Coding Agent，而不是ChatGPT&lt;/h2&gt;
&lt;p&gt;我做的第一个决策是：用Cursor/OpenCode作为讨论的平台，而不是ChatGPT。这件事其实并不显然，因为整个项目的开始来自于我想给这个网站做SEO。直观上看，这是个更适合ChatGPT的聊天性质的任务。但是我仍然坚持用了OpenCode。这里面最根本的原因是摩擦。&lt;/p&gt;
&lt;p&gt;具体地说，摩擦在两个方面。第一是上下文传递的摩擦。用ChatGPT我需要把我的博客的内容甚至代码复制粘贴给它，或者让它去写代码抓取这些文章的内容。但在OpenCode里，我只要用@指定我的博客所在的文件夹就好了，摩擦小很多。&lt;/p&gt;
&lt;p&gt;另一个方面是落地的摩擦。比如我们在ChatGPT里面通过聊天得出了结论：这个网站需要增加Summary元数据。为了把这个想法落地，我需要把我和ChatGPT来回几轮的聊天记录全部复制粘贴到Cursor/OpenCode里面去，然后再调用另一个AI来改文章的内容。相比之下，如果从头就在OpenCode里面做讨论的话，讨论之后立刻就能落地。&lt;/p&gt;
&lt;p&gt;所以我做了这第一个决策：对几乎所有任务，抛弃基于聊天的AI环境，选择能执行的Agentic环境。为什么把这个决策放在第一个，是因为这是有和无的区别。摩擦一大，我们就懒得做下去了，整个项目花了时间，交付是0，纯浪费时间。只有摩擦小了，项目能继续下去，才有必要继续聊具体的方法和技巧。&lt;/p&gt;
&lt;h2&gt;决策二：动手之前，先定义成功，提供测试&lt;/h2&gt;
&lt;p&gt;我做的第二个决策是：在让AI动手生成任何summary之前，先让它写一个测试。这个测试做的事情很简单，就是检查所有.md文件，看有没有summary字段。如果不是100%的文件都有这个字段就fail，并且打印是哪些文件有问题。&lt;/p&gt;
&lt;p&gt;为什么要先写测试？因为如果没有这个测试，AI说做完了，我也不知道它到底做完了没有。我确实可以抽查几篇，但300多篇文章，抽查没法覆盖全部。最后的局面就是我也不知道、AI也不知道，两个人都在wishful thinking。&lt;/p&gt;
&lt;p&gt;但有了测试就不一样了。AI做完一轮，测试fail了，它自己就知道还有20篇没覆盖，下面就会重新看这些文章。测试通过了，就是100%完成了。不需要人工抽查，不需要猜，一切100%都是确定的。&lt;/p&gt;
&lt;p&gt;这就是我们一直强调的&lt;a href="/agentic-ai-crisis.html"&gt;feedback loop&lt;/a&gt;。很多人用AI陷入踢一脚动一下、动完了发现不对，再踢再动的循环，觉得AI好难用，根本原因就是没有建立反馈机制。AI不知道什么叫"做完"，你也不知道AI做到什么程度了。这是要首先解决的核心问题。确定性的测试就是一个非常有效的解决方法。事实上，只要这种测试到位了，后面三个决策都是锦上添花的东西。&lt;/p&gt;
&lt;p&gt;所以在开始任何任务之前，我都会先问自己：我/AI有没有一个确定性的方式来判断任务完成了没有？如果没有，先把这个机制建起来。&lt;/p&gt;
&lt;h2&gt;决策三：让Agent自己去干，而不是我来写程序调用API&lt;/h2&gt;
&lt;p&gt;第三个决策是：我没有写程序去调用LLM API来生成summary，而是让coding agent自己去做这件事。&lt;/p&gt;
&lt;p&gt;更详细的原因在&lt;a href="/result-certainty.html"&gt;这篇文章&lt;/a&gt;中有解释。虽然让AI做概括听起来调个API就搞定了。但仔细想想，这里有很多corner case：有的文章已经有summary了不要重复加，有的metadata格式不一致，有的位置需要调整。如果写程序处理这些情况，代码会特别复杂，调试成本高，进展速度慢。最后可能AI会花大量的精力去调怎么处理这些细节。&lt;/p&gt;
&lt;p&gt;另一种思路是用自然语言直接给Cursor/OpenCode布置任务："你去看一下XX.md，保证它有个面向SEO的summary元数据域"。这时候完成任务的主题就不是一个机械的程序，而是一个真的有智能，知变通的Agent。它会自己看情况处理——有summary就跳过，格式不对就调整，遇到特殊情况自己判断。&lt;/p&gt;
&lt;p&gt;这就是把AI当agent用和把AI当工具用的区别。调用API的模式是：你写程序，AI是其中一个组件。这种模式确定性高，但灵活性低，遇到复杂情况反而更慢。而用Agentic AI，确定性从过程移到了结果上，你只需要讲清楚要什么结果。剩下的事，AI发挥自己的能动性和判断力自己搞定。&lt;/p&gt;
&lt;p&gt;所以在我的工作流里，调用API是最后手段。能交给agent去做的，尽量交给agent。&lt;/p&gt;
&lt;h2&gt;决策四：用Divide and Conquer应对认知饱和&lt;/h2&gt;
&lt;p&gt;第四个决策是：我没有给一个agent一股脑布置300篇文章的任务，而是让它开了8个sub-agent，分配任务以后并行处理。&lt;/p&gt;
&lt;p&gt;这里面的原因和context window saturation有关。一个agent一下处理300篇，前面可能还好，读了十几篇文章以后context window &lt;a href="/wide-research.html"&gt;会被占满&lt;/a&gt;，后面就开始偷懒、跳文章、或者忘了前面踩过的坑。这和人有点像，认知负荷一高就会丢三落四，或者开始敷衍。&lt;/p&gt;
&lt;p&gt;另一个原因是sub-agent是coding agent原生支持的功能。我不用自己写并发逻辑、分配任务、汇总结果。这些plumbing work都被外包出去了。我只要用一两句话描述一下这个工作流就好。&lt;/p&gt;
&lt;p&gt;很多人用AI的时候没有意识到这个问题。他们没有针对AI的缺陷思考，预测到里面的坑，就用最符合直觉的方法去布置任务。但像我们管理下属的时候要知人善任一样，我们要意识到AI的认知资源尤其有限，context window是一种需要管理的稀缺资源。任务量太大，质量必然下降。所以任务量大的时候，我会主动考虑拆分，而不是让一个agent扛所有东西。&lt;/p&gt;
&lt;p&gt;这个决策和前面几个的关系是：决策二保证结果是对的（测试通过），决策三保证过程是灵活的（agent自己处理corner case），决策四更进一步通过规避一个必然出现的坑，保证处理得又快又好。&lt;/p&gt;
&lt;h2&gt;决策五：保证Prompt Self Contained（自举）并且结果导向&lt;/h2&gt;
&lt;p&gt;第五个决策是：给AI的指令讲清楚所有的信息（不指望它读心），而且着重说acceptance criteria是什么，而不是每一个步骤怎么做。&lt;/p&gt;
&lt;p&gt;我的prompt大概是这样：&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;对于blog/content下面每一篇.md文件，从SEO的角度写一个summary域放到metadata里。你可以用sub-agent来做。先看几篇文章找到感觉，然后想一个prompt，让不同的sub-agent分别处理不同的文章。开8个agent并行处理，每个agent负责写summary并直接编辑.md文件。另外，我希望有个测试能check summary coverage，如果coverage不到100%测试就fail。你的目标就是把这个测试搞到100%让它能过。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;注意我没有告诉它具体怎么写这个测试程序、怎么处理各种corner case。&lt;/p&gt;
&lt;p&gt;这是很多人容易搞反的地方。他们给AI写指令的时候，事无巨细地规定每一步怎么做。这其实是在把AI当程序用，浪费了Agentic AI的主观能动性。AI不是一个只会照本宣科的乙方，它有很强的判断力和执行力。我们要发挥它的主观能动性，但同时给它一个足够清晰的边界。&lt;/p&gt;
&lt;p&gt;我总结写prompt有两个原则。第一，context要给足，不要指望AI能读心。它不知道metadata结构是什么样的。这些信息要么直接给，要么要保证它自己能搞清楚（比如这里我们给了具体路径，它可以通过读文件搞清楚）。第二，从结果出发，而不是从过程出发。你告诉AI你要什么，让它自己想怎么做。除非你预测到某个环节不给具体指导它会出问题——比如前面的context window问题——否则不用讲那么细。&lt;/p&gt;
&lt;p&gt;这个决策和决策三是一体两面：决策三是说把执行交给agent，决策五是说把指令也写成适合agent的形式。&lt;/p&gt;
&lt;h2&gt;总结：AI是一种杠杆&lt;/h2&gt;
&lt;p&gt;最后说一点感受。&lt;/p&gt;
&lt;p&gt;这个任务，我用语音识别花了大概两分钟把指令讲给AI。然后AI自己折腾了45分钟：并行开8个sub-agent，处理各种边界条件，写测试，返工，跑通，commit。全程我就没再管了。这就是一种leverage。用两分钟的时间，撬动了AI 45分钟的工作量。更准确地说，用5%的时间控制了100%的工程产出。&lt;/p&gt;
&lt;p&gt;而且现在的Agentic AI能力已经足够强，可以长时间自主工作。我们不需要盯着它干活。只要讲清楚deliverable是什么、acceptance criteria是什么，就可以去干其他事了。这就带来了一种新的可能：scalable agentic workflow。比如我们用两分钟撬动一个Agent A，让他忙45分钟。然后这个时间我们再去指挥Agent B，C，D，... 同时启动多个AI并行推进。这样脑力负担确实会很高，但这是在单Agentic workflow的基础上，再进一步实现10倍生产力的切实可行的途径。&lt;/p&gt;
&lt;p&gt;说完了10倍生产力的一面，这个项目的另一面是，有用AI的意识，但是方法不对——在ChatGPT里讨论、没有测试机制、让一个AI包办所有。这些决策做错了，我们可能要折腾几个小时才能做完，甚至鬼打墙做不出来。同一个任务，甚至同一个LLM，会用和不会用，决策做的质量高低，就是从从容容游刃有余和吃力不讨好，比人工做更慢的差别。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"/><category term="Chinese"/><category term="Agentic AI"/></entry><entry><title>Key Decisions for Agentic Workflows: A Simple Case Study</title><link href="https://yage.ai/ai-key-decisions-en.html" rel="alternate"/><published>2026-02-20T18:00:00-08:00</published><updated>2026-02-20T18:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-20:/ai-key-decisions-en.html</id><summary type="html">&lt;p&gt;A real-world case study of directing AI to add SEO summaries to 300 articles in two minutes, breaking down five key decisions: choosing the right execution environment, building tests before work, letting agents handle corner cases, divide and conquer, and outcome-oriented prompt writing.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Today I used AI to complete a small task. This case feels particularly suitable for introducing AI's practical principles, so I wrote this article to share it.&lt;/p&gt;
&lt;p&gt;The task itself was to add a summary line to every article in this blog, which helps search engines understand the website's content and improve its ranking (SEO). This task looks simple, but it has many pitfalls—one careless move and you fall into the trap of AI getting stuck in loops, being unreliable, or being cumbersome to use. Below I'll mainly share the five important decisions I made during this process to make the entire workflow stable and reliable.&lt;/p&gt;
&lt;h2&gt;Decision 1: Use a Local Coding Agent, Not ChatGPT&lt;/h2&gt;
&lt;p&gt;The first decision I made was to use Cursor/OpenCode as the platform for discussion, not ChatGPT. This isn't obvious, because the project started with me wanting to do SEO for this website. Intuitively, this seems like a chat-type task better suited for ChatGPT. But I still insisted on using OpenCode. The fundamental reason is friction.&lt;/p&gt;
&lt;p&gt;Specifically, friction exists in two aspects. First is the friction of context transfer. With ChatGPT, I need to copy and paste my blog's content or even code to it, or have it write code to fetch these articles. But in OpenCode, I just use @ to specify the folder where my blog is located—much less friction.&lt;/p&gt;
&lt;p&gt;Another aspect is the friction of implementation. For example, if we reach a conclusion through chatting in ChatGPT that this website needs to add Summary metadata, to implement this idea, I need to copy and paste several rounds of chat history between me and ChatGPT into Cursor/OpenCode, and then call another AI to modify the article content. In contrast, if the discussion happens in OpenCode from the beginning, it can be implemented immediately after the discussion.&lt;/p&gt;
&lt;p&gt;So I made this first decision: for almost all tasks, abandon chat-based AI environments and choose executable Agentic environments. Why put this decision first? Because this is the difference between existence and non-existence. If friction is high, we won't bother continuing, and the entire project takes time but delivers 0—pure waste of time. Only when friction is low and the project can continue is it meaningful to discuss specific methods and techniques.&lt;/p&gt;
&lt;h2&gt;Decision 2: Before Starting, Define Success and Provide Tests&lt;/h2&gt;
&lt;p&gt;The second decision I made was: before letting AI generate any summaries, have it write a test first. This test does something very simple—check all .md files to see if they have a summary field. If not 100% of files have this field, it fails, and prints which files have problems.&lt;/p&gt;
&lt;p&gt;Why write the test first? Because without this test, when AI says it's done, I don't actually know if it's really done. I can spot-check a few articles, but with over 300 articles, spot-checking can't cover everything. The final situation is that neither I nor AI knows—we're both in wishful thinking.&lt;/p&gt;
&lt;p&gt;But with the test, it's different. After AI completes one round, if the test fails, it knows there are 20 articles not covered, and will re-examine these articles. When the test passes, it's 100% complete. No manual spot-checking needed, no guessing—everything is 100% certain.&lt;/p&gt;
&lt;p&gt;This is the &lt;a href="/agentic-ai-crisis-en.html"&gt;feedback loop&lt;/a&gt; we've been emphasizing. Many people using AI fall into a cycle of nudging it once to move, finding it's wrong after it moves, then nudging again, feeling that AI is hard to use. The root cause is not establishing a feedback mechanism. AI doesn't know what "done" means, and you don't know to what extent AI has completed things. This is the core problem to solve first. Deterministic testing is a very effective solution. In fact, once this kind of test is in place, the next three decisions are just icing on the cake.&lt;/p&gt;
&lt;p&gt;So before starting any task, I ask myself: Do I/AI have a deterministic way to judge whether the task is complete? If not, build this mechanism first.&lt;/p&gt;
&lt;h2&gt;Decision 3: Let the Agent Do It, Instead of Writing Programs to Call APIs&lt;/h2&gt;
&lt;p&gt;The third decision was: I didn't write a program to call LLM API to generate summaries, but let the coding agent do it itself.&lt;/p&gt;
&lt;p&gt;More detailed reasons are explained in &lt;a href="/result-certainty-en.html"&gt;this article&lt;/a&gt;. Although having AI do summarization sounds like just calling an API, if you think carefully, there are many corner cases here: some articles already have summaries and shouldn't be duplicated, some metadata formats are inconsistent, some positions need adjustment. If you write a program to handle these situations, the code becomes very complex, debugging costs are high, and progress is slow. Eventually, AI might spend a lot of effort adjusting how to handle these details.&lt;/p&gt;
&lt;p&gt;Another approach is to use natural language to directly assign tasks to Cursor/OpenCode: "Go look at XX.md and make sure it has an SEO-oriented summary metadata field." At this point, the entity completing the task is not a mechanical program, but an Agent with real intelligence and adaptability. It handles situations on its own—skipping if summary exists, adjusting if format is wrong, judging by itself when encountering special cases.&lt;/p&gt;
&lt;p&gt;This is the difference between using AI as an agent and using AI as a tool. The API-calling pattern is: you write programs, AI is one component. This pattern has high certainty but low flexibility, and is actually slower when encountering complex situations. With Agentic AI, certainty moves from process to outcome—you only need to clearly state what result you want. The rest, AI figures out using its own initiative and judgment.&lt;/p&gt;
&lt;p&gt;So in my workflow, calling APIs is the last resort. Whatever can be handed to agents, I hand to agents.&lt;/p&gt;
&lt;h2&gt;Decision 4: Use Divide and Conquer to Handle Cognitive Saturation&lt;/h2&gt;
&lt;p&gt;The fourth decision was: I didn't assign one agent the task of handling 300 articles all at once, but had it open 8 sub-agents, distribute tasks, and process in parallel.&lt;/p&gt;
&lt;p&gt;The reason relates to context window saturation. If one agent processes 300 articles at once, it might be okay at first, but after reading a dozen articles, the context window &lt;a href="/wide-research-en.html"&gt;gets filled up&lt;/a&gt;, and later it starts slacking off, skipping articles, or forgetting pitfalls encountered earlier. This is similar to humans—when cognitive load is high, we become forgetful or start cutting corners.&lt;/p&gt;
&lt;p&gt;Another reason is that sub-agents are a natively supported feature of coding agents. I don't need to write concurrency logic, task distribution, or result aggregation myself. This plumbing work is all outsourced. I just need to describe the workflow in a sentence or two.&lt;/p&gt;
&lt;p&gt;Many people using AI don't realize this problem. They don't think about AI's defects or anticipate the pitfalls, and just assign tasks using the most intuitive method. But just like when managing subordinates we need to know their strengths and weaknesses, we need to realize that AI's cognitive resources are particularly limited—context window is a scarce resource that needs management. When task volume is large, quality inevitably drops. So when there's a lot of work, I actively consider splitting it up rather than having one agent carry everything.&lt;/p&gt;
&lt;p&gt;The relationship between this decision and the previous ones: Decision 2 ensures results are correct (tests pass), Decision 3 ensures the process is flexible (agent handles corner cases itself), Decision 4 goes further by avoiding a guaranteed pitfall, ensuring processing is both fast and good.&lt;/p&gt;
&lt;h2&gt;Decision 5: Ensure Prompt Is Self-Contained and Outcome-Oriented&lt;/h2&gt;
&lt;p&gt;The fifth decision was: when giving AI instructions, clearly state all information (don't expect it to read minds), and emphasize what the acceptance criteria are, not how to do each step.&lt;/p&gt;
&lt;p&gt;My prompt was roughly this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For each .md file under blog/content, write a summary field from an SEO perspective and put it in metadata. You can use sub-agents to do this. First look at a few articles to get a feel, then think of a prompt, let different sub-agents process different articles. Open 8 agents to process in parallel, each agent responsible for writing summaries and directly editing .md files. Also, I want a test to check summary coverage—if coverage is below 100%, the test fails. Your goal is to get this test to 100% so it passes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Notice I didn't tell it specifically how to write this test program or how to handle various corner cases.&lt;/p&gt;
&lt;p&gt;This is where many people get it backwards. When writing instructions for AI, they specify every step in detail. This is actually treating AI as a program, wasting Agentic AI's subjective initiative. AI is not a yes-man who only follows instructions—it has strong judgment and execution capabilities. We should leverage its initiative while giving it a clear enough boundary.&lt;/p&gt;
&lt;p&gt;I summarize two principles for writing prompts. First, give enough context—don't expect AI to read minds. It doesn't know what the metadata structure looks like. This information should either be given directly or ensured that AI can figure it out itself (for example, here we gave a specific path, and it can figure it out by reading files). Second, start from outcomes, not processes. Tell AI what you want, let it figure out how to do it. Unless you predict that not giving specific guidance on some aspect will cause problems—like the context window issue earlier—there's no need to explain in such detail.&lt;/p&gt;
&lt;p&gt;This decision and Decision 3 are two sides of the same coin: Decision 3 says hand execution to agents, Decision 5 says write instructions in a form suitable for agents.&lt;/p&gt;
&lt;h2&gt;Summary: AI Is Leverage&lt;/h2&gt;
&lt;p&gt;Finally, some thoughts on my experience.&lt;/p&gt;
&lt;p&gt;This task took me about two minutes to dictate instructions to AI using voice recognition. Then AI worked on it for 45 minutes: opening 8 sub-agents in parallel, handling various edge cases, writing tests, reworking, getting tests to pass, committing. I didn't manage it at all during this process. This is leverage. Using two minutes of time to leverage 45 minutes of AI work. More precisely, using 5% of time to control 100% of engineering output.&lt;/p&gt;
&lt;p&gt;And current Agentic AI capabilities are strong enough to work autonomously for long periods. We don't need to watch it work. As long as we clearly state what the deliverable is and what the acceptance criteria are, we can go do other things. This brings a new possibility: scalable agentic workflow. For example, we use two minutes to leverage Agent A, keeping it busy for 45 minutes. Then during this time we go command Agent B, C, D... simultaneously launching multiple AIs to proceed in parallel. The cognitive load is indeed high, but this is a practical path to achieve 10x productivity on top of single-agent workflow.&lt;/p&gt;
&lt;p&gt;Having talked about the 10x productivity side, the flip side of this project is: having the awareness to use AI, but using the wrong methods—discussing in ChatGPT, no testing mechanism, letting one AI handle everything. If these decisions are wrong, we might struggle for hours to finish, or even get stuck in endless loops unable to complete. The same task, even the same LLM—the difference between knowing how to use it and not knowing, the quality of decisions made, is the difference between being composed and at ease versus struggling without reward, even slower than doing it manually.&lt;/p&gt;</content><category term="Computing"/><category term="English"/><category term="Agentic AI"/></entry><entry><title>OpenClaw 是什么｜AI Agent 聊天工具的原理、价值与局限</title><link href="https://yage.ai/openclaw.html" rel="alternate"/><published>2026-02-14T23:00:00-08:00</published><updated>2026-02-14T23:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-14:/openclaw.html</id><summary type="html">&lt;p&gt;OpenClaw 爆火的原因和去年 DeepSeek 一模一样——不是技术突破，而是把小众体验推向大众。本文不教配置，而是从产品设计角度拆解它的记忆系统、Skills 机制和聊天界面的根本局限，帮你判断该不该跟，以及怎么把核心思路用到自己的工作流里。&lt;/p&gt;</summary><content type="html">&lt;p&gt;OpenClaw在2026年1月底爆火。公众号铺天盖地都在介绍怎么配置，云服务厂商都速度上线了一键部署，生怕错过这波热度。与此同时，各种行为艺术又满天飞：ClawdBot、MoltBot、OpenClaw，一周内改了三次名；结果改名的时候账号还被抢注，被一个叫$CLAWD的代币诈骗了1600万美元。与此同时，安全漏洞也层出不穷：有12%的第三方skills含恶意代码，有不少人把控制台裸露在公网上没设密码。一时间让人感觉整个领域全是相互矛盾的噪音，无所适从：这东西到底要不要装？不装会错过什么？装了有什么风险？这到底是下一个生产力革命还是又一个两周就过气的玩具？&lt;/p&gt;
&lt;p&gt;这篇文章就想从更高层的角度抽丝剥茧：OpenClaw到底做对了什么，为什么是它火，以及这跟我们有什么关系。&lt;/p&gt;
&lt;h2&gt;为什么会火的暴论&lt;/h2&gt;
&lt;p&gt;我有一个暴论：OpenClaw火的原因，和去年这个时候DeepSeek火的原因，是高度类似的。&lt;/p&gt;
&lt;p&gt;DeepSeek流行的时候，当时国内大家用的AI主要是纯聊天，没有搜索功能也经常信口瞎编。ChatGPT和Claude虽然有了思考和搜索功能，智能强很多，但国内用不了。DeepSeek引入了推理功能和搜索功能以后，第一次让大家体验到了会搜索懂思考的AI，带来了一种震撼：哇，AI还能这么有用，就爆火了。换言之，这个火不是因为技术上比竞争对手更好，事实上DeepSeek在纯模型能力上并没有碾压同时代的GPT-4o或者Claude 3.5。而是因为把一小撮人享受/习惯的事情，一下子推广到另一群更大的用户群面前，这才火起来。&lt;/p&gt;
&lt;p&gt;OpenClaw也是一样。2026年初Agentic AI领域其实有一个断层：ChatGPT这种产品虽然流行，但相比Cursor/Claude Code/Codex这种有本地权限的编程Agentic AI，整体能力还是落后了至少一代（具体为什么后面有解释）。但Cursor这种工具非常小众，基本上只有程序员在用。大家用的还是ChatGPT这种消费级产品，就觉得AI这两年没啥进步，能力很有限。然后OpenClaw第一次把Cursor这种能本地编程的Agent和WhatsApp/Slack/飞书这种流行通信软件接起来了，让非技术人员这种更广大的用户群第一次接触到了能读写文件，能执行命令，有记忆能持续迭代的Agentic AI，就爆火了。换言之，这个火不是说OpenClaw在技术上做到了什么新的事情，而是因为把一小撮人享受/习惯的事情，一下子推广到另一群更大的非技术用户群面前，这才火起来。&lt;/p&gt;
&lt;p&gt;但我说这些不是为了得出结论说OpenClaw、DeepSeek是花架子，没必要学。恰恰相反，DeepSeek从历史的角度提供了很多启发。比如DeepSeek火了以后，真正从中受益的是哪些人？我的观察是，有没有跟风第一时间玩上DeepSeek本身并不重要。很多人玩了一段时间就退烧了。真正理解了DeepSeek为什么火，把搜索和推理这两个关键因素整合到了自己工作流里的人，才是真正受益的人。类似的，OpenClaw火了以后，我们确实可以去跟风安装使用、体验一下，但这件事情本身并不会让我们一下就脱胎换骨生产力倍增了。因为这种现象级产品能爆火的重要前提是它是面向最广泛的用户设计的，因此设计决策上有很多妥协，直接用往往效率并不是最优。更关键的是要去理解它背后的设计哲学，分析它爆火的原因，从中吸取经验教训，改进自己的工作流。&lt;/p&gt;
&lt;p&gt;毕竟，工具会过气，对工具本质的理解不会。把可迁移的认知抽出来，融入自己的工作流，这才是内行的做法。&lt;/p&gt;
&lt;h2&gt;聊天界面：流行的基础，也是天花板&lt;/h2&gt;
&lt;p&gt;在具体分析OpenClaw的牛逼之处之前，我想先带大家看一个具体的例子，来解释“OpenClaw是面向最广泛的用户设计的”这句话到底是什么意思，以及有什么影响。&lt;/p&gt;
&lt;p&gt;前面我们提到OpenClaw火起来非常关键的一点是，它选用了大家天天都用的聊天软件作为交互入口，而不是像Cursor一样让你在电脑上多装一个软件。这样可以复用现有的使用习惯和渠道，让用这个工具的心智负担特别低。你没事反正都要用Slack/飞书，正好就看到了OpenClaw就会想着用用。另一方面，因为大家本身就非常熟悉这些软件的使用，所以它把学习成本也几乎压到了零。不需要装IDE，不需要学编程的术语概念，拿起手机就能用，这是它能出圈的基础。&lt;/p&gt;
&lt;p&gt;但如果你用过Cursor这种Agentic AI编程软件的话，就会发现Slack这种聊天窗口对AI来说是个相当受限的交互方式。&lt;/p&gt;
&lt;p&gt;第一是它要求对话是线性的。像Slack和微信这样的聊天窗口主要就是一条条消息往下排。但是深度的知识工作往往不是线性的。比如你需要引用另外一个thread的内容，需要把两个方向的探索merge在一起，需要在某个会话中fork出去。这些在桌面环境里比如Cursor和OpenCode里面都有专门的UI可以实现，但是在聊天窗口里面做就特别别扭。&lt;/p&gt;
&lt;p&gt;第二个问题是信息密度。如果只是做玩具性质的调研和开发，聊天窗口是没有问题的。但凡要做更复杂一点的分析和思考，它的信息密度就捉襟见肘了。比如图文混排的分析报告、复杂的表格、带格式的长文，这些在聊天里面看还都蛮痛苦的。同时不同平台对Markdown的支持也参差不齐，体验很不稳定。&lt;/p&gt;
&lt;p&gt;第三个问题出在过程的可观测性上。尤其是对要分好几步才能完成的任务，我把执行权交给AI以后，很自然地会想关心它到底在干啥。比如它是在稳步推进，还是在钻牛角尖鬼打墙？它调用了什么工具，改了哪些文件？这些在Cursor等等工具里会有自然的呈现，但是聊天窗口我们只能看见一条“对方正在打字”或者一个emoji表示正在处理。尤其是比较复杂的任务，OpenClaw需要等蛮久才能等到一条消息告诉我们搞定了还是中间挂了。&lt;/p&gt;
&lt;p&gt;但是我说这么多不是想说OpenClaw设计不好，而是想说这里面有个很明显的妥协（trade-off）。你要想把工具做得容易上手、面向最大的用户群，就必须用聊天工具这些人人都已经在用的工具作为载体。但这同时立刻又带来了对话形式、信息密度等等弊端。反之亦然。在这个从“易用但是拧巴”到“原生但是小众”的连续的trade-off空间里，OpenClaw选择了极致的易用性。这是它能爆火的基础。但我们也要清醒地认识到这种设计决策所带来的限制。在融合进自己工作流的时候，不是无脑地采用OpenClaw的所有设计，而是应该因地制宜，根据自己的需求来在这个trade-off轴线上找到属于自己的甜点区。&lt;/p&gt;
&lt;p&gt;理解了这个trade-off，后面的分析就容易理解了。&lt;/p&gt;
&lt;h2&gt;界面之外的流行要素&lt;/h2&gt;
&lt;p&gt;聊天界面是OpenClaw流行的基础，但只是最浅显的一点。真正让用户觉得这个AI真的智能，好用，懂我的，是它背后的三个设计决策。&lt;/p&gt;
&lt;p&gt;第一个是统一的入口和上下文。对比一下Cursor就很清楚。在Cursor里每个项目的上下文是隔离的——打开项目A，AI只知道项目A的事；切到项目B，之前关于项目A的对话就全没了。Claude Code、OpenCode也一样，每次启动都绑定一个工作目录。但OpenClaw则完全相反。它默认把所有对话的上下文混在一个池子里。你上午在Telegram里让它帮你整理邮件，下午在Slack里让它写个报告，晚上在WhatsApp里让它安排明天的日程——它全都记得。给人的感觉就是它特别聪明，好像真的认识你。&lt;/p&gt;
&lt;p&gt;但光把上下文混在一起是没用的，因为上下文窗口很快就会满了。这就牵扯到了它的第二个关键设计，持久化记忆。OpenClaw对记忆的处理非常巧妙，很值得学习。从大的原理上，它&lt;a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus"&gt;和Manus一样&lt;/a&gt;用的是基于文件的记忆系统。比如它维护了一个SOUL.md，定义AI的核心人格和行为准则；USER.md保存了对用户的画像，MEMORY.md存长期记忆，再加上每日的原始日志等等。&lt;/p&gt;
&lt;p&gt;这里面比较巧妙的是它有个自我维护机制：AI会在每隔一段时间（heartbeat）自动review最近的原始日志，把有价值的信息提炼到MEMORY.md里，顺便清理过时的条目。整个过程不需要用户干预。这个自我维护机制就把记忆给分层了，原始日志是短期记忆，每天的MEMORY.md是中期记忆，提炼出来的个性和喜好是长期记忆。对用户来说，体验一下就从“每次重开都要重新交代一遍"变成了"它好像在成长"，这个感知差异是非常大的。&lt;/p&gt;
&lt;p&gt;第三个设计是丰富的Skills。这个意义要远超节省那么一点用户的时间。工具数量带来的好处&lt;a href="/manus.html"&gt;不是线性的&lt;/a&gt;——6个工具比4个工具的能力提升，远大于4个相对2个。这是因为工具之间可以组合。接Slack能管下达指令，状态汇报，接图像生成能画图，接PPT服务能出稿，接deep research能调研。这些凑在一起，就可以组合进化出很多完整的业务能力和应用场景。&lt;/p&gt;
&lt;p&gt;这三个设计之间也不是简单的加法，而是互相促进的。&lt;/p&gt;
&lt;p&gt;记忆加上统一的上下文池，会带来数据复利。因为有持久化记忆，对话可以跨会话积累；因为有统一入口，所有来源的数据汇进同一个记忆池。你在Slack里讨论的工作内容、在Telegram里安排的日程、在WhatsApp里的个人对话，全部混在一起，形成了对你越来越完整的理解，以后完成任务也会越来越贴心。&lt;/p&gt;
&lt;p&gt;记忆加上skills，带来了自我进化的能力。今天学到的用法明天还在，能力会累积；AI自己能写新的skill并且记住它的存在和用法，这就进入了正循环。这里面特别值得一提的是coding能力。因为OpenClaw自己能写代码，所以遇到没有现成skill可用的时候，它就可以当场造一个。这个新skill会被保存下来，下次遇到类似场景直接复用。这就形成了自我进化的闭环。&lt;/p&gt;
&lt;p&gt;而这些能力和界面的易用性加在一起，又带来了使用频率。入口越顺滑，调用越频繁，飞轮越转越快，能力越来越强。&lt;/p&gt;
&lt;p&gt;总之，OpenClaw是一个相当厉害的产品。它的各种决策，不论是技术的（入口、记忆、工具）还是非技术的（界面），都在为同一个飞轮服务，让普通人第一次摸到了Agentic AI的完整形态。&lt;/p&gt;
&lt;h2&gt;限制和trade-off&lt;/h2&gt;
&lt;p&gt;前面说了它为什么牛，下面我要开始吐槽了。但我想先解释一下，下面介绍的这些限制不是说OpenClaw疏忽了没做好，而是前面说的那个trade-off的直接后果——为了爆款好用必须付出的代价。&lt;/p&gt;
&lt;p&gt;界面的限制前面已经说过了：线性、低信息密度、低可观测性。在深度使用时这些很快会成为瓶颈，这里不再赘述。&lt;/p&gt;
&lt;p&gt;更深层的问题在记忆上。OpenClaw的记忆系统对小白很友好。你不用管，它自己就会打理和进化。但对想把知识沉淀成资产的人来说，这反而是一个障碍。&lt;/p&gt;
&lt;p&gt;举个栗子，比如我们做完一次调研，产出了一份5000字的长文或者一份PRD。在Cursor/文件系统里它就是一个文件：&lt;code&gt;docs/research.md&lt;/code&gt;，想引用就@，想升级就开新版本，想对比就diff。但在OpenClaw里，这份东西像是人类记忆一样，说不定什么时候就会被自动摘要、自动重写，甚至整个被删除了（遗忘），整个过程完全不可控。你很难跟它说清楚：以后就以这份文档为准，遇到相关问题必须引用它，不要给我压缩成三行。总之就是，知识没办法显式管理。&lt;/p&gt;
&lt;p&gt;更让人头疼的是整个更新过程也是一个黑盒。MEMORY.md里存什么、怎么组织、什么时候清理，主要是AI在heartbeat期间自动做的。你看到的是结果，很难看到原因：它这次改了哪些条目，为什么删掉这一条，为什么把两个不相关的东西合并在一起。出了问题也很难定位根源，因而很难改进。&lt;/p&gt;
&lt;p&gt;OpenClaw记忆系统的设计带来的另一个问题是跨场景的信息干扰。统一记忆当然带来懂我的感觉，但也意味着信息很容易跨项目污染：A项目的偏好、甚至某个临时决定，可能会莫名其妙影响到B项目。对小白来说它好像什么都记得，但对真的想干活的进阶用户来说更像是“我去怎么又被它带偏了"。&lt;/p&gt;
&lt;p&gt;Skills的安全隐患又是另一类问题。ClawHub上的上千个技能中，安全审计发现有上百个包含恶意代码——加密货币盗窃、反向shell后门、凭证窃取都有。Simon Willison提过一个&lt;a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/"&gt;致命三角&lt;/a&gt;的概念：一个AI系统同时具备访问私有数据、暴露于不可信环境、能够对外通信这三个能力时，风险是指数级放大的。OpenClaw三个全中🤡。这就形成了一个奇特的悖论。你要想用的爽，就必须给他很多工具和权限。但这又会带来安全问题，所以就要把权限收得很紧。但权限收紧了就又变成类似Manus那样的云端Agent服务了，没了本地Agent的爽。安全和好用，似乎成了一对矛盾。&lt;/p&gt;
&lt;h2&gt;So What?&lt;/h2&gt;
&lt;p&gt;讲到这里，自然会有人问：分析了一堆，然后呢？这跟我有什么关系呢？&lt;/p&gt;
&lt;p&gt;回答是：可以用这些认知，在已有的工具上搭一套比OpenClaw更顺手的东西。我自己就是这么干的，效果比直接用OpenClaw好很多。下面讲几个关键决策。&lt;/p&gt;
&lt;h3&gt;复用Agentic Loop，而不是自己造&lt;/h3&gt;
&lt;p&gt;我们做的第一个决策，也是最重要的一个，是不自己从头实现一套Agentic AI系统，而是复用OpenCode这样的开源CLI编程工具作为基础。&lt;/p&gt;
&lt;p&gt;这个决策背后有一个更深层的判断。做一个能用的Agentic Loop——也就是调API、解析工具调用、执行工具、把结果返回给AI、请求下一次回答这个循环——说起来简单，但要做到能支撑真实使用的水平，有很多细节：文件系统的读写，文件内容的新增删除替换，沙箱环境，权限管理……每个都是坑。这些东西写起来繁杂、充满陷阱，而且和我们最终想创造的价值没有多少关系。&lt;a href="https://yage.ai/ai-builder-space.html"&gt;我之前的一篇文章&lt;/a&gt;里详细讨论过这个问题——核心观点是，Agentic Loop是体力活，应该外包；真正值得花精力的是Agentic Architecture，也就是怎么把业务逻辑注入AI系统让它直接创造价值。&lt;/p&gt;
&lt;p&gt;而OpenCode、Claude Code这类工具，恰恰就是一个特别好的外包。它们已经把Agentic Loop做得非常成熟了——能读写文件、能跑命令、能持续迭代，而且还在飞速进化中。用它们做基石，等于是白嫖了整个agentic编程工具链，可以把自己的开发成本降到最低。而且选OpenCode还有一些额外的好处：它完全开源可以魔改，支持并行的subagent（Cursor和Codex到现在都还没有），还支持多种coding plan——比如我自己用的是GLM的coding plan，也可以直接用OpenAI的Codex plan，不用像直接调API那么烧钱。&lt;/p&gt;
&lt;h3&gt;文件即记忆：继承和发展OpenClaw的哲学&lt;/h3&gt;
&lt;p&gt;第二个决策是在记忆体系上。OpenCode/Claude Code这类工具天生就有磁盘即记忆的思想——毕竟它们作为编程工具处理的基础单元就是文件。当我们又有基于磁盘的记忆，又有对文件直接的操纵权和透明度的时候，就解决了前面分析中OpenClaw记忆系统的问题。想沉淀资产就写文件，想强制AI遵守某些规则就写AGENTS.md，想管理记忆结构就直接编辑Markdown。前面说的那些知识没法显式管理、更新过程是黑盒的问题，用OpenCode的细粒度控制和文件系统天然就解决了。&lt;/p&gt;
&lt;p&gt;但光有文件系统还不够，我们还把OpenClaw那套persona自我进化的机制移植了过来。具体来说，我们把记忆分成了两层：project-level的记忆（每个项目自己的上下文、决策记录、技术方案）和persona-level的记忆（用户画像、行为偏好、沟通风格）。然后在AGENTS.md里加入persona维护的workflow，让AI在session结束时自动review对话、更新MEMORY.md和USER.md。同样的自我进化，但跑在完全可控的文件系统上，还能用Git做版本管理。&lt;/p&gt;
&lt;p&gt;至于统一上下文的问题，我们用了一个很简单粗暴的方案：Mono Repo。把不同项目放在同一个repo的不同文件夹下，AI天然就可以跨项目访问所有上下文。想隔离就隔离，想共享就共享，想merge两个方向的探索就直接@，想fork出去就复制文件——全都是文件系统和OpenCode的原生操作，比OpenClaw在聊天窗口里拧巴地做这些事情自然太多了。&lt;/p&gt;
&lt;h3&gt;Skills和安全&lt;/h3&gt;
&lt;p&gt;Skills方面，OpenCode生态有大量MCP server和Skills可以接入——日历、邮件、浏览器、搜索等等——功能覆盖和ClawHub大差不差。安全性上，我们的做法是不直接安装第三方skill，而是让AI先审查源码、理解逻辑，然后重写一个干净版本。在AI辅助编程的今天这个过程通常只要几分钟，但可以极大降低供应链攻击的风险。&lt;/p&gt;
&lt;h3&gt;最后一公里：移动端&lt;/h3&gt;
&lt;p&gt;前面三个决策解决了底座、记忆和工具的问题，但还差一个关键的东西：入口。OpenClaw火的一个重要原因是你不用坐在电脑前面。但现有的编程工具在这方面确实拉胯——VSCode有个Code Server可以远程访问，但对iPad非常不友好；OpenCode有个Web Client，但说实话只是解决了有和无的问题，非常难用；Cursor的Web Client高度绑定Github；Claude Code则完全没有Web Client。&lt;/p&gt;
&lt;p&gt;为了解决这个问题，我们做了一个原生的iOS App作为OpenCode的远程客户端。注意这个App不是把聊天窗口搬到手机上——它是一个真正为移动端设计的工作界面：能看到AI的实时工作进度，每一步工具调用、每一个文件操作；能切换模型做A/B测试；能浏览Markdown文件和审查更改；支持语音输入；支持基于HTTPS或者SSH隧道的公网访问；iPad上还有三栏分屏。&lt;/p&gt;
&lt;p&gt;这个客户端已经在github上&lt;a href="https://github.com/grapeot/opencode_ios_client"&gt;开源&lt;/a&gt;了。欢迎大家也来体验。未来可能会加入TestFlight。效果是吃灰很久的iPad重新变成了生产力工具，在沙发上指挥AI干活的体验比OpenClaw的聊天窗口爽得多。外出吃饭的时候接到oncall，也可以直接给AI小弟布置任务，当场就搞清楚了原因。而且全程都有对AI完全的掌控，知道它不会出幺蛾子，也不会把你的信息po到Moltbook上。&lt;/p&gt;
&lt;p&gt;&lt;img alt="iPad客户端" src="/images/opencode_ios_client.jpeg"&gt;&lt;/p&gt;
&lt;h2&gt;总结&lt;/h2&gt;
&lt;p&gt;回到开头的暴论。OpenClaw和DeepSeek的火，本质上是同一件事：把一小撮人已经在享受的能力，第一次推到了更广泛的人群面前。DeepSeek让大家第一次用上了会搜索懂推理的AI，OpenClaw让大家第一次摸到了能读写文件、有记忆、会自我进化的Agentic AI。&lt;/p&gt;
&lt;p&gt;但也正因为要面向最广大的普通用户，这类产品必然在设计上做大量妥协。DeepSeek如此，OpenClaw也如此。聊天界面带来了易用性但牺牲了表达力，统一记忆带来了懂我的感觉但牺牲了可控性，开放的Skills生态带来了能力但引入了安全风险。&lt;/p&gt;
&lt;p&gt;对于已经在用Cursor/Claude Code/OpenCode的人来说，更值得做的不是无脑跟风装一个OpenClaw，而是理解它为什么火——统一入口、持久化记忆、工具生态，以及它们之间的飞轮——然后把这些认知融入自己已有的工具链里，扬长避短。我们自己就是这么干的，效果确实比直接用OpenClaw好很多。&lt;/p&gt;
&lt;p&gt;毕竟，工具会过气，对工具本质的理解不会。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"/><category term="Chinese"/><category term="Agentic AI"/><category term="Review"/></entry><entry><title>OpenClaw Deep Dive: Why It Went Viral and What It Means for You</title><link href="https://yage.ai/openclaw-en.html" rel="alternate"/><published>2026-02-14T22:00:00-08:00</published><updated>2026-02-14T22:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-14:/openclaw-en.html</id><summary type="html">&lt;p&gt;OpenClaw went viral for the same reason DeepSeek did — not a technical breakthrough, but bringing a niche power-user experience to the masses. This post skips setup tutorials and instead dissects its memory system, Skills mechanism, and the fundamental ceiling of chat-based AI interfaces, helping you decide whether to adopt it and how to extract its core ideas into your own workflow.&lt;/p&gt;</summary><content type="html">&lt;p&gt;OpenClaw went absolutely viral at the end of January 2026. Social media was flooded with configuration guides, and cloud service providers rushed to launch one-click deployments, terrified of missing the hype train. Meanwhile, it felt like performance art was happening everywhere: the project changed its name three times in one week—from ClawdBot to MoltBot to OpenClaw. In the process of rebranding, their handle was even hijacked by a token called $CLAWD that scammed people out of $16 million. Security vulnerabilities were popping up left and right, too: 12% of third-party skills contained malicious code, and plenty of people exposed their consoles to the public internet without even setting a password. For a while, the whole space was just a mess of contradictory noise, leaving everyone confused: Should I install this thing? What am I missing if I don't? What are the risks? Is this the next productivity revolution, or just another toy that will be forgotten in two weeks?&lt;/p&gt;
&lt;p&gt;In this post, I want to peel back the layers from a higher-level perspective: What did OpenClaw actually get right? Why did it explode? And most importantly—what does this have to do with you?&lt;/p&gt;
&lt;h2&gt;Why It Went Viral: My Hot Take&lt;/h2&gt;
&lt;p&gt;I have a bit of a provocative theory: the reason OpenClaw blew up is almost identical to why DeepSeek went viral exactly one year ago.&lt;/p&gt;
&lt;p&gt;When DeepSeek first became popular, most AI tools in China were limited to pure chat—no search capabilities, and they hallucinated constantly. While ChatGPT and Claude had reasoning and search features that made them much smarter, they weren't easily accessible in the country. When DeepSeek introduced reasoning and search, it was the first time many people experienced what a thinking, searching AI could do. It was a massive shock to the system: "Wow, AI can actually be THIS useful!" and then—boom—it went viral. In other words, its popularity wasn't necessarily because it was technically superior to its competitors (DeepSeek didn't exactly crush GPT-4o or Claude 3.5 in pure model capability at the time). It went viral because it took something a small circle of early adopters were already enjoying and habituated to, and pushed it right in front of a much larger audience.&lt;/p&gt;
&lt;p&gt;OpenClaw is the exact same story. In early 2026, there was a massive gap in the field of Agentic AI. While products like ChatGPT were popular, they were at least a generation behind Agentic AI tools with local permissions like Cursor, Claude Code, or Codex (I’ll explain why later). But tools like Cursor are niche—mostly used by programmers. The general public was still stuck with consumer-grade chat interfaces, feeling like AI hadn't progressed much in the last two years. Then OpenClaw came along and, for the first time, connected those local programming agents with the messaging apps everyone uses every day—WhatsApp, Slack, Lark. It gave non-technical users their first taste of Agentic AI that can read and write files, execute commands, maintain memory, and iterate continuously. It went viral not because it did something brand new technically, but because it democratized an experience previously reserved for a tiny group of techies.&lt;/p&gt;
&lt;p&gt;Now, I’m not saying OpenClaw or DeepSeek are just "showy" tools you shouldn't bother with. Quite the opposite. DeepSeek provided a lot of historical inspiration. For example, after the hype died down, who actually benefited? In my observation, it wasn't the people who just jumped on the bandwagon to play with it for a few days. It was the people who understood &lt;em&gt;why&lt;/em&gt; it went viral and integrated search and reasoning into their actual workflows. Similarly, while we can go ahead and install OpenClaw and try it out, the tool itself won't magically double your productivity. Viral products are designed for the broadest possible audience, which means they involve a lot of design compromises. Using them as-is is rarely the most efficient way to work. The real value is in understanding the design philosophy behind them, analyzing why they exploded, and applying those lessons to improve your own workflow.&lt;/p&gt;
&lt;p&gt;At the end of the day, tools will come and go, but your understanding of their core essence won't. Extracting transferable insights and baking them into your own workflow—that's how the pros do it.&lt;/p&gt;
&lt;h2&gt;The Chat Interface: Both the Foundation and the Glass Ceiling&lt;/h2&gt;
&lt;p&gt;Before we dive into why OpenClaw is so powerful, I want to look at a specific example to explain what I mean when I say "OpenClaw is designed for the broadest audience," and how that impacts everything.&lt;/p&gt;
&lt;p&gt;As I mentioned earlier, a key reason OpenClaw exploded is that it chose messaging apps we use daily as its interface, rather than requiring you to install yet another piece of software like Cursor. This leverages existing habits and channels, keeping the cognitive barrier to entry incredibly low. You're already on Slack or Lark anyway, so seeing OpenClaw right there makes you want to try it out. Plus, since everyone is already familiar with these apps, the learning curve is pushed practically to zero. No IDE to install, no programming jargon to learn—just pick up your phone and start using it. That’s why it reached such a huge audience.&lt;/p&gt;
&lt;p&gt;But if you’ve ever used an Agentic AI programming tool like Cursor, you’ll quickly realize that a Slack-style chat window is actually a very restrictive way for an AI to interact.&lt;/p&gt;
&lt;p&gt;First, it forces a linear conversation. Slack and WeChat windows are basically just one message after another. But deep knowledge work is rarely linear. You might need to reference content from another thread, merge two different directions of exploration, or fork off a specific conversation. In desktop environments like Cursor or OpenCode, there are dedicated UI elements for this, but doing it in a chat window feels clunky as hell.&lt;/p&gt;
&lt;p&gt;Second, there’s the issue of information density. For toy-level research or quick development, a chat window is fine. But for any meaningful analysis or deep thinking, the information density is embarrassingly low. Trying to read formatted reports, complex tables, or long-form documents inside a chat bubble is pretty painful. Plus, different platforms have wildly inconsistent Markdown support, making the experience very unstable.&lt;/p&gt;
&lt;p&gt;The third problem is observability. Especially for multi-step tasks, once I hand over execution to the AI, I naturally want to know what it’s actually doing. Is it making steady progress, or is it spinning its wheels in a dead-end loop? Which tools did it call? Which files did it change? In Cursor and similar tools, this is presented naturally, but in a chat window, we’re stuck with a "the user is typing..." message or a single emoji. For complex tasks, you’re often left waiting a long time just to be told whether it succeeded or crashed halfway through.&lt;/p&gt;
&lt;p&gt;Now, I’m not saying these are "bad" design choices. They are clear trade-offs. If you want to make a tool that’s easy to pick up for everyone, you have to use the tools everyone is already using. But that immediately brings limitations in format and density. It’s a spectrum from "easy but clunky" to "native but niche," and OpenClaw chose extreme ease of use. That’s why it’s a hit. But we have to be clear-eyed about the limitations that decision brings. When you're integrating these tools into your own workflow, don't just mindlessly copy every design choice—find that sweet spot on the trade-off axis that works for &lt;em&gt;your&lt;/em&gt; needs.&lt;/p&gt;
&lt;p&gt;Once you understand this trade-off, the rest of the analysis becomes much clearer.&lt;/p&gt;
&lt;h2&gt;The Success Factors Beyond the Interface&lt;/h2&gt;
&lt;p&gt;The chat interface is what made OpenClaw approachable, but it’s just the surface. What actually makes users feel like this AI is genuinely intelligent, useful, and "gets" them are three core design decisions happening under the hood.&lt;/p&gt;
&lt;p&gt;The first is a unified entry point and context. If you compare it to Cursor, the difference is stark. In Cursor, project contexts are isolated—if you open Project A, the AI only knows about A. Switch to Project B, and the conversation about A is gone. Claude Code and OpenCode are the same; they bind to a specific working directory every time you launch. OpenClaw does the exact opposite. By default, it mixes all your conversation contexts into one big pool. You can ask it to organize your emails in Telegram in the morning, write a report in Slack in the afternoon, and schedule your calendar in WhatsApp in the evening—and it remembers everything. It feels incredibly smart, like it actually &lt;em&gt;knows&lt;/em&gt; you.&lt;/p&gt;
&lt;p&gt;But just dumping everything into one pool isn't enough, because the context window would fill up instantly. That leads to the second key design: Persistent Memory. OpenClaw handles memory very cleverly. At a high level, it uses a file-based memory system &lt;a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus"&gt;much like Manus does&lt;/a&gt;. It maintains a &lt;code&gt;SOUL.md&lt;/code&gt; to define the AI’s core personality and behavior, a &lt;code&gt;USER.md&lt;/code&gt; for your profile, and a &lt;code&gt;MEMORY.md&lt;/code&gt; for long-term storage, all on top of the raw daily logs.&lt;/p&gt;
&lt;p&gt;The clever bit is its self-maintenance mechanism. Every so often (a "heartbeat"), the AI automatically reviews its recent raw logs, distills valuable info into &lt;code&gt;MEMORY.md&lt;/code&gt;, and cleans up outdated entries. This happens entirely in the background without user intervention. This mechanism creates a tiered memory structure: raw logs are short-term, the daily &lt;code&gt;MEMORY.md&lt;/code&gt; is medium-term, and the distilled traits/preferences are long-term. For the user, the experience shifts from "I have to explain everything every time" to "It feels like it’s growing with me." That perceived difference is huge.&lt;/p&gt;
&lt;p&gt;The third pillar is the rich ecosystem of Skills. This is about so much more than just saving a few minutes of your time. The benefit of adding tools &lt;a href="/manus-en.html"&gt;isn’t linear&lt;/a&gt;—the jump from 4 to 6 tools adds far more capability than the jump from 2 to 4. Why? Because tools combine. Connecting Slack handles instructions and status reports; image generation handles visuals; a PPT service handles slide decks; deep research handles investigations. When you bundle these together, you get emergent business capabilities and end-to-end applications.&lt;/p&gt;
&lt;p&gt;These three designs aren't just additive; they reinforce each other.&lt;/p&gt;
&lt;p&gt;Memory combined with a unified context pool creates data interest. Because memory is persistent, conversations accumulate over time; because there’s a unified entry point, data from all sources flows into the same pool. Your work discussions in Slack, your scheduling in Telegram, your personal chats in WhatsApp—all of it merges to form an increasingly complete understanding of you, making every subsequent task more personalized.&lt;/p&gt;
&lt;p&gt;Memory combined with Skills brings the ability to self-evolve. Habits learned today are still there tomorrow; as the AI writes and remembers new skills, it enters a positive feedback loop. Its coding ability is particularly noteworthy here. Since OpenClaw can write its own code, if it hits a wall without an existing skill, it can just build one on the fly. That new skill is saved and ready to be reused next time. It’s a closed loop of self-evolution.&lt;/p&gt;
&lt;p&gt;And when you add all that power to the ease of use of the interface, you get high usage frequency. The smoother the entry point, the more the flywheel spins, making the AI smarter with every interaction.&lt;/p&gt;
&lt;p&gt;In short, OpenClaw is an impressive product. Every decision—technical or otherwise—serves the same flywheel, giving regular people their first real taste of what a fully realized Agentic AI can do.&lt;/p&gt;
&lt;h2&gt;Limitations and Trade-offs&lt;/h2&gt;
&lt;p&gt;I’ve spent plenty of time praising OpenClaw, so now it’s time to gripe. But let me be clear: the limitations I’m about to list aren't because the OpenClaw team was sloppy—they are the direct results of that trade-off I mentioned earlier. This is the price you pay for building a viral hit.&lt;/p&gt;
&lt;p&gt;I’ve already covered the interface: it's linear, low-density, and offers poor observability. When you move beyond casual use, these bottlenecks become apparent very quickly.&lt;/p&gt;
&lt;p&gt;The deeper issues lie in the memory system. OpenClaw’s memory is great for beginners—you don't have to manage it; it just works and evolves. But for anyone trying to turn knowledge into a long-term asset, this is actually a massive hurdle.&lt;/p&gt;
&lt;p&gt;For example, say you finish a deep dive research project and produce a 5,000-word report. In a tool like Cursor or a direct file system, that’s a file: &lt;code&gt;docs/research.md&lt;/code&gt;. You can @ reference it, version it, or diff it. In OpenClaw, that knowledge is more like human memory—at any point, it might be automatically summarized, rewritten, or even completely "forgotten" (deleted) by the background heartbeat process, and you have zero control over it. It’s hard to tell it: "This document is the absolute source of truth; reference it exactly and do not summarize it into three lines." In short, knowledge cannot be explicitly managed.&lt;/p&gt;
&lt;p&gt;Worse, the entire update process is a black box. What gets saved in &lt;code&gt;MEMORY.md&lt;/code&gt;, how it’s organized, and when it’s purged is all determined by the AI in secret. You see the result, but you rarely see the "why": What did it change this time? Why did it delete that specific note? Why did it merge those two unrelated thoughts? If something goes wrong, it’s a nightmare to debug and improve.&lt;/p&gt;
&lt;p&gt;Another issue with OpenClaw’s unified memory is cross-context interference. While unified memory makes the AI feel like it "knows" you, it also means information can easily pollute different projects. A preference from Project A, or even a one-off temporary decision, might mysteriously start influencing Project B. For a casual user, it seems like it remembers everything; for an advanced user trying to get work done, it feels more like, "Ugh, it’s going off on a tangent again."&lt;/p&gt;
&lt;p&gt;Then there are relevant security risks with Skills. Out of the thousands of skills on ClawHub, audits have found hundreds containing malicious code—from crypto theft and reverse shell backdoors to credential stealing. Simon Willison once mentioned a concept called &lt;a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/"&gt;the lethal trifecta&lt;/a&gt;: when an AI system has access to private data, is exposed to untrusted environments, and can communicate externally, the risk is amplified exponentially. OpenClaw hits all three🤡. This creates a strange paradox. To get the best experience, you have to give it broad tools and permissions. But that creates security risks, so you feel forced to tighten permissions. But if you tighten them too much, you’re back to a restrictive cloud agent like Manus, losing the magic of a local agent. Safety vs. usability remains a persistent contradiction.&lt;/p&gt;
&lt;h2&gt;So What?&lt;/h2&gt;
&lt;p&gt;At this point, you might be asking: "Okay, that was a lot of analysis—so what? How does this help me?"&lt;/p&gt;
&lt;p&gt;Here’s the answer: you can take these insights and build something for yourself that’s actually better and more tailored than OpenClaw. That’s exactly what I did, and the results have been much better than using OpenClaw directly. Let me walk you through a few key decisions I made.&lt;/p&gt;
&lt;h3&gt;Reuse the Agentic Loop, Don’t Rebuild It&lt;/h3&gt;
&lt;p&gt;The first—and most important—decision we made was to &lt;em&gt;not&lt;/em&gt; build an Agentic AI system from scratch. Instead, we reused an existing open-source CLI programming tool like OpenCode as our foundation.&lt;/p&gt;
&lt;p&gt;There’s a deep reasoning behind this. Building a functional Agentic Loop—the cycle of calling an API, parsing tool calls, executing them, returning results to the AI, and requesting the next step—sounds simple on paper. But making it robust enough for real-world use is full of pitfalls: file system I/O, partial file edits, sandbox environments, permission management... the list goes on. Building these things is tedious, risky, and doesn’t actually create much unique value for the end user. I discussed this in detail in &lt;a href="/ai-builders-space-en.html"&gt;a previous post&lt;/a&gt;—my core point was that the Agentic Loop is "grunt work" that should be outsourced. What’s actually worth your time is the &lt;em&gt;Agentic Architecture&lt;/em&gt;—how you inject business logic into the AI system to create direct value.&lt;/p&gt;
&lt;p&gt;Tools like OpenCode or Claude Code are basically perfect "outsourcing" options. They’ve already matured the Agentic Loop—they can read and write files, run commands, and iterate continuously, and they’re evolving incredibly fast. By using them as a cornerstone, you’re basically getting a free ride on the entire agentic programming toolchain, which drops your development costs to almost zero. Choosing OpenCode specifically has extra perks: it’s fully open-source (so you can hack it), it supports parallel subagents (something Cursor and Codex still don’t have), and it supports multiple coding plans. For instance, I use the GLM coding plan, but you could use the OpenAI Codex plan directly without the insane costs of raw API calls.&lt;/p&gt;
&lt;h3&gt;File as Memory: Inheriting and Evolving the OpenClaw Philosophy&lt;/h3&gt;
&lt;p&gt;The second decision was about the memory system. Tools like OpenCode or Claude Code have a natural "disk-as-memory" philosophy—after all, files are the basic unit they handle. Having disk-based memory, combined with direct ownership and transparency over those files, solves the exact issues we saw with OpenClaw. If you want to build up long-term assets, write a file. If you want to force the AI to follow certain rules, write an &lt;code&gt;AGENTS.md&lt;/code&gt;. If you want to manage your memory structure, just edit the Markdown. The problems of non-explicit management and black-box updates are naturally solved by OpenCode’s fine-grained control and the file system itself.&lt;/p&gt;
&lt;p&gt;But just having a file system isn't enough, so we also ported over OpenClaw’s "persona self-evolution" mechanism. Specifically, we split memory into two layers: project-level memory (the context, decision logs, and technical specs for a specific project) and persona-level memory (user profile, preferences, and communication style). We then added a persona maintenance workflow to &lt;code&gt;AGENTS.md&lt;/code&gt;, so the AI automatically reviews the conversation at the end of a session to update &lt;code&gt;MEMORY.md&lt;/code&gt; and &lt;code&gt;USER.md&lt;/code&gt;. You get the same self-evolution, but it runs on a fully controllable file system where you can even use Git for version control.&lt;/p&gt;
&lt;p&gt;As for the unified context problem, we went with a brute-force but elegant solution: the Mono Repo. By putting different projects in different folders within the same repo, the AI naturally has cross-project access to all contexts. You can isolate when you want, share when you want, merge different lines of exploration, or fork things off just by copying files. These are all native operations in the file system and OpenCode, which feels infinitely more natural than trying to do them in a clunky chat window.&lt;/p&gt;
&lt;h3&gt;Skills and Security&lt;/h3&gt;
&lt;p&gt;On the Skills front, the OpenCode ecosystem has a massive array of MCP servers and skills available—calendars, email, browsers, search, you name it. The feature set is pretty much on par with ClawHub. In terms of security, our approach is to not just blindly install third-party skills. Instead, we have the AI review the source code, understand the logic, and then rewrite a "clean" version. In the age of AI-assisted coding, this only takes a few minutes, but it drastically reduces the risk of supply chain attacks.&lt;/p&gt;
&lt;h3&gt;The Last Mile: Mobile&lt;/h3&gt;
&lt;p&gt;Our first three decisions solved the foundation, memory, and tools, but one key piece was still missing: the entry point. A huge reason OpenClaw is so popular is that you don’t have to be sitting at your computer. But existing programming tools are pretty weak here—VS Code has Code Server, but it’s terrible on an iPad; OpenCode has a web client, but it’s barely functional; Cursor’s web client is tied to GitHub; and Claude Code doesn't even have one.&lt;/p&gt;
&lt;p&gt;To bridge this gap, we built a native iOS app as a remote client for OpenCode. This isn't just a chat window ported to your phone—it’s a workspace genuinely designed for mobile. You can see the AI’s real-time progress, every tool call, and every file operation. You can switch models for A/B testing, browse Markdown files, review changes, and use voice input. It supports public access via HTTPS or SSH tunnels, and the iPad version even has a three-column split view.&lt;/p&gt;
&lt;p&gt;The client is &lt;a href="https://github.com/grapeot/opencode_ios_client"&gt;open-sourced&lt;/a&gt; on GitHub. Feel free to check it out; it might even hit TestFlight soon. The result is that my dusty iPad is finally a productivity beast again. Directing an AI from the couch is a much, much better experience than using OpenClaw’s chat window. If I get an on-call notification while I'm out for dinner, I can just assign the task to my "AI intern" and have the root cause figured out before the check arrives. And the whole time, I have total control over the AI—I know it isn't going to go rogue or leak my info to Moltbook.&lt;/p&gt;
&lt;p&gt;&lt;img alt="iPad Client" src="/images/opencode_ios_client.jpeg"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Let's go back to my "hot take" at the beginning. The viral success of both OpenClaw and DeepSeek points to the same underlying truth: it's about taking capabilities a small elite group is already enjoying and pushing them to a broader audience for the first time. DeepSeek gave people their first taste of searching, reasoning AI; OpenClaw gave them their first hands-on experience with an Agentic AI that has disk access, memory, and the power to self-evolve.&lt;/p&gt;
&lt;p&gt;But because these products are designed for the masses, they inherently involve massive design compromises. That was true for DeepSeek, and it’s true for OpenClaw. The chat interface brings ease of use but sacrifices expressiveness; unified memory makes the AI feel like it "gets" you but sacrifices control; an open skill ecosystem brings power but introduces security risks.&lt;/p&gt;
&lt;p&gt;If you’re already using tools like Cursor, Claude Code, or OpenCode, the takeaway isn't that you should mindlessly install OpenClaw. Instead, you should understand &lt;em&gt;why&lt;/em&gt; it’s a hit—the unified entry, the persistent memory, the tool ecosystem, and the flywheel connecting them—and then fold those insights into your own existing toolchain while avoiding the pitfalls. That’s what we did, and I can tell you: the results are significantly better.&lt;/p&gt;
&lt;p&gt;At the end of the day, tools will come and go, but your understanding of their core essence won't.&lt;/p&gt;</content><category term="Computing"/><category term="English"/><category term="Agentic AI"/><category term="Review"/></entry></feed>