Computing Life - Computing

从上下文失忆到文档驱动开发：突破Agentic AI的项目规模陷阱

2025-03-09T17:00:00-07:00

自2025年伊始，Cursor、WindSurf、Trae等Agentic AI编程工具开始席卷开发领域。然而与过往的GenAI技术类似，这些Agentic AI技术同样面临着小规模demo惊艳，产品化实战翻车的困境——它们生成一两千行的小型原型轻而易举。自我迭代、自动Debug、快速交付，整个过程行云流水。可一旦踏入真实的软件工程应用，比如当项目规模突破5000行代码时，Agentic AI的魔法似乎突然失灵。它就像在迷雾中盲行，既看不清全局架构，又记不住历史逻辑，最终代码中充斥着诡异的故障模式，开发者不得不频繁手动干预才能勉强达标。

这不禁让人怀疑：Agentic AI编程究竟是一场改变游戏规则的革命开端，还是又一个华而不实的技术泡沫？这篇文章将深入探讨两个核心问题：

为什么Agentic AI会有项目规模的限制？
有哪些技术路径可能突破这一规模限制，让AI在大型工程场景中真正发挥实用价值？

三大Failure Pattern：空间错配、时间遗忘、重复造轮子

想要回答为什么的问题，我们首先得具体看看Agentic AI在大规模工程中究竟有哪些Failure Pattern。一般来说，我们会遇到三种模式。

第一种模式是软件开发的空间维度，改一个文件，却忘了另一个文件的逻辑。在只有几百行代码的小型demo里，Cursor一次就能输出一个可用的版本，非常灵活可靠。但是当项目到了几千上万行的时候，它常常会犯一些初级错误。比如，明明另一个文件里已经有了现成的功能实现，它却视而不见，又从头开始写了另一个函数。或者它在一次迭代中对A模块做了修改，却没有注意到B模块正依赖着那个接口，最终导致A、B无法兼容。即使通过Agentic Workflow它可以看到相关的出错信息，但仍然感觉整个AI跟被降智了一样，也需要长时间的反复迭代才能发现问题，并且最终修复。

这一定程度上是因为上下文窗口所限制的。Cursor等工具是根据一定规则自动化构建上下文窗口的。当我们让AI写代码的时候，如果因为种种原因，另一个已经有了相关代码的文件没有被放进上下文窗口，它自然没办法在实现中考虑到这一因素。因此就会出现上面所说的这种问题。

第二种模式是时间维度的反复，纠错又推翻，推翻再纠错。如果你用Agentic AI开发过相对复杂的横跨几天的项目的话，你也许已经注意到了它的一种自我推翻的模式。比如，它第一次写程序犯了一个错误A，在程序跑挂以后，会说“好的，我来通过修改X来修正。”结果过一阵子在进行其他改动的过程中，它忽然又把X改回去了。所以这时候错误A又再次出现，它就只能重新再debug。同时，随着工程的增大，整个AI给人的感觉也越来越笨，在工程中后期debug的效率远远不如前期。所以这样反反复复不仅消耗时间和耐心，而且还可能让工程进展陷入困境。

这背后的原因其实是上下文窗口的限制。现在的Agentic AI的工具往往还是把上下文窗口作为一个记忆的媒介。开始的时候，错误A和修正X仍然在上下文窗口里，因此这时候Agentic AI仍然记得不要去推翻X这个实现。但是因为种种原因，比如对话太长，或者新开了一个会话，这个记忆被抛出窗口以后，AI就忘了这个历史的context，忘了自己为什么要使用X。于是就再次回到老路，推翻了之前的修复。

第三种模式是，这两个问题在对接已有代码库的时候尤其严重，尤其是当这些代码不是AI自己写的时候。它往往缺乏一种全局的视角，没办法理解这个代码库的高层设计。这在空间上表现为无法准确定位某个需求所需要更改的具体文件；在时间上，它往往也没有这个代码库发展历史上的context。因此，在AI写代码的过程中，它往往会体现出一个坏毛病：宁愿再写一个功能，也不去复用或者理解原先的代码。哪怕这个原先的代码有时候是它自己写的，它也陷入这种鬼打墙的境地。最终结果就是同一功能被重复实现多次，逻辑还可能互相冲突。

综合来看，当代码量突破一定程度，比如5000行的时候，Agentic AI的长期记忆力就开始成为整个生产力的瓶颈。工程越到后期，改动越多，上下文越庞杂，AI就越像在迷雾中摸索，一边推翻自己，一边又忘记自己探索过什么。接下来让我们看看，这种上下文失忆究竟源于什么技术限制。

核心限制：依存于上下文窗口的短期记忆

要理解Agentic AI在大规模项目上的失效，先得看看它的记忆机制：大多数Agentic AI目前都依赖有限的上下文窗口来获取先前写过的代码或决策信息。不论上下文窗口是RAG（Retrieval-Augmented Generation）构建的，还是基于Agentic方法自动读取文件，只要关键内容没有被完整纳入其中，AI就会遗忘那部分信息，自我推翻和反复犯错的模式就会再度出现。

当项目越来越大，一种直观的想法就是不断地重构：把代码切分得更细，好让AI在一次输出时更容易专注。然而这种方法只能缓解局部问题，并不能在根本上解决全局设计理解不足的挑战。就算代码再整洁，AI依然依赖短期上下文来思考，一旦上下文窗口溢出，便忘却先前的逻辑。换言之，重构只是减少了AI在某个时刻需要处理的信息量，却并没有让AI真正拥有一种可以反复调用的长久记忆。

所以，当前GenAI基于上下文窗口的工作的机制就决定了，它好比是一个只有七秒记忆的鱼，每次只能记住当前一小段信息，游过这道门，就忘了先前经历，导致它不断犯下同样的错误。它可以记住短时间内写过的代码。但如果没有长期记忆，Agentic AI对于规模超过几千行的项目，注定就只能处于半瘫痪状态。问题指向的核心是：我们到底应该如何改变AI的记忆的实现机制？不仅仅依靠短时间的上下文窗口来给它提供瞬时记忆，而同时用另外一种方式来提供一种长期记忆的机制。

和其他问题类似，这个解决方法也可以从我们人类自身寻找。我们人类也有类似的问题：我们的短期记忆容量也很有限，但是我们通过“好记性不如烂笔头”这种方法，解决了这个问题。在代码之外，我们有额外的文档来记录全局的设计和历史的决策。这些文档，包括Tribal Knowledge，构成了我们的经验和长期记忆。这样让我们人类不会有特别严重的自我推翻的模式，也让我们有了更强的全局视角。而这种长期记忆，也是解决Agentic AI当下困境的核心。

破局之道：为Agentic AI构建长期记忆

关于如何具体构建AI的长期记忆机制，这是一个非常开放的问题。我也没有一个成熟的解决方案，但是也许可以从下面三个方面来思考：

第一是Document-Driven Development，文档驱动开发。这个思路是对前面烂笔头带来好记性的模仿。我们可以通过Prompt Engineering等方法来让AI知道，交付一个文档是和交付代码同等重要的事情。AI不仅需要写代码，另一个重要任务是时刻维护一个文档。这个文档主要作用是定义这个项目的外部行为、产品决策、技术框架、高层设计，同时还要提供它历史上的一些context，比如之前做过哪些尝试，效果如何等等。这样，当 AI 在写代码的时候，就可以更高效地利用上下文窗口。它只要看文档，就可以迅速地知晓整体架构，理解相关依赖，而不必一遍遍地把所有原文件塞进上下文窗口。而类似的，我们也可以让AI在写代码的时候，先更新文档然后再根据文档去改代码，时刻保证文档的内容和代码的内容是吻合的。注意这个长期记忆未必要是纯自然语言的形式，它也可以是软件工程中常见的 UML 图，或者协议的可视化，甚至是一个 JSON 文档等，让人类和电脑都可以理解的格式。

第二是，这个文档不仅对于单 Agent 有用，对Multi-Agent 也是一个重要的技巧。在 Multi-Agent 的技术架构中，我们之前讨论过，它的核心好处在让不同的 Agent 有彼此隔离的上下文窗口。而这就需要不同的 Agent 之间有一个高效精准的通信渠道。这里所提到的长期记忆就是一个理想的渠道。它可以一方面让每个 Agent 从高层抽象的角度知道其他 Agent 干了哪些事，从而不会被其他 Agent 工作的细节所干扰。另一方面，也形成了一个横跨所有 Agent 的 single source of truth。

因此，我们在实现这个长期记忆的时候，也要额外注意维护多个 Agent 之间的 consistency，比如引入一些锁机制，让多个 Agent 同时读写的情况下也能有数据的一致性。这有点类似于多人协作软件的概念，只是把人工换成了 AI。我们如何在这个过程中避免冲突、自动合并历史和进行差异分析，也是一个有待探索的路径。

第三个角度则是人类在这个过程中应当扮演什么角色。短期内也许我们不应该期待 Agentic AI 可以成为一个完全自主工作的系统。就好比一个有经验的人类员工也需要领导不断的指导和 coach 一样。我们对待 AI 的态度不应该是用完即抛，给它个任务就放手不管。而应该是给它适当的引导、纠正，甚至培养，让它在工作中不断积累经验。而长期记忆的出现则给我们提供了新的可能。我们和 AI 的沟通不仅仅局限于通过 context window（也就是聊天）来进行。我们甚至可以把这种长期记忆作为一种沟通的媒介。例如，当我们发现 AI 的行为跟我们的预期有比较大的差别的时候，我们不是通过对话来一点一点纠正它，而是直接更改它的整个设计文档，然后指示它根据这个文档重写，纠正所有代码与文档不一致的地方。或者如果我们发现它有一个常犯的错误，我们也可以直接修改它的长期记忆，把类似的教训写进去。而这种长期记忆甚至是可以共享和分发的，如果一个 AI 学到了一个经验教训，别的在同一个项目下工作的 AI 也能自动得到相应的更新，这本身也是一个非常有意思和重要的改变。

换言之，这个长期记忆不是简单地说我们让AI写写文档，看看文档就结束了，而是同时还要考虑到它怎样融入人与AI协同工作的工作流里。目前Cursor等工具走的还是一种AI主导的道路。人给一个简单的需求，他们就去把整个工程实现出来。但对于更复杂的项目，有可能我们需要走一条人机配合的道路。比如，人类来主导文档体系的结构和关键摘要，AI来协作补充细节。当系统需要重大改动的时候，AI先尝试更新文档，给一个文档的草稿，人类做修订和最终确认。AI再根据这个文档进行更多的更改。这样也许是一条更科学的道路，我们又不至于完全依赖AI的上下文窗口这样短时记忆，也不用过分纠结过多的琐碎细节。

结语

我们现在还处于整个Agentic AI的萌芽时期。它已经给了整个软件开发领域很大的鼓舞和惊喜，但同时也带来了更多的问题。这篇文章并没有提供一个成熟的解决方案，而可能只是单纯抛出了更多的问题。

长期记忆的设计和具体形态的确定也不是一蹴而就的事情。要具体定位Agentic AI发展的瓶颈和未来的技术路线，我们的当务之急也许不是去拍脑袋想可能的方案是什么，然后就着手实现。而是先在目前的工具中加入更多的透明度、可解释性和可调试性。例如，在Cursor或者Trae等工具中把context window给暴露出来，让各种翻车模式更好理解。另外一方面，也可以引入社区的力量，让爱好者或者深度用户有能力去影响和控制这些context window，甚至加入他们自己对于长期记忆的构想。当然，对于各个Agentic AI厂商而言，这确实也有相当大的商业风险。但是，群策群力，开放透明的模式不失为一条在短期内快速提升整个领域对于Agentic AI认知和实践的可行道路。

总而言之，Agentic AI确实在5000行以上会面临翻车隐患，本质是记忆完全依赖于上下文窗口。而要真正突破，文档驱动开发、开放透明和社区共建可能是有前景的方向。

From Context Amnesia to Document-Driven Development: Overcoming the Project-Scale Trap in Agentic AI

2025-03-09T16:00:00-07:00

Ever since the start of 2025, Agentic AI coding tools like Cursor, WindSurf, and Trae have been sweeping through the development world. Yet, much like earlier generations of GenAI technology, these new Agentic AI tools often dazzle when creating small demos but stumble when put to the test in real-world, production-scale scenarios. Generating prototypes of a thousand lines of code is effortless, and the self-iteration, automated debugging, and rapid delivery all appear smooth and impressive. Once you move into actual software engineering—say, once your codebase exceeds five thousand lines—the magic fades. The AI seems to be groping around in the dark, unable to retain a cohesive sense of the overall architecture or the logical decisions made so far, ultimately leading to odd bugs that only get resolved through frequent human intervention.

This naturally raises a question: Is Agentic AI truly the game-changing revolution it’s touted to be, or just another overhyped bubble? This article delves into two fundamental questions:

Why does Agentic AI appear to have a project-size limitation?
What technical strategies could possibly break through that size restriction, enabling AI to truly perform in large-scale engineering contexts?

A Look at Three Major Failure Patterns: Spatial Mismatch, Temporal Forgetfulness, and Reinventing the Wheel

To understand why these issues arise, we first need to identify the concrete failure patterns Agentic AI exhibits in large-scale projects. Generally, there are three.

The first pattern emerges in the spatial dimension of software development, where modifying one file ends up breaking another. In small demos of just a few hundred lines, Cursor can generate a workable version in a single pass. Everything seems flexible and robust. But once a project expands to several thousand or tens of thousands of lines, the AI often makes elementary mistakes. For example, it may fail to notice that a function already exists in another file and just rewrite a duplicate. Or it modifies module A during one iteration without realizing that module B depends on A’s interface, so A and B become incompatible. Yes, Agentic Workflow can catch the resulting error messages, but the AI seems to have lost its edge, requiring multiple rounds of iteration to track down and fix the problem.

This limitation is partly due to the size constraints of the context window. Tools like Cursor build their context windows automatically following certain rules. When we ask the AI to write code, if an existing file with relevant features isn’t included in that window for some reason, the AI simply can’t factor that knowledge into its implementation. The result is this kind of mismatch.

The second pattern is a time-based cycle of repetition: fix, then backtrack, then fix again. If you’ve used Agentic AI for relatively complex projects spanning days of development, you may have noticed the AI sometimes cancels its own corrections. For instance, it might introduce fix X to address error A. Then, after a while, as it makes other modifications, it casually reverts X, causing error A to recur. By the later stages of a project, the AI seems to grow less efficient, taking longer and longer to debug. This back-and-forth cycle not only wastes time and energy, but can bring an entire project to a halt.

The core issue is, again, the limited context window. Most Agentic AI tools rely on the context window as their memory of prior code and decisions. In the early stages, error A and fix X are still in the window, so the AI remembers to maintain X. But once the conversation drags on, or a new session is started, that memory drops out of scope. The AI no longer remembers why X was important, causing it to revert the fix and recreate the original error.

The third pattern becomes glaring when these tools interface with an existing codebase they didn’t write, particularly when the total code has grown large. Without a global understanding of the high-level design, the AI struggles to pinpoint which file needs updating for a given requirement. And it lacks any sense of the code’s development history. It often manifests as a bad habit: instead of reusing or understanding existing code, it just writes new functionality from scratch. Sometimes it even duplicates functionality that it itself wrote earlier, leading to multiple, conflicting versions of the same feature.

In short, once the codebase surpasses a few thousand lines, the AI’s limitations in long-term memory begin to drag down productivity. The more the project grows, the more changes accumulate, and the more the AI feels like it’s wandering through a fog—undoing its own work, forgetting what it tried last time. Next, we’ll explore the underlying reasons for this “context amnesia.”

Core Constraint: Reliance on Short-Term Memory in the Context Window

To understand why Agentic AI underperforms in large projects, we need to look at its memory mechanism: in most current tools, memory of previously written code or decisions is limited to what can fit into a context window. Whether the tool uses retrieval-augmented generation (RAG) or an Agentic method that automatically reads files, if some vital piece of information isn’t included in the window, the AI forgets it. It then repeats past mistakes all over again.

As projects get bigger, you might think of a solution like constant refactoring—splitting code into smaller pieces so the AI can handle it in bite-sized chunks for each output. While that can help in specific cases, it doesn’t address the fundamental problem of “lack of global design understanding.” Even with neatly arranged code, the AI still relies on short-term context. Once the context window is full, it loses track of previous logic. In other words, refactoring only reduces how much the AI must handle at any one time. It doesn’t provide a true, callable memory that persists beyond the window.

Thus, today’s GenAI, which works primarily via context windows, behaves like a fish with a seven-second memory, forgetting any design decisions once they swim beyond the current context. Within that limited scope, it writes solid code, but with no long-term memory, an Agentic AI faces functional collapse once a project surpasses a few thousand lines. The core challenge is how to move beyond total dependence on short-term context and offer a different mechanism for persistent memory.

As with many technical puzzles, we can draw inspiration from our own experiences. Humans also have very limited short-term memory, yet we handle large-scale projects by using external documentation—our equivalent of “putting it in writing.” Documents store overall design decisions, historical choices, and tribal knowledge, enabling us to avoid repeated mistakes and maintain a broader perspective. This suggests a similar approach for Agentic AI: giving it a form of long-term memory outside the fleeting context window.

Potential Ways Forward: Building Long-Term Memory into Agentic AI

How to actually give AI a long-term memory is still an open question, without any one-size-fits-all solution. But we can outline a few possible starting points:

One idea is to adopt Document-Driven Development. This follows the principle that “even a rough record is better than a good memory,” mirroring how we humans rely on external documentation. We can use prompt engineering, for example, to inform the AI that delivering a project document is as crucial as delivering code. The AI should be expected not just to write code, but also to maintain an accompanying document at all times. This documentation would define external interfaces, product decisions, technical frameworks, and high-level designs, while also preserving historical context—previous attempts, their outcomes, and so on. When the AI then writes or modifies code, it can more effectively leverage the context window by referencing the document for an overview of the system. It no longer has to repeatedly stuff every code file into that window. Likewise, the AI can adopt a workflow of first updating the document, then revising the code to match, keeping them in sync. This long-term memory doesn’t have to be purely natural language; it could be UML diagrams, protocol overviews, or even a JSON-based structure that both humans and machines can interpret.

A second approach sees this documentation as valuable for multi-agent setups as well. In a multi-agent architecture, different Agents benefit from their own separate context windows. The tricky part is establishing efficient, precise communication among them. A shared long-term memory can serve as this communication channel, providing each Agent with high-level knowledge of what the others have done without bogging them down in one another’s internal details. It thus becomes a single source of truth that spans all Agents. Of course, this also means thinking carefully about concurrency and consistency—using locking mechanisms or other methods to ensure Agents working in parallel don’t create conflicting states. It’s reminiscent of multi-user collaboration, except now with AIs. Automatic merging, historical tracking, and diff analyses all become intriguing directions to explore.

A third perspective is the role that humans should play in this workflow. In the near term, we likely shouldn’t expect an Agentic AI to run an entire system autonomously. Even skilled human developers need ongoing guidance and coaching from team leads. Our attitude toward AI shouldn’t be “fire and forget,” giving it a task and walking away. Instead, we should guide, correct, and nurture the AI over time, letting it accumulate experience. The advent of long-term memory opens up new possibilities for this relationship. Rather than relying solely on conversation windows (i.e., chat interactions), we could treat the long-term memory as a shared medium of communication. If the AI’s behavior diverges from our expectations, we don’t have to correct it line-by-line in a chat. We can simply revise the design document and instruct the AI to refactor any code that no longer matches the updated doc. Or if it repeatedly makes the same mistake, we can encode the relevant lesson in its long-term memory. That fix then becomes immediately available to all other AIs working on the project. In other words, this isn’t just about “the AI writing some docs and consulting them.” It’s also about integrating long-term memory into human–AI collaboration. Currently, tools like Cursor are oriented around the AI as the driving force, with humans providing only a simple prompt. For larger projects, though, a more collaborative balance might be preferable, where humans design the overall documentation structure and key summaries, and the AI fills in details. When a major overhaul is needed, the AI drafts an updated document, then humans refine it and confirm. Only then does the AI proceed to code changes. This way, we’re not restricted to the AI’s short-term memory, and we don’t get bogged down in every tiny detail.

Conclusion

Agentic AI is still in its early stages. It’s sparked tremendous excitement and delivered productivity gains, but it also introduces new complications. This article doesn’t claim to offer a complete, out-of-the-box solution—if anything, it raises more questions than it answers.

Defining how to implement true long-term memory will likely be a gradual process. To pinpoint the bottlenecks and possible paths forward, our priority might be to incorporate more transparency, interpretability, and debuggability into existing tools. For instance, we could expose the context windows in Cursor or Trae, so that any failures are easier to diagnose. We could also invite community involvement, giving advanced users the ability to adjust or fine-tune these windows, or even experiment with new ideas for long-term memory. Of course, for the companies developing these Agentic AI tools, such openness might present commercial risks. But openness and collective effort could be a practical way to accelerate the community’s understanding of these systems and their best practices in the short term.

In sum, Agentic AI tools indeed face a perilous gap once they exceed about five thousand lines of code, and the root cause is their reliance on short-term context windows. To truly rise above this barrier, document-driven development, openness, and community collaboration are promising avenues to explore.

Manus爆火的背后，Agentic AI产品如何构筑持久的竞争优势？

2025-03-06T20:00:00-08:00

最近，Manus发布并迅速火遍了中文互联网。在深度使用了Manus以后，我觉得这个产品确实充满了启发。它抓住了Agentic AI产品竞争中非常重要的一方面，也就是复利效应。这篇文章就想聊一聊，对于类似Deep Research、Cursor或者Manus这样的Agentic AI产品，从更长时间的尺度来看，竞争的要点在哪些方面，以及哪些因素可以构成有效的护城河，哪些因素不可以。

在具体介绍Agentic AI产品竞争的三个重要方面之前，我想先解释一下为什么我觉得Manus是一个很惊艳的产品。其实，和很多自媒体渲染的情况不同，Manus不是一个石破天惊、从石头缝里蹦出来的产品，或者在它之前就没有任何人尝试过类似的产品形式。相反，它的出现是有着明确的发展脉络的。

和它最相关的两种产品形式，一个是Agentic调研类产品，比如Gemini、Perplexity和OpenAI的Deep Research。他们可以让你输入一个简单的主题或者要求，接着帮你调研全网的数据，并且生成一个详实且有深度的报告。另一种相关的形式是Cursor、Devin或者Gamma之类的Agentic生成类产品。你给它一个要求，它可以帮你写作代码、文稿或者幻灯片，直接交付你想要的产出形式。在2024年，这两种类型的产品都有了长足的发展，并且跨过了可用性的门坎，出现了病毒式的爆发增长。但是，还有一大痛点是：我要不然只能做调研，要不然只能写代码。二者之间并没有有效打通。

这其实是个非常微妙的痛点，因为Agentic AI好用的根本原因就在于，它通过自我迭代、自主决策，就可以帮我们一站式地完成复杂的任务。但如果我们在实际工作中还要不停地思考，这件事要用OpenAI的Deep Research来做，然后把结果复制粘贴到Cursor里面生成一些可视化的图像，最后再把这二者合并在一起，扔给Gamma去生成PPT，这其实完全和Agentic AI的初衷背道而驰，也丧失了使用Agentic AI的意义。

Manus让人觉得惊艳的地方就在于它打通了整条链路。一方面，它可以以Agentic的形式进行调研，通过浏览互联网来收集全面详实的资料。另一方面，它也可以根据这些资料，进行更多的分析、可视化，从而生成最终的产出，比如网站、图文报告或者幻灯片。这种端到端的应用场景在以往的产品中都是很难实现的。再加上，Manus本身产品特性打磨得也很好，完整度很高。作为一个思路精准又好用的产品，自然就引发了爆款。

工具的复利

但是，上面的这些观察其实并没有触及到Agentic AI中更本质的特点，或者说优势。在我看来，Agentic AI相比于传统AI，一个鲜明的特征是它在多个维度上都具有复利效应（Compound Effect）。

在Manus的例子里，它为什么能取得成功，一个重要的原因是：它能使用的工具的数量比以往的产品都要多。这其实是一个并不trivial的区别。在Agentic AI产品中，能使用的工具的数量从6个到8个所带来的产品体验变化，要远远大于从2个到4个所带来的变化。这是因为AI所使用的工具之间是可以相互组合、相互促进的。如果一个AI只会写代码和搜索文本，这时候给它加一个图像搜索的功能，或许不会有太多的好处。但如果它在这个基础上还能写报告、做PPT，这个时候加入图像搜索的能力，就会一下让它产出的报告和幻灯片变得多彩很多，从而有力地推动了产品体验。而这恰恰是Manus干的事情。即使不论它做的其他创新，即使我们只看它把Deep Research和Cursor这两种产品并在一起，这个单纯的工具数量的增长，立刻就让它完成了很多以前的产品实现不了的场景。

这就是Agentic AI的第一种复利效应。当我们增加它能调用的工具的数量的时候，它带来的好处是组合的、爆炸式的增长。这是一种非常现实的增强用户体验的手段，但是它并不能构建一条有效的护城河。原因是在现在 Cursor 等等工具的支持下，单纯地对接某种工具让 AI 能够自主调用是一件非常简单的事情。如果抛开产品力不谈，只是想复刻一个 Manus 的话，这件事本身并不难。而单纯地通过卷工具的数量来构建护城河，也不是一个长久之计。

数据的复利

而其实Agentic AI 在其他方面也有类似的复利效应。一个被很多人忽视的方面是数据。这里的数据指的不是类似LLM预训练的数据。你用了2T Token，我用了3T Token，我就比你厉害。在Agentic AI的时代，它有着更深刻的含义。确切地说，这里的数据指的不是数据量本身，而是横贯整个生命周期的数据的获取、组织和外化。在和人类共事的过程中，我们往往会体会到“家有一老，如有一宝”。比如工厂里一个机器坏了，老员工过来就知道在哪里拍一拍就可以把它弄好。或者有经验的医生看病的时候，随便摸两下就能把病因说得八九不离十。但是，刚毕业的小年轻往往要做很多检查项目才能约略得到类似的结论。

这就是一个典型的数据带来的好处。对于人而言，在这个过程中主要发生了两件事情：第一是经验的积累，无论是机械厂的员工还是医生，在他几十年的从业生涯中都遇到过很多类似的故障或者病例。在这个基础上，他就可以再进行第二步，总结归纳，也就是知识的组织。到了这一步，这些知识就被内化到他们的记忆里，对人类而言已经够用了。但是，由于AI与人类的沟通目前还完全依赖于书面沟通，所以往往还需要一个知识外化的过程，将它沉淀成一个明确的文档，才能被AI所使用。

因此，对于Agentic AI来说，维持一个知识积累、组织和外化的循环是非常重要的。一个我们之前提到过的软件工程的例子是，如果有一个写代码的AI，你直接给它一个十万行的代码仓库，并且给它布置一些任务，它一步成功就完成所有任务的概率其实不大。但如果你给它一些时间，让它逐渐阅读代码，并且加以理解、消化和分类，在这个基础上把它学到的东西总结、沉淀成一个个的文档，此时它的编码工作就会简单很多。

这里是一个这种文档的例子。在里面，我们介绍了代码的基本架构、设计思想、不同的函数放在哪些文件里。对于历史比较久的项目，我们还可以再加上一些历史的context。有了这些文档的支撑以后，一方面在空间上，AI就知道如何精准地定位要更改的代码，而不会无脑地去新建一个文件，把所有东西都从头写。从时间的维度上，AI也会知道之前做过哪些尝试，当前的设计思路是什么，而不会陷入鬼打墙，把设计方案又改回了以前的状态。因此，在这里我们所说的数据不是单纯的数据的堆砌、Token的叠加，而是一个长期的，自动或半自动化积累、理解和沉淀的过程。对于一个固定的客户来说，AI和他共事的时间越长，就可以积累越多的类似的知识。相比于另一个初来乍到、没有任何背景的AI，哪怕后者的智能程度更高，用户也会觉得前一个AI使用起来更舒服，它更懂我。因此这种知识系统的二次处理，是一个有效的护城河。

而类似的，这种数据的积累也有组合的复利效应。有更多的历史数据和总结出来的文档，AI就能通过对比和思考，形成更多的洞察。从某种程度上来说，这是一个把传统的知识系统变成一个AI友好（AI Friendly）的知识系统的过程。AI friendly不是一个非黑即白的二元状态，而是一个需要时间去沉淀和发展的东西。我甚至愿意把它和人与自然的共同进化（co-evolution）来类比。一方面，AI会从原始的知识库中进行挖掘、提炼和积累。另一方面，用户在使用AI的过程中也会越来越体会到，如果我们把各种数据让AI能轻松获取，对自己的工作而言也大有裨益。因此，他们就更愿意改变自己的工作方式来适应AI的数据管理流程。这又会带来额外的好处，比如以前在Zoom会议里面丧失的tribal knowledge，如果用户愿意引入Zoom AI Companion，那这些知识就会被捕捉下来，并且沉淀到文档库里，为AI所用来帮助用户。这就形成了一个双向奔赴的正向循环。而这个彼此的适应和默契，是一个非常强的护城河。

智能的复利

Agentic AI还有一个非常有意思的特点，就是智能本身也是有复利的。这件事情可能不像工具和数据那么容易理解。但是，一个工具的智能程度会从多个角度对Agentic的用户体验带来影响。

最粗浅的角度是，越聪明的工具，它越知道怎么样去高效理解用户的需求，也知道怎么组合少数几个工具来获得最大的收益。比如，一个没那么智能的LLM，可能会东一榔头西一棒，调用了很多工具之后，还是没办法得到足够的信息。但是，会思考的LLM看起来就更有章法。通过巧妙地组合少数的几种工具，就可以在短时间内解决问题。

另一个相关的因素是，如果你对比过Gemini和OpenAI的Deep Research，就会发现这两个AI段位完全不一样。Gemini更像是机械地follow一个事先定好的指令，先想一些关键字上网搜索，然后自主决定抓取哪些网页，最后基于这些网页内容进行汇总回答。但OpenAI的Deep Research就会感觉更主动，自我迭代的能力也更强。它会先制定一个计划，然后根据计划用不同的关键字进行搜索。在搜索的过程中，根据结果它还可能动态调整策略。最终生成的答案往往也富有启发，不仅回答问题本身，还会自主延伸出下一步可能有价值的研究方向。这种自主性所带来的增益也是非线性的。

考虑到现在有能力进行自主LLM研发的公司并不多，同时，LLM的训练是一个非常强调资源和资本的事情。因此，这一定程度上也是一个有意义的护城河。

竞争的要点

不过我想强调的一点是，这三种复利并不是平行的加法关系，而是有彼此激发和影响的乘数效应。工具数量的增多会为信息的处理和积累提供更多出口，比如在项目管理、搜索、文档生成等多个维度留下可以供 AI 学习的数据接口。AI 在分析这些信息的时候，也在不断提升自己的推理和理解能力。这种协同演化的效果在 Manus 中就非常明显。起初，Deep Research 能够做深入调研，Cursor 能够写代码、写文档，但一旦把二者融合在一个 Agentic 平台上，AI 就可以对调研得到的信息进行进一步的逻辑加工、编排和发布。而这些信息、工具和智能在同一个闭环中相互刺激，就给用户带来了一种更流畅、更智能的全流程体验。

因此，Agentic AI 竞争的重点可能就在于如何尽早地扩展到工具、数据或者智能这些指数增长曲线的右侧。这是因为花费同样的精力来增长工具的数量或者数据的数量，在初期得到的收益是相对有限的。但是，当它跨越一个 tipping point 以后，指数增长的威力就会显现出来。在曲线的右侧，每增加一个工具、增加一点数据，就会让用户的体验顺畅很多。因此，这可能是 Agentic AI 相关产品开展竞争、构建护城河的关键。当然，这条指数曲线并不一定会无限延伸下去，而可能是一个 S 形。到达某个节点以后，再继续加入新工具或者新信息，就会受到系统复杂度与资源投入的掣肘，导致增长减缓，甚至可能在某些阶段出现瓶颈。此时，可能就需要在架构和协作机制上做更深层的创新，来保持系统的协同演化。

而从上面的讨论也可以看出，在工具方面建立护城河不是特别可靠。从LLM智能方面建立护城河需要大量的资源，而从数据方面建立护城河可能是最简便可行的一种方法。在数据沉淀之外，可能更重要的是如何进行沉淀的流程与方法论。因为数据本身是可以复制的，但是如何系统性地将隐性知识外化，如何进行结构化沉淀以及高效数据管理，这是极其难以复制的。这类似于企业文化，一旦形成了强大的数据管理与知识外化的方法论和流程体系，即使竞争对手把这些数据和工具都复制过去，也很难在短期内复制这种隐性的组织能力。因此，在Agentic AI产品的长期竞争中，最难以攻破的不是数据或者智能的规模，而是这种数据与工具使用的体系化组织能力。

小结

从Manus出发，我们看到了Agentic AI领域竞争的关键点与护城河所在。但更重要的是，这场竞争并不仅仅是关于工具数量的比拼或数据规模的较量，而是组织如何适应AI时代的深刻转型。未来胜出的可能并不是拥有最强技术的公司，而是那些真正理解了AI与人类如何共进化、并能建立持续、稳定的协作机制的公司。这也许才是Agentic AI真正为我们带来的启示。

Behind Manus’s Wild Popularity: How Agentic AI Builds Lasting Competitive Advantages

2025-03-06T19:00:00-08:00

Manus has been making waves on the Chinese internet since its recent release, quickly winning over a wide range of users. After diving deeply into the product, I find it truly inspiring. It captures a crucial aspect in the competition among Agentic AI products: the compound effect. In this article, I want to explore, from a longer-term perspective, the key factors that drive competition for Agentic AI products like Deep Research, Cursor, or Manus, and which elements can form a true moat (and which cannot).

Before discussing the three main aspects of Agentic AI product competition, I want to explain why I think Manus stands out. Contrary to the hype portrayed in some media, Manus is neither a mind-blowing creation that popped out of nowhere, nor the first to try this kind of product format. On the contrary, it has a very clear lineage.

It’s closely related to two types of existing products. One is Agentic research tools, such as Gemini, Perplexity, and OpenAI’s Deep Research. These allow you to enter a simple topic or request and then help you research across the web, producing a detailed and in-depth report. The other type is Agentic generation tools, such as Cursor, Devin, or Gamma. You give them a request, and they can help you write code, produce a document, or create presentation slides—essentially delivering the final form you need. In 2024, both kinds of products made huge strides, crossing the threshold of basic usability and going viral. Still, one big pain point remained: you either do research or you do code, but there’s no effective linkage between the two.

This pain point is subtle, because Agentic AI’s core appeal lies in its ability to complete a complex task end-to-end through self-iteration and autonomous decision-making. If, in practice, you still have to think, “I’ll use OpenAI’s Deep Research for this,” then copy the result into Cursor to generate a visualization, and finally throw both pieces into Gamma to create a PPT, you’re basically going against the very idea of Agentic AI—and you lose most of its value.

Manus is striking because it bridges that entire workflow. On one hand, it can conduct research in an Agentic manner by browsing the internet and gathering a comprehensive set of materials. On the other, based on these materials, it can carry out further analysis and visualization to produce the final output—be it a website, a text-based report, or a slideshow. This end-to-end application scenario is extraordinarily difficult to achieve in previous products. Moreover, Manus is finely polished and impressively complete as a product, making it both precise in concept and highly user-friendly—hence the instant, explosive popularity.

The Compound Effect of Tools

None of this, however, really touches on the deeper, more essential features or advantages of Agentic AI. One standout characteristic is that Agentic AI exhibits compound effects in multiple dimensions.

In Manus’s case, one important reason for its success is that it can leverage a greater number of tools than previous products. That might sound trivial, but it isn’t. In Agentic AI products, going from being able to use six tools to being able to use eight is far more significant than going from using two tools to four. This is because the tools used by AI can interact and reinforce each other. If an AI can only write code and search text, adding a new image search function might not be all that helpful. But if it can already write reports and generate slides, then adding an image search function suddenly makes its output far more vivid, substantially enhancing the product experience. That’s exactly the approach Manus takes. Even if we ignore all its other innovations and just focus on the way it merges Deep Research and Cursor, the simple increase in the number of tools immediately opens up scenarios that earlier products couldn’t handle.

This is the first kind of compound effect in Agentic AI. When we expand the number of tools it can call upon, the benefits multiply in a near-explosive fashion. It’s an extremely direct way to enhance the user experience. However, it doesn’t necessarily provide a durable moat. With tools like Cursor around, simply integrating a specific tool so an AI can autonomously call it is not difficult. Setting aside the finer points of good product design, cloning a Manus-type product is not especially challenging. And relying on the quantity of tools alone to build barriers to entry is not a long-term strategy.

The Compound Effect of Data

Agentic AI has a similar compound effect in other areas, too. One that’s often overlooked is data. But here I’m not referring to data in the sense of training an LLM—for example, you used 2T tokens, I used 3T, and now I’m better than you. In the Agentic AI era, data has a deeper meaning. Specifically, it isn’t just about the quantity of data but about acquiring, organizing, and externalizing data across its entire lifecycle.

Working side by side with humans, we see a certain phenomenon: having a seasoned veteran on the team is like having a secret weapon. In a factory, a senior mechanic might know exactly where to tap a failing machine to get it working again, while a new graduate has to run all sorts of tests for a rough understanding of the issue. A veteran doctor can make diagnoses by feeling a patient’s pulse, whereas a junior doctor may require multiple lab tests for a similar conclusion.

That’s a classic example of the advantage gained from accumulating data over time. For humans, two key things are happening: the accumulation of experience—decades of encountering similar machine failures or medical cases—and the organization of that knowledge. At that point, the knowledge is internalized in their memory, which is enough for most human experts. However, because humans still rely on written language when communicating with AI, there’s a further stage of externalizing that knowledge, turning it into clear documentation that AI can use.

For Agentic AI, maintaining this loop of knowledge accumulation, organization, and externalization is essential. Consider a software engineering scenario. If you give an AI that writes code a repository of one hundred thousand lines and some tasks to do, the odds of it accomplishing everything perfectly on the first try are low. But if you give it time to gradually read the code, understand it, and categorize it, then produce documentation that summarizes what it’s learned, coding becomes far simpler.

Here is an example of such documentation. It describes the basic architecture, design concepts, and which functions live in which files. In older projects, we can include more historical context. With this documentation in place, the AI can zero in on the right parts of the code without blindly creating brand-new files for everything. And over time, the AI also knows what’s been tried before and what the current solution looks like, reducing the chance of going in circles and undoing its own work. So the “data” we’re talking about here isn’t just a huge heap of tokens. Rather, it’s a long-term, (semi-)automated process of collecting, digesting, and consolidating information. For a given customer, the more time an AI spends working alongside them, the more knowledge it accumulates. Even if another AI is inherently “smarter,” a long-term collaborator with extensive organizational knowledge might still feel much more useful in actual practice. This system of second-order data management is a genuine moat.

Likewise, this process of data accumulation enjoys a compound effect of its own. More historical data and well-organized documentation can lead the AI to more meaningful insights. You can think of it as transforming a traditional knowledge system into one that’s AI-friendly. “AI-friendly” isn’t a binary switch but a continuum that requires time to solidify. I’d even compare it to a co-evolution between humans and nature. The AI is mining, refining, and accumulating knowledge from the raw data. Meanwhile, users become more and more aware that making data easily accessible to the AI brings enormous benefits. As a result, they’re more willing to adapt their workflow to accommodate AI’s data management processes. That, in turn, yields additional benefits: for instance, with Zoom AI Companion, you can capture the tribal knowledge shared in meetings that would otherwise be lost. Over time, it becomes documented knowledge that the AI can use to assist you. This feedback loop of adaptation and mutual understanding forms a potent moat.

The Compound Effect of Intelligence

Agentic AI also has another fascinating aspect: intelligence itself can exhibit compounding effects. It’s less obvious at first glance compared to tools or data, but a tool’s level of intelligence influences the user’s Agentic experience in multiple ways.

On the most basic level, a smarter AI can better understand user needs and knows how to combine a few tools to achieve maximum benefit. A less-intelligent LLM might waste time calling tool after tool yet still not acquire enough information. A more thoughtful LLM, on the other hand, has a clear process and can solve problems quickly with only a few carefully chosen tools.

A related factor: comparing something like Gemini to OpenAI’s Deep Research, you see a completely different caliber of AI. Gemini feels like it’s mechanically following a predetermined script, starting with certain keywords, searching the web, then deciding which pages to scrape, and ultimately summarizing the content. Deep Research, by contrast, feels much more proactive and better at self-iteration. It starts by formulating a plan, uses different search keywords accordingly, and may dynamically adjust its strategy based on whatever it finds. The final results tend to be more enlightening, not just answering your questions but offering new perspectives or research directions. This capacity for autonomous thinking yields a nonlinear boost in value.

Given that only a few companies can truly develop their own LLMs and that training them requires ample resources and capital, intelligence can also serve as a meaningful barrier to entry.

Key Drivers of Competition

It’s important to note that these three compound effects don’t merely add up; they amplify each other. When the number of tools expands, that creates more avenues for data processing and accumulation—spanning project management, searching, and documentation, each yielding data for the AI to learn from. Meanwhile, as the AI processes all this information, it refines its understanding and reasoning abilities. We can see this synergy clearly in Manus. At first, Deep Research could conduct in-depth investigations, and Cursor could write code or produce documents. But once you bring them together on a single Agentic platform, the AI can go from researched information to logic, presentation, and final publication in one seamless flow. Within this closed loop, tools, data, and intelligence stimulate each other, delivering a more fluid and sophisticated end-to-end experience.

That’s why the crucial question in Agentic AI competition is how to expand quickly enough to reach the right side of the exponential growth curves involving tools, data, or intelligence. In the early phase, investing effort in increasing the number of tools or the amount of data yields only moderate returns. But once you hit a tipping point, exponential growth really kicks in. On the right side of the curve, each new tool or added piece of data can transform the user experience in a major way. This is how Agentic AI products compete and build their moat. Of course, the exponential curve won’t keep going forever. It might resemble an S-curve, where benefits level off at some point as complexity and resource constraints pile up. There might also be bottlenecks at certain stages that require deeper architectural or organizational innovations to keep the system evolving in synergy.

From this perspective, building a moat around tools alone is not very reliable. Building a moat around LLM intelligence demands a lot of resources. And building a moat around data might be the simplest and most feasible approach. More than just stockpiling data, the key lies in how you structure the data and the methodology behind it. Data itself can be copied, but systematically externalizing tacit knowledge, structuring and storing it, and managing it efficiently are exceptionally hard to replicate. This is a bit like corporate culture: once an organization develops a powerful data management and knowledge externalization framework, competitors could copy the data and tools yet still find it extremely challenging to mirror that kind of implicit organizational capacity in the short term. Therefore, over the long haul, the most formidable edge in Agentic AI competition isn’t just about the scale of data or intelligence but rather about how the data and tool usage are systematically organized and synchronized.

Conclusion

From Manus, we can see the crucial factors in Agentic AI competition and how genuine moats might be formed. But more importantly, this competition isn’t simply about racing to add more tools or accumulate ever-larger volumes of data. It’s about how organizations adapt to the AI era on a deeper, structural level. The eventual winners might not be the companies with the strongest technologies per se, but those that truly grasp how AI and humans can co-evolve—and can develop long-lasting, stable collaboration mechanisms. That, in my view, is the real inspiration that Agentic AI brings us.

被低估的Claude Code：AI Native软件开发的关键拼图

2025-02-24T20:30:00-08:00

Anthropic最近发布了Claude 3.7，吸引了媒体的广泛关注。不过，大多数人似乎忽略了同时推出的另一个工具——Claude Code。这个命令行工具看起来不那么引人注目，界面也不够人性化，没有像Cursor或Devin那样高调宣称要自我进化甚至取代人类程序员。但当我深入了解后才意识到，这个看似普通的工具可能恰恰是推动AI Native软件开发革命的关键，甚至有可能深刻影响未来的商业模式和生态系统。

从传统模式到AI Friendly开发

在我们介绍和推广Agentic AI的过程中，经常收到针对Cursor这样的AI编程工具的反馈是：这些工具对于快速做原型、辅助写一些脚本或者进行数据分析的ad-hoc的任务可以完成得很好。但是，只要涉及到大型项目或者已有的成熟的代码库，AI就经常卡壳——它往往找不到最相关的文件和函数接口，也不了解相关的开发规范和最佳实践，更不用说去理解每个设计的高层决策过程甚至历史包袱了。

但这实际上是一个很关键的问题。很多企业手里都有十几万行、甚至上百万行的内部核心代码，光是列出依赖、结构、文档就够人喝一壶，别说让AI一次性学完。于是AI在小项目里大显身手，却对企业最关键的系统无从下手。如果我们没办法解决对接已有代码库这个问题，AI开发就只能停留在辅助开发或者原型构建上，而不能真的用于大型、严肃和最具有商业价值的项目里。而如果我们能解决让AI正确且高效地使用已有的代码库这个问题，下面整个领域的发展可能就一日千里。我们可以看见AI主导的软件开发在各个领域迅速落地。因此从某种程度上说，AI开发正面临一个鸿沟，一边是AI自主开发软件的美好愿景，一边则是AI还没办法利用大多数私有代码库的现实困境。

但说句公道话，这其实不是AI自己的问题。我们之前也多次讨论过，AI 要想有效地完成任何任务（不仅限于编程），前提是它要有足够的背景知识。就好比你新招了一个实习生，如果直接扔给他几百个文件的代码库，就指望他立刻写出正确高效的代码，这本身就是强人所难。更好的方法是给他一个入职培训，让他理解整个库的函数，结构，设计甚至历史。经过这样的培训，实习生写出高质量代码的成功率就会大幅提升。

对于AI来说，这种入职培训的具体形式往往是给它一个现成的prompt文件。例如，Model Context Protocol (MCP)这个项目就在这个URL里提供了这样的prompt文件。如果你把这个prompt文件扔给AI（比如ChatGPT或者Cursor），它就会有足够的背景知识来写出正确的代码。目前这个prompt文件的构建很多时候仍然是一个半自动甚至手动的过程。例如，我们可以把一个库里所有函数的docstring全部扔给ChatGPT，让它筛选出最重要的函数，然后写出一个Markdown的文档，作为AI的prompt使用。在这种文档的支持下，对于中小规模的项目，AI就可以轻松写出正确的程序了。

所以，到这一步为止，我们已经看见了两种不同的软件开发模式。一种传统模式是软件开发的deliverable就是代码本身，最多加一个给人类看的文档。但是，另一种新的思路是AI friendly的软件开发。它强调软件开发的交付物不仅是针对人类交付，同时是针对AI交付。所以，在代码本身和针对人类的文档之外，往往还会包含一个针对AI的prompt文件。它的好处是让AI可以迅速上手，用这个库直接写出正确的程序。

随着Cursor等工具的逐渐流行，软件开发是否AI friendly的重要性也越来越高。例如，你是一个公司的技术领导，面对两个库要做技术选型。一个库直接拿过来用Cursor就能上手写程序，另一个库你还要去做各种学习和处理，才能在AI的辅助下一点一点写。前者在市场上的竞争优势是不言而喻的。

21st.dev，Library as a Service，和AI Native开发

AI friendly的软件开发模式，虽然相比于传统的软件开发模式有了长足进步，但是它离引发整个软件开发领域的革新，仍然有着相当大的距离。为了理解这一点，我们先看一下21st.dev这个网站。21st.dev这个网站或者说软件库很有意思。它本质上是一个 UI 库，提供了很多前端组件。但和传统的软件库的使用方法不同，它不是把所有的组件交付给用户，让用户来学习和写代码，而是反过来，让用户告诉它想构建什么东西，它来选择调用什么样的 UI 组件来为用户构建一个前端。换言之，它的deliverable不是一个类库，而是用户需要的代码本身，这个代码里面包含了相关的类，用户也可以在拿到以后自行更改。

这其实是一种新的使用库的方法，Library as a Service（LaaS）。软件库的交付物不是代码，不是文档，而直接变成了一种服务，一个 API。对用户/开发体验来说，这是一个重大的改变。因为就算是AI Friendly的模式，在实际开发中，这个prompt文件也还是需要开发者自己撰写和维护，对团队在文档管理、版本控制、prompt撰写等方面提出了很多要求。因此，即使是AI Friendly这种开发模式，也往往只是在原有写代码—打包—部署的流水线里做局部改进，并没有颠覆性地改变交付模式。而LaaS就不一样了，它从“我交付的是API+背后的AI执行力”这个逻辑出发，把商业和技术交付捆绑在一起，彻底改变了怎么收费、怎么对外演示、怎么迭代升级。如果说传统的软件库交付的是建筑材料，LaaS交付的就是一个包工包料的施工团队，重构了软件开发的价值链。

在用户体验升级之外，LaaS对提供库的企业来说，也有很大的好处。一方面，这让企业有了更大的控制力和自主权，对外可以鼓励用户在使用库的时候应用最佳实践，对内可以做出更综合更高效的优化。另一方面，从收费的角度来说，它提供的这种附加值服务也可以带来超额盈利，让传统意义上一次交付的软件库变成了一只持续下金蛋的鸡。

从整个软件开发的领域来说，LaaS这种方法带来了巨大的想象空间。如果未来我们的软件库都以LaaS的方法交付的话，AI利用这些库来进一步构建软件的过程会变得特别简单和流畅。它更多的是起一个planner统筹规划的作用，只需要根据每个库擅长的领域决定什么问题用什么库就可以了。接下来，具体利用这些库来写代码则完全可以使用每个库的服务，利用API调用来完成。从某种层面上说，这也可以被称为Library as an Agent。这样我们就把一个复杂的利用多个软件库的软件开发流程，变成了一个multi-agent合作、迭代的问题。这样就可以解决我们在上一章开头提出的问题。如果有了LaaS，通过把库的学习和使用完全delegate给库的作者构建的agent，Agentic AI就不再是只能解决adhoc问题的玩具，而可以灵活高效地适配已有的企业级的软件库。

这种开发方法相比于传统的方法，甚至AI friendly的方法，是更加AI native的。虽然Agentic AI仍然是一个非常新的东西，我们在目前还很难预测未来软件开发最终会收敛到怎样的形式上去。但是我们基本可以肯定，最终大家都采用的一定是一种AI native的方法——所有技术手段围绕AI来设计，作为AI enabler来尽可能提升AI的工作效率这样的技术框架。

Claude Code和未来的AI开发生态

LaaS 的愿景虽然美好，但它的实现门槛仍然很高。比如，如何在代码生成的过程中引入 Agentic Workflow，来实现自我迭代、自我更正，从而保证生成的代码质量。这需要相当多的实战经验和对 LLM 的深刻理解。我们只要看一下开源的 Cline 和 Cursor 之间的性能差距，就容易理解了。

但 Claude Code 的出现正好填补了这个空白，极大地降低了 LaaS 的实现难度。具体来说，Claude Code 提供的功能就好像是一个命令行版的 Cursor。在命令行的参数里面用自然语言描述要实现的编程任务，它就会调用附带的几种工具，包括运行命令、修改文件等，来利用 Agentic Workflow 实现自我迭代、自我更正。因此，很多之前因为 Cursor 不提供 API 而很难解决的应用场景，我们可以通过借助 Claude Code 来轻松实现。

因此虽然它没有 GUI，看起来对终端用户没有太大吸引力，但是Claude Code作为一个基础的 Building Block，则为我们构建很多类似的应用提供了坚实的基础。与此同时，在 LaaS 愿景的指导下，它的价值不是让我们能够快速地抄一个 Cursor 出来，而是有潜力推动整个软件开发领域向 LaaS 或者类似的 AI-Native 开发的范式进发。这为彻底改变整个 AI 开发的模式，补全了一块重要的拼图。

除此之外，Claude Code 还有另一个功能是它可以用 Agentic 的方式调用各种工具，来阅读一个文件夹下面的代码，一步一步构建出一个 Prompt 文件。它具备的Agentic特征让它可以主动探索代码库，并通过对执行过程的动态分析来不断完善、修正对项目的理解。它就像一个人类实习生，在得到最初的指令后，会自己去读取并分析相关模块，用自然语言描述和工具调用相结合的方式，确认哪部分代码适合复用、哪部分需要改写。一旦我们在现有系统里完成这样的迁移过程，就等于为AI真正打通了与传统代码库的阻隔。后续开发新功能或重构旧逻辑时，AI不需要再次经历同样冗长的学习过程，而是能够直接接管这套自动化工作流，迭代更新成果。

因此，总的来说，Claude Code 的目标是把传统的软件库的使用方式向 AI-Friendly 和 AI-Native 的方向转变，通过提供适当的工具来让这两种迁移更加顺利和无痛。它背后所拥有的愿景和野心其实远远大于这个简单的命令行工具表面上的价值。一个AI统筹所有服务、自动编码、自动调度的世界的雏形，或许已经悄悄地在Claude Code身上出现了。这种愿景离真正落地还有不少路要走，包括对API规范的标准化、对安全的策略把控，以及对Agentic AI在不同场景下边界与责任划分的探讨。但从Claude Code迈出的这一步来看，Anthropic已经在非常务实地解决一个业界普遍痛点：如何让AI更智能地对接旧系统，又如何让旧系统逐渐蜕变为新的LaaS形态，从而通往AI-native的未来。

也许这正是Anthropic最具前瞻性的地方，却因为没有Claude 3.7那么炫目而被很多人忽视。但我相信，将来大家回头看时，会发现Claude Code在这场AI-native软件革命里扮演了一个不可替代的起点角色。

The Underestimated Claude Code: A Key Puzzle Piece for AI Native Software Development

2025-02-24T19:30:00-08:00

Anthropic recently released Claude 3.7, attracting widespread media attention. However, most people seem to have overlooked another tool launched simultaneously—Claude Code. This command-line tool appears less impressive, with an interface that isn't particularly user-friendly, and it doesn't make bold claims about self-evolution or replacing human programmers like Cursor or Devin. But after diving deeper, I realized that this seemingly ordinary tool might be the key to driving the AI Native software development revolution, and could even profoundly influence future business models and ecosystems.

From Traditional Patterns to AI Friendly Development

In our process of introducing and promoting Agentic AI, we often receive feedback about AI programming tools like Cursor: these tools can perform well for rapid prototyping, helping write scripts, or performing ad-hoc data analysis tasks. However, when it comes to large projects or established mature codebases, AI frequently gets stuck—it often can't find the most relevant files and function interfaces, doesn't understand the related development standards and best practices, let alone comprehend the high-level decision-making process for each design or its historical baggage.

But this is actually a critical issue. Many enterprises have internal core codebases of hundreds of thousands or even millions of lines of code. Just listing the dependencies, structure, and documentation is overwhelming enough for humans, let alone expecting AI to learn it all at once. As a result, AI excels in small projects but struggles with enterprises' most critical systems. If we can't solve the problem of connecting with existing codebases, AI development will remain limited to auxiliary development or prototype construction, rather than being used in large, serious, and commercially valuable projects. If we can solve the problem of helping AI correctly and efficiently use existing codebases, the entire field's development could progress rapidly. We could see AI-driven software development quickly implemented across various domains. In some ways, AI development is facing a chasm—on one side is the beautiful vision of AI autonomously developing software, and on the other is the reality that AI still can't effectively utilize most private codebases.

To be fair, this isn't really AI's fault. As we've discussed multiple times before, for AI to effectively complete any task (not just programming), it needs sufficient background knowledge. It's like hiring a new intern—if you simply hand them a codebase with hundreds of files and expect them to immediately write correct and efficient code, you're being unreasonable. A better approach is to provide onboarding training so they understand the library's functions, structure, design, and even history. With such training, the intern's success rate in writing quality code increases dramatically.

For AI, this onboarding training often takes the form of a ready-made prompt file. For example, the Model Context Protocol (MCP) project provides such a prompt file at this URL. If you give this prompt file to an AI (like ChatGPT or Cursor), it will have enough background knowledge to write correct code. Currently, building these prompt files is still largely a semi-automatic or even manual process. For instance, we can take all the docstrings of functions in a library, give them to ChatGPT, and ask it to filter out the most important functions and write a Markdown document to use as an AI prompt. With this documentation support, AI can easily write correct programs for small to medium-sized projects.

So, up to this point, we've seen two different software development models. The traditional model delivers just the code itself, perhaps with human-readable documentation. But a new approach is AI-friendly software development. It emphasizes that software development deliverables aren't just for humans but also for AI. So, in addition to the code itself and human-targeted documentation, it often includes an AI-specific prompt file. The benefit is that AI can quickly get up to speed and write correct programs using the library.

As tools like Cursor become increasingly popular, the importance of AI-friendly software development is growing. For example, if you're a company's technical leader choosing between two libraries, and one can be used immediately with Cursor to write programs while the other requires various learning and processing before you can gradually write with AI assistance, the former has an undeniable competitive advantage in the market.

21st.dev, Library as a Service, and AI Native Development

The AI-friendly software development model, while a significant improvement over traditional software development, is still quite far from revolutionizing the entire software development field. To understand this, let's look at the 21st.dev website. This website, or library, is quite interesting. It's essentially a UI library that provides many frontend components. But unlike traditional software libraries, it doesn't deliver all the components to users for them to write code. Instead, it reverses the process—users tell it what they want to build, and it selects and calls the appropriate UI components to build a frontend for the user. In other words, its deliverable isn't a class library but the code that users need, which includes relevant classes that users can modify afterward.

This is actually a new way of using libraries: Library as a Service (LaaS). The deliverable of a software library isn't code or documentation, but directly becomes a service, an API. For the user experience, this is a significant change. Even with AI-friendly models, developers still need to write and maintain prompt files themselves, placing many demands on teams in terms of document management, version control, and prompt writing. Therefore, even the AI-friendly development model often only makes local improvements to the existing code-package-deploy pipeline, without fundamentally changing the delivery model. LaaS is different—it starts from the logic of "I'm delivering an API + the AI execution power behind it," bundling business and technical delivery together, completely changing how to charge, demonstrate, and iterate. If traditional software libraries deliver building materials, LaaS delivers a construction team that includes both labor and materials, restructuring the software development value chain.

Beyond upgrading the user experience, LaaS also offers significant benefits to library-providing enterprises. On one hand, it gives companies greater control and autonomy—externally encouraging users to apply best practices when using the library, and internally enabling more comprehensive and efficient optimizations. On the other hand, from a charging perspective, this added-value service can bring excess profits, turning traditional one-time-delivery software libraries into continuous sources of income.

For the entire software development field, the LaaS method brings enormous potential. If future software libraries are all delivered using the LaaS method, the process of AI using these libraries to further build software will become particularly simple and smooth. AI will play more of a planner role, only needing to decide which library to use for which problem based on each library's expertise. Then, AI can use each library's service through API calls to write specific code. In a sense, this can also be called Library as an Agent. This transforms the complex process of developing software using multiple libraries into a problem of multi-agent cooperation and iteration. This solves the problem we raised at the beginning of the previous chapter. With LaaS, by completely delegating library learning and usage to agents built by library authors, Agentic AI is no longer just a toy for solving ad-hoc problems but can flexibly and efficiently adapt to existing enterprise-level software libraries.

This development method is more AI-native compared to traditional methods or even AI-friendly methods. Although Agentic AI is still very new, and it's difficult to predict what form software development will ultimately converge to, we can be fairly certain that the method everyone will adopt will be AI-native—where all technical means are designed around AI, serving as AI enablers to maximize AI's work efficiency.

Claude Code and the Future AI Development Ecosystem

While the LaaS vision is promising, implementing it still has high barriers. For instance, how to introduce Agentic Workflow into the code generation process to achieve self-iteration and self-correction, thereby ensuring the quality of generated code. This requires considerable practical experience and a deep understanding of LLMs. We only need to look at the performance gap between open-source Cline and Cursor to understand this.

But the emergence of Claude Code fills this gap, greatly reducing the difficulty of implementing LaaS. Specifically, Claude Code provides functionality similar to a command-line version of Cursor. By describing the programming task in natural language through command-line parameters, it calls several accompanying tools, including running commands and modifying files, to use Agentic Workflow to achieve self-iteration and self-correction. Therefore, many application scenarios that were previously difficult to solve because Cursor doesn't provide an API can now be easily implemented with the help of Claude Code.

So although it doesn't have a GUI and doesn't seem particularly attractive to end users, Claude Code, as a fundamental building block, provides a solid foundation for constructing many similar applications. At the same time, guided by the LaaS vision, its value isn't in allowing us to quickly copy Cursor, but in its potential to push the entire software development field toward LaaS or similar AI-Native development paradigms. This completes an important puzzle piece for fundamentally changing the entire AI development model.

Beyond this, Claude Code has another function: it can use an Agentic approach to call various tools to read code in a folder and build a prompt file step by step. Its Agentic characteristics allow it to actively explore the codebase and continuously improve and correct its understanding of the project through dynamic analysis of the execution process. It's like a human intern who, after receiving initial instructions, reads and analyzes relevant modules on their own, using a combination of natural language descriptions and tool calls to confirm which parts of the code are suitable for reuse and which need to be rewritten. Once we complete this migration process in existing systems, it's equivalent to truly breaking down the barriers between AI and traditional codebases. When subsequently developing new features or refactoring old logic, AI doesn't need to go through the same lengthy learning process again, but can directly take over this automated workflow and iteratively update the results.

Therefore, overall, Claude Code's goal is to transform the traditional use of software libraries toward AI-Friendly and AI-Native directions, by providing appropriate tools to make these two migrations smoother and painless. The vision and ambition behind it are actually far greater than the surface value of this simple command-line tool. The embryonic form of a world where AI coordinates all services, automates coding, and automates scheduling may have already quietly appeared in Claude Code. This vision still has a long way to go before it truly lands, including standardization of API specifications, security policy controls, and discussions on the boundaries and responsibilities of Agentic AI in different scenarios. But judging from the step Claude Code has taken, Anthropic is already very pragmatically solving an industry-wide pain point: how to make AI more intelligently interface with old systems, and how to make old systems gradually transform into the new LaaS form, thus heading toward an AI-native future.

Perhaps this is Anthropic's most forward-looking aspect, yet it has been overlooked by many because it's not as flashy as Claude 3.7. But I believe that when people look back in the future, they'll find that Claude Code played an irreplaceable starting role in this AI-native software revolution.

用数学找到近20年间的X星连珠

2025-02-23T22:00:00-08:00

当城市灯火渐暗，夜空澄澈如洗，你或许会注意到几颗格外明亮的行星，仿佛约好般排成一线，在苍穹中彼此呼应。这就是传说中的行星连珠现象。虽然这些天体并没有真正在太空里严丝合缝地排成几何直线，但从地球的视角看，多颗行星似乎被一根看不见的线串联在一起，它们形成的视觉队列依然令人屏息。不论中外，从25年1月开始，诸多媒体都在热议所谓的六星或七星连珠现象。有人为之惊叹，也有人质疑其中的科学成分。那么如果我们想找出过去20年（或任意一段时间）里所有可能出现的n星连珠现象，该怎么做呢？

在现代天文学研究里，一种常见思路是用程序模拟行星位置并进行量化评估。行星的运动并非随机杂乱，而是依循各自的椭圆轨道绕太阳公转。虽然在夜空中它们的相对位置可能几天或几小时就发生些微变化，但只要我们掌握各自轨道参数、再结合地球上观测的具体条件，就能较精确地计算它们在不同时刻的坐标。好在这样的计算模拟并不需要我们手动进行，像Skyfield这样的天文计算库，既能提供高精度的行星历，也能便捷地将得到的太阳系坐标转换到地面观测者所在位置。

不过，行星连珠主要与行星绕太阳公转时的几何分布有关，地球自转的影响相对较小。为了方便分析，我们通常采用黄道坐标系来记录行星的位置，把黄道面当作基准。太阳在一年四季里看似在天球上缓缓移动，这条轨迹之所以被称为黄道，正是因为地球绕太阳公转，而我们又站在地球上向外看，所形成的投影现象。简单来说，地球绕太阳转的轨道平面就是黄道面，映射到天球上就成为黄道。这个“主干道”之所以重要，是因为大多数行星的运行平面都跟它相差不大，所以当我们把行星坐标转换到黄道坐标系时，能更准确地呈现它们绕日公转时的整体布局。

从数学上说，黄道坐标并不会改变行星之间的实际角度，但它能让我们更直观地看出这些行星是否挤在同一个经度范围里。在运行具体程序时，我们先在一个给定的时间区间内（比如2010到2030年），针对每一天或更细的时间步长，逐一计算出各行星在黄道坐标系中的经度和纬度。

然而，只掌握行星都在哪里还不足以直接判定它们是不是连珠。问题的核心在于，我们要给这个看似直观的整齐排布一个量化的定义。具体地说，我们要把“行星排得很整齐”这个直觉加以拆解：

首先，这些行星至少在晚上要能看见，不能跟着太阳朝升夕落。不要笑，考虑到绝大多数行星并不会偏离黄道太远，这其实是影响最大的一个因素。否则就算几何上再完美，如果这些行星中大多数都深藏在太阳的光辉之中，或根本不在地平线上可观测，视觉上便无从谈起连珠。因此，我们先利用北京的经纬度和时间，利用行星，太阳，和地平线的相对位置判定每颗行星的可见性，这样就把能被观测到的行星计入可见行星中，并且利用这个可见行星的数量进行打分。如果夜空中可以看到的行星越多，那这一天的连珠指数就越大。

其次，我们在意它们是不是呈现近似排成一条线的几何效果。要衡量这一点，最简单的方法是做一个高中学过的最小二乘。通过拟合出来的直线的拟合质量来判断这些行星是不是成一条直线。具体做法是，把同一时刻各行星的黄道坐标点（经度、纬度）输入算法，拟合出一条最佳拟合线，然后看每颗行星到这条线的平均距离或最大距离，距离越小就代表越对齐。但里面有一个小坑是，水星和金星两颗内行星往往跑的特别快，有时候会离黄道特别远。这时候就算其他行星正好排成了一个完美的直线，如果我们把水星金星也放进连珠的判定的话，得出来的分数必然极差。但这其实是不对的，因为这一天还是可以构成一个完美的五星连珠。为了解决这个问题，我们用了一种更稳定的拟合算法叫做RANSAC，它针对水星金星这种例外情况有奇效。

最后，我们往往也会希望行星集中在天球上的某一片区域。因为相比于一条横贯东西的线，如果行星集中在一小片区域形成一条很短的直线，这样的视觉效果就像一支天际弧线上的队伍，会更加震撼。为了描述这一点，我们注意到多数行星的轨道都与黄道面只存在一个小倾角，因此它们的黄道经度是否集中，往往决定了在夜空中看起来是不是扎堆在同一个弧段上。所以我们可以计算当时刻行星经度的分散度（比如极差或标准差），越小就表示它们的分布越紧凑，也就越容易给人几颗行星挤在天际某一带的印象。

有了这三项指标后，我们便能对某一时刻的行星连珠指数做出量化判断。具体步骤是：

在指定时间范围（例如2010到2030年）内，每隔一定时间（如每天或更细粒度）计算行星位置。
对每个采样时刻，基于这些位置来算出可见行星数量、排列紧密度和经度集中度三项得分。
以40%、30%、30%的权重加权合并，得到一个对齐评分。
将这一评分随时间作图，就可以看到行星连珠从无到有、再逐渐散去的动态过程。如果哪段时间评分出现显著峰值，极有可能就是一次罕见的n星连珠。

这种方法能让我们从“看起来很整齐”或“好像都在一起”的主观描述，转向相对客观、可重复的数值评估。当评分高的峰值出现时，我们更有把握断言这几天内会出现视觉效果壮观的行星连珠现象。换言之，这套量化思路既可以帮我们回顾过去20年里究竟有哪些时段可能出现n星连珠，也能让我们对未来某年某月行星排队的几率做出相当可靠的预测。而如果我们真的写出程序（开源在https://github.com/grapeot/PlanetAlignment），并且在2010-2030的时间上跑一遍的话，会发现2025年1-2月这段时间并不是特别出彩，在2025年7月会有更符合这个标准的连珠出现。

但是要注意的是，虽然我们用了严谨的数学来描述这个现象，整个过程仍然带有相当大的主观性。比如很多其他媒体在讨论九星连珠的时候，并不要求它们聚在天球上的一小块区域。或者，我们这个40%、30%、30%的权重也会极大地影响分数。比如你要是把排列紧密度的权重增加的话，就会发现今年1月、2月的这次连珠分数明显升高。但如果把黄道经度集中度权重升高的话，就会发现它的评分又降低了。这就好像大学排名一样，你把学校的规模的权重升高或者降低，往往会带来大学排名非常剧烈的变动。这么一说，大家是不是就明白了？

所以我想强调，我们要辩证地来看待这篇文章提到的数学思想。一方面，它把一种非常模糊、缥缈的直觉给有效地量化出来，让我们能够在几十年、几百年的尺度里进行快速搜索，用严谨和定量极大提升了我们的效率。但另一方面，在严谨的外表下面，它仍然是一个非常主观的东西。甚至只需要调一些权重，就可以让最终得出的结论发生大幅的变化。所以我们也不能迷信数学。

Using Math to Find Planetary Alignments in the Past 20 Years

2025-02-23T21:00:00-08:00

When city lights grow dim and the night sky becomes crystal clear, you might notice several particularly bright planets, as though they’ve agreed to line themselves up in the heavens. That’s what we call a planetary alignment. Even though these celestial bodies don’t actually form a perfect geometric line in space, they can appear to align from our viewpoint on Earth, creating a visually striking row. Since the start of 2025, various media outlets around the world have been abuzz with talk of so-called six-planet or seven-planet alignments. Some are awed by it, while others question the science behind it. If we’d like to identify all possible n-planet alignments over the past 20 years (or any chosen time span), how might we go about it?

A common approach in modern astronomy is to use computer programs to simulate planetary positions and make quantitative assessments. Planets don’t wander randomly; they move along their own elliptical orbits around the Sun. Their relative positions in the night sky can shift over days or even hours, but by knowing each orbit’s parameters and taking Earth-based viewing conditions into account, we can accurately calculate their coordinates at any point in time. Happily, there’s no need to do this manually. Tools like the Skyfield library provide high-precision planetary ephemerides and allow us to convert solar system coordinates into positions for an observer on Earth.

However, planetary alignments mainly depend on the geometric positions of the planets orbiting the Sun, while Earth’s rotation has comparatively little effect. For ease of analysis, we usually record planetary positions in the ecliptic coordinate system, using the ecliptic plane as a reference. From our vantage point on Earth, the Sun appears to move slowly across the sky over the course of a year, tracing a path known as the ecliptic. This line matters because Earth’s orbit defines the ecliptic plane, and most planets’ orbital planes don’t deviate much from it. When we convert planetary coordinates into the ecliptic coordinate system, we get a clearer picture of how they’re arranged around the Sun.

Mathematically, ecliptic coordinates don’t change the actual angles between planets, but they make it easier to see whether these planets are clustered within a certain range of ecliptic longitudes. In a practical program, we might choose a specific time span (say, 2010 to 2030), and at each time step (daily or more frequently), compute the planets’ ecliptic longitude and latitude.

But just knowing where the planets are doesn’t by itself tell us whether they’re aligned. The real challenge is to give a quantitative definition to what might look like “an orderly arrangement.” We can break that intuitive sense of alignment into measurable parts:

First, these planets need to be visible at night—otherwise they’re just following the Sun’s daily rise and set. This might sound trivial, but given that most planets stay close to the ecliptic, it’s actually the biggest factor. If most of these planets are lost in the Sun’s glare or below the horizon, there’s no chance for a visually striking alignment. So we begin by figuring out which planets can be observed from Beijing (or any location), based on the positions of the Sun, the planets, and the local horizon. We only count the visible ones in that moment, and use that number of visible planets to form part of our alignment score. The more visible planets there are, the higher the “alignment index” at that time.

Second, we care about whether they line up more or less along a single line in the sky. The simplest way to measure this is to do a least-squares fit for the coordinates and see how well they can be approximated by a single line. Specifically, we take each planet’s ecliptic coordinates (longitude and latitude) at that moment and feed them into the algorithm, fitting for the best line. Then we examine the average or maximum distance of each planet from that line—the smaller the distance, the more aligned they are. There’s a small caveat, though, because Mercury and Venus (the inner planets) move much faster and sometimes deviate more from the ecliptic. That can drag down the score even if the other planets are nearly perfectly aligned. In such cases, you might still be seeing a flawless five-planet alignment, so for robustness, we use a more advanced algorithm called RANSAC, which can handle outliers like Mercury and Venus without ruining the entire alignment score.

Lastly, we often care about whether these planets are clustered in a particular region of the sky. Compared to a line that spans a wide swath of the sky, a line that’s fairly short—like a group of planets bunched in one small area—can be visually more impressive. Because most planetary orbits have only a small inclination to the ecliptic, their ecliptic longitudes are what primarily determine whether they seem to cluster together in the sky. We can calculate the distribution of longitudes (for example, their range or standard deviation) and note that a smaller spread means they’re packed more tightly, thus more likely to appear as if they’re in a common arc.

Once we have these three indicators, we can make a quantitative assessment of whether the planets are in alignment at any given time. The steps are:

Pick a time range (for example, 2010 to 2030) and compute planetary positions at set intervals (daily or finer).
For each sampling point, calculate how many planets are visible, their linear alignment (fitting quality), and their longitude clustering.
Combine these three metrics with weights of 40%, 30%, and 30%.
Plot that combined score over time to see when alignments become prominent and then dissipate. Whenever you see a prominent peak, there’s a good chance that a notable n-planet alignment is occurring.

This method translates the subjective notions of “they seem lined up” or “they appear clustered” into something more reproducible. When the score peaks, you can be more confident that there’s an eye-catching alignment going on in the night sky. In other words, this quantitative approach helps us identify periods in the past 20 years (and predict in the future) when an n-planet alignment is likely to appear. If we really did write a program for it (open-sourced at https://github.com/grapeot/PlanetAlignment) and ran it on the 2010–2030 data, we’d see that January and February of 2025 aren’t that outstanding. In fact, July 2025 would produce a more impressive alignment by these standards.

But keep in mind that while we’re using rigorous math, the entire process still carries a lot of subjectivity. For instance, many other media outlets discussing nine-planet alignments don’t insist on them being grouped into a small patch of sky. Moreover, our chosen 40/30/30 weighting heavily influences the final score. If you were to boost the weight on linear alignment, the scores for January and February this year would jump up significantly. But if you emphasized the longitude clustering metric instead, their scores would drop. It’s akin to college rankings: ramping up or down the importance of certain factors can send a university soaring or plummeting in the rankings. Sounds familiar, doesn’t it?

The point is, we should view the math described here with nuance. On one hand, it’s a huge leap forward from a fuzzy idea to a measurable index, letting us quickly search through decades or even centuries of data. On the other hand, it’s still inherently subjective. Simply tweaking a few weights can drastically shift the results, so we shouldn’t regard these mathematical methods as absolutely definitive.

从数据挖掘到认知炼金：用OpenAI Deep Research（模拟）购车决策全流程

2025-02-17T15:00:00-08:00

最近我用一个模拟买车的场景测试了一下OpenAI的Deep Research。具体的应用场景是我在年轻的时候看的一般是性能车，但是随着年龄增长和家庭实用需求，逐渐了解到还有另一种车强调的是NVH（Noise, Vibration, and Harshness）。但我对这个领域几乎没有任何了解，所以就希望通过Deep Research辅助我做一次模拟购车。从前期调研到中期选择到后期谈判压价，在ChatGPT上模拟走完整个流程，来理解Deep Research对我们的日常生活到底有多大帮助。

注意，这里我们用OpenAI的Deep Research只是举个例子。虽然我个人的经验是OpenAI Deep Research的调研质量远远高于其他竞品，但是其他类似的工具，比如Perplexity Deep Research或者Gemini with Deep Research，也可能达到类似的效果。

先说结果，不试不知道，一试吓一跳。在Deep Research的加持下，我的整个调研、思考和决策的流程有了很大的深化。

第一个也是最明显的感受是Deep Research这个产品本身极大地提升了我调研的效率。对于NVH这样一个全新的领域，传统的方法是在互联网上搜索，然后阅读大量网页和资料，最后在这个基础上进行思考和总结，形成自己的观点。但是首先o1 Pro已经可以系统详细地向我介绍NVH到底是什么了，让最基本的扫盲一下就完成了。然后，Deep Research进一步用具体的例子让抽象的讲解一下变得鲜活立体。比如介绍NVH的厂商有哪些，具体是怎么实现的，市场策略和卖点如何。这等于将调研这个手工过程自动化了，而且它产生的调研报告的质量相当高，系统性解答了我的很多问题。因此，Deep Research让我能够在短时间内有条理地摄入大量信息，大致/浅层地了解一个新的领域，这本身就节约了大量的时间。

在这个基础上，我又领会到了第二个好处：因为调研、阅读和做笔记的体力活被Delegate给Deep Search了，我就有更多的余力来做高层抽象的思考。在有了对NVH领域的基本理解以后，下一步就是对要买的车型进行收窄。在AI年代以前，我可能会去做一个非常完善费时的调研。比如整理一下哪些车型强调NVH，每种车型它的卖点是什么，缺点是什么，品牌形象如何，大家网上包括第三方车评媒体和普通用户对它的评价怎么样。然后汇总成一个报告。这个调研本身是巨量的工作量，在没有AI+我自己不是汽车专家的情况下，可能需要泡在网上花5-10个小时才能真正逐渐缩窄到我们可能感兴趣的车型，然后再做更详细的报告。如果能做到这一点，拿到这样一份报告，其实从due diligence的角度来说，已经做得非常好了。

但是，在使用Deep Research以后，我的感觉是它完全重新定义了什么叫一份“好”的调研。在Deep Research的帮助下，我的调研的广度和深度轻松地突破了我们刚才说到的这几点。举个例子，在使用DeepResearch快速搜集了上面我们提到的典型车款，优缺点及评价以后，我们基本上花了10分钟做完了5个小时的活。接下来就有余力去思考其他可能遗漏的方面。比如我额外调研了四个领域：

残值。经过调研，我迅速地发现电动车的残值在开头几年下降得特别快。比如，Lucid的电车和奔驰的EQS前三年基本会跌掉60%的价格。
可靠性。不同的车型可靠性的口碑评价也非常不一样。比如说路虎，开了一段时间没问题的二手车甚至比新车还贵。
在Deep Research的启发下，我还进行了库存积压的调研。比如，它在Edmunds这个网站上发现奔驰EQS680这款车现在有100多台在售，甚至有24年的车到现在还没卖出去[链接]，这暗示着它可能有很大的折价空间。
我们也做了一些买车压价的调研。比如在Reddit上面，有人提到它甚至可以把EQS680这辆车的价格砍下23%。

这极大地拓展了我做调研的深度和广度。在没有DeepResearch的情况下，由于投入时间的限制，我根本不可能把调研做得这么完备。

第三点好处和第二点好处是一脉相承的。正因为在调研方面我们轻松地实现了以前所不能达到的深度和广度，这进一步解放了我的脑力。当把这些完备的信息呈现在我的眼前的时候，我就可以在这些基础上进行思考，触及以前没有能力触及的领域。举个例子，以前当我做完每种车的优缺点评价以后，我就会去找dealer，开始试驾、看车、砍价了。因为整个脑子完全被调研overwhelm/overload，而且我自己对谈判砍价兴趣也不高，所以往往就是夏季八砍，最后很多情况原价买。

但是在看了这些调研的数据之后，鲁钝如我也意识到奔驰EQS 680可能是一个可以大砍的车型，它积压严重，首年贬值严重，就暗示了有可能和经销商砍价可以得到相当不错的折扣。同时Reddit上也出现了类似的成功例子。因此我就和o1 Pro进一步探讨了一下，我应该用什么样的谈判策略来拿到好的价格。o1 Pro进行了系统的分析和回答。它提到：虽然我看到前面我们调研的NVH好的车里面有一些高端品牌，比如劳斯莱斯、宾利等等，但是你在谈判的时候千万不要提这一点。因为你要从sales的角度来揣摩他的心理。一旦你提了这一点，他就会把你定位成不差钱的客户，就不会跟你谈价格，而是跟你谈历史、谈传承、谈奢华体验，跟你的目标背道而驰。因此，你应该跟他说我同时在考虑一些价格更低的车型，同时要强调它们有哪些地方比EQS更好，通过这种方式来获得谈价的主动权。同时给sales营造一个客户画像：你有一些小钱，能买得起这个车，不是来随便问着玩的，但同时又很在乎性价比。如果他出价低，可能能够谈下这单。这些分析从Sales的角度出发，打开了我的视角，感觉特别有道理。

但下面的问题就是，我也不知道到底有哪些车可以作为竞品来砍价。所以Deep Research又帮我具体进行了搜索，找到了几个有针对性的备选车型，比如凯迪拉克Escalade IQ、保时捷卡宴Hybrid和奥迪e-tron Q8，向我介绍了每一款相比于EQS的性能优势、价格优势和谈判策略。我觉得这些都特别有帮助。不仅节省了时间，更重要的是让我触及了以前凭我自己的能力完全没有可能达到的深度。如果说前两点好处还是帮我更快的找到答案的话，这一点等于是重新定义了问题的边界。

因此总体来说，我觉得OpenAI的Deep Research这个调研工具相比于传统的人肉调研的方法，带来了成倍的生产力提升。一方面，它让调研这件事本身变得特别高效、简单。同时，无论在调研还是在其他方面。它也让我们达到以前所达不到的深度和效果。这让我进一步感受到AI对我们的现实生活有很多潜力可以挖掘，尤其是这种人类和AI打配合的情况。AI负责数据挖掘，把搜集和清洗过的数据给我看。我负责认知炼金，从里面提炼insight，决策下一步的方向，然后继续再AI的支持下进一步深化。

只是注意，在整个过程中我们还是要履行一个AI Manager的职责，对重要的细节要点到网页里去核查。Deep Research这样的工具让核查这件事情变得特别简单，只要点进去看一下就好了，它甚至会帮你把相关的语句高亮出来。在我的整个实验中，我没有看到它有什么差错，但是这一步仍然是要做的。

From Data Mining to Cognitive Alchemy: A Full Simulated Car-Buying Workflow Using OpenAI Deep Research

2025-02-17T14:00:00-08:00

Recently, I ran a simulated car-buying scenario to test OpenAI's Deep Research. The setup was that in my younger days, I focused mostly on performance cars, but as I got older and more family-oriented, I realized there was a whole other category of vehicles that emphasize NVH (Noise, Vibration, and Harshness). I knew almost nothing about this space, so I wanted Deep Research to help me walk through a complete simulation of buying a car, from early-stage research to mid-stage selection to final negotiations, just to see how useful Deep Research could be in everyday life.

Note that I'm using OpenAI's Deep Research as just one example. My personal experience suggests that OpenAI Deep Research has much higher-quality findings than its competitors, but other similar tools like Perplexity Deep Research or Gemini with Deep Research might deliver comparable results.

The outcome was eye-opening. With Deep Research, my entire process of research, thinking, and decision-making was greatly enhanced.

The first and most noticeable benefit was that Deep Research itself significantly boosted my research efficiency. Take NVH as an example. Traditionally, if you want to learn about NVH from scratch, you would go online, search endlessly, and read numerous webpages and articles, then finally try to summarize everything and form your own viewpoint. But o1 Pro alone can already explain what NVH is, covering the basics. Deep Research then takes it a step further by providing concrete examples that make an abstract explanation more vivid. It tells you which manufacturers focus on NVH, how they achieve it, what their marketing strategies and selling points are, and so on. In other words, it automates the manual research process and produces a report of very high quality, comprehensively answering many of my questions. Deep Research let me take in a large amount of information quickly and in an organized way, giving me a basic or preliminary understanding of a new field, and saving me a huge amount of time.

Building on that, I discovered a second key advantage: because Deep Research took on the heavy lifting of data-gathering, reading, and note-taking, I had more mental energy for high-level thinking. Once I had a basic grasp of the NVH field, the next step was to narrow down my potential car choices. Before the AI era, I might have done an extensive search—comparing which car models emphasize NVH, listing each model's selling points and drawbacks, checking brand images, and reading both media and user reviews. Then I would compile all of that into a report. This alone would be an enormous amount of work—without AI, and given I'm not a car expert, I might have spent 5 to 10 hours scouring the internet just to gradually narrow down my options before doing a deeper dive. Achieving that level of diligence would already be considered very thorough.

However, with Deep Research, I felt like it completely redefined what a “good” piece of research looks like. After using Deep Research to gather the main points about typical cars that emphasize NVH, outlining their pros, cons, and reviews, the process took about ten minutes—work that would have otherwise taken five hours. This freed me up to consider other potential blind spots in my decision-making. For example, I ended up exploring four extra domains:

Residual value. I quickly learned that residual values for electric vehicles drop quite a bit in the first few years. For instance, Lucid EVs and the Mercedes EQS often lose about 60% of their value in the first three years.
Reliability. Different models have very different reputations for reliability. Land Rover, for example, sometimes sees used vehicles (after they've run smoothly for a while) selling at a higher price than new ones.
Inspired by Deep Research, I also looked into inventory backlogs. For example, it found on the Edmunds site that there are more than a hundred Mercedes-Maybach GLS EQS680 models for sale, including some 2024 units that still haven't been sold [link]. This suggests there may be a significant discount opportunity.
I did some research on car price negotiations as well. On Reddit, I found people mentioning that they managed to negotiate as much as 23% off that same EQS680 model.

That massively increased both the depth and scope of my research. Without Deep Research, limited time alone would mean I could never do such a thorough job.

The third benefit goes hand in hand with the second. Because we reached a breadth and depth of research that used to be out of reach, my mind was freed to think about areas I previously would never have considered. An example: in the past, after listing the pros and cons of each car, I would head to the dealer, test-drive, and haggle a bit—usually doing only minimal negotiation because my brain was wiped out by the research phase, and I wasn't all that interested in bargaining. Often, I ended up paying list price.

But after seeing all this data, even someone as slow on the uptake as me realized that the Mercedes EQS 680 might be worth pushing for a big discount, thanks to its heavy backlog and steep initial depreciation. Reddit offered similar success stories. So I discussed negotiation strategies with o1 Pro to see how I could get a good price. o1 Pro systematically analyzed it and advised me to avoid mentioning high-end brands like Rolls-Royce and Bentley. From the sales rep's perspective, once you bring those up, you get categorized as a customer who's not concerned about price. Then the conversation shifts from price to history, heritage, and the luxury experience—completely contrary to your goal. Instead, you should say that you're also considering lower-priced models and talk about how those might actually be superior in certain aspects. This creates a customer image: you have some money, enough to afford the car, but you also care a lot about getting the best value. If the salesperson offers a better deal, you're more likely to buy.

This salesperson's perspective opened my eyes. But then I was stuck, because I had no idea which other models could serve as good bargaining chips. Deep Research came to the rescue again, searching for targeted alternatives like the Cadillac Escalade IQ, the Porsche Cayenne Hybrid, and the Audi e-tron Q8. It showed me each model's performance advantages and pricing advantages compared to the EQS, plus relevant negotiation tactics. This was all enormously helpful—not just in saving time, but in letting me think at a depth that would have been impossible on my own. If the first two benefits helped me find answers more quickly, this one essentially redefined the boundaries of the question.

So overall, I believe OpenAI's Deep Research tool has delivered a massive productivity boost over traditional, human-only research methods. On one hand, it makes the research process itself highly efficient and straightforward. On the other hand, in both research and beyond, it also enables a depth and quality that used to be out of reach. It further proves that AI has tremendous untapped potential for our everyday lives, especially in scenarios where humans and AI collaborate. The AI does the data mining and hands us cleaned-up information. Our job is the “cognitive alchemy,” extracting insights, making decisions about the next step, and then pushing further with AI's support.

That said, remember we still have to fulfill our role as AI managers. We should open the links and verify crucial details. Tools like Deep Research make this verification easy—just click the link and you'll see the relevant passages highlighted. In my experiment, I didn't find any errors, but it's still an essential part of the process.