AI AgentIndustry & CompetitionAI Coding

The Cognitive Edge Behind Manus and Cursor: Technical Bets and Their Validation

2026-04-28

A Common Verdict

Meta paid $2 billion for Manus. Elon Musk offered Cursor a $60 billion acquisition option. After these numbers came out, the most common reaction on the Chinese internet boiled down to two claims. First, these are just wrapper products. They use other people’s models under the hood, so what’s the big deal? Second, Zuckerberg and Musk are impulse buyers: Meta missed the AI wave and is overpaying to catch up, and Musk just buys whatever is hot.

The subtext of this verdict is that Manus and Cursor are nothing special, no different in essence from the flood of AI agent tools and AI coding tools on the market. They just had better marketing and better timing.

This article argues that this verdict is wrong. Not slightly wrong, but directionally wrong. Manus and Cursor each hold a cognitive lead over their respective industries by at least one full step, and this lead can be verified through specific technical choices and head-to-head comparisons with competitors. The prices Meta and SpaceX/xAI offered are not impulse purchases. They are how the market priced that cognitive lead.

Manus: Reasoning from First Principles

Manus has been controversial since its launch in March 2025. The most common criticism is that it’s a wrapper: it doesn’t train its own models, uses Claude and Qwen, and just wraps an agent orchestration framework around them. MIT PhD Zengyi Qin’s comment represents one school of thought: this is a good product, but it is not a technical breakthrough.

To understand what Manus got right, the most effective approach is to put it side by side with its contemporaries.

Cognitive Gap 1: No Role-Playing

From 2023 to early 2025, most multi-agent systems were designed by copying human organizational structures. MetaGPT is the canonical example: it divides LLM agents into five roles (product manager, architect, project manager, engineer, QA), each with fixed responsibilities and workflows, executing sequentially like a human software company. This is hat wearing.

The problem with this design starts at its premise. Human societies need specialization because an individual’s cognitive bandwidth is limited; it takes over a decade of training to become a senior product manager or a senior engineer. Division of labor compensates for human cognitive constraints. LLMs are different. Any LLM off the shelf is already a generalist with knowledge across all domains. Telling it in a prompt “you are a senior software engineer” does nothing except restrict its capabilities.

Thinking from first principles leads to an entirely different conclusion: instead of having multiple agents each role-play a human function and collaborate sequentially, each agent should retain its full generalist capability, with division happening only at the task level. Manus’s wide research mechanism is the productization of this idea. Its main planner agent decomposes a user request into independent subtasks, then spins up a separate, full-capability Manus instance for each subtask. Each instance has its own independent context window and executes autonomously in a cloud VM sandbox. There are no role labels like “product manager agent” or “engineer agent.” Every sub-agent can plan, execute, and verify.

This is not a UI-level difference or a product strategy difference. It is a difference in understanding the fundamental nature of LLMs. MetaGPT designed from human organizational structure; Manus designed from LLM capability characteristics. The latter was right, the former was wrong. This judgment was a minority view in March 2025. By 2026, it has become industry consensus: OpenAI’s Codex uses Plan/Spec Mode (a planner analyzes the request, an executor materializes each step in a sandbox), Anthropic’s Claude Code uses orchestrator-worker (a lead agent formulates the plan, sub-agents execute in parallel), and Cursor uses Planner-Worker-Judge. Every major player has converged on architectures divided by function (planning, execution, evaluation), and none of them assign human job titles to their agents.

Manus’s product-level judgment reflects the same cognitive caliber. In March 2025, while most agent products were siloed in vertical domains (research tools could only research, generation tools could only generate), Manus was the first to build an end-to-end pipeline, running from autonomous search to code generation to data visualization in a single flow. This is table stakes for agent products today, but it was a minority bet at the time. I wrote an analysis that week, discussing the compounding effects of Agentic AI across three dimensions: tools, data, and intelligence. Manus was the only product at the time that had realized all three layers of compounding.

Cognitive Gap 2: Creating and Distributing User Generated Software

The software industry has a long-standing supply-demand mismatch: products from professional software companies serve head demand, while a vast long tail of needs goes unaddressed. This parallels the media industry before YouTube: TV networks served head content demand, long-tail content creation needs were ignored, until User Generated Content platforms emerged.

Manus identified this early and made a product decision that seemed unconventional at the time: let users deploy and distribute the applications Manus generates. A user describes a need, Manus auto-generates the frontend, backend, and database, then deploys it to the cloud with one click and returns a shareable link. Getting this far already exceeded most contemporaneous agent products. But Manus went one layer further: it provided an API so that deployed applications could call Manus’s own AI capabilities. In other words, users could not only use AI to generate software, the generated software itself could continue using AI.

This judgment was far from obvious at the time. In March 2025, most AI agent products positioned themselves as tools that help you complete a task, producing reports, code, or slides, and finishing when the task is done. Manus positioned itself as a tool that helps you create a software product that can run continuously and be distributed, with built-in intelligence. These are two fundamentally different product logics. The former treats AI as a one-off productivity tool; the latter treats AI as infrastructure for User Generated Software.

Market response validated this judgment. Manus’s waitlist surpassed 2 million after a public demo, and what excited users most was not just that AI could do research and write code, but that it could deploy the finished product with one click, turning it into a real, usable online product. By the end of 2025, vibe coding and AI app builders had become a $4.7 billion market, and Manus was among the earliest products to build the complete pipeline of creation plus deployment plus intelligence injection.

The cognitive caliber behind this design choice is visible in its completeness of value chain judgment. Most competitors stopped at generation. Manus thought all the way through to distribution and continuous operation. This points to the same root as the first cognitive gap (no hat wearing): the team reasons from first principles rather than making incremental optimizations on existing product forms.

Results and Responses

Commercial returns directly reflect these insights: in 8 months, Manus reached $100M ARR, processed 147 trillion tokens, and spun up over 80 million virtual machines. Its GAIA Level 3 benchmark score of 57.7% led OpenAI Deep Research’s 47.6%.

Two common follow-up questions deserve a response.

First, “agent products are everywhere now; Manus is a previous-generation product form that has no direct use for Meta.” This claim is half right. Manus represents the cloud-sandbox agent paradigm, and the mainstream direction in 2026 has shifted to local terminal agents like Claude Code and OpenClaw, and enterprise-integrated agents like Amazon Q. In terms of product generation, Manus’s form is indeed not the latest. But acquisition logic has never been about buying the latest generation of product. What Meta bought is this team’s cognitive caliber, engineering capability, user base, and infrastructure accumulation. Product forms can iterate; a team’s understanding of and practical experience with agent AI does not expire when a new generation of products appears. By February 2026, Meta had already integrated Manus’s agent capabilities into the Ads Manager workflow, demonstrating that Manus’s technical assets found a real landing point within Meta’s product ecosystem.

The context engineering blog post published by the Manus team in July 2025 is more direct evidence. The information density of this post is extremely high, and from it you can directly see that the Manus team’s understanding of agentic AI leads the industry by a full step. The three core principles it proposed (keep prefix stable, make context append-only, mask tools don’t remove them) were later widely cited and adopted across the entire harness engineering field. More importantly, the post opens by answering a key technical roadmap question: should you train an end-to-end agentic model based on open-source models, or should you build agents on top of frontier models’ in-context learning capabilities? Manus chose the latter and proved its viability with product results. This judgment was not consensus in mid-2025; by 2026, it had become the industry’s mainstream approach. For a single technical blog post to achieve this level of foresight and influence is itself proof of the team’s cognitive caliber.

Second, “Manus has been a wrapper from start to finish, with no technical substance.” In April 2026, China’s National Development and Reform Commission invoked the first “prohibit and rescind” order in the five-year history of the Foreign Investment Security Review Measures to block the acquisition. If Manus were truly a wrapper product with no core technology, regulators would have had no reason to deploy their strongest legal instrument to protect it. The regulator determined that this company’s core team, R&D capability, training data, and IP constitute national security assets requiring protection. The weight of that determination exceeds any technical benchmark or media debate.

Cursor: The Only Third-Party Player Training Its Own Models

Cursor faces wrapper accusations similar to Manus’s: the underlying models are someone else’s, and all Cursor built is an editor. But Cursor made a judgment that none of its competitors in the same space made, and built a complete technical moat around that judgment.

Cognitive Gap 1: Judging That a Self-Trained Model Is a Product Necessity, Then Delivering It

The core loop of a coding agent is high-frequency tool calls: reading files, writing code, running commands. Each round carries latency, and the cumulative total directly determines product experience. The Cursor team judged early on that in this use case, relying on external frontier model APIs could not deliver the interactive experience developers expect in terms of speed and cost, and that a self-trained model was a product-level necessity that could not be bypassed. In Cursor’s own words on their official blog, their goal was to train the smartest model that can support interactive use, keeping developers in their coding flow.

A natural question arises here: the earlier section argued that Manus’s use of external model APIs was the right call, so how can training a custom model be a necessity for Cursor? The distinction lies in the core constraints of their respective domains. In Manus’s domain of general-purpose agents, the key differentiation lives in the agent architecture and context engineering layer; capability differences between underlying models are absorbed by the agent framework. Coding is different: latency and cost directly determine product usability. What the two share is precisely this: both made the correct build-vs.-buy judgment by reasoning from the actual constraints of their own domain.

Having committed to this direction, Cursor delivered, and product experience validated the judgment. After the release of Composer 1, I used it to replace Sonnet 4.5 across a large number of projects. In my experience, for roughly 90% of everyday coding tasks (fixing bugs, writing CRUD, refactoring, adding features), Composer 1 and Sonnet 4.5 showed no meaningful difference in completion quality. The share of daily coding tasks that truly require rocket-science-level reasoning is small; most of the time it’s grunt work where capability gaps between models don’t surface. But the speed advantage was overwhelming: for the same task, Sonnet 4.5 required a wait of one to two minutes, while Composer 1 came back in seconds to low tens of seconds. Similar quality, several times faster. In a high-frequency use case, this experience gap is enormous. This is exactly the judgment Cursor made from the start: in coding, the product experience bottleneck is model speed and cost, not the capability ceiling.

In terms of approach, Cursor did not pretrain a model from scratch. Instead, they took an open-source MoE base and ran large-scale RL post-training in a harness that simulates Cursor’s production environment, training the model’s tool-call decision-making and response efficiency.

A common objection comes up here: isn’t that just fine-tuning?

The five-month evolution from Composer 1 to 2 answers this question. Cursor’s training pipeline went through three iterations, each of which was not simple hyperparameter tuning but a methodological upgrade. The 1 and 1.5 phases followed a pure RL route: large-scale post-training on an open-source base. By Composer 1.5, RL compute had scaled 20x, with post-training consuming more compute than the base model’s pretraining itself, while introducing two new trained behaviors: thinking tokens (adaptive reasoning depth) and self-summarization (automatic long-context compression). But they found diminishing marginal returns on the RL-only route: CursorBench improved by only 6.2 points from 1 to 1.5, despite a 20x increase in compute.

With Composer 2, Cursor made a key methodological pivot: adding continued pretraining before RL to improve the quality of the starting point for RL exploration. The base model was switched to Kimi K2.5 (officially confirmed by Moonshot), with continued pretraining followed by RL, and CursorBench jumped 17.1 points in one move. Composer 2’s technical report states explicitly: it achieved Pareto optimality while keeping inference costs significantly below comparable models. In other words, Cursor’s post-training pipeline did not just slap a fine-tuning layer on a base model and accept a performance discount; it compressed cost and latency while maintaining comparable coding capability.

This methodological self-correction has academic backing. ICML 2025 research (SFT Memorizes, RL Generalizes) and Moonshot’s own Kimi K2 technical report both point in the same direction: pretraining establishes priors, RL conducts efficient exploration on those priors, and continued pretraining changes the quality of the starting point. The Cursor team independently discovered this before Composer 2 and shipped it into their product.

Looking back at competitors’ choices: the AI coding tool space has many startups. Cline is an open-source VS Code extension that connects to various third-party models. Trae is ByteDance’s offering, using Claude 3.5 Sonnet and GPT-4o. Windsurf is from Cognition, formerly Codeium. Their product differentiation comes from UI design, workflow orchestration, and pricing strategy, not from model capability itself. Cognition’s SWE-1.5 (the model behind Windsurf) did RL on an open-source base but did not go as far as continued pretraining. Cline and Trae do no model training at all. In the AI coding tool space, only LLM providers’ first-party products (OpenAI’s Codex, Anthropic’s Claude Code) and Cursor have completed the full four-stage pipeline of base model selection, continued pretraining, RL post-training, and product integration. Cursor is the only third-party startup to have done so. These competitors are not lacking in effort; they did not make the judgment that a self-trained model is a product necessity.

Cognitive Gap 2: Harness Engineering Shipped to Product

Cursor’s cognitive lead also shows in its exploration of harness engineering and agent scaling, and these explorations have shipped directly into the product.

In the self-driving codebases experiment published in February 2026, Cursor used a recursive Planner-Worker architecture to run hundreds of agents in parallel, with peak throughput of approximately 1,000 commits/hour, generating over one million lines of Rust code. The problem this experiment addressed was not “how to make a single agent write good code,” but “how to get 10x meaningful throughput from 10x compute.” The framing of this question alone was ahead of the industry.

This experiment later sparked controversy. People checked the public repo and found that none of the recent commits compiled, with all GitHub Actions CI runs failing. Reddit and Hacker News erupted with criticism, accusing Cursor of faking results.

This criticism identified a real gap, but the label “faking” is inaccurate. Cursor proactively discussed these issues in the original blog post. The blog states at the outset that the browser is not intended for external use and that code quality imperfections are expected. The body specifically analyzes why the system design must accept a certain error rate: when they demanded 100% correctness per commit, the system’s effective throughput collapsed, agents went out of scope to fix unrelated issues, and multiple agents stepped on each other. The blog also addresses the dependency hallucination problem (agents pulling dependencies they shouldn’t use) and explains subsequent corrective measures. Cursor did not get caught and then make excuses; they laid out failure patterns for analysis alongside the experimental results.

The more reasonable reading of this episode is: this was a frontier experiment in agent spatial scalability, and Cursor honestly documented the current capability boundaries and failure modes in their blog. The code quality of the experimental output was indeed not good enough, and the public repo’s state was worse than the blog’s language suggested. But the questions the experiment addressed (throughput scaling laws for multi-agent parallelism, error rate control, task allocation strategies) have not been explored at comparable scale by any other player in the industry.

During the same period, Cline was optimizing single-agent permission controls and tool-call flows. Trae was polishing the Builder Mode scaffolding experience. This work all has value, but the questions they were answering and the questions Cursor was answering are on different levels. Cursor was already thinking about spatial scalability of agents while competitors were still optimizing single-agent interaction quality.

Cursor also introduced background agents and parallel task execution in its product early on, surfacing these capabilities directly in the UI. While most AI coding tools were still debating single-turn conversation quality, Cursor was already solving the further-out problem of how agents continue working after the user walks away. This product judgment did not begin appearing in similar explorations from Cline, Trae, and other competitors until mid-2026.

Responding to the Wrapper Thesis

LLM providers’ first-party products (OpenAI Codex, Anthropic Claude Code, Google Gemini Code Assist) have invested far more resources than Cursor, yet Cursor still competes with them head-on at the product level. Fortune magazine’s March 2026 analysis was titled Cursor’s crossroads, and the question it explored was not whether Cursor can compete with LLM providers, but how it maintains its edge after LLM providers have fully entered the arena. A true wrapper product would not be discussed in this position. SpaceX/xAI’s $60 billion offer is not for an editor skin.

The Common Pattern

Placing Manus and Cursor side by side, the common pattern becomes clear.

First, both teams hold a cognitive lead over their respective industries. Manus’s understanding of LLM fundamentals (no hat wearing, wide research), its judgment on User Generated Software (the complete pipeline of creation plus deployment plus intelligence injection), and its grasp of end-to-end agent capability composition. Cursor’s understanding of the post-training pipeline (the complete four-stage chain), its grasp of harness engineering (spatial scalability), and its understanding of agent work modes (background agents and parallel execution). All of these were at least six months to a year ahead of peers. Comparisons with competitors confirm this: the problems these two companies were solving and the problems their competitors were solving during the same time period were frequently on different levels. This lead was not achieved by throwing more resources at the problem. It comes from a fundamentally different understanding of AI as a medium.

Second, cognitive lead ultimately converted into outcome differentiation. Manus reached $100M ARR and a leading GAIA benchmark score. Cursor achieved the strongest position in the AI coding tool space outside of LLM providers’ first-party products. Both outcomes were endorsed with real money by top-tier global buyers: Meta at $2 billion, SpaceX/xAI at $60 billion.

Third, both teams faced wrapper accusations. These accusations reflect a taxonomy of technical value that is becoming obsolete. In 2023, whether a company trained its own model was the core criterion for judging an AI company’s technical substance. By 2025, training pipeline design, agent architecture engineering, context engineering methodology, and harness engineering practice had risen to equal or greater importance alongside self-trained models. Evaluating 2025 products with a 2023 framework systematically underestimates the technical depth of companies like these.

The challenges each company faces are also real. Manus’s acquisition has been blocked by the NDRC, creating significant path uncertainty. Cursor faces head-on competition from Claude Code and Codex, and key engineering talent has already begun flowing to xAI. These challenges do not change one fact: the technical judgment these two teams have demonstrated in the Agent AI era places them in the first tier of the entire industry. The prices Zuckerberg and Musk offered are not impulse purchases. They are how the market priced that judgment.