Sequoia recently published two articles worth reading back to back. One is Julien Bek’s Services: The New Software, the other is Jack Dorsey and Roelof Botha’s From Hierarchy to Intelligence. After reading both, one conclusion becomes increasingly clear: many AI technical capabilities have already arrived, but the organizational interfaces, business logic, and evaluation systems have not caught up.
By “organizational interface,” I mean more than reporting lines. It includes who defines objectives, who judges whether results meet the bar, who has the authority to let a system execute, and who bears the consequences when something goes wrong. Model capability keeps advancing and costs keep falling, but other parts move on much slower timelines. Budget ownership still follows legacy departmental lines. Authorization boundaries are hardcoded into contracts and processes. Audit requirements are set by regulators and industry standards. Professional liability sits with licensed individuals. Each of these constraints evolves on its own schedule and will not automatically keep pace just because models get stronger.
These two Sequoia articles are worth reading because they articulate this gap clearly. One addresses the shift in the unit of product; the other addresses the shift in the unit of organization. Read together, they are two sides of the same phenomenon.
Below, I will lay out what each article argues, then discuss their shared implications and the questions that remain open.
Julien Bek’s core thesis: copilot sells a tool; autopilot sells the work itself.
He uses a straightforward example. A company spends $10,000 a year on QuickBooks and $120,000 on an accountant. The next-generation company does not sell a better bookkeeping tool for the finance team to operate. Instead, it delivers the entire workflow — bookkeeping, reconciliation, month-end close, financial statements — as a finished output. The customer is no longer buying a tool; they are buying the outcome: books closed, reports delivered.
The article develops along two dimensions. The first is the distinction between intelligence and judgement. Intelligence refers to work that is complex but ultimately rule-coverable — writing code, translating specifications, medical coding. Judgement involves decisions that require experience and taste, such as deciding which feature to build next. Bek argues the boundary between the two is shifting: today’s judgement becomes tomorrow’s intelligence, but the conversion is gradual.
The second dimension is go-to-market strategy. Bek’s view is that the entry point for autopilot is not replacing internal headcount directly, but replacing existing outsourcing contracts. The logic is straightforward: if a task is already outsourced, the company has already accepted external execution, there is an existing budget line, and the buyer is already paying for outcomes. Replacing an outsourcing vendor is a supplier switch; replacing internal headcount is an organizational change. The former meets far less resistance. He cites a data point: for every dollar enterprises spend on software, they spend six dollars on services. By that ratio, autopilot’s addressable market is far larger than SaaS.
On this basis, Bek provides a market map ranking a dozen-plus vertical sectors by intelligence share and outsourcing ratio: insurance brokerage, accounting and audit, medical billing, tax advisory, legal documentation, managed IT, recruiting, and so on. For each sector, he lists companies already operating in the space — Harvey, WithCoverage, Anterior, among others.
This article makes concrete what many people have only vaguely sensed: the unit of pricing is shifting from seats and features to workflows and outcomes. For anyone building an AI product, this framework forces a reexamination of what exactly you are selling.
Jack Dorsey and Roelof Botha’s article approaches from the organizational side and advances a bolder thesis: hierarchy is a two-thousand-year-old information routing protocol, and AI can directly replace its coordination function.
The article begins with the Roman legion. The Roman military organized soldiers into eight-person squads, centuries, cohorts, and legions, with officers at each level responsible for aggregating information and issuing orders. The core constraint of this organizational form is span of control: the number of direct reports a single person can effectively manage is limited, so deeper hierarchies mean slower information flow. The Prussian General Staff was the prototype of modern middle management. American railroads gave rise to the org chart. Taylor turned the factory into a scientific management laboratory. The Manhattan Project demonstrated the power of cross-functional coordination. McKinsey popularized the matrix organization. For two thousand years, every organizational innovation has been a tradeoff made under the span-of-control constraint.
Dorsey and Botha’s claim is that AI can break this constraint. Their approach is not to give every person a copilot, but to restructure the entire company as an intelligent agent. Using Block as the primary example, they describe a four-layer architecture: capability atoms (payments, lending, card issuance — basic building blocks), a world model (replacing management’s information routing function, encompassing both a company world model and a customer world model), an intelligence layer (combining capabilities for a specific customer at a specific moment), and interfaces (Cash App, Square, and other delivery surfaces).
One detail in the article is worth flagging. Block describes a scenario: a Square merchant’s tax filing deadline is approaching while its Cash App lending limit is about to clear approval. In a traditional company, these two events belong to different departments with no awareness of each other. In Block’s architecture, the intelligence layer recognizes this moment and proactively combines the tax tool with the lending capability, pushing the combined offer to the merchant. No product manager decided to build this feature. The capabilities already existed; the intelligence layer identified the timing and composed them.
On the organizational side, they collapse roles into three types: ICs (deep specialists, with the world model providing context that previously only managers possessed), DRIs (cross-domain owners on fixed terms, such as 90-day rotations), and player-coaches (replacing traditional managers by combining hands-on work with mentorship). The article states explicitly: permanent middle management layers are no longer necessary.
The product roadmap is also rewritten. The traditional roadmap has PMs and management prioritizing based on judgement. Block’s approach is to observe where the intelligence layer fails: when it cannot compose a solution, that failure signal becomes the roadmap.
This article is worth reading because it is not making vague claims about AI changing organizations. It provides a concrete architectural description and role definitions. It pulls organizational design back from management philosophy into the domain of information systems design, and it makes many previously abstract discussions concrete enough to actually work through.
Read together, the two articles converge on a common judgement.
Bek describes the product-side shift: the unit of pricing for AI products is moving from tools to work outcomes. Dorsey and Botha describe the organizational-side shift: the coordination mechanism inside companies is moving from human hierarchy to world models and intelligence layers. Both rest on the same premise: intelligence has become cheap enough to substitute for human coordination, execution, and judgement.
Bek reasons from product economics to conclude that autopilot is the larger market. Dorsey reasons from information theory to conclude that hierarchy can be replaced by AI. One is about the interface between company and customer; the other is about the coordination interface within the company. Same direction, different cross-sections.
This direction is most likely correct. Model capability is improving, costs are declining, and economic pressure is driving the shift from assistance to execution. But both articles share a common characteristic: they describe this transition as primarily driven by technical capability. The model is strong enough, so you can sell outcomes; the model is strong enough, so you can replace hierarchy.
My view is that what actually determines the speed of this transition is not technical capability alone, but the still-unformed middle layer between technology and organizational interfaces. That is what the following questions are about.
Technically, moving from copilot to autopilot may be a matter of climbing the capability gradient. But at the organizational and institutional level, it is a fundamental shift. The issues involved extend well beyond legal and compliance, although legal and compliance are an obvious part of it.
The move from recommendation to execution is a transfer of liability. In copilot mode, the AI is a tool and the human is responsible for the output. A lawyer uses Harvey to draft a contract; the lawyer signs it. An accountant uses AI to do the books; the CPA stamps them. The chain of responsibility is no different from using Word or Excel. In autopilot mode, the AI is selling outcomes. When an outcome is wrong, the party the customer holds accountable shifts from the person who used the tool to the system that delivered the result. But behind that system, there is no professional qualification, professional insurance, or legal entity in the traditional sense. This gap is barely addressed in either article.
Liability attribution is just the starting point. The deeper issues are the following.
How do you know the AI performs well enough to make decisions? This question sounds simple but is extremely hard to answer. Bek mentions that legal work products are sufficiently standardized that quality can be verified. But there is a wide gap between standardized and verifiable. In accounting, correctness has a clear definition — standards like GAAP and audit criteria provide a reference frame. But in insurance quoting, legal strategy advice, and hiring screening, what counts as “meeting the bar” is itself a judgement call, currently made by experienced professionals. When AI takes over execution, who defines that quality standard, how is it calibrated, and how often is it updated?
How do you build evaluation benchmarks? Quality assessment in many service domains does not have the deterministic pass/fail criteria of software testing. Whether an insurance quote is reasonable depends on understanding the risk. Whether a legal recommendation is sound depends on judgement about the client’s specific situation. Building these benchmarks requires domain expert involvement, extensive historical case review, and systematic analysis of edge cases. This is more akin to work done by traditional consulting firms and industry associations than something an engineering team can complete independently. Bek’s market map ranks verticals by intelligence share but does not discuss the maturity of evaluation systems in each domain. That maturity may be the key variable determining how quickly autopilot can actually be deployed.
This work increasingly resembles management cognition, not software engineering. When we discuss how AI should define execution boundaries, perform quality acceptance, and allocate resources, these questions are highly isomorphic to traditional operations design: defining SLAs, designing escalation processes, establishing sign-off authority. Dorsey and Botha’s article re-describes them in technical language — world model, intelligence layer, failure-driven backlog — but the underlying questions do not disappear because the terminology changed. If you translate Block’s architecture back into traditional management language, it is an automated operational decision system plus a very flat human governance layer. The technical parts (models, APIs, data pipelines) can be engineered, but the governance parts (who sets standards, who handles exceptions, who bears consequences) are fundamentally questions of organizational design and business judgement.
Who ultimately owns business judgement? Bek distinguishes intelligence from judgement and argues that judgement will gradually be converted into data. Dorsey describes a world where failure signals automatically generate the roadmap. But in actual business environments, many important judgements cannot be reduced to data and rules. In insurance, whether to underwrite a borderline case involves actuarial models, commercial strategy, and client relationships. In law, whether to pursue a particular litigation strategy requires weighing legal risk, time cost, and the client’s business objectives. These judgements are currently made by qualified, insured, and accountable professionals. When AI takes over these judgements, the question is not just legal attribution — it is who confirms the judgement is correct, on what basis, and who bears the downside when the confirmation is wrong.
How generalizable is the Block case? Dorsey and Botha use Block as the core example, but Block has several conditions most companies lack. It is a two-sided transaction platform — Cash App plus Square — with visibility into both buyer and seller on every transaction. Financial transaction data is among the highest signal-to-noise behavioral data available. Block’s world model is built on an exceptionally high-quality data foundation. For companies without comparable high-frequency, structured data, what signals does the world model rely on? If the available signal quality is insufficient, the world model may degrade into a fancier dashboard: it looks like the intelligence layer is coordinating, but humans are still making the critical judgements — just through a different interface. That is not necessarily bad, but it is a far cry from the “company as intelligent agent” vision the article describes.
These problems do not have ready-made answers, but they are not entirely intractable either. Several directions are worth sustained attention.
Evaluation systems may be the real infrastructure gap. The most critical precondition for moving from copilot to autopilot is not model capability — it is having a credible methodology for judging whether model output meets the bar. This evaluation system takes entirely different forms in different domains: accounting has GAAP, medical coding has ICD-10, but evaluation benchmarks for legal advice, management consulting, and hiring decisions are far from mature. Whoever establishes credible evaluation standards in a given vertical first will be closest to capturing that domain’s copilot-to-autopilot entry point. The difficulty here is that it requires deep integration of domain expertise and technical capability, and the results are not flashy: it is neither a stronger model nor a better-looking demo.
Authorization and audit may need to be productized. Today, most AI products define execution boundaries through informal human agreement or vague system prompts. But as AI moves from recommendation to execution, execution boundaries need to become part of the system: configurable, traceable, auditable. Similarly, after AI takes an action, there needs to be a complete record for retrospective review — what it saw, why it acted, what state it changed. Many companies currently treat these capabilities as compliance add-ons, but over the long term, authorization management and audit trails may themselves be capabilities customers are willing to pay for, rather than accessories to autopilot.
The boundary between intelligence and judgement is dynamic, but moves at vastly different speeds across domains. Bek acknowledges that today’s judgement becomes tomorrow’s intelligence, but does not discuss what conditions this conversion requires. In code, the conversion happens quickly because code has clear correctness criteria — it compiles, tests pass. In law, insurance, and management consulting, the conversion is much slower, for different reasons in each case: law is constrained by case law systems and jurisdictional variation; insurance is constrained by actuarial standards and regulatory approval; management consulting output quality depends heavily on understanding the client’s specific context, with no universal verification baseline. For builders, choosing which domain and which approach to accelerate this conversion may be a more precise product question than “upgrade copilot to autopilot.”
There may be an underestimated transitional form between copilot and autopilot. Not humans handing execution authority directly to AI, and not humans approving every action, but a hybrid mode of AI execution, system constraints, and human oversight. In this mode, AI executes autonomously within well-defined boundaries. Those boundaries are defined by business rule systems rather than system prompts. Anomalies automatically escalate to humans. The human role shifts from item-by-item approval to defining rules and handling exceptions. This may be closer to commercial viability than pure autopilot, and more valuable than pure copilot.
Context infrastructure may be the long-term moat. If intelligence continues to commoditize, the truly scarce resource may not be at the model layer but in auditable business context, verifiable execution history, and authorized operational boundaries. Accumulating these assets takes time, domain depth, and deep integration with the customer’s business processes — none of which a competitor can easily replicate. Model capability is a function of compute and capital. These context assets are a function of time and trust.
These two Sequoia articles describe a direction in which AI delivers work outcomes directly and intelligence replaces hierarchical coordination. This direction is most likely correct — economic pressure, model capability, and customer expectations are all pushing it forward.
But for me, the most valuable aspect of these articles is not the completeness of the endgame they envision. It is that they expose what is genuinely hard: when technical capability is already in place, how do organizational interfaces, evaluation systems, liability attribution, and ownership of business judgement keep up?
This is not just a legal and compliance problem, nor just an engineering problem. It involves how evaluation is defined, how benchmarks are built, how execution boundaries are managed, and how ownership of business judgement is allocated. The answers to these questions are more likely to come from domain experience and organizational design than from pure software engineering.
What is worth watching is not when autopilot fully replaces copilot, but which infrastructure layers — evaluation, authorization, audit, context — mature first during this transition, and how our own workflows and product designs should prepare for it.