Developer ToolsAI Agent

codex-cli-internals-survey-en-20260314

Published Mar 14, 2026

Date: 2026-03-14 Core Sources: OpenAI “Unrolling the Codex agent loop” (Michael Bolin, 2026-01), OpenAI “Unlocking the Codex harness” (Celia Chen, 2026-02), The Pragmatic Engineer “How Codex is built” (Gergely Orosz, 2026-02) Supplementary Sources: Zenn Source Code Analysis Series (takiko), ZenML Architecture Case Study, Blake Crosley Architecture Comparison, Ars Technica Technical Report, InfoQ App Server Report, Morph Harness Comparison, Reddit Community Observations Related: Harness Engineering Survey

1. Why it’s worth dissecting

By early 2026, three major coding agents—Codex CLI, Claude Code, and Gemini CLI—formed a tripolar landscape. Among them, Claude Code is entirely closed-source, and Gemini CLI is open-sourced in TypeScript but has thin documentation. Only Codex CLI has achieved “fully open-sourced core logic + a series of technical blog posts by official engineers.” The repository is at github.com/openai/codex, with 4,547 commits at the time of writing, implemented in Rust under the Apache 2.0 license.

This is more than just “being able to look at the code.” OpenAI engineers Michael Bolin and Celia Chen wrote deep dives on the agent loop and the App Server protocol, respectively. The Pragmatic Engineer interviewed Thibault Sottiaux, the Codex team lead, and Japanese developer takiko published a function-by-function source code analysis on Zenn. Together, these materials make Codex CLI the most scrutinizable and well-documented production-grade AI agent client implementation available.

For teams building or customizing AI agent toolchains, the value of Codex CLI lies not in directly reusing its code (the Rust barrier and OpenAI API binding are limitations) but in the specific, production-proven design decisions it exposes. The tradeoffs behind these decisions are more worth studying than the code itself.

2. Overall Architecture: A Four-Layer Stack

The architecture of Codex CLI can be understood as a four-layer stack, with each layer having clear responsibility boundaries:

The Surface Layer handles access. The TUI (Terminal User Interface), App Server (a JSON-RPC service for IDEs and Web calls), MCP Server (for other agents to call), and SDK (for CI/CD and scripts) are different implementations of the surface layer. They share the same core but differ in interaction modes.

The Session Layer manages state. Thread creation, resumption, forking, archiving, configuration loading and switching, and authentication flows (including ChatGPT OAuth login) are handled here. Celia Chen’s blog refers to this as “the full agent experience beyond the core loop.”

The Core Layer is the agent loop itself, located in codex-rs/core/. This is the heart of the system: receiving user input, assembling prompts, calling models, handling tool calls, and managing the context window. All Codex experiences (CLI, Web, VS Code, macOS App, JetBrains, Xcode) share this same core.

The Execution Layer is responsible for actual operations: sandbox isolation, shell command execution, file editing, and MCP tool scheduling.

A key feature of this layering is that the core layer does not know which surface layer it is running in. codex-rs/core is a pure Rust library that communicates with the outside world via async channels and an event protocol. The TUI is one consumer, as are the VS Code plugin and the Web interface. This decoupling allows OpenAI to quickly build Codex experiences for new platforms without modifying the core logic.

3. Agent Loop: An Event-Driven Three-Layer State Machine

3.1 Two Channels

According to takiko’s source code analysis, the codex-rs agent loop adopts an event-driven architecture, with two asynchronous channels serving as its circulatory system.

The Submission channel flows from the client to the session, passing user operations (Op). When the TUI receives keyboard input, it packages the content as Op::UserInput and sends it to this channel.

The Event channel flows from the session to the client, distributing events. Text fragments of AI responses, tool call execution results, and error notifications all reach the TUI (or other clients) through this channel for rendering.

The benefit of this dual-channel design is natural asynchrony: users can continue typing (e.g., to cancel an operation) while the agent is still executing, and the session can execute long-running tool calls without blocking the UI.

3.2 Three Nested Loops

The agent loop itself is a three-layer nested structure.

The outermost layer is the submission_loop, a persistent infinite loop that waits for Op and dispatches processing. It runs until it receives Op::Shutdown. In addition to UserInput, it handles other operation types like UserTurn (when a user continues the conversation after the agent finishes its work).

The middle layer is handler scheduling. When the submission_loop receives Op::UserInput or Op::UserTurn, it calls handlers::user_input_or_turn to start a round of dialogue.

The innermost layer is the turn loop, the core cycle of a single round of dialogue. Briefly, it: assembles the prompt → calls the Responses API (streaming) → processes stream events. If the model outputs a function_call type event, it executes the corresponding tool, appends the result back to the prompt, and calls the API again. If the model outputs a plain text assistant message (without a tool call), the round is considered complete, and control is handed back to the user.

An important feature is that there is no hard limit on the number of inference ↔︎ tool-execution cycles within a single round. An agent can execute dozens or even hundreds of tool calls in one round until it decides the task is complete. This means context window management becomes a core responsibility of the agent loop.

3.3 Model Inference: Stateless Design

Codex CLI sends HTTP requests to the Responses API to perform inference. A key design decision is that every request sends the full conversation history and does not use the previous_response_id parameter.

Bolin explained this choice in his blog. The Responses API provides an optional previous_response_id parameter that allows the server to store dialogue state, so the client only needs to send increments. Codex does not use it for three reasons: to simplify implementation complexity on the API provider side, to more easily support Zero Data Retention (not storing user data on the server), and to more easily adapt to non-OpenAI providers (any endpoint implementing the Responses API can be used).

The cost is obvious: as the conversation grows, the number of tokens in each request continues to swell. This directly necessitates a compaction mechanism.

3.4 Context Window Management and Automatic Compaction

When the token count exceeds a threshold, Codex automatically triggers compaction. This is implemented in the codex-rs/core/src/context_manager module. This module is responsible for pairing tool calls with outputs, truncating oversized payloads, and then calling a specialized API endpoint to compress the history. The compressed content is retained in the prompt as an encrypted content item, from which the model can recover its “understanding” of previous work.

Early versions required users to manually execute the /compact command. The current version is fully automated and transparent to the user.

Notably, Ars Technica reported that this compaction mechanism relies on OpenAI’s proprietary API endpoint. If using --oss mode with Ollama, this feature may be unavailable. This is an implicit cost of the stateless design: while theoretically provider-agnostic, certain advanced features remain coupled with OpenAI’s infrastructure.

4. Prompt Assembly: Cascaded Instruction Injection

Before each model call, Codex assembles a prompt containing multiple layers of information. Understanding this assembly process is key to understanding Codex’s behavior.

ZenML’s case analysis detailed the assembly order. The first to be injected is the system prompt, a hard-coded identity definition in the client that tells the model, “You are Codex CLI, a terminal-based coding assistant.” takiko’s source code analysis reconstructed this prompt, which includes a list of available operations (receiving user prompts, streaming responses, executing shell commands, applying patches, working in a sandbox, etc.).

Next are optional developer instructions from the user’s config.toml. Then come cascaded user instructions: the system starts from the $CODEX_HOME directory and reads AGENTS.md and AGENTS.override.md at each level along the path from the git root to the current working directory. More specific instructions override more general ones, with the total volume constrained by a 32 KiB default limit. If Skills are configured, their content is also injected at this stage.

Then comes the tool list. Codex tools fall into three categories: built-in tools (shell execution, file editing), tools returned by the API, and tools exposed by MCP servers. All three categories are listed uniformly in the tools field of the prompt, and the model autonomously chooses which to use during inference.

Finally, there is the environmental context (file tree, git status) and the full conversation history.

This cascaded design directly corresponds to a core finding in harness engineering research: “AGENTS.md should be a directory, not an encyclopedia.” Codex’s prompt assembly mechanism, through the 32 KiB limit and path cascading, naturally encourages users to distribute information across the directory structure rather than piling it into one massive file. This enforces harness engineering best practices at the system level.

5. Sandbox: Kernel-Level Isolation

Codex’s sandbox design is the most fundamental architectural difference between it and Claude Code, and it best reflects their differing design philosophies.

5.1 Implementation Mechanism

On macOS, Codex uses Seatbelt (Apple’s sandboxing framework, the same technology used for App Store apps) to restrict processes generated by the agent. On Linux, it uses Landlock + seccomp (kernel-level access control and system call filtering). This means restrictions occur at the operating system level, and the agent’s code cannot bypass them through application-layer means.

codex-rs/core/src/protocol.rs defines a SandboxPermission enum: DiskFullReadAccess (can read any file), DiskWriteTempOnly (can only write to temporary directories), NetworkFullAccess (can access the network), etc. These permissions are combined into three preset modes: read-only, workspace-write, and danger-full-access.

5.2 Comparison with Claude Code

Blake Crosley’s architecture comparison article accurately summarized this difference. Codex’s sandbox rejects syscalls at the kernel level, making escape difficult but programmability low (you can only choose preset modes and cannot write custom rules). Claude Code uses application-layer hooks (17 types of lifecycle events), where escape difficulty is medium (hooks and the agent share a process boundary), but programmability is high (hooks can run any bash or Python).

This tradeoff reflects two different security philosophies. In the Pragmatic Engineer interview, Tibo, the Codex team lead, said bluntly: “We take a stance with the sandboxing that hurts us in terms of general adoption. However, we do not want to promote something that could be unsafe by default.”

In other words, Codex chose “secure but inflexible,” while Claude Code chose “flexible but requires the user to be responsible for security.” For security-sensitive scenarios (finance, healthcare, government), Codex’s approach is more persuasive. For scenarios requiring highly customized approval logic, Claude Code’s hooks are more powerful.

5.3 A Noteworthy Security Gap

An analysis article from the OpenCode community (“Building Sandboxes into OpenCode”) pointed out a known weakness in the Codex sandbox: MCP servers are started as child processes outside the sandbox. This means if a malicious MCP server is configured, it can bypass sandbox restrictions. Codex documentation acknowledges this: only built-in Codex tools are protected by the sandbox; MCP tools must ensure their own security.

6. App Server: The Critical Leap from CLI Tool to Platform

If the agent loop is the heart of Codex, the App Server is its vascular system. Celia Chen’s blog recounts the background of this component’s birth.

6.1 Why an App Server is Needed

Codex initially only had a TUI. When the team wanted to build a VS Code plugin, they faced a choice: reimplement the agent logic or find a way to reuse the TUI core. They first tried packaging Codex as an MCP server but found that MCP’s request/response semantics were unsuitable for the complex interaction patterns of an agent. An agent needs to stream progress, request user approval, display diffs, and be cancelable mid-task. These were not scenarios considered when the MCP protocol was designed.

Consequently, the team designed a specialized bidirectional JSON-RPC protocol, initially just an “unofficial first version” for VS Code. As the macOS App, JetBrains, and Xcode were integrated, this protocol matured into the formal App Server.

6.2 Three Core Primitives

The App Server protocol is built around three primitives:

A Thread is a complete conversation. It can be created, resumed, forked (branched from a certain point), and archived. The event history of a thread is persisted, allowing clients to resume rendering from a breakpoint after reconnecting.

A Turn is a single user request and the subsequent agent work. A Thread contains multiple Turns. Each Turn streams incremental updates while in progress, allowing the client to render progress in real-time.

An Item is the smallest unit of input/output. User messages, agent messages, command executions, file modifications, tool calls, and approval requests are all different types of Items. Each Item has a clear lifecycle (started → in-progress → completed/failed).

The benefit of this three-layer primitive design is that different clients can choose different rendering granularities based on their UI capabilities. VS Code can render an independent UI card for each Item. The TUI can stream all Items together into terminal output. The Web interface can display each Turn as a collapsible task panel. The protocol is the same; the presentation differs.

6.3 Openness of the Protocol

The full source code for the App Server is open-sourced in codex-rs/app-server/, and the documentation includes schema generation tools for TypeScript and JSON Schema. This means anyone can build a Codex client in any language. The protocol supports both STDIO and Streaming HTTP transport methods—the former for local process communication and the latter for remote deployment.

This is the most valuable part of Codex’s open-source strategy. It doesn’t matter if the IDE plugins are closed-source because the protocol is open. You can write a Codex client for Emacs, for a proprietary IDE, or even for a Slack bot.

7. Connection to Harness Engineering

This survey is directly related to the previous harness engineering survey. Harness engineering answers “how humans design working environments for agents,” while this survey answers “how the agent client is implemented internally.” They are two perspectives on the same problem.

7.1 Agent Loop as the Harness Runtime

In the harness engineering survey, OpenAI’s Ryan Lopopolo described “environment design” (documentation systems, architectural constraints, feedback loops, validation tools), all of which eventually enter the model’s perception through the prompt assembly stage of the agent loop. AGENTS.md is just a file on disk; the moment it truly takes effect is when Codex’s context_manager reads it, strings it into the prompt, and sends it to the Responses API. Understanding the implementation details of the agent loop helps harness engineers design their environments more precisely.

For example, knowing the 32 KiB default limit and the priority rules for path cascading allows a harness engineer to make better information architecture decisions: placing global invariants in the git root’s AGENTS.md and module-level details in subdirectories’ AGENTS.md, so the agent automatically obtains the most relevant context without being overwhelmed by a massive global file.

7.2 Sandbox as the Harness Trust Boundary

The harness engineering survey mentioned A04 (reliability is a management problem), which has a direct engineering counterpart in the sandbox design. Tibo’s statement that “we take a stance that hurts us in terms of general adoption” essentially means that the price of reliability is flexibility, but it is a price worth paying.

The combination of Codex’s three sandbox modes (read-only / workspace-write / full-access) and its profile system allows harness engineers to define different trust levels for different task types. A careful profile paired with a read-only sandbox is suitable for code reviews. A fast profile paired with a workspace-write sandbox is suitable for rapid iteration. This is the specific engineering implementation of the “trust spectrum” mentioned in A04.

7.3 App Server as the Foundation for Harness Extensibility

The harness engineering survey mentioned “observability as leverage” (OpenAI integrating Chrome DevTools and observability stacks to let Codex autonomously check output quality). The underlying dependency for such capabilities is the extensibility of the App Server protocol. Because the protocol supports any type of Item and custom events, new observability tools can be integrated into the agent loop as MCP servers or custom tools without modifying the core code.

This also explains why OpenAI chose to open-source the App Server protocol while keeping the IDE plugins closed-source: openness at the protocol layer attracts the community to build more harness tools (lint integration, metrics integration, custom reviewers), while the IDE plugin is just one of many possible UI surfaces.

8. Practical Evaluation for Customizers

Based on the above analysis, here is an evaluation for teams wanting to build custom toolchains based on Codex CLI.

Highly Customizable Parts:

The content layer of AGENTS.md and Skills offers near-infinite customization space. You can encode your team’s coding standards, architectural decisions, security policies, and review criteria. The implementation of Skills is essentially appending markdown text during the prompt assembly stage, with no complex runtime mechanism and low comprehension cost.

MCP integration can connect to almost any external system. JIRA, Linear, Figma, Datadog, or proprietary internal tools can all be integrated into Codex’s tool system as long as there is an MCP server implementation. Codex itself can also be exposed as an MCP server to other agents, allowing it to be embedded in larger orchestration systems.

The openness of the App Server protocol makes building custom clients entirely feasible. If your team has a proprietary IDE or internal development platform, you can directly interface with this protocol to gain full Codex agent capabilities.

The Config profile system allows for presetting complete configuration snapshots (model selection, sandbox level, approval policy) for different scenarios, switchable with a single command.

Moderately Customizable Parts:

The model can be switched to a local Ollama via the --oss parameter, but the actual performance depends heavily on the tool-calling capabilities of the chosen model. As of now, no open-source model comes close to gpt-5.3-codex or codex-1 in coding tasks. This means “switching models” is theoretically possible but will result in a significant drop in experience in practice.

The core logic of the agent loop can be modified by forking the repository, but the Rust learning curve and compilation complexity are barriers. Modifying the system prompt, adjusting compaction thresholds, or adding custom tool handlers are all feasible but require a deep understanding of codex-rs/core’s asynchronous architecture.

Lowly Customizable Parts:

Sandbox rules are kernel-level. Adding custom sandbox policies requires understanding the configuration syntax of Seatbelt (macOS) or Landlock/seccomp (Linux), which has a high barrier to entry.

The compaction mechanism relies on OpenAI’s proprietary API endpoint and may fail when switching providers. This is the most obvious gap between the provider-agnostic ideal and actual dependency in the current implementation.

Multi-agent orchestration remains experimental at the CLI level. True production-grade parallel agent coordination requires Codex Cloud capabilities, which are entirely closed-source.

9. Provider Compatibility: Codex CLI vs OpenCode’s Harness Differences

9.1 Same Model, Different Harness, Different Experience

“Does the harness matter? How much difference does it make to run the same model in different clients?” The community has provided empirical answers to this question.

A thread on Reddit’s r/opencodeCLI titled “Opencode vs Codex CLI: Same Prompt, Clearer Output” noted that the author ran the same GPT-5.2 model in both Codex CLI and OpenCode with the same prompt. They found that OpenCode’s output was “explained more clearly” and the dialogue felt “like chatting with Claude or Gemini,” whereas Codex CLI felt “like talking to a robot.” The author added the next day that using OpenCode + GPT-5.2-medium for planning and discussion “genuinely feels like working with Opus, and sometimes even better,” and they “don’t fully understand how OpenCode does it.”

Morph’s benchmarks provided more quantitative data. On the same task, Codex CLI (GPT-5) took 2 minutes and 45 seconds to complete a cross-file refactor, while OpenCode (Claude) took 4 minutes and 20 seconds. However, in a test generation task, OpenCode produced 94 tests compared to Codex’s 73. There is a tradeoff between speed and depth.

Broader industry data also supports the conclusion that “the harness affects results.” In the SWE-bench Verified evaluation, Augment’s Auggie, Cursor, and Claude Code all ran Opus 4.5, but Auggie solved 17 more problems than Claude Code (731 total). Same model, different scaffolding, measurable difference.

9.2 The Root of the Difference: System Prompt and Tool Schema

Where does the difference come from? The core lies in the differing design philosophies of their system prompts and tool definitions.

Codex CLI’s system prompt is specifically tuned for the GPT-5.x series. It tells the model “You are Codex CLI,” lists precise available operations, and defines constraints for the output format. Tool schema parameter definitions, description texts, and error-handling guidelines are all designed around the behavior patterns of GPT models. Morph’s analysis calls this a “vertical integration play”: system prompts, tool definitions, and context strategies are specifically optimized for the behavioral characteristics of GPT-5.3. This performs best when using OpenAI models, but when you switch to other providers, these optimizations become mismatches.

OpenCode’s system prompt is generic. It does not assume which model is underneath and uses broader descriptions to define agent behavior. The tool schema design is also more standardized. This means it can achieve a “passing grade” on any model but is not as good as a specifically optimized solution on any particular model. A comment on Hacker News accurately summarized this choice: “The great thing about basing a workflow on a tool like OpenCode is that if OpenAI enshittifies Codex, I don’t have to worry about being trapped and can easily pivot to an open source model, or Anthropic via the API.”

9.3 Comparison of Engineering Implementations for Provider Compatibility

The engineering approaches for “how to connect third-party providers” differ significantly between the two.

Dimension	Codex CLI	OpenCode
Supported Providers	1 Native (OpenAI) + Custom	75+ Native
Model Switching	Define `model_providers` in `config.toml`, specify `base_url` and `wire_api`	Configure provider in `opencode.json` or choose interactively at first launch
Wire API	`responses` (default) or `chat` (OpenAI-compatible)	Automatically handled via AI SDK provider adapters
Local Models	`--oss` mode, requires manual configuration of Ollama/LM Studio URL	Native Ollama support, simpler configuration
OAuth Integration	ChatGPT account login, GitHub Copilot token	ChatGPT Plus, GitHub Copilot, Google, and other OAuths
Claude Access	Requires LiteLLM proxy for format translation	Native support for Anthropic API key
Compaction	Relies on OpenAI proprietary API, fails after switching providers	Not dependent on a specific provider

The key difference is in the concept of the wire_api. Codex CLI has two wire formats: responses (OpenAI’s Responses API, full-featured but only supported by OpenAI) and chat (standard OpenAI Chat Completions API, compatible with many third-party providers). When you use the chat wire_api, compaction, some streaming events, and certain tool-calling features are downgraded. In contrast, OpenCode uses the generic Chat Completions protocol by design, so it does not suffer from functional downgrades when switching between different providers.

In other words, Codex CLI is “best experienced on OpenAI, with functionality lost when switching providers.” OpenCode is “consistent across all providers, but not deeply optimized for any single one.” This aligns with Codex’s overall design philosophy: production-grade, deeply integrated, and extremely optimized for specific scenarios. OpenCode’s philosophy is: community-driven, provider-agnostic, and preferring uniformity over specialization.

9.4 Fundamental Architectural Differences

At a deeper architectural level, the differences are equally significant.

Codex CLI is a pure client architecture: a Rust binary runs directly on your machine and calls model APIs directly. There is no intermediate server. This means performance is excellent (fast startup, low memory footprint), but it also means all logic must be implemented in the client.

OpenCode is a client-server architecture: a Go backend (Hono HTTP Server) runs locally or remotely, and iOS/Web/Desktop clients connect via REST + SSE. This architecture naturally supports remote sessions (running the agent on a server and seeing results on a phone), session sharing (generating links for others to see your dialogue), and multi-device synchronization.

This architectural difference also explains why OpenCode has “less downgrade” when switching third-party providers: because the provider adaptation logic is on the server side, the client is completely indifferent to which model is being used. In contrast, Codex CLI’s provider adaptation logic is in the client’s Rust code (selection of wire_api, assembly of request formats, parsing of streaming events), so the client must correctly handle format differences when switching providers.

From a code quality perspective, a static analysis on r/codereview showed that Codex’s Rust code has 8 times fewer issues per line than TypeScript projects, though OpenCode, as a product of a smaller team, has maintained good basic quality even without linter configurations.

10. Three Design Patterns Worth Taking Away

From the implementation of Codex CLI, three design patterns can be distilled and migrated to other agent systems:

Pattern 1: Strict decoupling of core logic and UI surfaces. codex-rs/core exists as a pure library and does not care which UI it runs in. This allows OpenAI to support six surfaces (TUI, VS Code, Web, macOS App, JetBrains, Xcode) with the same core. If you are building an agent system, making the agent loop a library rather than an application from day one will significantly lower the cost of future expansion.

Pattern 2: Stateless API calls + client-side compaction. Giving up server-side state means simpler provider dependencies and better privacy compliance, at the cost of larger request volumes. Compensate for this with automatic client-side compaction. This tradeoff is particularly valuable for tools aiming to be provider-agnostic.

Pattern 3: Protocol design with three-layer primitives (Thread / Turn / Item). This is more suitable for agent interaction patterns than MCP’s request/response semantics. If you need to share agent state across multiple clients, these three layers of primitives are a production-proven level of abstraction.

11. Design Tensions: Five Tradeoffs Worth Questioning

The previous ten sections attempted to faithfully reconstruct the internal implementation of Codex CLI. This section takes a different perspective: re-examining these facts within the context of broader design decisions. Each of the following tradeoffs is not a “mistake” by Codex, but a “bet” with conditions for success and failure. Understanding these conditions is of more long-term value than knowing the implementation details.

11.1 Deep Binding vs. Horizontal Compatibility

Section 9 described a fact: Codex’s system prompt and tool schema are specifically tuned for GPT-5.x, leading to functional downgrades and a “dry dialogue feel” when switching providers. The technical explanation involves dependencies on wire_api and compaction. But behind this lies a more fundamental design bet: OpenAI believes that deeply binding to a top-tier model is more valuable than shallowly being compatible with all models.

This bet holds true as long as OpenAI models maintain their lead. GPT-5.3-Codex leads Claude Code by nearly 12 percentage points on Terminal-Bench, and Codex’s vertical optimization indeed pays off on its own models. However, the competitive landscape in AI reshuffles every six months. If you choose Codex and build your entire workflow around its AGENTS.md format, Skills system, and App Server protocol, then when the day comes to switch providers, you lose not only technical compatibility but also the implicit rapport accumulated with GPT: the way prompts are phrased, the trigger patterns for tool calls, and the mental model for context management. These things are hard to quantify, but the “clumsiness” upon switching is real.

In contrast, OpenCode’s generic prompt design is not as good as a specifically optimized solution on any particular model, but its “passing grade” is roughly the same across all models. More importantly, the working habits users accumulate in OpenCode are provider-agnostic, with almost zero cognitive migration cost when switching models. In a stage where the basic assumptions of a field are still rapidly changing, this flexibility itself might be the most important capability.

11.2 Process Safety vs. Result Verification

Section 5 detailed Codex’s kernel-level sandbox: Seatbelt, Landlock, seccomp, and three levels of permission modes. This is the ultimate implementation of “process safety,” restricting what the agent can do at the operating system level. In comparison, Claude Code’s hooks system and OpenCode’s more flexible permission models lean more toward “result verification”: not restricting how you do it, but verifying whether the output after you’re done meets the standards.

The difference between these two paths is deeper than it appears. The advantage of process safety is certainty: if the sandbox prohibits network access, the agent absolutely cannot leak data, regardless of how its prompt is injected. But the cost is that you must enumerate all legal operation modes in advance. When an agent needs to do something the sandbox didn’t foresee (e.g., temporarily accessing a new API endpoint to verify a bug), you can only modify the sandbox policy, and the barrier to modifying kernel-level policies is much higher than for application-layer hooks.

From the perspective of actual coding work, most quality issues are detected not by “what the agent was prevented from doing,” but by “whether the code passed lints, tests, and type checks after the agent was done.” A sandbox can prevent catastrophic failures (deleting databases, leaking keys) but cannot prevent logical errors, performance degradation, or style inconsistencies. The latter are precisely the more frequent problems in daily coding. Codex invested heavily in process safety engineering, but in result verification, it relies (by comparison) on the model’s own judgment and the acceptance criteria written by the user in AGENTS.md. Whether this investment allocation is optimal depends on what ranks first in your threat model.

11.3 Production-Grade Engineering vs. Modifiability

Section 2 mentioned Codex being rewritten from TypeScript to Rust for performance, security, and to remove Node.js dependencies. These benefits are real. But each benefit introduces a new constraint.

Rust optimizes startup speed and memory footprint but also raises the barrier for community participation from “knowing TypeScript” to “knowing async Rust + understanding kernel interfaces for Seatbelt/Landlock.” The number of contributors to Codex on GitHub is far fewer than for OpenCode, partly due to the language barrier. A kernel-level sandbox is harder to escape than application-layer hooks but also harder to customize. The three-layer primitives of the App Server protocol are better for agent interaction than MCP, but any client wanting to connect must understand and implement these three layers of semantics, whereas OpenCode’s REST + SSE interface can be called directly by almost any HTTP client.

This is a classic tension: improvements that make a system stronger, faster, and more secure often simultaneously make the system harder to modify externally. If your role is a user of Codex (using it as designed), these improvements are all good. If your role is a builder (wanting to extract patterns from Codex’s design to build your own system), these improvements instead increase the cost of understanding and migration.

A practical strategy is: you don’t need to fork all of Codex to get its good designs. The cascaded logic of prompt assembly (32 KiB limit + path priority) can be replicated in dozens of lines of code. The stateless + compaction pattern can be implemented in any language. The Thread/Turn/Item protocol design can serve as a reference for designing your own simpler version. Codex’s value as learning material might be greater than its value as a forkable codebase.

11.4 Where is the Bottleneck?

Section 9.1 cited Reddit observations and SWE-bench data proving that the same model performs differently in different harnesses. This fact is established. But a deeper question is: in a coding agent system, where exactly is the bottleneck that limits output quality?

If the bottleneck is in model intelligence, then harness differences should be minimal. But community observations (same model, different results) and SWE-bench data (same model, Auggie solving 17 more problems than Claude Code) point to the opposite conclusion: the harness has a greater impact on results than most people realize.

If the bottleneck is in the harness, then in which part of the harness? Codex invested heavily in the sandbox, protocol, and performance, but these are the “infrastructure layer” of the harness. The differences felt by Reddit users (“dry dialogue” vs. “like working with Opus”) point to the “interaction layer” of the harness: the phrasing of the system prompt, the design of the tool schema, and the way context is assembled.

This leads to a hypothesis: the current bottleneck of coding agent systems might not be in security, performance, or protocol design, but in the quality of adaptation between the prompt/tool schema and the underlying model. If this hypothesis holds, then Codex’s investment in the infrastructure layer, while engineeringly admirable, might have a smaller marginal contribution to output quality than doing more model adaptation work in the interaction layer.

Of course, this is just a hypothesis. Verifying it would require a controlled experiment: running the same model through both Codex and OpenCode harnesses on the same set of coding tasks (ideally a standardized set like SWE-bench) and comparing pass rates and code quality. If the differences are significant and concentrated on specific types of tasks, the bottleneck’s location could be more precisely identified. Current public data is not yet sufficient to make a definitive conclusion, but the direction is worth watching.

11.5 Portability of Cognitive Assets

Section 8 evaluated Codex’s customization space. But the deeper the customization, the more important an implicit question becomes: when you need to switch tools, how much of the invested cognition can you take with you?

The AGENTS.md format is portable across tools. The Linux Foundation’s Agentic AI Foundation is already promoting it as an industry standard, and OpenCode, Cursor, Copilot, and Gemini CLI all support reading it. The project specifications, architectural constraints, and code styles you encode in AGENTS.md will work regardless of the tool you switch to. This part of the investment is safe.

However, Skills (Codex’s proprietary markdown + script format), the profile system in config.toml, and the client integration of the App Server protocol are all exclusive to the Codex ecosystem. The more you invest in these layers, the higher the migration cost. Specifically for Skills, the content itself (“how to run a linter,” “how to deploy to staging”) is generic cognition, but its format and injection mechanism are proprietary to Codex. If you write the same knowledge as a generic markdown file instead of the Codex Skills format, its portability will be much better, albeit with the loss of automatic injection convenience in Codex.

A deeper view is that in a stage where the tool ecosystem is still rapidly evolving, what is worth long-term investment is the cognition itself (understanding of the project, definition of code quality, judgment of architecture), not the carrier of that cognition (the proprietary configuration format of a specific tool). AGENTS.md is a good investment not only because it is usable across tools but also because the process of writing AGENTS.md forces you to make implicit knowledge explicit. Even if the AGENTS.md format disappears one day, the understanding you gained during the process of making it explicit will not. This is the true compound interest asset.

Reference Sources

Michael Bolin, “Unrolling the Codex agent loop”, OpenAI, 2026-01. https://openai.com/index/unrolling-the-codex-agent-loop/
Celia Chen, “Unlocking the Codex harness: how we built the App Server”, OpenAI, 2026-02. https://openai.com/index/unlocking-the-codex-harness/
Gergely Orosz, “How Codex is built”, The Pragmatic Engineer, 2026-02. https://newsletter.pragmaticengineer.com/p/how-codex-is-built
takiko, “Exploring the OpenAI Codex CLI Source Code”, Zenn, 2025. https://zenn.dev/takiko/articles/e2b8065158c8d0
takiko, “How Codex CLI Implements Agent Skills”, Zenn, 2025. https://zenn.dev/takiko/articles/codex-cli-agent-skills-implementation
Blake Crosley, “Codex CLI vs Claude Code in 2026: Architecture Deep Dive”, 2026. https://blakecrosley.com/blog/codex-vs-claude-code-2026
ZenML, “OpenAI Codex CLI Architecture and Agent Loop Design”, 2026. https://www.zenml.io/llmops-database/building-production-ready-ai-agents-openai-codex-cli-architecture-and-agent-loop-design
Ars Technica, “OpenAI spills technical details about how its AI coding agent works”, 2026-01. https://arstechnica.com/ai/2026/01/openai-spills-technical-details-about-how-its-ai-coding-agent-works/
InfoQ, “OpenAI Publishes Codex App Server Architecture”, 2026-02. https://www.infoq.com/news/2026/02/opanai-codex-app-server/
OpenAI, “Open Source Components”, developers.openai.com/codex/open-source/. https://developers.openai.com/codex/open-source/
Morph, “OpenCode vs Codex CLI: Harness Architecture Deep Dive”, 2026-02. https://morphllm.com/comparisons/opencode-vs-codex
Reddit r/opencodeCLI, “Opencode vs Codex CLI: Same Prompt, Clearer Output — Why?”. https://www.reddit.com/r/opencodeCLI/comments/1qdujsl/
OpenAI, “Advanced Configuration: Custom model providers”, developers.openai.com/codex/config-advanced/. https://developers.openai.com/codex/config-advanced/
Reddit r/codereview, “We analyzed the code quality of 3 open-source AI coding agents”. https://www.reddit.com/r/codereview/comments/1rs3skb/
Morph, “We Tested 15 AI Coding Agents (2026). Only 3 Changed How We Work”. https://morphllm.com/ai-coding-agent