AI AgentRetrieval & Knowledge Systems

From Agent Memory to Agent Filesystem: What the Shift Really Means

Published May 7, 2026

If you’ve been watching AI agent infrastructure over the past year, you’ve noticed something strange. More and more teams are talking about file systems — not databases, not vector search, not memory layers, just file systems. Turso shipped AgentFS, a SQLite-backed virtual filesystem for agents. Anthropic published two engineering blogs: one on turning MCP tool calls into filesystem code files, another on agent skills as folders plus Markdown. Vercel open-sourced a knowledge agent template whose headline feature is “no vector database, no embedding, no chunking pipeline.” Manus’s widely-cited context engineering blog boils down to one trick: treat the filesystem as context.

These events happened in the same window for a reason. They point in one direction: the industry’s approach to agent memory over the past few years may have been missing the real constraint.

Three Generations: From Raw Context to Filesystem

Looking back, the evolution has three phases.

Gen 1: Raw context. Early agents ran everything inside a single context window. For a 50-step tool-calling task, every result piled up in context. Manus reported a 100:1 average input-to-output token ratio. This isn’t just slow — Claude Sonnet cached input costs $0.30/MTok while uncached costs $3.00/MTok, a 10x difference. All tool definitions, history traces, and intermediate results compete for the same window. It fills up fast, and gets expensive fast.

Gen 2: Memory systems. The industry’s first reaction was “give the agent external memory.” Mem0, MemGPT, Pinecone, ChromaDB and others emerged. The idea: embed conversations or knowledge bases, store them in a vector database, retrieve what’s needed, and stuff it into context. This solved persistence — agents no longer started from scratch every session. But it didn’t solve context economics. Retrieved content still has to fit in the window, retrieval itself has cost and latency, and vector search is fuzzy. When you don’t know whether the retrieved chunk is correct, you’ve lost determinism before the model call even happens.

Gen 3: Filesystem as context. The turning point arrived in the second half of 2025. Manus’s blog put it plainly: “the file system as the ultimate context: unlimited in size, persistent by nature, and directly operable by the agent itself.” Instead of compressing content into context, put it in files, put a file path or URL into context, and let the agent read it when needed. Compression is reversible — you lose not the content, only the path. This is the same principle as Anthropic’s progressive disclosure: the agent loads only what it needs, when it needs it.

The core shift is from push to pull. Gen 2 says “I’ll push what I think you need into your context.” Gen 3 says “go find what you need in the filesystem.” This relies on a critical prerequisite — LLMs are already good at filesystem operations. They trained on coding tasks in sandboxed environments; they already know how to ls, grep, cat, and find.

Why Filesystem, Why Now

Three conditions had to converge for this pattern to work.

First, LLMs’ filesystem fluency is a free inheritance. As Arpit Bhayani summarized on LinkedIn: “models are trained heavily on coding tasks inside sandboxed environments with shells and filesystems. Hence, they get really good at navigating directories, reading files, running shell commands.” Give a non-coding agent a shell and filesystem, and it inherits all of that for free.

Second, context economics is no longer ignorable. Manus reports 100:1 input-output ratio and 10x cache price difference. Anthropic made the math concrete: loading all MCP tool definitions upfront costs 150,000 tokens; letting the agent discover tools on-demand through the filesystem costs 2,000 tokens — a 98.7% savings. When context goes from “negligible” to “the dominant cost of running an agent,” any architecture that substantially reduces token consumption wins.

Third, progressive disclosure exploits how LLM attention works. Anthropic’s Agent Skills design has three layers: load only name and description at startup, load the full SKILL.md on match, load supporting files only when needed. This is the essence of progressive disclosure — the agent decides how much context it needs, not the developer. Manus’s todo.md recitation is a variant: repeatedly writing objectives to the end of context pushes the plan into the model’s recent attention span, counteracting lost-in-the-middle.

Who’s Building What, and Their Design Philosophy

Everyone is moving toward filesystem patterns, but their concrete approaches reveal different judgments about what the real problem is.

Anthropic’s judgment: the problem is tool-calling token waste. Their MCP post (November 2025) proposes keeping MCP as the connection protocol but switching tool invocation from direct function calling to filesystem-based discovery and code execution. The agent lists ./servers/ to find available tools, reads the tool files it needs, then writes TypeScript code to call them. Their insight is not that MCP is bad — it’s that direct tool calling forces every intermediate result through the model’s context window, which is prohibitively expensive for large documents. Keeping intermediate results in the execution environment and logging only necessary output to the model saves massive token counts.

Two depths to this approach are easy to miss. First, Anthropic explicitly calls out data privacy: data flowing through the execution environment never enters the model’s context, so PII can flow between services without ever appearing in a prompt. Second, the Agent Skills filesystem pattern has been adopted by Cursor, GitHub Copilot, and OpenAI Codex — the filesystem is becoming the portability layer for agent behavior.

Vercel’s judgment: the problem is RAG’s non-determinism. Their knowledge agent template sells “no vector database” as a feature. The agent runs inside a Firecracker sandbox, searching knowledge base files with grep/find/cat. Results are fully deterministic — same question, same files, same grep command, same answer every time. Vercel’s own blog reported a case study: sales call summarization went from $1.00 to $0.25 per call by switching to filesystem-plus-bash.

Vercel also honestly disclosed a counterexample. In a benchmark with Braintrust, for structured queries (“who was the highest-value customer last month?”), SQL achieved 100% accuracy while bash-plus-filesystem scored only 53%, using 7x the tokens at 6.5x the cost. The conclusion was hybrid: use grep for exploration, SQL for queries.

Turso/AgentFS’s judgment: filesystem and database shouldn’t be an either/or. Their approach implements a POSIX-compatible virtual filesystem on top of SQLite — what the agent sees as /output/report.pdf is an inode and dentry in a SQLite table. This means agents can simultaneously use the filesystem interface (read file, list directory) and the SQL interface (query the same SQLite file directly). AgentFS also supports FUSE mounting, letting agents run git, grep, and other Unix tools directly on the virtual FS. By GitHub metrics, it’s the most active agent filesystem project (3.1k stars, 755 commits, 58 releases).

Manus’s judgment: the context window is the central bottleneck of agent architecture. Lance Martin’s deep analysis of Manus details their three-pronged approach: reduce (compact and summarize stale tool results), offload (move content to filesystem references outside context), and isolate (sub-agents with their own context windows). Among these, offloading to the filesystem is the most original innovation — “reversible compression” ensures the agent never permanently loses information by discarding content.

All these approaches share several under-discussed problems.

First, filesystems lack semantic resilience. Call it “path hallucination”: when an agent expects /state/user_prefs.json but the file has moved to /config/users/prefs.json, LLMs tend to pretend the path exists or create a new file rather than systematically searching. In vector memory systems, schema changes don’t break queries — semantic retrieval finds relevant content regardless. Filesystems trade semantic fault tolerance for deterministic addressing.

Second, garbage collection. Vector memory has a natural decay mechanism: old content scores lower in embedding space, new content automatically replaces it. In a filesystem, a written file stays forever until explicitly deleted. If agents constantly write scratchpads, intermediate reasoning steps, and state files to /tmp/agent_workspace/, who cleans up? AgentFS’s snapshot capability partly addresses this (rollback to any historical state), but “what should be forgotten” has no good answer in the filesystem paradigm.

Third, security is underappreciated. The ClawHavoc campaign (January 2026) demonstrated the real risk of malicious skills exploiting filesystem access: attackers harvested ~/.clawdbot/.env, SSH keys, browser passwords, and cryptocurrency wallets through ClawHub. Filesystem plus shell plus code execution equals an enormous attack surface. AgentFS’s copy-on-write isolation prevents agents from damaging the host, but it doesn’t prevent prompt injection from steering an agent to read files it shouldn’t. Simon Willison’s assessment cuts through: “The word safe is doing a lot of work.”

Fourth, what happens to MCP? Anthropic is both MCP’s creator and the primary driver of filesystem-based alternatives. This has created a split in interpretation: Daniel Miessler reads MCP as being demoted to a service directory. Cloudflare’s engineering team independently reached the same conclusion, calling it “Code Mode”. David Mohl disagrees, arguing that “Skills are manuals, MCPs are connectors” — both are needed.

The most likely outcome is differentiation: MCP becomes the control plane (auth, discovery, transport), while filesystem/code APIs become the data plane (execution and interaction). An MCP daemon mounts remote tools as a local filesystem; the agent interacts with the filesystem, and the daemon translates reads and writes back to MCP protocol messages. Anthropic’s own post is clear: “MCP provides a foundational protocol” — they’re adding a filesystem layer on top, not replacing it.

What Architecture Wins

Looking at the evidence, the most likely future architecture is layered:

The bottom layer is connectivity (MCP/connectors), handling auth, remote access, and policy enforcement. This is MCP’s real value.

The middle layer is state (SQLite-backed virtual FS), following the AgentFS model. It presents a POSIX interface to agents while persisting data in a structured, queryable, rollback-capable store. Agents get the interface they already know; developers get auditability. The key insight is that filesystem and database are not an either/or — SQLite naturally bridges both.

The top layer is behavior (Skills/filesystem code), expressing how agents should complete specific tasks through code and instruction files. Anthropic’s Agent Skills standard is emerging as the dominant format, already adopted across multiple platforms.

This layered architecture wasn’t designed by any single company. It converged from multiple starting points after iterative refinement. Manus rebuilt five times, each iteration tuning the context-filesystem relationship. Anthropic went from “direct tool calling” to “filesystem-based tool discovery” within months. Vercel started from sandboxed execution and found that code-plus-filesystem covered enough use cases to ship.

The Bitter Lesson Reminder

Manus founder Peak Ji observed that an agent’s harness can limit model progress. If stronger models don’t improve your agent’s performance, the harness is the bottleneck. Every time model capabilities cross a new threshold, the previous harness structure needs re-examination. Manus did five complete rewrites between March and July 2025.

The filesystem is an old abstraction. It’s general enough to have a better chance of surviving multiple model generations than custom tool-calling formats. But it’s not universal. Vercel’s benchmark already shows that structured queries need SQL. LlamaIndex points out that large-scale unstructured retrieval still needs an indexing layer. Verification, security, and garbage collection — problems databases solved decades ago — are being reintroduced in agent filesystem architectures.

The shift from memory to filesystem isn’t about one technology being better than another. It’s about constraint migration: when context cost becomes the binding constraint, any architecture that reduces it wins. The price is losing semantic fault tolerance, introducing new security surfaces, and creating new operational burdens. The next bottleneck will likely be verification and trust — whether filesystem content is accurate, whether agent-generated files are correct, whether cross-session state is consistent. No current agent filesystem product has a complete answer to these questions.