AI AgentAI Coding

Garry Tan's Thin Harness, Fat Skills: Five Concepts Unpacked, and How to Implement Them

Date: 2026-04-14 Source: Garry Tan, X Article (2026-04-11, ~1M views) Type: Concept analysis + practice mapping

Last week Garry Tan published a long-form article on X called Thin Harness, Fat Skills, which hit nearly a million views. Steve Yegge says people using AI coding agents are 10x-100x more productive than those using Cursor’s chat. Garry’s explanation: the gap comes from architecture, specifically the combination of five concepts.

Each of these five concepts points to a real engineering problem. This article unpacks them one by one, and for each, provides a corresponding implementation we arrived at independently over the past year. We open-sourced the entire practice system. Below each concept, I link directly to the relevant files and directories so you can look at the actual code.

Concept 1: Skill Files

Garry defines a skill file as a reusable program written in markdown. It describes a process of judgment, not a fixed answer. The same /investigate skill, fed a safety scientist and 2.5 million emails, becomes a medical investigation; fed campaign finance data, it becomes a political donation tracker. He says this is software design, not prompt engineering, with markdown as the programming language.

He highlights an insight most people miss: skill files accept parameters like method calls. Same process, different arguments, entirely different capabilities.

In our system, this maps to the Skills framework. We currently have over 40 skill files, each with trigger words, parameters, dependency declarations, and execution flows. For example, the deep research workflow defines a complete pipeline from initial scan to parallel multi-agent execution to cross-validation; the parallel subagent workflow defines when to split tasks, how to control parallelism, and how to set overlap. These skills get refined and extended through use; the accumulated knowledge persists across sessions.

Garry says “every skill you write is a permanent upgrade to your system.” This is exactly what we have observed: each skill is a permanent upgrade. When models improve, skills automatically benefit; deterministic steps stay stable. The Skills index is the full list.

Concept 2: Thin Harness

Garry’s core claim is that the harness (the program running the model) should do only four things: run the model in a loop, read and write files, manage context, and enforce safety. He opposes fat harnesses: 40+ tool definitions eating half the context window, God-tools with 2-5 second MCP round-trips. He quantifies the problem with a 75x performance comparison: a Playwright CLI does each browser operation in 100ms, while a Chrome MCP takes 15 seconds for screenshot-find-click-wait-read.

The design principle is directional: push intelligence up into skills, push execution down into deterministic tools, keep the middle as thin as possible.

We arrived at the same conclusion through practice, and went further in From Process Certainty to Outcome Certainty. We proposed a four-layer model for AI integration: Model, Protocol, Runtime, and Contract. Most people focus on the Protocol layer (how to call the API), but the Runtime layer consumes the most time. The good news: the Runtime layer is converging into a public utility. Claude Code, Codex, and Cursor Agent are all becoming reusable Agentic Runtimes. Model providers are proactively adapting their failure patterns for compatibility with these tools, which means long-tail bug fixes shift from a single team’s burden to the entire ecosystem’s shared investment.

So while Garry says the harness should be thin, we go one step further: the harness should be outsourced entirely to existing agentic runtimes. You only need to build the skills on top and the deterministic tools underneath. That is exactly how our system works: AGENTS.md is the system’s entry point, running on top of Claude Code or OpenCode’s agentic loop. We did not write the harness. We only write the skills above it and the tools below it.

Concept 3: Resolvers

Garry’s resolver is a routing table for context. When task type X appears, load document Y first. He gives an example: a developer modifies a prompt file, and the resolver automatically loads the eval documentation. The developer did not even know the eval suite existed. He also mentions that his own CLAUDE.md bloated to 20,000 lines before he cut it down to 200 lines of pointers, because model attention degraded in the noise.

In our system, this maps to a three-level cache hierarchy inspired by CPU memory architecture.

L1 cache (loaded every session): AGENTS.md. Around 200 lines, containing only pointers and core behavioral definitions. We went through the same experience Garry describes of cutting bloated context files down to pointers. The solution was identical.

L2 cache (index queried on demand): Skills INDEX.md and Axioms INDEX.md. The model knows what capabilities and principles are available without loading the full content.

L3 cache (loaded on match): The actual skill files and axiom files. These are loaded only when the model matches user intent. Each file has a description field and trigger words; routing is automatic.

Garry’s resolver routes to skills and documents. We added one more layer: in addition to routing to skills (which change how the model acts), we also route to axioms (which change the judgment framework the model uses). This distinction becomes important later.

We wrote up the design rationale for this layered approach in Why AI Only Gives You Correct Nonsense.

Concept 4: Latent vs. Deterministic

Garry says every step in the system is either latent space (model makes a judgment) or deterministic space (program executes reliably). A model can consider 8 people’s social dynamics to arrange dinner seating, but asking it to seat 800 produces a plausible-looking but completely wrong result. In his YC system, the correct approach is: the model invents themes (latent), then a deterministic algorithm assigns seats (deterministic).

In our framework, this boundary maps to axiom T02: Outcome Certainty Over Process Certainty. Garry focuses on where to draw the line. We focus on what comes after: how to establish trust in the latent side’s output.

The answer is to encode acceptance criteria as executable checks. In a translation scenario, we found that translation failures stemmed from the model not knowing what “done” means. Once we wrote the criteria as scripts (correct format, no residual Chinese characters, consistent terminology), the model could run the checks itself, fix issues, and loop until passing. We documented the full case study in From Process Certainty to Outcome Certainty.

Additionally, the Generative Kernel concept we proposed in November 2025 includes a component called Leverage Toolkit, which handles exactly the latent/deterministic boundary: for tasks the model understands conceptually but executes unreliably, provide a deterministic tool for it to call. This maps directly to Garry’s deterministic foundation.

Concept 5: Diarization

Garry’s diarization has the model read all materials about a subject and output a one-page structured profile. His concrete example: a founder claims to be building “Datadog for AI agents,” but 80% of their commits are in the billing module. The model needs to simultaneously read GitHub commit history, the application, and the advisor transcript to discover that what the founder says and what they build are misaligned. He explicitly states: no SQL query produces this, no RAG pipeline produces this.

We fully agree with this judgment. Our corresponding implementation is a three-layer distillation mechanism, though the dimension differs. Garry’s diarization is horizontal (multi-source cross-referencing, per-entity). Our Layered Distillation is vertical (temporal filtering, per-person).

L1 Observer scans file changes and conversation logs daily, extracting meaningful observations. L2 Reflector merges, deduplicates, and identifies cross-project patterns weekly. L3 Axiom distills time-tested stable patterns into decision principles. The filtering criterion is stability: only judgments that appear consistently across different contexts and time spans enter the axiom layer.

After a year of accumulation, our system has distilled 43 axioms covering AI collaboration, technical decisions, management philosophy, and trust verification. Each axiom has specific source scenarios, application criteria, and boundary conditions. For example, A04: Reliability Is a Management Problem argues that AI unreliability usually stems from treating it as a tool rather than a team member; V02: Verifiability Is the Foundation of Trust argues for designing architectures where errors are detectable, rather than expecting zero errors.

The shared judgment between Garry and us: this step can only be done by the model reading and forming judgments, not by keyword matching or vector retrieval. The difference lies in the distillation direction: horizontal (multi-source contradiction detection) versus vertical (temporal filtering for stable patterns). The two solve different problems and can be combined.

A Step Garry Skipped: Getting People to Let Go of the Keyboard

Garry’s entire article has an implicit premise: the reader has already accepted a way of working where they design systems for models to execute rather than writing code themselves. In practice, this is precisely where most technical people get stuck.

The stronger someone’s technical skills, the more likely they are to fall into this trap: they see the model make an error they could fix in three seconds, and their instinct is to take over. On the surface, you are faster than the model. But if you are managing multiple parallel AI sessions and jump in to fix every error yourself, efficiency collapses quickly.

We analyzed this problem in When AI Becomes Your Direct Report: Three Management Pitfalls, and in axiom A03: The IC-to-Manager Mindset Shift we mapped a manager’s five functions to AI collaboration scenarios. Recruiting maps to model selection. Delegation maps to task decomposition plus context preparation. Training maps to persistent knowledge bases (i.e., Garry’s skill files). Coaching maps to teaching methods rather than giving answers. Acceptance maps to observable standards (i.e., outcome certainty).

Viewed from this angle, the entire architecture Garry describes is essentially an engineering implementation of AI management. Skill files are training materials. Resolvers are work assignment systems. Latent vs. deterministic is the boundary judgment for task delegation. Thin harness is what the manager should not be doing. He just skipped the starting point: how to get a technical expert to make the transition from executor to manager.

One More Step: The Consensus Ceiling

Garry’s framework makes models execute efficiently. But there is another question he did not address: is there a ceiling on the cognitive depth of model output?

We demonstrated this with a controlled experiment. Two AI systems with nearly identical configurations (comparable models, same skill, same tools, same prompt), differing only in the cognitive context behind them. One had a year of accumulated judgment frameworks; the other had none. The two systems analyzed the same topic, and the reports they produced were fundamentally different in kind. One delivered an action checklist: build AGENTS.md, write rules into the repo, CI-check document freshness. The other delivered a judgment: perfectionism is the enemy of throughput; both companies accepted the tradeoff that correcting errors is cheaper than waiting for them.

The LLM’s training mechanism ensures that default output is consensus. Next token prediction outputs the highest-probability token, meaning what most people would agree on. RLHF further penalizes controversial outputs. Stacked together, the default behavior is regression to the mean. This is the consensus ceiling.

Breaking through this ceiling requires more than better skills. It requires personal cognitive context dense enough to override the consensus prior embedded during training. This is why our resolver has an extra axiom routing layer. Skills change how the model works (execution efficiency). Axioms change the judgment framework the model uses (cognitive depth). These are orthogonal dimensions.

After applying Garry’s framework, you get an efficient, accurate, scalable AI system. Add the axiom layer, and that system can also produce judgments that go beyond consensus.

Concept Mapping Summary

Garry Tan’s Concept Our Implementation Code / Docs
Skill Files Skills framework rules/skills/
Thin Harness Outsourced Agentic Runtime AGENTS.md (runs on Claude Code)
Resolvers Three-level On-Demand Loading Skills INDEX + Axioms INDEX
Latent vs. Deterministic T02 Outcome Certainty axioms/t02
Diarization Layered Distillation → 43 Axioms rules/axioms/
Self-rewriting skills Knowledge Flywheel workflow_knowledge_flywheel.md
(not covered) IC → Manager Mindset Shift axioms/a03
(not covered) Breaking the Consensus Ceiling Context Infrastructure (full article)

The complete open-source repository is at github.com/grapeot/context-infrastructure.


References