Why AI Only Gives You Correct Nonsense, and How to Push It Out of Its Comfort Zone

Three weeks ago, OpenAI published a blog post on Harness Engineering. My friend Lao Wang and I each had our own agents run a deep research on it. We were deliberate about keeping conditions matched: comparable LLMs (Claude Opus 4.6 vs GPT-5.4), the same research skill, the same search tool (Tavily), the same agentic backend (OpenCode), even the same prompt. The only difference was the context behind each agent: mine had a year's worth of accumulated judgment frameworks and personal mental models; Lao Wang's had none. The two AIs produced analyses that felt fundamentally different in kind. Here's the comparison on one specific topic—how OpenAI and Cursor have converged on harness architecture:

Report One (recommendations section): (Original Chinese) 先做知识底座，再做更强agent。给仓库建立清晰的AGENTS.md目录索引；把产品规则、架构规则、执行计划、质量标准写进repo；用CI检查文档freshness和cross-link completeness。

Report Two (same topic): (Original Chinese) 完美主义是吞吐量的敌人。OpenAI采用最小阻塞合并、后续修复的策略。Cursor发现要求100%正确性会导致系统停滞，接受小而稳定的错误率反而更高效。两者都接受了「纠错比等待便宜」的权衡。

What's the difference? Report One gives you a checklist: correct, safe, the kind of output you'd get from anyone asking the same question. Report Two gives you an insight: it synthesizes a pointed judgment across two independent sources (perfectionism is the enemy) and names the underlying tradeoff (correcting errors is cheaper than waiting for them). One is a courier; the other is an analyst. Both full reports are available for comparison: Report One and Report Two.

If you think back on your own experience using AI for analytical work, most outputs look like Report One: nothing obviously wrong, but nothing you'd actually learn from. Correct nonsense. Report Two is rare. This pervasive mediocrity has a specific root cause in how LLMs are trained.

The LLM Consensus Ceiling

The reason is this: the way LLMs are trained means their default output is consensus.

LLM training is fundamentally next token prediction—outputting the highest-probability token at each step. Highest probability means most people would agree, which is consensus. RLHF layers another mechanism on top: safety alignment specifically penalizes controversial, strongly-opinionated outputs and rewards balanced, comprehensive, non-committal answers. Two mechanisms stacked together, and the LLM's default behavior is regression to the mean.

This default has a serious consequence. Take Deep Research, arguably the hottest product category in AI over the past two years. But look at what it actually does: automated high-frequency search, multi-document synthesis, expanding information coverage. That has nothing to do with depth. It's Wide Research at best. "Deep Research" is a genuinely misleading name.

The problem it solves is information asymmetry: things you didn't know before, now you do. But real depth comes from a different axis—cognitive asymmetry. Facing the same industry report, a twenty-year veteran and a fresh hire see completely different things. The veteran has a mental system built from years of trial and error: they know which data is noise, which anomalies signal a trend. The newcomer doesn't have that filter. Even with a report ten times as long, they can't make the same quality decisions.

This is why you rarely hear someone say "AI gave me a judgment I never could have formed on my own." AI can raise a novice to average, because its training data is a compression of average. But for people already above average, consensus output adds almost nothing. Depth is, by definition, non-consensus—and non-consensus is exactly what LLMs are trained to avoid.

But this gap is also a waste, and an opportunity. If AI can only output consensus, you can't actually delegate real thinking to it. Setting aside AGI and long-term futures, in everyday use it can be your secretary—organizing information—but not your advisor or coach, helping you form judgments. The earlier AI management series touched on this distinction, but at the time I hadn't found a systematic way through.

So where's the opening?

AI Has Shifted from CPU-Bound to Memory-Bound

The intuitive response to AI producing correct nonsense is to optimize the model: use a smarter, more expensive model, write more elaborate prompts, add more tools—throw Multi-Agent and Harness architectures at it. All of these optimize the same dimension: model intelligence.

But the opening experiment already gave us the answer. The model intelligence on both sides was roughly equal. Same tools, same prompt. The only variable was context: Report Two had a year of accumulated judgment frameworks behind it; Report One had nothing. One produced a checklist, the other produced an insight.

One variable, one conclusion: once model intelligence crosses a threshold, what determines the nature of the output is context, not model intelligence. This kind of transition has happened before in computing: once CPUs became fast enough, upgrading the CPU stopped mattering—the meaningful gains all came from memory architecture. LLMs are at the same inflection point.

This is counterintuitive. When people talk about AI, the first thing they think about is the model. We constantly hear about model upgrades; nobody talks about context upgrades. But that asymmetry itself reveals something. Every model upgrade makes intelligence a little cheaper, available to everyone. Your context, on the other hand, is yours alone—model upgrades don't depreciate it. Continuously investing in a depreciating dimension (model intelligence) yields diminishing returns; investing in a non-depreciating dimension (personal context) compounds.

Since the bottleneck is context, breaking through the consensus ceiling requires personal cognitive context dense enough to override the consensus prior baked in during training. A few lines of system prompt can't do this. Your taste, your intuitions about priorities, your judgment frameworks refined through repeated validation in a specific domain—these are high-dimensional, scattered across countless past decisions and feedback loops, and a few sentences can't capture them. You need a system.

How to Push LLMs Out of Their Consensus Comfort Zone

To solve this, I spent a year building a system that evolved into three mutually reinforcing components. Each addresses a specific question.

Accumulate at Scale

The first question: what exactly is your cognitive framework?

This sounds simple and turns out to be quite hard. Skilled people usually can't articulate what makes them skilled—and when they try, they're often wrong. Many of their most distinctive capabilities are muscle memory, things they'd describe as "no big deal," which is precisely what makes them unique. You need a third party to capture that.

So the starting point for capturing context is collecting objective behavioral data, not just writing prompts from introspection. Over the past year I've run a continuous experiment: voice transcripts, meeting notes, WeChat conversation exports, every conversation with AI, every correction, even every frustrated outburst—all accumulated as local files. These are the reasoning patterns I actually exhibit under real decision-making conditions.

Worth noting: it's hard to extract patterns from your own data because you're too close to it. You need an outside observer, and AI is a good fit for that role. So I keep all this data in one folder, where an AI can see everything at once and cross-reference across any project. This is the foundation of context density.

Layered Distillation

The second question: given all that noise in the raw data, how do you find the signal?

A decision you made today might reflect poor sleep, incomplete information, or just a random choice. If you feed raw data directly to AI (as systems like Mem0 do), the interpretive space is too wide. A single event can express many different principles, and some decisions are simply arbitrary. Distillation is necessary.

The filtering criterion I use is simple: stability. A judgment that appears consistently across different situations and over time is likely part of your cognitive structure. What's unstable is situational reaction; what's stable is actually you.

Inspired by OpenClaw, the distillation runs in three layers. L1 Observer scans file changes daily, extracts meaningful observations, and writes a running log. L2 Reflector runs weekly, merging duplicates, pruning stale information, and identifying cross-project patterns—separating signal from noise. L3 Axiom distills stable patterns into decision principles, keeping only what genuinely represents you. After a year of accumulation and a few weeks of distillation, my system now holds 44 axioms covering my technical preferences, communication style, business judgment, and other subjective leanings.

The key difference from Mem0 and similar memory systems is the depth of distillation. Mem0 stops at the fact layer: "you prefer TypeScript," "you live in Shanghai." My system goes further, distilling to the level of judgment principles: "when you evaluate technical solutions, how do you weigh maintainability against performance, and what's your priority ordering?" Facts tell an AI who you are. Judgment principles tell an AI how you think. Getting AI output to shift from consensus to non-consensus requires the latter.

On-Demand Loading

The third question: with all this context, how do you actually give it to the AI?

Dumping everything in isn't viable. Context windows have limits, and irrelevant information dilutes the useful signal. A coding task doesn't need all your business judgment principles loaded. A research task doesn't need your code architecture preferences.

The solution is the existing skill system: each skill is a context subset tailored to a specific task type, containing the most relevant axioms, evaluation criteria, and common tools for that task. Load the research framework when doing research; load architecture principles and review preferences when writing code. It's analogous to a CPU's memory hierarchy: L1 cache is AGENTS.md, L2 corresponds to the skill index (telling the AI where to look if it needs something), L3 corresponds to the actual skill files. On-demand loading, progressive disclosure, each layer invoked only when needed.

The Feedback Loop

Once the three components are running, something interesting happens: knowledge products start emerging, and each product consumes context while also generating new context.

Duck's AI Daily is a daily AI industry briefing written using this context system. Each issue consumes axioms and skills, while also producing new observations that feed back into the observation library. The domain research reports are depth analyses with explicit judgment criteria, and each report updates the relevant domain's knowledge framework as it's produced. Both series have attracted wide readership and sharing—which suggests the feedback loop can sustain itself at sufficient context density and stay alive through continued operation.

The essence of this system is injecting your bias into AI. Cultivated bias is the source of depth; but bias can also be low-quality prejudice. Still, the act of making bias explicit has its own value. Before this system, your biases were scattered across decisions, invisible to you in form or even existence. After collection, distillation, and refinement, you can actually see what you tend to prioritize—and what you tend to overlook—when facing certain types of problems. That self-knowledge is independently valuable.

Back to the opening experiment. Lao Wang's AI outputs consensus because the context it can see is essentially empty—the training-time consensus prior hasn't been overridden by any personal cognitive layer. My AI produces opinionated analysis because there's a year of judgment frameworks behind it. Same model, different context density, fundamentally different output.

An Open-Source Reference Implementation

This system takes time to accumulate, some technical ability to set up, and ongoing willingness to maintain. But consider the cost from another angle: "tweak the system prompt and the AI instantly understands you" or "just use a better model"—these shortcuts don't work in principle. The consensus prior is too strong to override with a few sentences, and switching models just gives you a different source of consensus. For a problem this important, it deserves a system. Collect from the source, distill in layers, load on demand, update in cycles—each step has its reason. Not simple, but genuinely useful.

We've open-sourced the complete structure: github.com/grapeot/context-infrastructure. The repo is a reference implementation containing the 44 axioms we actually use, core skill files, code for the three-layer memory system, and all the components described in this post.

To be clear: the primary value of this repo is letting you see what a system that's been running for a year looks like, not giving you something you can clone and immediately use. You can open it and talk to an AI—ask "how would the author think about this?"—and immediately experience the difference between having context and not. But there's no shortcut to making an AI truly yours. You need to start collecting your own behavioral data, set up your own scheduled tasks, and let the system distill judgment principles from your own decision history. Someone else's skills represent someone else's perspective. A useful reference, not a replacement.

Bias and the Silicon Brain

AI getting smarter doesn't automatically make it deeper. Smarter consensus is still consensus. There's only one way to break through the ceiling: inject non-consensus perspective.

Everyone has their own non-consensus perspective. Your criteria for judgment, your aesthetic sensibilities, the lessons distilled from your failures, your intuition about what matters and what doesn't. These things don't exist in AI's training data and will never be automatically learned by any version of any model, because they belong only to you.

The silicon brain's absolute objectivity can only ever reach an intelligent mediocrity. What can reshape it is something else entirely: the human soul you've spent decades accumulating—full of bias, full of taste.