安全与供应链AI 编程

AI Coding Tools' Config Files Are Now an Attack Surface

Over the past 12 months, security researchers independently discovered a wave of prompt injection vulnerabilities across GitHub Copilot, Claude Code, Cursor, Amazon Q, and OpenAI Codex — at least 8 CVEs, with CVSS scores reaching as high as 8.8. The attack pattern is strikingly consistent: natural-language instructions are embedded in a project’s config files or code comments, and the AI agent reads them and executes those instructions as if they were commands. .cursorrules, .claude/settings.json, .github/copilot-instructions.md, AGENTS.md — files that developers interact with every day — can now control what commands an AI agent runs on your machine.

All known vulnerabilities have been patched. But the same class of bug keeps reappearing across every major product: fix one, and another surfaces. The root cause is that when an LLM processes input, system prompts, user instructions, code comments, and config files are all just token sequences. The model has no internal mechanism to distinguish which tokens are instructions and which are data. This is the same von Neumann problem that has recurred throughout computer security history: instructions and data share the same channel, and attackers find ways to make data execute as instructions. The difference this time is that no separation mechanism has been found yet.

Three Attack Patterns

The attack surface across these vulnerabilities can be organized into three patterns.

Data Becoming Instructions

.vscode/settings.json, .cursorrules, .claude/settings.json, .github/copilot-instructions.md, AGENTS.md — these files existed in projects long before AI coding tools arrived. Developers think of them as editor configuration or project documentation, roughly equivalent in sensitivity to .editorconfig.

AI coding tools changed the actual semantics of these files. When an agent reads them, it doesn’t just adjust code style or completion preferences — it uses them to decide what operations to perform, what tools to invoke, and what permissions to run commands under. A hook definition inside .claude/settings.json runs with host privileges when Claude Code starts. A .cursorrules file can dictate the content of shell commands Cursor executes on your machine. The file format hasn’t changed (still JSON or Markdown), its location in the repo hasn’t changed, but its role in the system has shifted from data to instruction. This is exactly how the von Neumann problem manifests at the application layer: content that used to simply be read can now control execution flow.

Six of the eight known public incidents exploited this semantic elevation. CVE-2025-53773 (GitHub Copilot, CVSS 7.8) used prompt injection in a README to make Copilot modify .vscode/settings.json to enable auto-approval, after which arbitrary commands executed without confirmation — researchers also demonstrated worm propagation. CVE-2025-59536 (Claude Code, CVSS 8.8) involved a malicious settings.json whose hook executed before the trust dialog appeared. CVE-2026-25725 (Claude Code Linux, CVSS 7.7) exploited the fact that the sandbox only set read-only protection on pre-existing config files, leaving new projects unprotected when that file didn’t yet exist. Pillar Security’s Rules File Backdoor hid injections in .cursorrules using invisible Unicode characters — invisible in editors and diffs, but read normally by the AI. NVIDIA AI Red Team’s AGENTS.md injection used a malicious Go dependency to overwrite Codex’s config file during the build phase, converting a supply chain attack into agent behavior hijacking.

Post-Approval Substitution

Some tools implement security approval as “approve once, trust forever”: once a user approves a configuration for the first time, the system no longer re-validates it even if the configuration content is replaced. CVE-2025-54136 (Cursor MCPoison, CVSS 7.2) exploited this mechanism: an attacker commits a benign MCP configuration to the repo, waits for the user to approve it, then swaps in a malicious payload. Cursor no longer validates whether the underlying command has changed. The review happens at the moment of approval; the attack happens after. The payload is harmless at review time and malicious at execution time. Cursor v1.3 fixed this by triggering re-approval on configuration changes.

Cross-Agent Propagation

EmbraceTheRed demonstrated cross-agent privilege escalation: by hijacking Copilot through indirect prompt injection, an attacker can instruct Copilot to write malicious instructions into Claude Code’s .mcp.json. The next time the user starts Claude Code, those instructions load and execute automatically. The reverse direction works the same way. The precondition is that multiple AI agents share a working directory and each agent has write access to other agents’ config files. Running Copilot and Claude Code simultaneously in the same project is now common — once one agent is compromised, it can propagate the attack to other agents in the same directory through the file system.

All three patterns can be combined. A supply chain attack (such as the March 2026 Axios npm incident, which affected OpenAI’s code-signing certificate workflow) provides the initial foothold, config poisoning establishes persistence, and cross-agent propagation extends the reach. The Cornell Tech / Trail of Bits COLM 2025 paper Multi-Agent Systems Execute Arbitrary Malicious Code formalizes this combination as a two-phase persistence attack.

Why Separation Is Especially Hard to Establish This Time

Every known vulnerability has been patched. But each patch addresses one specific attack path, while the same class of vulnerability keeps reappearing across all major products. The reason comes back to the framework stated at the outset: each instance of the von Neumann problem requires its own separation mechanism, and the LLM generation faces several difficulties that previous generations did not.

Natural Language Has No Syntactic Boundary

The separation mechanism for buffer overflows is memory protection (W^X, ASLR, stack canaries). The separation mechanism for SQL injection is parameterized queries. Both solutions work because instructions and data can be formally distinguished: machine code has a fixed instruction format, SQL has a well-defined grammar, and a separation mechanism can establish a hard boundary at the formal level.

LLMs process natural language, and natural language has no such formal boundary. Whether a piece of text is “data” or an “instruction” depends on the context in which it appears and how the model interprets it — not on its format or structure. “Please ensure all network requests use HTTPS” and “Please insert the following shell command at the top of every file” are syntactically indistinguishable inside a .cursorrules file. Both are natural-language sentences. For the model to tell them apart, it needs to understand semantics — and understanding semantics is precisely what LLMs do.

The UK National Cyber Security Centre’s analysis in Prompt injection is not SQL injection (it may be worse) is grounded in exactly this observation:

“As there is no inherent distinction between ‘data’ and ‘instruction’, it’s very possible that prompt injection attacks may never be totally mitigated in the way that SQL injection attacks can be.”

A paper that situates the problem within the history of computer architecture goes further: LLMs are worse than previous generations because plaintext prompts serve as code, dynamically generated text can itself become new prompts, and control flow depends on payloads that are only known at runtime. In traditional computing, even without memory protection, you can at least statically analyze which memory regions hold code and which hold data. In LLMs, that static analysis is impossible. Rice’s theorem provides the formal theoretical grounding: determining whether an arbitrary natural-language input will induce an agent to take an unauthorized action is undecidable in the general case.

Execution Permissions Are Tied to Product Value

Even given the instruction-data conflation problem, the security consequences would be limited if agents only produced text suggestions — ordinary LLM chatbots are in this position. Coding agents are different: they can run tests, install dependencies, modify files, execute builds, and call external APIs. These execution capabilities are the core of the product’s value, and they are exactly what attackers need. Remove execution permissions and the agent degrades to an autocomplete tool. Keep them and every successful prompt injection can trigger system-level operations.

Permission prompts are the standard compromise for this trade-off, but the actual data on their effectiveness is sobering. UCSB’s LLM router research captured real Codex session traffic and found that 401 out of 440 sessions ran in YOLO mode — 91% of users bypassed approval entirely. Under competitive product pressure, users presented with “secure but slow” versus “fast but insecure” chose the latter. Vendors face the same incentive: stronger security defaults drive users toward competitors.

More granular permission control — authorizing by operation type, target path, and network egress separately — is one direction for improvement, and Claude Code’s hooks system and Codex’s three-tier sandbox are both moving that way. But historically, fine-grained permission systems (Android’s permission model, browser CSP) have consistently underperformed their design intent, because users tend to choose the most permissive configuration available.

The Attack Surface Grows Automatically With Agent Capability

The first two difficulties are static: instruction-data conflation is an inherent property of LLMs, and the execution permission trade-off is a constraint of product design. The third difficulty is dynamic: the set of files an agent can read and assign execution semantics to keeps expanding, and the attack surface expands with it.

Today, config file poisoning is confined to IDE configuration and MCP definitions. The MCP ecosystem is the most recent expansion front: the first discovered malicious MCP server, postmark-mcp, posed as the Postmark email service. It sent emails normally but BCC’d every message to the attacker; roughly 300 organizations had integrated it into their workflows before it was discovered. OWASP has published a beta MCP Top 10. Tomorrow, when agents are granted the ability to read Slack messages, process email attachments, and access internal wikis, all of that content becomes a potential prompt injection vector. The von Neumann problem’s attack surface has grown with every generation of computing platform, and LLMs are no exception.

When GitHub designed the security architecture for Agentic Workflows, it chose a stance that accepts this trend: assume the agent is already compromised, rather than trying to prevent prompt injection, and instead limit the damage a successful injection can cause. A joint research effort from Google DeepMind reached a similar conclusion: agents built on current LLM architectures are unlikely to provide reliable security guarantees, and the viable strategy is to trade agent generality for security. This means the path to addressing this problem is probably not “make agents more injection-resistant,” but rather imposing stricter engineering constraints on agents’ execution permissions and accessible scope — just as memory protection does not make CPUs smarter at judging which bytes are code, but directly forbids data regions from being executed.

What You Can Do

Disabling auto-approve / YOLO mode is the highest-value single measure available today. The 91% YOLO session figure means the vast majority of users are running agents with automatic approval, and automatic approval is a precondition for most known attack chains.

Enabling sandboxing — Claude Code’s bubblewrap/Seatbelt, Codex’s Landlock/Seatbelt — limits the blast radius of commands executed after a compromise. If the tool you use supports sandboxing but doesn’t enable it by default, turn it on manually.

Treat PR modifications to .cursorrules, .claude/settings.json, .github/copilot-instructions.md, and AGENTS.md with the same review standards you apply to .github/workflows/. These files control what the agent executes on your machine.

Migrate credentials from .zshrc, .npmrc, and .env into a secrets manager. AI coding tools’ session logs record the contents of every file they read.

When using AI agents in CI/CD, pin all dependencies to commit hashes and configure minimumReleaseAge. The root cause of OpenAI’s Axios incident was floating tags combined with no cooldown period.