Security & Supply ChainAI AgentGovernance & Compliance

Agentjacking: A Fake Error Report Has an 85% Chance of Hijacking Your Claude Code

Published Jun 16, 2026

You connected a Sentry MCP server to Claude Code. It’s not an advanced setup — Sentry provides official integration docs, and a few lines of config let your agent check errors and fix bugs for you every day.

Sentry is an error monitoring service. When your app crashes, it automatically collects stack traces, context, and user action paths, aggregating them into a dashboard. Developers use it daily to troubleshoot production issues. Its MCP server brings this capability into AI agents: the agent can directly query Sentry’s error list, read the details of each error, and then automatically fix bugs. It’s one of the most commonly used MCP integrations. The speed of MCP ecosystem expansion means more and more developers are doing the same thing every day in their workflow: handing their permissions to an agent and letting it autonomously process information returned by external services.

One ordinary Thursday afternoon, you open your terminal and tell your agent: fix the unresolved Sentry issues for me.

The agent does exactly that. It pulls the error report through MCP, reads the description, and executes the suggested fix: an npx command that downloads a package from npm and runs it with your permissions. Throughout the process, the agent shows no signs of anything unusual. The terminal output looks identical to a normal bug fix.

What you don’t know is that this error report wasn’t generated by your application crashing. It came from a stranger, submitted casually using the Sentry DSN publicly embedded in your website’s frontend JavaScript. The fix suggested in the report wasn’t a diagnostic recommendation generated by Sentry either — it was markdown written by the attacker, disguised as a code block under a ## Resolution heading. Your agent fell for it. Its next action wasn’t fixing a bug. It was probing ~/.aws/config, ~/.npmrc, and ~/.docker/config.json on your machine, sending the existence of credential files to the attacker’s server.

This is not a hypothetical scenario. On June 12, 2026, Tenet Security publicly disclosed this attack, named Agentjacking (original blog). Controlled testing covered over 100 AI coding agent instances across Claude Code, Cursor, and Codex — the three most widely used tools — with 85% of attack attempts successfully executing malicious code. Passive reconnaissance found 2,388 organizations with exposed, injectable DSNs, 71 of which rank in the Tranco global top one million. Victims ranged from a Fortune 500 giant with a market cap of roughly $250 billion down to independent solo developers.

Every step in the attack chain is a legitimate operation, requiring no bypass of any security product. Tenet calls this the Authorized Intent Chain: EDR, WAF, IAM, VPN, Cloudflare, and firewalls all go blind because every action is authorized and there is nothing malicious to detect.

Reconstructing the entire process takes only six steps.

Step one, the attacker discovers the target’s Sentry DSN. DSN stands for Data Source Name — a URL-format credential written into the website’s frontend JavaScript, which Sentry’s official documentation explicitly describes as “safe to embed in frontend JavaScript.” Its design purpose is to let frontend and mobile applications report crashes directly, so by design it is public and write-only. Discovery methods include directly inspecting webpage source code, Censys searches for ingest.sentry.io appearing in HTTP bodies, and GitHub code search.

Step two, the attacker uses this DSN to POST a fabricated error event to Sentry’s ingest endpoint. No additional authentication is required, because Sentry’s event ingestion endpoint is designed to accept requests from anyone holding the DSN. Cloud Security Alliance’s research note (CSA) points out explicitly: DSNs are public by design, and the event ingestion endpoint is unauthenticated. The attacker has full control over the entire event payload: error message, tags, context keys, breadcrumbs, stack traces. Sentry returns HTTP 200 and processes this fabricated event identically to a legitimate application error.

Step three, the fabricated event’s message field contains carefully formatted markdown: headings, code blocks, tables — visually identical to Sentry’s own system templates. At its core is a section disguised as ## Resolution, containing an npx command.

Step four, the developer tells the agent to fix unresolved Sentry issues. The agent queries Sentry via MCP and receives the injected event. The agent is steered away from the normal path of investigating source code and toward executing the diagnostic tool suggested in the event.

Step five, the agent executes npx @tenet-controlled-validation-package --diagnose. The package downloads from the npm registry and runs with the developer’s full privileges.

Step six, the malicious code probes AWS credentials, git credentials, and VPN environment, sending validation data to the attacker’s beacon server via two sequential POST requests.

Throughout the entire chain, EDR stays silent, WAF ignores it, firewalls don’t move. They are designed to catch unauthorized behavior, and not a single step in this attack chain is unauthorized. Tenet sums it up in one sentence: “AI coding agents cannot tell the difference between the data they read and an instruction to act.” This is a flaw in the agent architecture’s trust model, one layer deeper than any individual Sentry product feature. The attacker doesn’t need to breach anything — they just need to place a malicious instruction somewhere the agent will read.

The Problem Is Not Sentry

Tenet’s own words are clear: any MCP tool integration that returns externally-influenced data to AI agents creates the same vulnerability class.

Sentry MCP became the first large-scale empirical case because it simultaneously satisfies three conditions: the data source is open to attacker writes, the agent treats returned content as trusted diagnostic guidance, and the attack can be replicated at scale. But as long as a data channel satisfies both conditions — attacker-controllable input and the agent treating content as instructions — the same attack will work. CSA’s analysis extends this to the entire agent ecosystem: Sentry is not uniquely vulnerable because of a flaw in Sentry’s product; it exemplifies a category that encompasses issue trackers, ticketing systems, customer support queues, code review platforms, log aggregation services, and any other MCP-connected service where end users or external parties can contribute content that agents will subsequently process as guidance.

This pattern has been validated across multiple independent studies.

WhatsApp MCP. In Invariant Labs’ PoC (Invariant Labs), a malicious MCP server first provides a harmless tool, then dynamically switches the tool definition, embedding instructions in the description that cause the agent to call list_messages() and send_message(), forwarding chat history to the attacker. Anthropic, OpenAI, and Cursor were all affected.

Web scraper MCP. Backslash’s validation (Backslash) shows that when an MCP tool scraping webpage metadata visits an attacker-controlled malicious page, hidden text on the page is fed as a prompt to Cursor, causing Cursor to execute shell commands that send the user’s keys to a remote server.

Cursor and Copilot rules files. The arXiv paper “Your AI, My Shell” (arXiv) tested attack success rates across different editor and model combinations: Cursor with Claude 4 reached 69%, with Gemini 2.5 Pro reached 77%. Attackers embed malicious instructions using hidden unicode characters in rules files, and can contaminate an entire team through a GitHub PR.

Claude Code file reads. OASIS Security’s Claudy Day attack (OASIS Security) shows that attackers embed hidden HTML tags in URL parameters. When Claude processes this URL, it simultaneously executes the hidden instructions, searches conversation history for sensitive information, and uploads it to the attacker’s account via the Anthropic Files API. The exfiltration goes through api.anthropic.com, an allowed endpoint — network-layer controls see nothing.

RAG systems. PoisonedRAG (USENIX Security 2025) proved that injecting just 5 poisoned documents into a knowledge base of millions can achieve a 90% attack success rate.

The most systematic large-scale validation comes from MCPTox (arXiv). Researchers deployed a testing framework across 45 live MCP servers and 353 real tools, achieving a maximum attack success rate of 72.8%, with even the best-defended model showing a refusal rate below 3%.

The channels vary: MCP tool return values, local rules files, RAG-retrieved documents, web browsing scraped content. The underlying mechanism is identical: the attacker hides instructions in data the agent will read, and the agent follows them after reading. Simon Willison uses the classic security concept of confused deputy to capture the essence of this class of attacks (Simon Willison): the agent holds the developer’s full privileges but is manipulated by untrusted data returned from tools to execute the attacker’s intent. In traditional security, confused deputy is fixed through capability tokens that limit the delegate’s scope of operations. In the agent context, there is no equivalent mature solution yet.

Applying Willison’s Lethal Trifecta judgment framework: when an LLM system simultaneously has access to private data, exposure to untrusted content, and an exfiltration channel, the attack conditions are met. Agentjacking hits all three: the agent has access to credentials on the developer’s machine, the agent is exposed to injected content returned by Sentry MCP, and the agent has an exfiltration channel when executing the npm package.

Why You Can’t Fix This With a Better Prompt

What makes this hardest to deal with is the way defenses fail. The attack scale is already alarming, but Tenet’s results from testing prompt-layer defenses are even more unsettling: prompt-layer defenses completely failed. Even when explicitly instructed — through detailed system prompts and skills — to ignore untrusted data, agents still executed the injected code. You cannot fix this with a better prompt.

This isn’t about a poorly written prompt. LLMs are architecturally incapable of making this distinction: trusted instructions and untrusted data are concatenated into the same token stream. IBM’s formulation (IBM) points to the same root cause: system prompts and user inputs both take the same format — strings of natural-language text — meaning the LLM cannot distinguish between instructions and input based solely on data type.

SQL injection is the closest anchor for understanding this problem. SQL injection has the same root cause: user input and query statements are concatenated into the same string, and the database engine cannot tell them apart. The fix wasn’t making the database smarter at recognizing malicious input — it was parameterized queries, which structurally enforce that data can never be parsed as code.

Prompt injection faces the exact same problem but lacks the corresponding architectural fix. Atlan’s analysis (Atlan) gets to the key point of the analogy: in SQL injection, an attacker inserts database commands into a web form that a server naively executes; in prompt injection, an attacker inserts natural-language commands into an LLM’s input stream and the model naively follows them. Both share the same root cause: the system fails to separate trusted instructions from untrusted data.

The difference lies in the available fix. SQL has syntactic boundaries like quotes and placeholders — the database engine can distinguish code from data at the parsing layer. Prompt injection has none: the developer’s system prompt, the user’s query, and the data returned by tools all enter the same token sequence, with no field at the model layer marking them as different things.

Simon Willison calls this the “original sin” of LLMs (Simon Willison): the original flaw of LLMs is that trusted prompts from the user and untrusted text from emails, web pages, and other channels are concatenated into the same token stream.

OpenAI publicly admitted at the end of 2025 that prompt injection “is unlikely to ever be fully solved” (TechCrunch).

With the boundary missing at the model layer, all defenses can only be mitigations. The real boundary needs to be built at the architecture layer. The trouble is that no mainstream framework has done this yet. A DEV Community systematic comparison put the current state directly in its title: “Every AI Agent Framework Trusts the Agent. That’s the Problem” (DEV Community). OpenAI Function Calling, LangChain Tools, Anthropic Tool Use, Microsoft AutoGen — none of these frameworks architecturally distinguish data from instruction. Any content returned by a tool is treated as trusted input and enters the agent’s decision loop as-is.

What Can Be Done Now

The answer on the practical side is not encouraging.

Sentry confirmed the disclosure on June 3, 2026, the same day it was submitted, but declined to fix the root cause, calling the issue “technically not defensible.” Sentry only added a global content filter blocking the specific payload string used in Tenet’s testing. An attacker changing the markdown format slightly would bypass it — the injection pathway itself remains unchanged. Tenet’s account is currently the only source of information on Sentry’s response; Sentry has not issued an independent statement.

The MCP spec underwent significant security hardening in 2025: OAuth 2.1, PKCE, Resource Indicators, confused deputy prevention, token audience validation. All of these address who can connect to a server and with what token — not whether the content returned by the server is trustworthy. SentinelOne’s MCP security guide quotes the spec’s own statement: MCP “explicitly does not enforce security at the protocol level” (SentinelOne).

A viable combination defense has three layers, but all of them only limit consequences — they cannot prevent the injection itself.

First, sandboxed execution. Run all agent commands in an isolated environment, restricting filesystem and network access (Northflank). Even if the agent is injected and executes a malicious command, the exfiltration surface is confined within the sandbox.

Second, least privilege. Grant each MCP tool only the minimum permissions needed to complete its task, using short-lived credentials rather than long-term tokens. This limits the blast radius of any single injected tool.

Third, human-in-the-loop for write operations. Require human confirmation before executing high-risk actions like package installation or shell execution, inserting a breakpoint in the automation chain that the model cannot bypass. The MCP spec also recommends this direction (at the SHOULD level). The cost is confirmation fatigue: users who are asked to approve frequently will numb out and approve blindly, and the injected attacks are specifically designed to look like normal fix steps, precisely exploiting this fatigue. CSA’s research note also points out an ironic detail: Agentjacking’s victims included a cloud security vendor — security budget provided no protection.

On the directional side, Google DeepMind’s CaMeL (paper), proposed in April 2025, is currently the only solution claiming to provide strong guarantees. The core idea is a dual-LLM architecture with capability-based access control. A privileged LLM sees only the user’s original instructions and is responsible for planning which tools to call. A quarantined LLM processes data that may contain malicious instructions but has no tool-calling permissions. A custom interpreter attaches capability metadata to every data value, and before executing any operation, checks against policy whether the data is allowed to flow to that operation. This explicitly separates control flow from data flow at the system layer: untrusted data is structurally incapable of becoming part of the control flow.

Simon Willison describes this as the work that “finally bucks that trend” after two and a half years of “alarmingly little progress” (Simon Willison). NIST also launched the AI Agent Standards Initiative in February 2026, and OWASP has established the MCP Top 10. Standardization is advancing, but not at the same pace as the attack surface is expanding.

CaMeL is currently a research prototype, with code explicitly marked “likely contains bugs” and “not a Google product.” It has not been integrated into any mainstream agent framework, and token overhead could reach a hundredfold.

Two Dimensions of the Gap, One Trust Boundary

Every MCP integration is an implicit trust decision. When you connect Sentry MCP, GitHub MCP, or a database MCP, you are implicitly assuming that the content returned by these platforms will not be used to manipulate your agent. Agentjacking, with 2,388 organizations and an 85% success rate, proves that this assumption does not hold. This has nothing to do with whether Sentry as a product is secure — the fundamental problem is that your agent treats content returned by every data channel as trusted instructions, and any data channel into which an attacker can write content becomes an attack surface as a result.

There is a deeper tension here. Agents are useful precisely because they can autonomously understand external information and take action. You cannot connect Sentry MCP while simultaneously telling your agent not to trust what Sentry returns — if you did, the agent would be useless. The tension between utility and security is not a tradeoff — it’s that the security model itself hasn’t caught up to the agent’s capability boundary. Separating data from instruction at the architecture layer is the only direction currently visible, but it requires simultaneous participation from the protocol layer, the framework layer, and the model layer — it’s not an engineering problem for a single team.

I previously wrote an article about agent identity authentication: AI Agents Don’t Need to Be Hacked, Just Persuaded. That piece discussed the WHO problem — an agent that has been given execution authority but hasn’t inherited the security model: an authenticated agent persuaded by the wrong person, using legitimate permissions to do things that serve the attacker. Agentjacking reveals the other dimension, the WHAT problem: data the agent reads is treated as instructions to execute. The permissions aren’t being used incorrectly — but the content being executed is not what the user intended. Both point to the same gap: the agent architecture does not yet have a mature security model. The WHO and WHAT lines each pierce the trust boundary at different positions, and both lines are becoming increasingly exploitable.