Sources#
Summary#
Prompt injection is the insertion of malicious instructions that cause an agent to follow attacker commands. OWASP lists it as the lead threat to agentic systems, and the core technical fact behind it is load-bearing for the whole Zero Trust for AI Agents framework: LLMs cannot reliably distinguish between informational context and actionable instructions (Microsoft Research). Because the model treats data and commands as the same token stream, no amount of "tell the agent not to" fully solves it — defense is structural.
Two forms#
- Direct prompt injection — attackers craft inputs that override system instructions: explicit instruction overrides, encoding schemes (Base64, hex) to bypass filters, and adversarial suffixes that look meaningless to humans but steer outputs. Research shows algorithmic approaches achieving 100% attack success rates with prompts that transfer across multiple model families.
- Indirect prompt injection — the more insidious form. Attackers embed instructions in external data the agent processes (web pages, emails, documents). The user never sees the payload, and the agent executes it as if it were a legitimate request. This is what makes agents that browse, read mail, or ingest documents structurally exposed.
Injection is also the delivery mechanism for adjacent threats: it's how tool-misuse and tool-chaining attacks are triggered, and a vector for Memory and Context Poisoning when the injected instruction is written to persistent memory.
Defenses (structural, not exhortative)#
The framework's input-validation tier ladder and Phase 4 prescribe layered defenses:
- Input isolation / spotlighting — treat all natural-language input as untrusted and clearly delimit it so the model knows what is data vs. instruction. Microsoft's Spotlighting reduces indirect-injection success from over 50% to under 2%. This is the single highest-leverage control.
- Constitutional classifiers — AI-based guards that scan prompts and responses for manipulation attempts. Anthropic's approach blocked 95% of jailbreak attempts in testing with minimal increase in over-refusal. Can be trained into LLM guards monitoring both input and output.
- Input sanitization — schema validation, length limits, known-bad-pattern and encoded-payload filtering (Foundation → Enterprise). Notably, this does not translate cleanly from SQL injection: agent inputs are freeform and unpredictable, so simple enforcement rules are insufficient.
- Limit attack surface — restrict who and what can interact with the agent. A traditional technique, but among the most effective: fewer untrusted inputs, fewer injection opportunities.
- Parameter validation — validate tool-call arguments (Phase 5) on both agent and tool side; reject parameters outside expected ranges.
Why "tedious" defenses fail here#
Encoding-based filters and pattern blocklists are friction controls: a patient attacker re-encodes the payload. Per the Impossible, Not Tedious (Design Test), the durable controls are the ones that change the structure (spotlighting delimits, isolation quarantines, classifiers semantically detect) rather than the ones that merely raise the cost of a retry.
Connections#
- Zero Trust for AI Agents — Phase 4 ("defend against prompt injection") and the input-validation control domain (hub)
- Least Agency — the authorization principle that contains a successful injection: even a hijacked agent can only misuse the tools its agency permits
- Memory and Context Poisoning — injection is a delivery vector for persistent memory corruption; both exploit the same "data ≡ instructions" weakness
- Impossible, Not Tedious (Design Test) — distinguishes structural defenses (spotlighting, isolation) from friction-only filters
- Claude Code Auto Mode — classifier-gated tool approval is a deployed instance of the constitutional-classifier idea at the action boundary
- Agentic Misalignment (AM) — injection is how an external attacker induces harmful agent behavior; agentic misalignment is the self-motivated analogue
- OWASP — lists prompt injection as the lead agentic threat
- MCP and Computer Use — browsing / email / document tools are the indirect-injection entry points
Open Questions#
- Spotlighting and constitutional classifiers each leave a residual (2%, 5%). Stacked, what's the realistic floor, and does it hold against adaptive attackers who know both are deployed?
- "LLMs cannot reliably distinguish information from instructions" — is this a fundamental property of the architecture or a training gap that future models close? The framework treats it as durable.
Sources#
- Zero Trust for AI Agents — Part II threat description; Part III input validation tiers; Part IV Phase 4
Cited by 9
- Agentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
- Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
- Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
- Least Agency
OWASP term extending least privilege to agents: constrain not just what an agent can access but what each tool can do,…
- MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
- Memory and Context Poisoning
Corruption of persistent agent memory that influences behavior long after the initial injection; includes RAG poisoning…
- MOC — AI Engineering & Agent Tooling
<!-- BEGIN GENERATED: moc -->
- OWASP
Open Worldwide Application Security Project; source of the agentic threat taxonomy cited throughout Anthropic's Zero Tr…
- Zero Trust for AI Agents
Anthropic's security framework for deploying autonomous agents: trust nothing / verify everything / assume breach, appl…
Related articles
- Zero Trust for AI Agents
Anthropic's security framework for deploying autonomous agents: trust nothing / verify everything / assume breach, appl…
- Agent Supply Chain Risk
Runtime-composed agent ecosystems expand the supply-chain attack surface: model poisoning (250 docs backdoor a 13B mode…
- Agent Identity and Authentication
The foundation control for agentic Zero Trust: cryptographically-rooted per-agent identity (→X.509→hardware attestation…
- Least Agency
OWASP term extending least privilege to agents: constrain not just what an agent can access but what each tool can do,…
- MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
