Sources#
Summary#
Agents that persist context across sessions can have that memory corrupted so future reasoning becomes biased, unsafe, or actively aids data exfiltration. What makes it distinct from single-session attacks like Agentic Prompt Injection is persistence: malicious instructions implanted in assistant memory can compromise current and all future sessions — the agent keeps serving attacker goals long after the initial injection. Phase 7 of Zero Trust for AI Agents ("safeguard agent memory") addresses it.
Variants#
- Direct memory poisoning — attacker instructions written into the agent's long-term memory store; influences all subsequent reasoning.
- RAG poisoning — malicious data introduced into vector databases via poisoned sources, direct uploads, or over-trusted pipelines. The agent retrieves contaminated context when answering queries, producing false answers or executing targeted payloads. (A runtime-data analogue of Agent Supply Chain Risk.)
- Shared context poisoning — in multi-tenant environments, attackers inject data through normal interactions that influence later sessions; a new user session inherits poisoned context.
- Long-term memory drift — the subtlest: summaries or peer-agent feedback gradually shift stored knowledge or goal weighting, producing behavioral deviations over time that evade detection because no single change appears malicious. This is the threat that motivates drift-detection in behavioral baselines.
Defenses (Phase 7)#
- Memory isolation — strict boundaries between sessions and users so poisoned context from one conversation can't influence another. The framework notes Claude Code enforces session isolation by default (fresh context per session; sub-agents in isolated context windows).
- Context integrity validation — cryptographic hashes detect unauthorized modification; source attribution tags where each memory element came from. Validate at every retrieval, not just at storage; store hashes in tamper-resistant logs separate from the memory content; reject and alert on validation failure.
- Context retention policies — TTLs that automatically expire unverified memory; shorter retention for high-risk context (external inputs, unverified tool outputs). Claude Code's
cleanupPeriodDayscontrols local transcript persistence. - Versioned memory + quarantine — rollback to known-good states; quarantine suspect content for forensic analysis before deletion; pre-test rollback procedures; define criteria for full purge vs. targeted remediation.
Relation to the wiki's memory concepts#
This is the adversarial counterpart to the benign persistent-memory designs elsewhere in the wiki — bounded memory files in agent harnesses, the compiled knowledge base pattern this vault itself runs on. Any system that lets an agent write to durable memory inherits this threat surface; integrity validation and source attribution are the controls that let a compiled/persistent store stay trustworthy.
Connections#
- Zero Trust for AI Agents — Phase 7 ("safeguard agent memory") (hub)
- Agentic Prompt Injection — injection is the delivery vector; both exploit the model's inability to separate data from instructions, but poisoning adds persistence
- Agent Supply Chain Risk — RAG poisoning is a runtime-data analogue of poisoned upstream components
- LLM-as-Compiler Knowledge Base — the benign persistent-knowledge pattern that inherits this exact threat surface (write-access to durable memory)
- Claude Code — cited reference: session isolation by default,
cleanupPeriodDays, checkpoint/rewind for rollback
Open Questions#
- Long-term memory drift is defined as undetectable per-change. Drift detection requires a baseline — but if the baseline itself drifts (Advanced "continuous baseline refinement"), how is a slow poisoning attack distinguished from legitimate evolution?
- Integrity hashing detects modification but not malicious-but-valid memory written through a legitimate (injected) interaction. What catches semantically-poisoned-but-cryptographically-intact memory?
Sources#
- Zero Trust for AI Agents — Part II memory/context poisoning threats; Part IV Phase 7 (isolation, integrity validation, retention)
Cited by 7
- Agent Supply Chain Risk
Runtime-composed agent ecosystems expand the supply-chain attack surface: model poisoning (250 docs backdoor a 13B mode…
- Agentic Prompt Injection
Direct and indirect injection of malicious instructions into an agent; LLMs cannot reliably distinguish information fro…
- Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
- LLM-as-Compiler Knowledge Base
Karpathy's architecture: LLM incrementally compiles raw docs into a persistent interlinked wiki, replacing RAG with a 4…
- MOC — AI Engineering & Agent Tooling
<!-- BEGIN GENERATED: moc -->
- OWASP
Open Worldwide Application Security Project; source of the agentic threat taxonomy cited throughout Anthropic's Zero Tr…
- Zero Trust for AI Agents
Anthropic's security framework for deploying autonomous agents: trust nothing / verify everything / assume breach, appl…
Related articles
- Least Agency
OWASP term extending least privilege to agents: constrain not just what an agent can access but what each tool can do,…
- MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
- Zero Trust for AI Agents
Anthropic's security framework for deploying autonomous agents: trust nothing / verify everything / assume breach, appl…
- Agentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
