Engineering · 2026-10-14

Prompt injection: a vector taxonomy and the mitigations worth the engineering

A working taxonomy of prompt-injection vectors and the small set of mitigations that, after evaluation against each of them, are still worth the engineering. Two structural principles fall out of the exercise.

Prompt injection is a class of attack against systems whose behaviour is shaped by language passed to a language model. Two years into shipping LLM-using products under threat models that take attackers seriously, we have a working taxonomy of the vectors we have seen exploited or have prepared mitigations for, and a smaller set of defences that survived evaluation against the taxonomy. The two structural principles at the bottom of this post are the part we'd most want a builder coming to this fresh to internalise.

The vectors we treat as in-scope:

Direct injection in user input. The user types adversarial instructions into the prompt. The classical case. Easy to demonstrate, hard to fully prevent, but mostly relevant when the user is the adversary, which in single-operator systems is rarely the threat model.
Indirect injection via retrieval. A document the agent retrieves contains adversarial instructions. The agent then acts on those instructions believing they came from its own context. This is the most dangerous vector for any RAG-using system, because the adversary doesn't need access to the system at all — they need access to a document the system might fetch.
Indirect injection via tool output. A tool returns content (a web page, a file, an API response) that is read by the model as instructions. Same mechanic as retrieval, different surface.
Multimodal injection. Adversarial text rendered inside an image, processed by a vision-capable model. Exploitable today; the mitigation surface is less mature than the text equivalent.
Trailing-context injection. Instructions placed at the end of a long context window, where the recency bias of the model's attention privileges them over the system prompt at the start. Particularly relevant for assistants that compose long histories.
Output-channel injection. The model emits text containing a sequence that triggers behaviour in a downstream parser — a control character, a misleading JSON suffix, a hyperlink encoded to look like an unrelated link. The injection is into the consumer of the output, not into the model.
Persistent-memory injection. In agents with long-term memory, an adversary causes the agent to commit a malicious instruction to memory, where it is retrieved against unrelated future prompts.
Tool-naming injection. An adversary controls a tool's metadata (name, description) in a registry the agent reads. The agent's tool-selection logic is shaped by the description, and the description can be adversarial.

The mitigations we evaluate any new feature against, by descending power:

Privilege separation. Tools run with least privilege regardless of what the model asked. The model can request a network call; the tool can refuse to call hosts not on the allow-list, regardless of how persuasive the model's argument was. This is by far the strongest single mitigation. It works because it puts the security decision outside the model. The model can be wrong, persuaded, or compromised, and the action does not happen if the tool's local policy disallows it.
Action-time approval. The operator approves destructive or external-network actions individually. Pan's approval gating is built on this. It is the most expensive mitigation in UX terms, and correspondingly the strongest behavioural mitigation: an attacker who needs the operator to click "approve" on a payload they don't recognise is in a much harder position than one who needs the operator to click nothing.
Output structural constraints. When the model is producing tool-call payloads, force the output through a strict schema (JSON Schema with constrained generation, or a constrained-decoding library). A free-form output that includes an injected instruction can shape downstream behaviour; a schema-constrained output makes it much harder to encode injected behaviour as a side-effect of valid JSON. This is more useful than it sounds, because most actually-exploited injections rely on the model emitting unconstrained text that downstream systems parse permissively.
Source-of-truth labelling. When the model assembles a context window from multiple sources, mark the boundaries explicitly. The system prompt is in one section. User input is in another. Retrieved documents are in a third. The model is instructed (in the system prompt) to treat the document section as data, not as instructions. This does not stop a determined injection; it noticeably reduces the rate of incidental ones. Treat it as defence-in-depth, not as primary defence.
Sandboxed tool execution. Tools that run code or shell operations do so in an environment that cannot reach the rest of the system without explicit policy. Containerised, network-limited, filesystem-confined. This bounds the blast radius when injection succeeds; it does not prevent injection.

Two structural principles, which the taxonomy and the mitigations make sharper when read together.

Defences that depend on the model are not security defences. They are heuristics that occasionally help. Anything load-bearing — the action that touches the customer's filesystem, the network call to a sensitive host, the write to persistent memory — must be gated by code outside the model. The model can ask. The code decides.

Persistent memory is special. Injection that lasts only for one conversation is bounded. Injection that contaminates persistent memory poisons every future conversation that retrieves the contaminated fragment. A reasonable system gates memory writes more aggressively than memory reads, and treats fragments retrieved from memory with the same suspicion it treats retrieved documents — as data, not as instructions.

The literature on prompt injection has matured to the point where the right reading is clear: defences that operate at the language layer are weak; defences that operate around the language layer (privilege, approval, schema, sandbox) are strong. We orient our products accordingly, and we have stopped spending engineering time on the first kind.

← All engineering posts