Engineering · 2026-07-22

A prompt cache that does not leak across tenants

Caching LLM responses pays for itself within a day on most workloads. Doing it correctly in a multi-tenant gateway is a key-derivation problem more than a storage one. Here is the design we'd defend.

Most prompts are repeated. A coding assistant rephrases the same question. An evidence-summarisation endpoint re-summarises the same document overnight. A RAG pipeline retrieves the same passages and asks the same templated question with different fillings. Caching responses pays for itself within a day on most realistic workloads.

In a single-tenant deployment, the cache is straightforward. Hash the prompt, key by the hash, return the cached response if present. Add the model identifier to the key so that responses don't survive a model upgrade. Pick a TTL that matches the prompt's expected freshness window. Done.

In a multi-tenant gateway, the same design is wrong in two ways.

Cross-tenant response leakage. Tenant A and tenant B issue the same prompt. The cache returns to B the response that was generated for A. Most of the time this is harmless — the prompt was identical, so the response would have been near-identical — but the response may have been shaped by A's system-prompt overrides, A's retrieval context, A's tool-result payloads, none of which are visible in the prompt hash. B receives a response conditioned on A's data. This is the kind of bug a careful synthetic-traffic test catches in staging, and the kind of bug that makes a customer's security review unhappy if it reaches production.

Cache-side timing inference. Even if the response is freshly generated for each tenant, a shared-key cache lets B infer that A has issued a particular prompt: B's request returns faster (cache hit), so something has populated that cache slot. For most prompts this is uninteresting; for prompts containing sensitive identifiers, it is a side-channel.

The fix is a key-derivation change, not a storage change. The cache key is derived from the tenant identifier and the prompt's classification (in addition to the prompt and model fingerprint), so cross-tenant lookups cannot collide and an attacker who can issue calls under one tenant's identity gains no purchase on another tenant's cache.

The partitioning rule we'd argue for: prompts that carry sensitive content or customer data are namespaced per tenant — only that tenant's previous calls can hit. Prompts that are explicitly public, or that the customer has classified as shareable, can hit a global namespace. The classification schema discussed earlier is what determines which side of the boundary a prompt sits on. Reusing the classification as the cache-partitioning predicate keeps the surface coherent: one piece of data, one place it gets read, no opportunity for the cache logic and the routing logic to disagree.

The per-tenant component of the key needs to be unguessable, not just unique — a public identifier would namespace the cache but would not defend against an attacker who can probe it across tenants. The construction that makes that unguessable is unremarkable; the load-bearing decision is to require one at all.

A few details that took us iteration to settle.

The model identifier needs to be more granular than the model name. A surface-level model name like regolo-mistral-large-2 masks point-releases that subtly change output distribution. The cache will happily return old-revision responses against a new revision until the TTL expires, and the resulting drift is hard to debug. We key on (provider_id, model_name, model_revision) where model_revision is whatever fingerprint the provider exposes — an ETag-like header, a version string, or the date-bucket of the response when the provider exposes neither. Coarse, but it forces a cache miss when the upstream changes underneath us, which is the property we want.

TTLs should follow the prompt's classification, not a flat default. Prompts marked as ephemeral should not be cached at all. Prompts derived from customer data should get a shorter TTL than purely public prompts, so that revoked customer data drops out of the cache promptly. The classification schema is again the right place to read this off; flat TTLs give up too much freshness on the volatile end and waste storage on the stable end.

Strip provider-bookkeeping fields before caching. Cached responses that include the original tenant's request id, response timestamps, or provider trace headers replay those fields verbatim on cache hit. None of it is high-stakes, but it is observable, and reconstructing those fields per-hit is cheap.

What we deliberately don't cache. Streaming responses are a tempting target — assemble the stream, replay it as a stream — and we tried it. The replay pattern (instant first token, then no inter-token delay) was distinguishable from genuine generation in a way that confused at least one downstream UI we tested against. The cleaner answer is to mark this would have been a cache hit in the response metadata and let the calling application decide whether to skip streaming-shaped UI for it.

The cache store itself carries no cross-tenant logic — the partitioning is entirely in the key. That keeps the store dumb, which is the property we want of a cache: the cleverness goes into the key, never into the store. A cache that has to reason about who can read what is a cache that will eventually be wrong about it.

← All engineering posts