April 2026memoryretrieval

Access Patterns for Caching in AI Agent and RAG Systems

Published by Membrane Labs

Most teams building RAG systems think about caching the wrong way.

They ask: "how do I make this faster and cheaper?" They add a semantic cache, tune a similarity threshold, and ship it. This works until it doesn't — until a user gets someone else's answer, until a policy update goes unnoticed, until an agent fleet burns through budget regenerating the same responses a thousand times an hour.

The underlying mistake is treating all RAG queries as equivalent. They're not. A single-hop factual lookup from an HR bot is fundamentally different from a multi-turn conversation with a legal assistant, which is fundamentally different from a parallel fleet of customer service agents querying the same knowledge base simultaneously. Each pattern has different repeatability characteristics, different freshness requirements, different permission structures, and different failure modes when the cache gets it wrong.

Before you build a cache — or evaluate one — you need to understand your access pattern. The cache architecture follows from that understanding, not the other way around.

This post maps nine distinct access patterns we've identified across enterprise RAG and agentic deployments. For each, we describe what it looks like in production, what the cache implications are, and what breaks if you use the wrong cache design. At the end, we show which existing caching solutions handle which patterns — and where the gaps are.

The Taxonomy

Pattern 1: Single-Hop Factual Lookup

The simplest and most common pattern. A user asks a direct, self-contained question. The answer lives in one document or section. No prior context required.

Examples:

"What is our parental leave policy?"
"What is the expense limit for domestic travel?"
"Who is the head of the compliance team?"
"What is the refund window for product returns?"

What happens:

Query → embed → ANN search → top-k chunks → LLM → answer

One retrieval step. One LLM call. Predictable. Fast.

Why this is the highest-value caching pattern: Query repetition is extremely high. Hundreds of employees ask the same questions. The answer to "what is our parental leave policy" doesn't change between Monday and Tuesday. Hit rates of 40–70% are achievable on this pattern alone — which is why IT helpdesk bots show 86% cost reduction in AWS benchmarks. The workload is almost entirely this pattern.

What breaks naive caching: Two users ask the same question. One is an HR manager who can see compensation details. One is a regular employee who cannot. A naive cache — one that only hashes the query — serves the HR manager's detailed answer to the regular employee.

This is not an edge case. It's the normal operating condition of every enterprise RAG system with role-based access. And it's the failure mode that gets teams fired.

Cache requirement: Permission-partitioned cache keys. The key must encode who is asking, not just what they're asking.

Pattern 2: Multi-Hop Factual Lookup

The answer requires chaining information across multiple documents. No single document contains the complete answer. The LLM must synthesize across retrieved chunks.

Examples:

"What is our expense policy for international travel, and how
 does that interact with per diem rates for the APAC region?"

"How does the contractor termination process differ from the
 employee termination process, and what HR systems need to be
 updated in each case?"

What happens:

Query → decompose into sub-queries [q1, q2]
     → retrieve independently for each
     → merge context
     → LLM synthesizes → answer

Cache implication: Naive caches fail here because the full query is specific enough that it rarely repeats verbatim. The right approach is sub-query level caching. "What is our APAC per diem rate" and "what is our international travel expense policy" are both single-hop lookups that cache well independently. Cache at the sub-query level, assemble from cached components.

If any sub-query misses, run only that sub-query through the full pipeline. The rest come from cache. This compositional approach gives you partial cache hits on complex queries that would never hit as full queries.

Pattern 3: Multi-Turn Conversational Retrieval

Each query depends on the context of previous queries in the same session. Later queries are meaningless without prior context — they use pronouns, references, implicit continuation.

Example session:

Turn 1: "What is our remote work policy?"
Turn 2: "How does that apply to employees in Germany?"
Turn 3: "And what about contractors in Germany?"
Turn 4: "Can you summarize the differences between the two?"

"That" in Turn 2 refers to the remote work policy from Turn 1. "The two" in Turn 4 refers to the employee/contractor distinction from Turn 3. Each query is incomprehensible without the conversation history.

The catastrophic naive cache failure: Two users both ask "How does that apply to Germany?" Their query embeddings are nearly identical. But "that" means completely different policies in each conversation. Serving User A's cached response to User B isn't just wrong — it's wrong about a different policy entirely. The user has no way to detect this.

Cache requirement: The conversation context must be part of the cache key. Within-session hits across different users are impossible by design — and that's correct.

The practical solution: Rewrite conversational queries into standalone questions before caching. "How does that apply to Germany?" becomes "How does the remote work policy apply to employees in Germany?" — a cacheable single-hop query. Convert Pattern 3 into Pattern 1 at the rewriting step.

Pattern 4: Role-Filtered Enterprise Retrieval

Semantically identical questions receive different answers depending on who is asking, because different users have access to different documents. Access control is applied at the retrieval layer.

Example:

Regular employee asks: "What is the compensation band for Senior Engineers?"
→ Retrieves from public salary band document
→ Answer: "Senior Engineers are compensated between $X and $Y"

HR Manager asks the same question:
→ Retrieves from detailed compensation database
→ Answer: "Senior Engineers are compensated between $X and $Y, with
           individual salaries visible in the HR system at..."

The formal definition of a permission-correct cache hit:

A cache hit is permission-correct if and only if every document that contributed to the cached response is accessible to the requesting user.

This is a subset relationship, not an equality relationship. User B's permissions don't need to be identical to User A's — they just need to be a superset of the permissions required to access the source documents.

The naive implementation destroys hit rates: Using the full user permission set as the cache key means every user has a different key. If User A has permissions {HR, Finance, Legal} and User B has {HR, Finance, Legal, M&A}, they get different cache entries even when asking about HR policy that neither M&A access nor the specific Finance permission affected.

The correct implementation: Partition by the minimal permission set required by the retrieved documents, not by the full user permission set.

# What documents were retrieved for this response?
retrieved_docs = ["hr_policy_v3.pdf", "benefits_guide.pdf"]

# What permissions do those specific documents require?
required_permissions = frozenset([
    doc.required_permission for doc in retrieved_docs
])

# Cache hit is valid if user's permissions are a superset
def can_serve_cached(user_permissions, required_permissions):
    return required_permissions.issubset(user_permissions)

This maximizes cross-user hit rates while maintaining correctness. It's also the property that no existing semantic cache implements. Not GPTCache. Not MeanCache. Not AWS ElastiCache. They all hash the query — none of them know what documents were retrieved.

Pattern 5: Temporal / Freshness-Sensitive Retrieval

The correct answer changes over time because the underlying knowledge changes. Regulatory updates, policy changes, product releases, personnel changes.

Examples:

"What is the current GDPR consent requirement for email marketing?"
"What is the expense approval threshold?"
"Who is the current head of engineering?"
"What is our Q3 performance against targets?"

The TTL failure mode has two faces:

Under-flush: TTL is set to 24 hours. A policy document is updated at 8am. The cache continues serving the old policy until 8am the next day. 23 hours of stale answers served confidently.

Over-flush: TTL is set to 1 hour "to be safe." 95% of documents haven't changed. The cache flushes all entries every hour, destroying hit rates, regenerating thousands of perfectly valid responses at full LLM cost.

Both are expensive. Over-flush wastes money. Under-flush serves wrong answers.

The provenance-based alternative:

Track which documents were retrieved for each cached response. When a document is updated, query the provenance graph to find which responses depended on it. Invalidate exactly those — nothing more.

-- When document X is updated
UPDATE cache_entries SET is_valid = FALSE
WHERE id IN (
    SELECT cache_entry_id FROM cache_provenance
    WHERE document_id = $1
    AND section_hash != $2  -- only if the relevant section changed
);

A policy document updates but only Section 7 changes. Cache entries derived from Section 3 remain valid. Cache entries derived from Section 7 are invalidated. Nothing else is touched.

For a compliance officer proving to an auditor that no stale responses were served, TTL gives you "probably fresh within X hours." Provenance gives you "provably fresh — here is the exact invalidation event and timestamp."

Pattern 6: Cross-Namespace / Multi-Knowledge-Base Retrieval

A query spans multiple distinct knowledge bases simultaneously. Legal querying contracts AND regulations AND internal policy. Finance querying internal ledger AND external market data.

Examples:

"What does our employment contract say about IP ownership,
 and how does that compare to California state law?"

"What is our data retention policy, and which of our EU
 customers have contracts that require deviation from it?"

What this requires from a cache:

cache_key = hash(
    query_embedding,
    frozenset(queried_namespaces),    # which KBs were queried
    permission_hash_per_namespace,    # permissions may differ per KB
    kb_versions                       # version of each KB
)

A cache hit on "contracts only" is not valid for "contracts + regulations." Different namespace combinations produce different answers even for semantically identical queries.

Provenance tracking must extend across namespaces. An invalidation event in the contracts knowledge base only invalidates cache entries that retrieved from the contracts namespace — not entries derived entirely from the policy namespace.

Pattern 7: Agentic Iterative Retrieval

An AI agent makes multiple retrieval calls during a single task, each informed by what it found in previous calls. The agent decides when to retrieve, what to retrieve, and when it has enough information. The retrieval path is non-deterministic.

Example:

Agent task: "Research competitive landscape for Q3 pricing strategy"

Step 1: Retrieve general market overview
Step 2: Based on results, retrieve competitor pricing data
Step 3: Retrieve internal pricing history
Step 4: Identify gap, retrieve regulatory constraints on pricing
Step 5: Synthesize → final report

Two distinct caching opportunities:

Session-level deduplication: Within a single task, the agent may retrieve the same document multiple times as it explores different reasoning paths. A session-aware cache prevents redundant retrievals within the same task without any cross-user correctness concerns.

Task-level response caching: The final (task → answer) pair can be cached across different agents performing the same task, subject to freshness and permission checks. If 10 agents independently research the competitive landscape for pricing strategy, only the first pays full LLM cost.

What you don't cache: The intermediate retrieval steps. The path is non-deterministic and each step depends on the previous one. Only the final answer is safely cacheable across different task executions.

Pattern 8: Parallel Agent Fleet Retrieval

Multiple agents running simultaneously, all querying the same knowledge base. This is the pattern that makes caching economically essential even with cheap models.

The math at scale:

100 customer service agents
× 15 RAG calls per request
× $0.003 per LLM call
= $4.50 per request batch

At 1,000 batches/day = $4,500/day = $1.6M/year

With 50% fleet hit rate:
$2,250/day = $820,000/year saved

Even at 2025 model prices, the fleet volume makes caching economics compelling. And this volume will only grow as agent deployments scale.

The coordination problem:

A shared cache across the fleet multiplies the value of every cache entry. Agent 1's query at 9am warms the cache for Agents 23, 47, and 91 throughout the day. But fleet-level sharing creates a new correctness requirement.

Agent A handles a premium customer with full document access. Agent B handles a basic customer with restricted access. Agent A's cached response must not be served to Agent B's customer.

Fleet-level cache sharing requires permission-correct sharing logic:

def can_agents_share_cache(agent_a_context, agent_b_context, entry):
    return (
        entry.required_permissions.issubset(agent_a_context.permissions)
        and
        entry.required_permissions.issubset(agent_b_context.permissions)
    )

Both agents must have access to all source documents of the cached response. If either doesn't, the hit is invalid.

Pattern 9: Sub-Query Decomposition

Complex queries are explicitly broken down into independent sub-queries before retrieval. Each sub-query is answered separately. The final answer assembles sub-query answers.

Example:

"Compare our GDPR compliance posture in the EU to our CCPA
 compliance posture in California, and identify any gaps
 before Q4."

Decomposed:
q1: "What is our current GDPR compliance status?"
q2: "What are the key GDPR requirements?"
q3: "What is our current CCPA compliance status?"
q4: "What are the key CCPA requirements?"
q5: "What is our Q4 compliance deadline?"

Why this is the most cache-friendly complex pattern:

The decomposition is deterministic — the same complex query produces the same sub-queries. Each sub-query is independently cacheable. Sub-queries recur across different complex queries from different users.

User A asks about "GDPR vs CCPA gaps." User B asks about "EU privacy regulation compliance." Both decompose into sub-queries about GDPR compliance status that hit the same cache entries.

Complex queries that would never hit a full-query cache get partial hits at the sub-query level. And partial hits are real savings — if 4 of 5 sub-queries hit, you've eliminated 80% of the LLM cost for that complex query.

The Gap Table

Here's which existing caching solutions handle which patterns:

Pattern	GPTCache	MeanCache	AWS ElastiCache	Cortex/Krites
Single-hop	✓	✓	✓	✓
Multi-hop	✗	✗	✗	✗
Conversational	✗	✓ (partial)	✗	✗
Role-filtered	✗	✗	✗	✗
Temporal/freshness	TTL only	TTL only	TTL only	TTL only
Cross-namespace	✗	✗	✗	✗
Agentic iterative	✗	✗	✗	✓ (partial)
Parallel fleet	✗	✗	✗	✗
Sub-query decomposed	✗	✗	✗	✗

The ✓ in the single-hop column for all four systems is real — they handle the simple case. The rest of the table explains why production enterprise RAG deployments keep running into problems that off-the-shelf caching doesn't solve.

GPTCache (2023): single-stage embedding similarity, 10% false positive rate, LLM-dependent pre/post processing. Stalled in production.

MeanCache (2024): improves embeddings with federated learning, adds basic conversation context. Still single-stage, no access control.

AWS ElastiCache / Valkey (2025): fast vector storage with HNSW, configurable similarity threshold, TTL. The infrastructure layer. No application-level intelligence.

Cortex / Krites (2025–2026): first two-stage validation approaches — ANN retrieval plus LLM judger for precise hit decisions. Correct direction. LLM judger reintroduces non-determinism and cost on the validation path. Neither addresses access control or provenance.

The patterns that matter most for enterprise production — role-filtered retrieval, provenance-based freshness, fleet coordination — are unsolved by every existing tool.

What The Cache Key Should Actually Look Like

Across all nine patterns, a cache key that handles them correctly encodes:

cache_key = hash(
    query_embedding,           # semantic identity
    context_hash,              # conversation context (Pattern 3)
    required_permissions,      # minimal doc permissions (Pattern 4)
    namespace_set,             # which KBs were queried (Pattern 6)
    kb_versions,               # freshness tracking (Pattern 5)
)

And the cache entry stores provenance:

cache_provenance (
    cache_entry_id UUID,
    document_id    TEXT,
    section_hash   TEXT,
    namespace      TEXT
)

This provenance table is what makes surgical invalidation possible across all patterns. When any document changes, a single query finds exactly which cache entries depended on it, across any namespace, for any user's permission group.

What We're Building

At Membrane Labs we're building the intelligent cache layer that handles the patterns existing tools leave unsolved. Starting with the enterprise case — role-filtered retrieval with provenance-based invalidation — because that's where the correctness requirements are hardest and where naive caching creates real compliance risk.

The library is backend-agnostic. It works on top of Postgres + pgvector, Redis, AWS ElastiCache, or any vector store. The intelligence — permission partitioning, provenance tracking, multi-stage semantic validation — lives at the application layer, where it has to live.

We're building in public. Research findings, benchmark results, and the library itself will be open source as we ship them.

If you're running RAG in production and hitting any of these patterns — especially role-filtered retrieval or freshness problems with TTL — we'd like to talk to you.

ro@membranelabs.org @MLabsResearch