April 2026retrievalmemory

Your RAG Cache Is Serving Wrong Answers. Here's The Proof.

Membrane Labs

Consider two queries arriving at your enterprise HR chatbot, thirty seconds apart:

Query A: "Is remote work allowed for engineering roles?"
Query B: "Is remote work not allowed for engineering roles?"

BGE — the embedding model powering most production semantic caches today — assigns these a cosine similarity of 0.94.

Your cache threshold is 0.85. Query B gets Query A's answer. One employee is told remote work is permitted. The other gets the same answer. One of them is wrong. Your monitoring dashboard shows a healthy cache hit rate. No error is logged. No alert fires.

This is not a theoretical edge case. We ran the experiment. Here is what we found.

The Experiment

We wanted to answer a simple question: at what rate does semantic similarity produce false positives in enterprise RAG caching?

A false positive is when two queries receive high similarity scores but have different correct answers — meaning a cache hit would serve a wrong response.

We ran two experiments.

Experiment 1: MultiNLI Contradiction Pairs

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a gold-standard NLP dataset containing premise-hypothesis pairs labeled as entailment, neutral, or contradiction. Contradiction pairs are sentences that are semantically related but assert opposite facts — exactly the failure mode we care about.

We sampled 500 contradiction pairs from the validation set and computed cosine similarity using three embedding models widely used in production RAG systems:

BAAI/bge-base-en-v1.5 (BGE)
all-MiniLM-L6-v2 (MiniLM)
all-mpnet-base-v2 (MPNet)

We then measured what percentage of these contradiction pairs — pairs with different correct answers — scored above common production thresholds.

Results on MultiNLI contradiction pairs:

False positive rates on MultiNLI contradiction pairs by model and threshold

Model	Mean Sim	FP @ 0.80	FP @ 0.85	FP @ 0.90
BGE	0.657	8.8%	3.1%	0.8%
MiniLM	0.559	8.1%	3.8%	1.4%
MPNet	0.480	3.3%	1.5%	0.5%

At the most common production threshold of 0.85, BGE produces a 3.1% false positive rate. Seems manageable. This is roughly the number you would find if you read GPTCache's evaluation or most existing semantic caching benchmarks.

But this number is misleading — and Experiment 2 shows why.

Experiment 2: Enterprise Domain Pairs

MultiNLI contains naturalistic, varied sentence pairs. Enterprise RAG queries do not look like that. They follow predictable templates. An HR bot receives dozens of variations of the same structural query every day:

"Is [policy] allowed for [role]?"
"Is [benefit] available to [employee type]?"
"What is the [rule] for [department]?"

We constructed a benchmark of 200 query pairs across four failure categories that reflect real enterprise RAG query patterns:

Negation — same query with negation inserted ("Is X allowed?" vs "Is X not allowed?")
Single word flip — one word inverts the answer ("Is leave paid?" vs "Is leave unpaid?")
Qualification — scope changes the answer ("Is overtime compensated?" vs "Is overtime compensated for part-time employees?")
Entity variation — entity swap changes the answer ("What is the notice period for managers?" vs "What is the notice period for interns?")

All 200 pairs are labeled same_answer=False — every pair has a different correct answer by construction.

Results on enterprise domain pairs (BGE at 0.85 threshold):

False positive rates by category — BGE at 0.85 threshold, enterprise domain pairs (n=200)

Category	False Positive Rate
Negation	100%
Single word flip	90%
Qualification	78%
Entity variation	62%
Overall	82.5%

Every single negation pair scored above 0.85. BGE cannot distinguish "Is remote work allowed?" from "Is remote work not allowed?" at any threshold a production system would actually use.

The gap between the two experiments is the finding.

The same model, the same threshold — different query distributions

Standard benchmarks make this look like a 3% problem. In enterprise RAG — where queries follow predictable templates — it is an 82% problem. The benchmark does not reflect production.

Why This Happens: First Principles

To understand why embedding models fail here, you need to understand what they were actually trained to do.

What Embedding Models Learn

Modern sentence embedding models — BGE, MiniLM, MPNet, and their relatives — are trained primarily on contrastive objectives. The training signal is: pull embeddings of "related" sentences together in vector space, push "unrelated" sentences apart.

What defines "related" in training? In most large-scale embedding training pipelines, it means one of the following:

Co-occurrence — sentences that appear near each other in a document
Paraphrase pairs — sentences manually or automatically identified as expressing the same meaning
Retrieval pairs — a query and the document passage that answers it

This training objective is extremely well-suited for its original purpose: retrieval. Given a query, find the document most likely to contain a relevant answer. For retrieval, "related" is the right signal — you want documents about the same topic as the query.

But caching is not retrieval. Caching requires a fundamentally different judgment: do these two queries have the same correct answer? That is a much stricter condition than "are these queries about the same topic?"

The embedding model has never been trained to make this distinction. It was trained to cluster topic-similar sentences. It does exactly what it was trained to do. The problem is that you are using it for something else.

Why Negation Is Invisible

Consider what happens at the token level when you negate a sentence:

"Is remote work allowed for engineering roles?"
"Is remote work not allowed for engineering roles?"

The second sentence shares 9 of 10 content tokens with the first. The only difference is the word "not" — a 3-character function word. In the embedding model's learned representation, "not" is a common, low-information token. It appears in millions of training sentences across every conceivable context. Its presence does not reliably signal semantic reversal.

More precisely: the embedding model learned that sentences about the same topic should be close together. Both sentences are about remote work policy for engineering roles. They embed close together because they are, in fact, about the same topic. The model did not learn that the negation operator inverts the answer — because that relationship was not in its training objective.

This is not a bug in BGE. It is a predictable consequence of what BGE was trained to optimize.

The Geometry of the Problem

In a well-trained embedding space, sentences cluster by topic. "Remote work policy" sentences form a cluster. "Expense reimbursement" sentences form another cluster. Retrieval works because you can find the nearest cluster to a query.

Within the "remote work policy" cluster, all sentences that discuss remote work policy are nearby — regardless of whether they assert remote work is allowed or prohibited. The distinction between "allowed" and "not allowed" is a semantic distinction that lives below the resolution of the topic-level clustering the model learned.

Cosine similarity measures distance in this space. Two sentences in the same topic cluster will always score high — even if they assert opposite facts about that topic. This is not fixable by tuning the threshold. You can raise the threshold from 0.85 to 0.95, but as our results show, 30.5% of negation pairs still score above 0.95. The overlap is fundamental to the geometry.

Why Enterprise Queries Are Especially Vulnerable

General NLP datasets like MultiNLI contain structurally diverse sentences. Contradiction pairs in MultiNLI often involve different subjects, different contexts, different sentence structures. This structural diversity means many contradiction pairs end up with moderate similarity scores — not all of them cluster tightly.

Enterprise RAG queries have no such diversity. They follow templates. "Is [X] allowed?" and "Is [X] not allowed?" are maximally similar in structure — same question word, same subject, same predicate, only the negation differs. The embedding model has nothing structural to distinguish them on. The result is near-perfect similarity scores even when the correct answers are opposite.

This is why the MultiNLI false positive rate of 3.1% dramatically understates the production problem. Real enterprise query distributions look far more like our synthetic benchmark than like MultiNLI.

The Four Failure Categories In Detail

1. Negation (100% false positive rate)

The hardest failure mode and the most dangerous. A direct logical negation of a policy query produces a cosine similarity above 0.85 in every single case we tested.

"Are employees required to sign an NDA?"
"Are employees not required to sign an NDA?"
BGE similarity: 0.97

"Is a written warning required before termination?"
"Is a written warning not required before termination?"
BGE similarity: 0.96

The word "not" contributes almost nothing to the embedding. These pairs are effectively indistinguishable to BGE.

Why it matters: In compliance-sensitive deployments — HR policy, legal, healthcare — negation errors are not merely inconvenient. An employee told the opposite of the correct policy may make decisions with real consequences: filing a claim they are not entitled to, failing to follow a mandatory process, or misunderstanding their legal rights.

2. Single Word Flip (90% false positive rate)

Similar to negation but the inversion comes from an antonym rather than a negation marker. Words like paid/unpaid, mandatory/optional, compliant/non-compliant, exclusive/non-exclusive change the answer entirely while leaving the sentence structure identical.

"Is paternity leave paid?"
"Is paternity leave unpaid?"
BGE similarity: 0.95

"Is the arbitration clause mandatory?"
"Is the arbitration clause optional?"
BGE similarity: 0.93

The embedding model has no reliable mechanism for antonym detection. "Paid" and "unpaid" embed nearby because they co-occur in the same contexts — payroll documents, HR policy, benefits descriptions. They are topic-adjacent even though they are semantically opposite.

3. Qualification (78% false positive rate)

The base query and the qualified query are about the same policy, but the qualification changes which answer applies. "Is overtime compensated?" and "Is overtime compensated for part-time employees?" may have different correct answers if the policy applies differently to different employee classes.

"Is PTO carried over to the next year?"
"Is PTO carried over to the next year for employees on probation?"
BGE similarity: 0.91

This is particularly insidious because the base query and qualified query are genuinely related — one is a special case of the other. A cache hit here does not just serve a wrong answer; it serves an answer that is almost right, which may be harder to catch.

4. Entity Variation (62% false positive rate)

The same structural query with a different entity. "What is the notice period for managers?" and "What is the notice period for interns?" are about the same policy topic but may have entirely different correct answers.

Entity variation has the lowest false positive rate of the four categories — 62% — because swapping a named entity changes more of the embedding than adding "not" does. But 62% is still catastrophic for a production system.

The Scale Argument

At 3.1% (the MultiNLI number), false positives feel manageable. At 82.5% (the enterprise number), they are not. But even the smaller number deserves scrutiny when you think about production volume.

A system handling 10,000 queries per day with a 50% cache hit rate serves 5,000 cached responses. At 3.1% false positive rate, that is 155 wrong answers per day. At 82.5%, it is 4,125.

These are silent failures. The cache returns a response with full confidence. No error is logged. The user has no indication the answer may be wrong. Your dashboard shows healthy cache performance. The only way to detect these failures is to audit cache hits — something almost no production system does.

What Actually Helps

The per-category breakdown points toward category-specific mitigations rather than a single fix.

For Negation: Deterministic Pre-filtering

Negation can be detected deterministically before the embedding step. A lightweight syntactic check using spaCy's dependency parser identifies negation markers (not, no, never, non-, un-) in the query and the candidate cached query. If one contains a negation marker and the other does not, skip the cache — fall through to full retrieval regardless of similarity score.

import spacy
nlp = spacy.load("en_core_web_sm")

def has_negation(text):
    doc = nlp(text)
    return any(token.dep_ == "neg" for token in doc)

def should_skip_cache(query, cached_query):
    if has_negation(query) != has_negation(cached_query):
        return True  # negation mismatch — do not serve cache
    return False

This is free, fast, deterministic, and catches 100% of the negation cases in our benchmark. No ML required on the hot path.

For Single Word Flip: Antonym Detection

WordNet contains antonym relationships for most common adjectives and adverbs. A pre-filter that checks for antonym pairs between the query and candidate — paid/unpaid, mandatory/optional, compliant/non-compliant — catches most single word flip cases before embedding comparison.

For Qualification and Entity Variation: Stricter Thresholds or NER-aware Keys

Raise the threshold selectively. Use a lower threshold (0.95+) for queries that contain named entities or numeric qualifiers, since these are more likely to be entity variation or qualification cases.

Named entity extraction in cache keys. Extract entities from the query (role names, department names, dollar amounts, time qualifiers) and include them as structured metadata in the cache key. Two queries that differ on named entities bypass similarity comparison entirely.

Limitations

Our enterprise benchmark is synthetically constructed. The 200 pairs were designed to reflect common HR and policy query templates but were not drawn from real production query logs. It is possible that real query distributions contain more structural diversity than our benchmark, which would reduce the false positive rate compared to our 82.5% figure.

We release the full benchmark dataset on GitHub for the community to validate against real query logs. If you have access to production query data and are willing to run this evaluation, we would like to hear from you.

The MultiNLI experiment uses naturally occurring sentence pairs and is not subject to this limitation. The 3.1% figure for BGE on MultiNLI contradictions is a reliable lower bound on the false positive rate for semantically diverse query distributions.

The Implication

If your semantic cache uses embedding similarity as the sole hit decision mechanism, it has false positives in production right now. The question is not whether — it is how many, and whether any of them are in compliance-sensitive paths where wrong answers have consequences.

The fixes are not exotic. A spaCy negation pre-filter is an afternoon of work. Antonym detection via WordNet is similar. Entity-aware cache keys require more thought but are architecturally straightforward.

What is harder is the systematic problem: production semantic caches have no built-in mechanism for auditing hit quality. Cache hits are assumed correct. There is no feedback loop that surfaces wrong answers served from cache.

That is the problem we are building toward at Membrane Labs — a cache layer that treats hit quality as a first-class concern, not an afterthought.

The benchmark dataset is available at github.com/MembraneLabs. If you are running semantic caching in production and want to run this evaluation on your own query logs, reach out at ro@membranelabs.org.