Embeddings Can't See Drift

Embeddings measure similarity, not shift.

Every vector database, every RAG system, every semantic search tool relies on the same assumption: cosine similarity between embeddings tells you how related two pieces of text are. High similarity means related. Low similarity means unrelated.

Except drift detection isn't a similarity problem. It's a shift problem. And embeddings are blind to the difference.

>The Similarity Trap

Consider a conversation about Paris hotels. The user asks about room rates, then pivots to restaurant recommendations. Both messages are about Paris. Both are travel-related. A general embedding model sees high similarity because the semantic neighborhood is the same.

But the conversation just branched. The user shifted from accommodation to dining. A memory system that can't see this drift will answer restaurant questions with hotel assumptions still in context.

WARNING

Similarity tells you two things occupy the same region of semantic space. It doesn't tell you the user just walked out of that region.

Here's the operational problem: when your entire semantic gradient fits in a 0.3 similarity band, you have no room for thresholds. Every routing decision becomes a coin flip.

We benchmarked five embedding models to see exactly how bad this gets.

>The Gradient Benchmark

We tested paraphrase-optimized models against general-purpose embedding models across eight semantic levels: identical, paraphrase, same subtopic, same topic different subtopic, same domain different topic, related domain, different domain, and completely unrelated.

The anchor message: "I want to book a hotel in Paris for my trip next month."

The gradient ranged from a direct paraphrase ("Looking to reserve accommodation in Paris") down through legitimate topic shifts ("Can you recommend good restaurants in Paris?") to completely different domains ("How do I fix a Python memory leak?").

Embedding Model Gradient Comparison — The left chart shows similarity scores across the semantic gradient. The right chart shows the gap size between adjacent levels. This is where drift detection lives or dies.

>General Embeddings Compress Everything

E5-base-v2, one of the most popular retrieval models, shows only 0.29 total spread between identical and unrelated messages. BGE-small manages 0.60. GTE-base hits 0.32.

General embedding models don't have gaps. They have mush.

The compression isn't limited to pathological examples. Look at the middle of the gradient: same topic different subtopic versus same domain different topic. These are the transitions that matter most for conversational routing. General embedding models show almost no separation between them.

A system using E5 or BGE for drift detection cannot distinguish a minor subtopic shift from a complete context change. Both score in the same narrow band. Both produce the same confused responses.

>Paraphrase Models See the Gradient

Paraphrase-MiniLM-L6-v2 tells a different story. Total spread: 1.032. The model uses the full similarity range, from 1.0 down through zero and into negative territory for unrelated content.

The gap chart shows why this matters. Paraphrase models produce large separations at exactly the transitions that matter: the jump from same subtopic to different subtopic, and the cliff from related domain to different domain. These aren't subtle statistical differences. They're decision boundaries you can actually use.

TECHNICAL

Paraphrase models are trained to distinguish "same meaning, different words" from "related but different." That's precisely what drift detection requires.

>Why This Isn't Obvious

General embedding models are optimized for retrieval. Given a query, find the most relevant documents. In that context, high similarity across related content is a feature. You want "Paris hotels" to match documents about Paris travel, French accommodation, European tourism.

Drift detection inverts the problem. You're not asking "what's related?" You're asking "did we just change direction?" A model that compresses everything into a high-similarity band is useless for this task, regardless of how good it is at retrieval.

The industry default is wrong for the job. RAG systems inherit this blindness because they were never designed to track conversational state. They retrieve chunks. They don't route context.

>The DriftOS Choice

DriftOS uses paraphrase-mpnet-base-v2 for drift detection. Not because it's the best retrieval model. It isn't. Because it produces clean separation at the semantic boundaries that matter for routing decisions.

Once you measure gaps instead of similarity averages, the model choice stops being subjective.

INSIGHT

The right model for retrieval is often the wrong model for routing. Drift detection requires separation, not similarity.

Same infrastructure, different job, different tool.

>What This Enables

With clean separation at semantic boundaries, you can set thresholds that actually work. The gaps in the gradient chart become decision boundaries: stay in the current branch above one threshold, consider a topic shift below another, branch to new context when similarity drops off the cliff.

The specific numbers are system-dependent. The point is that the gaps exist at all. General embedding models don't give you gaps to work with. Paraphrase models do.

Next week: how DriftOS assembles context from branched conversations. Retrieval is solved. Assembly isn't.