The Free Lunch in Dense Retrieval: Centering Your Embeddings

This post builds on the embedding geometry series. Part 1 covers token collapse. Part 3 validates findings at scale. The core finding from that series: all mxbai/bge/gte embeddings live in a narrow cone, and the cone poisons cosine similarity. This post is about the cheapest possible fix.

Here are the experiment results first:

Discrimination and semantic similarity (does centering help scores?):

Strategy	Anisotropy	SNLI discrimination gap	STS-B Pearson r	Effective dims
Raw (baseline)	0.414	+0.212	0.893	54.7
Subtract mean	~0.000	+0.350 (+65%)	0.895	54.7
Whiten (full 1024 dims)	~0.000	+0.130 (−39%)	0.837	1023
Whiten (top 256 dims)	~0.000	+0.209	0.840	256
Whiten (top 56 dims)	~0.000	+0.321	0.825	56

Retrieval rankings (does centering help top-k recall?):

Strategy	STS-B Recall@1	STS-B Recall@5	STS-B NDCG@10
Raw (baseline)	0.582	0.714	0.669
Subtract mean	0.582	0.717	0.669
Whiten (top 56 dims)	0.511	0.661	0.609
Whiten (top 256 dims)	0.562	0.687	0.647
Whiten (full 1024 dims)	0.384	0.499	0.461

Two very different stories. Centering eliminates anisotropy completely and improves hard-negative discrimination by 65% — but leaves retrieval rankings essentially unchanged. Full whitening actively hurts both. The simple thing wins, but only in score space, not rank space.

This distinction matters for where you apply it.

Why All Dense Embeddings Point the Same Direction

BERT pretraining uses masked language modelling (MLM): predict a masked token from its context. The output head is softmax(W · h) where h is the hidden state and W is the shared token embedding matrix (~30k vocabulary terms).

High-frequency tokens — "the", "a", "is", "of" — appear in almost every training batch. Their embedding vectors in W receive the most gradient signal and establish a dominant direction in the 1024-dim space. Every sentence representation gets pulled slightly toward this direction because predicting common tokens is a constant pressure across all training examples.

Self-attention compounds this: each layer computes a weighted average of value vectors, and mean-pooling averages again. After 12 rounds of averaging, every sentence representation has converged toward the same shared "generic English sentence" direction — regardless of what the sentence actually says.

This is the cone: all embeddings point roughly northeast (in some arbitrary basis), with only small angular deviations encoding semantic content. PCA on mxbai embeddings confirms it — the top eigenvalue explains a disproportionate fraction of variance, and the effective dimensionality (participation ratio) is ~55 out of 1024.

The practical consequence: cosine similarity between two mxbai embeddings is dominated by this shared direction. Even contradictions — "the market surged" vs "the market plunged" — have cosine ~0.62 at the final layer, because both vectors are mostly pointing northeast.

Why This Is BM25's Problem, Solved Differently

BM25 solved an identical problem in 1994 with inverse document frequency:

IDF(term) = log(N / df)

Terms appearing in every document (df ≈ N) get IDF ≈ 0. They contribute nothing to the relevance score. Signal comes from rare, specific terms with high IDF.

Dense embedding models have no IDF. MLM pretraining implicitly does the opposite — high-frequency tokens dominate the geometry. The embedding space is organised around what's most common, not what's most discriminative.

Centering is the embedding equivalent of IDF: identify the "common direction" (the mean vector, analogous to common terms) and remove it before scoring. What remains is the variation that actually distinguishes sentences.

What Centering Does Geometrically

Every mxbai embedding vector can be decomposed as:

v = μ + (v - μ)

where μ is the corpus mean and (v - μ) is the residual — the part unique to that sentence.

When you compute cosine similarity without centering:

cos(a, b) = dot(a, b) / (||a|| · ||b||)
           ≈ dot(μ, μ)/||μ||² + semantic_signal
           = big shared constant + small meaningful signal

The dot(μ, μ) term is the same for every pair. It doesn't help rank pairs relative to each other — it just inflates every cosine score uniformly and compresses the useful dynamic range.

After centering:

cos(a - μ, b - μ) = dot(a - μ, b - μ) / (||a - μ|| · ||b - μ||)
                  = only semantic_signal

The shared constant is gone. The 55 effective semantic dimensions now operate without the mean direction drowning them out.

This is why anisotropy drops from 0.414 to ~0.000: after centering, sentence pairs that have nothing in common no longer share the "northeast" component, so their cosine drops to near zero (as it should in a well-behaved embedding space).

Why Full Whitening Backfires

Whitening goes further than centering: it rescales each principal direction to unit variance. The intent is to make all 1024 dimensions equally active.

The problem: mxbai has ~55 genuine signal dimensions and ~969 near-zero noise dimensions. Centering leaves the noise dimensions alone — they're nearly zero, so their contribution to cosine is nearly zero anyway. Whitening amplifies them to equal scale as the signal dimensions. Suddenly 969 noise dimensions drown the 55 signal dimensions.

whiten_full: gap = +0.130  (worse than raw's +0.212)

The cone's axis (mean direction) is noise — remove it. The cone's spread contains the signal — preserve it. Centering removes exactly the right thing. Full whitening removes the right thing and then breaks everything else.

Truncated whitening to 56 dimensions (whiten_56) recovers most of the centering benefit (gap = +0.321) because it keeps only the signal dimensions and discards the noise. But it's more complex to implement and loses some of the graded similarity signal that STS-B measures (r drops from 0.895 to 0.825).

For production use: just center.

Computing the Mean Vector

The mean vector needs to be representative of your embedding space. Three approaches:

1. From the model's pretraining distribution (best portability)

The cone's axis is almost entirely a pretraining artifact — it's the same for mxbai, bge-large, and gte-large (their backbone weights are 99.99% identical through L12). A mean vector computed on any large English corpus approximates the true cone axis well. Mixedbread could publish a precomputed mean vector alongside the model weights.

2. From your collection (best quality)

Sample 5–10k documents from your actual collection, compute their embeddings, average them. This accounts for domain shift — a medical corpus has a slightly different mean than a news corpus, both shifted from the base pretraining direction.

sample = random.sample(all_documents, 5000)
embeddings = model.encode(sample)
mean_vector = embeddings.mean(axis=0)  # shape: (1024,)

3. Incremental update as collection grows

Maintain a running mean with O(1) per-insert update:

# On each new document insert:
mean_vector = mean_vector * (N / (N + 1)) + new_embedding / (N + 1)
N += 1

This keeps the mean current without reprocessing old vectors. The mean stabilises quickly — after ~1k documents it changes by less than 1% per batch.

How a Vector Database Could Implement This

Currently, centering requires the user to subtract the mean before uploading and before each query. This is a footgun: if only one side is centered, scores are worse than raw. Most users don't know to do it at all.

A vector database could make this automatic:

# Collection creation
client.create_collection(
    "my_collection",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
    centering=CenteringConfig(
        mode="auto",              # compute mean from collection
        update_interval="daily",  # recompute periodically
    )
)

# Upload raw embeddings — database stores them as-is
client.upsert("my_collection", points=[...])

# Query with raw embedding — database centers at scoring time
client.search("my_collection", query_vector=raw_query)

The database stores raw vectors and the current mean vector. At scoring time it applies:

score(query, doc) = cosine(query - μ, doc - μ)

This can be rewritten as a modified dot product without recomputing centered vectors:

dot(q - μ, d - μ) = dot(q, d) - dot(q, μ) - dot(μ, d) + dot(μ, μ)

dot(μ, μ) is a precomputed constant. dot(q, μ) is computed once per query (one dot product). dot(μ, d) can be precomputed for every stored document at mean-update time. Net cost: one extra dot product per query, zero extra storage per document.

No re-indexing. No re-encoding. No model changes.

Where Centering Helps (and Where It Doesn't)

The retrieval experiment clarifies the boundary precisely.

What centering fixes: score calibration. The SNLI discrimination gap measures how well the model separates hard negatives (contradiction pairs) from random cross-pairs. Raw mxbai: +0.212. Centered mxbai: +0.350. That's a 65% improvement in the absolute range of cosine scores. Similarity scores become meaningful numbers: a cosine of 0.85 after centering actually indicates high similarity, whereas before centering it might just mean "two English sentences."

What centering doesn't fix: top-k ranking. STS-B Recall@1 is 0.582 before and after centering — identical. The rank ordering of documents is almost entirely preserved. This makes sense: subtracting μ from all vectors shifts them by the same amount, so the relative ordering by cosine changes only slightly.

This means centering's benefit depends on how you use similarity scores:

Use case	Centering helps?
Top-k ANN retrieval (just need the ranking)	No
Score thresholding (is this pair similar enough?)	Yes — 65% wider dynamic range
Hybrid search score fusion (dense + sparse weighted sum)	Yes — scores are better calibrated
Re-ranking with similarity as a feature	Yes
Clustering with cosine distance	Yes

For production systems doing pure nearest-neighbour search, centering is neutral — it won't improve Recall@10. For systems that use cosine scores as confidence measures, thresholds, or fusion weights, it's the highest-ROI post-processing step available.

Full whitening, by contrast, damages even the rankings (STS-B Recall@1 drops from 0.582 to 0.384), because it amplifies the ~969 noise dimensions until they overwhelm the ~55 signal dimensions.

The centering vector is portable: compute it once on a background corpus, reuse it across all collections using the same embedding model. If Mixedbread published a precomputed mean for mxbai-embed-large-v1 (a 1024-float file, 4KB), every user would benefit immediately with zero retraining.

The cone exists because MLM pretraining didn't optimise for isotropy. Centering is the minimal surgical fix: it removes exactly the non-informative component the pretraining introduced, leaving the 55-dimensional semantic signal intact and uncompressed. It's a free lunch — just a more specific one than it first appeared.