How random noise becomes an embedding

One formula turns a blob of random numbers into a point shaped like a real embedding:

normalize( m + V · √Λ · z )

It looks opaque, so let's watch it happen — one piece at a time, in 2D. The maths is the same in 1024 dimensions; the picture is just flatter. We'll build up to what the actual layers of a real model look like.

0 · First, where do embeddings live?

Embeddings are compared by cosine — the angle between two vectors, not their length. So length is throwaway, and every vector is really just its tip on the unit circle (a sphere, in 1024-D). That last step, normalize, is just "put the tip on the circle." Learn it first; then it's not a mystery later.

angle of B

length of B

cosine(A, B) = —

1 · Build one point from pure noise

You just saw where points end up (the circle). Now watch one random point z get there — step through the three operations.

normalize( m + V · √Λ · z )

z√ΛV+m

the knobs (shared with the sandbox below):

√Λ₁ stretch

√Λ₂ stretch

V rotate

m shift x

m shift y

2 · Now all the points at once

One point followed the chain. A whole cloud of noise follows the same chain — so the round blob becomes a tilted, off-center ellipse on the circle. Drag the knobs above and watch this react.

This green ellipse — its size (√Λ), tilt (V) and position (m) — is one "blob". Three knobs fully describe its shape. Change them on the left; everything here follows.

3 · Your turn — match the target

If three knobs fully describe a blob, you should be able to recreate one by hand. Here's a hidden target (gray). Tune the knobs until your green blob sits on top of it.

match

√Λ₁

√Λ₂

V rotate

m shift x

m shift y

Slide the five knobs so the green ellipse covers the gray one.

4 · One blob isn't enough — so use many

You matched a single tidy blob. But real data curves, and one ellipse can't bend. The fix: chop the cloud into K patches and give each its own √Λ, V, m — the same formula, once per patch. That's "local PCA".

patches K

What the model's real layers look like — same machine, different patch sizes:

That's the whole story. √Λ stretch, V rotate, +m shift a ball of noise, normalize onto the sphere — that's one blob. Real embeddings are many such blobs tiling a curved surface. And the model's own layers are just different settings of the same knobs: the word layer (WE) shrinks every patch to a single dot (a lookup table — each word is one fixed vector), while the final layer (L24) fattens them into spread, overlapping clouds (context has smeared each word out).