Tokenization is a Compression Codec Nobody Uses That Way

Tokenization as Compression Cover

Looking for TL;DR? Check key takeaways

Every vector database has two layers: the vectors (heavily optimized — quantized, compressed, carefully sharded) and the text payload (UTF-8 at the API, often LZ4-compressed on disk). Nobody talks about compressing that second layer with anything smarter than generic byte codecs.

I ran some benchmarks and found that the payload layer is a 5x compression opportunity sitting in plain sight. The tool to do it — BPE tokenization — is already in every ML stack. It just needs to be pointed at the right problem.

The Forgotten Layer

When you insert a document into Qdrant, Weaviate, or Elasticsearch, you typically store something like:

{
  "id": "doc_123",
  "vector": [0.12, -0.34, ...],
  "payload": {
    "text": "Retrieval-Augmented Generation is a technique..."
  }
}

The vector gets quantized, compressed, and indexed carefully. The text field is still plain UTF-8 at the API — compression happens underneath.

Qdrant stores on-disk payloads in Gridstore (its custom KV engine, replacing RocksDB since v1.13) with LZ4 compression enabled by default. Elasticsearch applies LZ4 on stored fields too. Both give you generic byte compression — roughly 2× on text in the benchmarks below — but neither exploits the structure of natural language.

That's the gap.

Tokenization is Already Compression

BPE (Byte Pair Encoding) — the tokenizer behind GPT-2, GPT-4, and most modern LLMs — merges frequently co-occurring byte sequences into single tokens. "retrieval" becomes one token. "embeddings" becomes one token. Common phrases get compressed into fewer symbols.

The compression angle is underappreciated. Consider English text:

Average word length: ~5 characters = 5 bytes (ASCII)
Add a space: 6 bytes per word
Average BPE token spans: ~1.3 words
Token ID stored as uint16: 2 bytes

6 bytes/word × 1 word/token × 1 token = 6 bytes/token (raw)
2 bytes/token (uint16)
ratio: 3x

That's before applying any further compression on the token IDs themselves.

The catch: this only works if the tokenizer's vocabulary fits in uint16 (max 65,535). GPT-4's cl100k tokenizer has 100,277 tokens — that needs uint32 (4 bytes), which kills the advantage. GPT-2's r50k tokenizer has 50,257 tokens — fits cleanly in uint16.

Tokenizer      Vocab size    ID type    Bytes/token
─────────────────────────────────────────────────
cl100k (GPT-4) 100,277       uint32     4 bytes  ← no win
r50k (GPT-2)    50,257       uint16     2 bytes  ← 3x win
mxbai WP        30,522       uint16     2 bytes  ← 3x win

The Benchmark

I took a ~270-word paragraph about RAG and vector databases (representative of real payload content) and measured every combination I could think of.

import tiktoken, struct, gzip, zstandard, constriction

enc = tiktoken.get_encoding("r50k_base")
token_ids = enc.encode(text)
packed = struct.pack(f">{len(token_ids)}H", *token_ids)  # uint16 big-endian

Here's what I found:

Method	Size	Ratio	CER
Raw UTF-8	1,853 bytes	1.0x	0%
gzip -9	984 bytes	1.9x	0%
zstd -22	981 bytes	1.9x	0%
r50k BPE uint16 (raw)	784 bytes	2.4x	0%
r50k BPE + zstd	709 bytes	2.6x	0%
r50k BPE + ANS	364 bytes	5.1x	0%
mxbai WordPiece + ANS	328 bytes	5.7x	20.1%*
Entropy lower bound	323–358 bytes	~5.5x	—

*mxbai CER drops to 0% after simple post-processing — more on this below.

The headline: raw BPE uint16 already beats gzip on raw text, with zero further compression. That's just tokenization and packing.

Latency (2,000 runs, 1,853-byte doc)

Compression ratio is only half the story. Here's every component timed separately:

Operation	Median	p99
gzip -9 compress	29µs	36µs
gzip decompress	18µs	25µs
zstd -22 compress	182µs	267µs
zstd decompress	4µs	5µs
lz4 compress	24µs	29µs
lz4 decompress	1µs	1µs

Operation	Median	p99
r50k tokenize	132µs	208µs
r50k detokenize	6µs	10µs
r50k pack uint16	4µs	5µs
r50k + zstd compress	48µs	59µs
r50k + zstd decompress	3µs	4µs
r50k ANS encode	30µs	42µs
r50k ANS decode	32µs	41µs

Full pipelines:

Pipeline	Median	p99
r50k encode (tokenize + ANS)	156µs	287µs
r50k decode (ANS + detokenize)	41µs	53µs
mxbai encode (tokenize + ANS)	750µs	1,270µs
mxbai decode (ANS + detokenize + fix)	420µs	603µs

A few things stand out:

The tokenizer dominates encode latency, not ANS. r50k tokenization is 132µs; ANS on top adds only 30µs. If you already tokenize documents at ingest time (which most embedding pipelines do), the extra encode cost is just those 30µs.
Reads are fast. r50k decode — ANS + detokenize — takes 41µs total. Fast enough for any real-time serving path.
zstd -22 is slow to compress (182µs) because --ultra levels trade CPU heavily for ratio. At a lower level (say zstd -3), it would be ~15µs but give up some compression.
ANS is symmetric — encode and decode are nearly identical (~31µs each), unlike zstd where decompress (4µs) is 45× faster than compress.

What is ANS and Why Does It Matter

zstd operates on bytes. It doesn't know that [0x01, 0x7F] is token ID 383 — it just sees two opaque bytes.

ANS (Asymmetric Numeral Systems) — the entropy coder inside zstd's compression stage, also used in LZMA and FLAC — can be applied directly on token IDs. It knows that token 198 (" the") appears 40× more than token 47291 ("embeddings"), so it assigns shorter codes to frequent tokens.

The result: ANS on token IDs gets within 6 bytes of the Shannon entropy lower bound — the theoretical minimum for lossless compression.

import constriction, numpy as np
from collections import Counter

def ans_encode(ids):
    counts = Counter(ids)
    vocab  = sorted(counts.keys())
    v2i    = {v: i for i, v in enumerate(vocab)}
    freqs  = np.array([counts[v] for v in vocab], dtype=np.float64)
    freqs /= freqs.sum()

    model   = constriction.stream.model.Categorical(freqs, perfect=False)
    encoder = constriction.stream.stack.AnsCoder()
    encoder.encode_reverse(np.array([v2i[t] for t in ids], dtype=np.int32), model)
    return encoder.get_compressed().tobytes()

The full pipeline:

text (UTF-8)
    │
    ▼
BPE tokenizer (r50k)
    │  50,257-token vocab, fits uint16
    ▼
[token_id_1, token_id_2, ...]  ← 2 bytes each
    │
    ▼
ANS encoder (frequency model over token IDs)
    │  assigns short codes to frequent tokens
    ▼
compressed bytes  ←  5x smaller than original text
    │
    ▼ (decode path)
ANS decoder → token IDs → tokenizer.decode → original text

The mxbai Story: Better Compression, Surprising Catch

The mixedbread embedding model (mxbai-embed-large-v1) uses BERT's WordPiece tokenizer with 30,522 tokens. Smaller vocabulary → more common tokens → lower entropy → better compression.

mxbai + ANS reaches 328 bytes vs r50k + ANS at 364 bytes — a 10% compression win. But when I measured latency and recovery, the story got more complicated.

Tokenizer	CER	CER (ignore case+whitespace)	CER (+ fix punct)
r50k BPE	0.0%	0.0%	0.0%
mxbai WordPiece	20.1%	6.4%	0.0%

The 20% CER is entirely from two deterministic artifacts of BERT tokenization:

Lowercasing — "Qdrant" → "qdrant", "RAG" → "rag"
Punctuation spacing — "(content)" → "( content )", "up-to-date" → "up - to - date"

Neither is information loss at the token level. The information is there — it's just encoded with different whitespace conventions. Three regex rules fix it completely:

def fix_punct(s):
    s = re.sub(r'\( ', '(', s)
    s = re.sub(r' \)', ')', s)
    s = re.sub(r'(\w) - (\w)', r'\1-\2', s)
    return s

So mxbai is effectively lossless too, with a small detokenizer. Whether that's acceptable depends on your use case — if you need byte-perfect round trips without post-processing, r50k is the clean choice.

The latency picture makes mxbai harder to justify:

Tokenizer	Encode	Decode	Compressed size
r50k + ANS	156µs	41µs	364 bytes
mxbai + ANS	750µs	420µs	328 bytes

mxbai is 5× slower on encode and 10× slower on decode — entirely due to the HuggingFace tokenizer's Python overhead. The 10% smaller output (328 vs 364 bytes) doesn't come close to justifying that. For a storage codec, r50k is the practical choice.

On UNK tokens: WordPiece has a [UNK] token that causes actual information loss. But for standard English text — including code, URLs, numbers, gibberish ASCII — it never fires. It only fires on emojis (🚀 → [UNK]) and some rare Unicode math symbols. So for typical RAG payloads over web or document text, UNK is a non-issue.

Real-World Scale: English Wikipedia

English Wikipedia is ~7M articles, 708 words/article on average — a useful benchmark for RAG-scale corpora.

Extrapolating from the benchmark:

Method	Per doc	7M docs	vs raw
Raw UTF-8	4,497 bytes	31.5 GB	1x
zstd -22	2,382 bytes	16.7 GB	1.9x
r50k + zstd	1,721 bytes	12.1 GB	2.6x
r50k + ANS	884 bytes	6.2 GB	5.1x
mxbai + ANS	797 bytes	5.6 GB	5.6x

At 1B docs the storage gap is more dramatic:

Method	1B docs	S3 ($0.02/GB/mo)	Disk ($0.08/GB/mo)
Raw UTF-8	4.5 TB	~$92	~$368
r50k + ANS	0.88 TB	~$18	~$72
Saving	3.6 TB	~$74/mo	~$296/mo

Disk savings dominate. Vector databases keep hot payloads on ephemeral SSD for fast reads — that's the $0.08/GB column. At 1B docs you're saving **$ 296/month** on disk alone, and the I/O win is the same magnitude: 0.88 TB reads 5× faster than 4.5 TB, which directly reduces query latency when payloads are fetched during retrieval.

S3 savings are smaller in dollar terms but still meaningful at scale. S3 is cheap per GB, so the $74/month looks modest. Where it compounds: S3 is charged for both storage _and_ data transfer. Cross-region replication, bulk re-indexing, and backup/restore all move the full payload corpus over the wire — at$ 0.09/GB egress, shipping 3.6 TB fewer data per operation saves ~$324 per transfer. For a system that re-indexes monthly or replicates across regions, that adds up fast.

How This Compares to State of the Art

For context, here's where this sits on the broader compression landscape (benchmarked on enwik9, 1GB Wikipedia):

Method	bits/byte	Ratio vs raw	Practical?
Raw UTF-8	8.0	1x	—
gzip -9	~2.6	3.1x	Yes
zstd -22	~2.0	4x	Yes
r50k + ANS (this post)	~1.56	~5x	Yes
zpaq	~1.4	5.7x	Slow
fx2-cmix (Hutter Prize record, Oct 2024)	0.887	9x	Extremely slow
Chinchilla 70B (DeepMind, 2024)	0.664	12x	Not practical

The gap between r50k + ANS (~5x) and the Hutter Prize record (~9x) is exactly the signal that lives in conditional token probabilities. ANS here only models marginal frequency — how often each token appears globally. The top compressors model P(token | all previous tokens), which is what LLMs do. That's the remaining 2x on the table.

Getting from 5x to 9x requires a language model. Getting from 1x to 5x just requires a tokenizer and 30 lines of Python.

What Needs to Change

Nothing in the current stack prevents this. The pieces exist:

tiktoken is already a dependency in any LLM-adjacent stack
constriction is a Rust-backed library with a Python API, used in ML compression research
Decoding is fast — ANS is symmetric, O(n) in both directions

What would need to change in a vector DB like Qdrant:

Accept encoding: "r50k" as a payload storage option
Pack token IDs as uint16, run ANS on the stream
Decompress transparently on read

The index, vectors, and query pipeline stay identical. This is purely a payload storage optimization.

The latency overhead is acceptable for both paths:

Write (encode): 156µs per doc. Ingest pipelines already spend far more than this on embedding generation (~25ms for a 780-token doc on H100). It's a rounding error.
Read (decode): 41µs per doc. A typical RAG query returns 10–20 docs — that's 400–800µs of decode overhead on a path that already includes vector search and reranking. Negligible.

Current                          Proposed
───────────────────────────────────────────────────────
INSERT                           INSERT
  payload: {"text": "..."}         payload: {"text": "..."}
       │                                  │
       ▼                                  ▼
  JSON → Gridstore (LZ4)         tokenize → uint16 IDs → ANS
                                        │
                                        ▼
                                  Gridstore (ANS)

READ                             READ
  fetch LZ4 JSON bytes             fetch ANS-compressed bytes
       │                                  │
       ▼                                  ▼
  return as-is                   ANS decode → token IDs → decode
                                        │
                                        ▼
                                  return original text

The entire round-trip is transparent to the user. Same API, 5x smaller payloads.

This gets more important, not less, as search and generation converge toward a single system: retrieval indexes and model weights are both ways of storing knowledge, just at different compression levels. RAG and agentic search already blur that boundary. If your index still holds UTF-8 blobs while the model reasons in tokens, you maintain two representations of the same text — and pay for both on every retrieve-then-generate path. Token-native payload storage aligns the retrieval layer with the generation stack: one vocabulary, one compression story, cheaper handoffs between search and synthesize.

Key Takeaways

Vector databases optimize vectors aggressively; text payloads get generic LZ4 at best (~2x). Token-aware compression is a separate ~5x opportunity on top.
BPE tokenization (r50k, vocab=50,257) produces uint16 token IDs — 2 bytes each — that already beat gzip on raw text with zero further compression.
Adding ANS entropy coding on token IDs brings you to 5x compression using 30 lines of Python and infrastructure already in every ML stack.
r50k is perfectly lossless (0% CER). mxbai WordPiece gives 5.7x but lowercases text — fixable with simple post-processing, but also 5–10× slower to tokenize.
Latency is not a concern: encode adds 156µs/doc (vs ~25ms for embedding generation); decode adds 41µs/doc (negligible on a query path that already does vector search + reranking).
The tokenizer dominates encode latency (132µs), not ANS (30µs). If you tokenize at ingest anyway, the marginal encode cost is just those 30µs.
The gap between this (~5x) and the Hutter Prize record (~9x) is conditional token probability — i.e., you need a language model. Getting to 5x doesn't.
At English Wikipedia scale (7M docs), this takes payload storage from 31.5 GB to 6.2 GB. The bigger win is I/O: 5x faster bulk reads, indexing, and replication.
As search and generation unify, token-native payloads avoid maintaining UTF-8 and token representations of the same corpus.

Acknowledgements

The compression benchmarks were run using tiktoken, zstandard, lz4, and constriction. The Hutter Prize numbers are from Matt Mahoney's Large Text Compression Benchmark.

https://www.mattmahoney.net/dc/text.html
https://arxiv.org/abs/2309.10668 (Language Modeling is Compression, DeepMind)
https://bellard.org/nncp/
https://github.com/bamler-lab/constriction
https://github.com/openai/tiktoken