- Published on
Tokenization is a Compression Codec Nobody Uses That Way
- Authors
- Name
- Kumar Shivendu
- @KShivendu_

Looking for TL;DR? Check key takeaways
Every vector database has two layers: the vectors (heavily optimized — quantized, compressed, carefully sharded) and the raw text payload (stored verbatim as UTF-8). Nobody talks about the second one.
I ran some benchmarks and found that the payload layer is a 5x compression opportunity sitting in plain sight. The tool to do it — BPE tokenization — is already in every ML stack. It just needs to be pointed at the right problem.
The Forgotten Layer
When you insert a document into Qdrant, Weaviate, or Elasticsearch, you typically store something like:
{
"id": "doc_123",
"vector": [0.12, -0.34, ...],
"payload": {
"text": "Retrieval-Augmented Generation is a technique..."
}
}
The vector gets quantized, compressed, and indexed carefully. The text field gets written as-is.
Qdrant stores payloads as JSON on RocksDB — no compression by default. Elasticsearch applies LZ4 on stored fields, which is better but still generic byte compression. Neither system knows anything about the structure of natural language text.
That's the gap.
Tokenization is Already Compression
BPE (Byte Pair Encoding) — the tokenizer behind GPT-2, GPT-4, and most modern LLMs — merges frequently co-occurring byte sequences into single tokens. "retrieval" becomes one token. "embeddings" becomes one token. Common phrases get compressed into fewer symbols.
The compression angle is underappreciated. Consider English text:
- Average word length: ~5 characters = 5 bytes (ASCII)
- Add a space: 6 bytes per word
- Average BPE token spans: ~1.3 words
- Token ID stored as uint16: 2 bytes
6 bytes/word × 1 word/token × 1 token = 6 bytes/token (raw)
2 bytes/token (uint16)
ratio: 3x
That's before applying any further compression on the token IDs themselves.
The catch: this only works if the tokenizer's vocabulary fits in uint16 (max 65,535). GPT-4's cl100k tokenizer has 100,277 tokens — that needs uint32 (4 bytes), which kills the advantage. GPT-2's r50k tokenizer has 50,257 tokens — fits cleanly in uint16.
Tokenizer Vocab size ID type Bytes/token
─────────────────────────────────────────────────
cl100k (GPT-4) 100,277 uint32 4 bytes ← no win
r50k (GPT-2) 50,257 uint16 2 bytes ← 3x win
mxbai WP 30,522 uint16 2 bytes ← 3x win
The Benchmark
I took a ~270-word paragraph about RAG and vector databases (representative of real payload content) and measured every combination I could think of.
import tiktoken, struct, gzip, zstandard, constriction
enc = tiktoken.get_encoding("r50k_base")
token_ids = enc.encode(text)
packed = struct.pack(f">{len(token_ids)}H", *token_ids) # uint16 big-endian
Here's what I found:
| Method | Size | Ratio | CER |
|---|---|---|---|
| Raw UTF-8 | 1,853 bytes | 1.0x | 0% |
| gzip -9 | 984 bytes | 1.9x | 0% |
| zstd -22 | 981 bytes | 1.9x | 0% |
| r50k BPE uint16 (raw) | 784 bytes | 2.4x | 0% |
| r50k BPE + zstd | 709 bytes | 2.6x | 0% |
| r50k BPE + ANS | 364 bytes | 5.1x | 0% |
| mxbai WordPiece + ANS | 328 bytes | 5.7x | 20.1%* |
| Entropy lower bound | 323–358 bytes | ~5.5x | — |
*mxbai CER drops to 0% after simple post-processing — more on this below.
The headline: raw BPE uint16 already beats gzip on raw text, with zero further compression. That's just tokenization and packing.
Latency (2,000 runs, 1,853-byte doc)
Compression ratio is only half the story. Here's every component timed separately:
| Operation | Median | p99 |
|---|---|---|
| gzip -9 compress | 29µs | 36µs |
| gzip decompress | 18µs | 25µs |
| zstd -22 compress | 182µs | 267µs |
| zstd decompress | 4µs | 5µs |
| lz4 compress | 24µs | 29µs |
| lz4 decompress | 1µs | 1µs |
| Operation | Median | p99 |
|---|---|---|
| r50k tokenize | 132µs | 208µs |
| r50k detokenize | 6µs | 10µs |
| r50k pack uint16 | 4µs | 5µs |
| r50k + zstd compress | 48µs | 59µs |
| r50k + zstd decompress | 3µs | 4µs |
| r50k ANS encode | 30µs | 42µs |
| r50k ANS decode | 32µs | 41µs |
Full pipelines:
| Pipeline | Median | p99 |
|---|---|---|
| r50k encode (tokenize + ANS) | 156µs | 287µs |
| r50k decode (ANS + detokenize) | 41µs | 53µs |
| mxbai encode (tokenize + ANS) | 750µs | 1,270µs |
| mxbai decode (ANS + detokenize + fix) | 420µs | 603µs |
A few things stand out:
- The tokenizer dominates encode latency, not ANS. r50k tokenization is 132µs; ANS on top adds only 30µs. If you already tokenize documents at ingest time (which most embedding pipelines do), the extra encode cost is just those 30µs.
- Reads are fast. r50k decode — ANS + detokenize — takes 41µs total. Fast enough for any real-time serving path.
- zstd -22 is slow to compress (182µs) because
--ultralevels trade CPU heavily for ratio. At a lower level (say zstd -3), it would be ~15µs but give up some compression. - ANS is symmetric — encode and decode are nearly identical (~31µs each), unlike zstd where decompress (4µs) is 45× faster than compress.
What is ANS and Why Does It Matter
zstd operates on bytes. It doesn't know that [0x01, 0x7F] is token ID 383 — it just sees two opaque bytes.
ANS (Asymmetric Numeral Systems) — the entropy coder inside zstd's compression stage, also used in LZMA and FLAC — can be applied directly on token IDs. It knows that token 198 (" the") appears 40× more than token 47291 ("embeddings"), so it assigns shorter codes to frequent tokens.
The result: ANS on token IDs gets within 6 bytes of the Shannon entropy lower bound — the theoretical minimum for lossless compression.
import constriction, numpy as np
from collections import Counter
def ans_encode(ids):
counts = Counter(ids)
vocab = sorted(counts.keys())
v2i = {v: i for i, v in enumerate(vocab)}
freqs = np.array([counts[v] for v in vocab], dtype=np.float64)
freqs /= freqs.sum()
model = constriction.stream.model.Categorical(freqs, perfect=False)
encoder = constriction.stream.stack.AnsCoder()
encoder.encode_reverse(np.array([v2i[t] for t in ids], dtype=np.int32), model)
return encoder.get_compressed().tobytes()
The full pipeline:
text (UTF-8)
│
▼
BPE tokenizer (r50k)
│ 50,257-token vocab, fits uint16
▼
[token_id_1, token_id_2, ...] ← 2 bytes each
│
▼
ANS encoder (frequency model over token IDs)
│ assigns short codes to frequent tokens
▼
compressed bytes ← 5x smaller than original text
│
▼ (decode path)
ANS decoder → token IDs → tokenizer.decode → original text
The mxbai Story: Better Compression, Surprising Catch
The mixedbread embedding model (mxbai-embed-large-v1) uses BERT's WordPiece tokenizer with 30,522 tokens. Smaller vocabulary → more common tokens → lower entropy → better compression.
mxbai + ANS reaches 328 bytes vs r50k + ANS at 364 bytes — a 10% compression win. But when I measured latency and recovery, the story got more complicated.
| Tokenizer | CER | CER (ignore case+whitespace) | CER (+ fix punct) |
|---|---|---|---|
| r50k BPE | 0.0% | 0.0% | 0.0% |
| mxbai WordPiece | 20.1% | 6.4% | 0.0% |
The 20% CER is entirely from two deterministic artifacts of BERT tokenization:
- Lowercasing —
"Qdrant"→"qdrant","RAG"→"rag" - Punctuation spacing —
"(content)"→"( content )","up-to-date"→"up - to - date"
Neither is information loss at the token level. The information is there — it's just encoded with different whitespace conventions. Three regex rules fix it completely:
def fix_punct(s):
s = re.sub(r'\( ', '(', s)
s = re.sub(r' \)', ')', s)
s = re.sub(r'(\w) - (\w)', r'\1-\2', s)
return s
So mxbai is effectively lossless too, with a small detokenizer. Whether that's acceptable depends on your use case — if you need byte-perfect round trips without post-processing, r50k is the clean choice.
The latency picture makes mxbai harder to justify:
| Tokenizer | Encode | Decode | Compressed size |
|---|---|---|---|
| r50k + ANS | 156µs | 41µs | 364 bytes |
| mxbai + ANS | 750µs | 420µs | 328 bytes |
mxbai is 5× slower on encode and 10× slower on decode — entirely due to the HuggingFace tokenizer's Python overhead. The 10% smaller output (328 vs 364 bytes) doesn't come close to justifying that. For a storage codec, r50k is the practical choice.
On UNK tokens: WordPiece has a [UNK] token that causes actual information loss. But for standard English text — including code, URLs, numbers, gibberish ASCII — it never fires. It only fires on emojis (🚀 → [UNK]) and some rare Unicode math symbols. So for typical RAG payloads over web or document text, UNK is a non-issue.
Real-World Scale: English Wikipedia
English Wikipedia is ~7M articles, 708 words/article on average — a useful benchmark for RAG-scale corpora.
Extrapolating from the benchmark:
| Method | Per doc | 7M docs | vs raw |
|---|---|---|---|
| Raw UTF-8 | 4,497 bytes | 31.5 GB | 1x |
| zstd -22 | 2,382 bytes | 16.7 GB | 1.9x |
| r50k + zstd | 1,721 bytes | 12.1 GB | 2.6x |
| r50k + ANS | 884 bytes | 6.2 GB | 5.1x |
| mxbai + ANS | 797 bytes | 5.6 GB | 5.6x |
At 1B docs the storage gap is more dramatic:
| Method | 1B docs | S3 ($0.02/GB/mo) | Disk ($0.08/GB/mo) |
|---|---|---|---|
| Raw UTF-8 | 4.5 TB | ~$92 | ~$368 |
| r50k + ANS | 0.88 TB | ~$18 | ~$72 |
| Saving | 3.6 TB | ~$74/mo | ~$296/mo |
Disk savings dominate. Vector databases keep hot payloads on ephemeral SSD for fast reads — that's the 296/month** on disk alone, and the I/O win is the same magnitude: 0.88 TB reads 5× faster than 4.5 TB, which directly reduces query latency when payloads are fetched during retrieval.
S3 savings are smaller in dollar terms but still meaningful at scale. S3 is cheap per GB, so the 0.09/GB egress, shipping 3.6 TB fewer data per operation saves ~$324 per transfer. For a system that re-indexes monthly or replicates across regions, that adds up fast.
How This Compares to State of the Art
For context, here's where this sits on the broader compression landscape (benchmarked on enwik9, 1GB Wikipedia):
| Method | bits/byte | Ratio vs raw | Practical? |
|---|---|---|---|
| Raw UTF-8 | 8.0 | 1x | — |
| gzip -9 | ~2.6 | 3.1x | Yes |
| zstd -22 | ~2.0 | 4x | Yes |
| r50k + ANS (this post) | ~1.56 | ~5x | Yes |
| zpaq | ~1.4 | 5.7x | Slow |
| fx2-cmix (Hutter Prize record, Oct 2024) | 0.887 | 9x | Extremely slow |
| Chinchilla 70B (DeepMind, 2024) | 0.664 | 12x | Not practical |
The gap between r50k + ANS (~5x) and the Hutter Prize record (~9x) is exactly the signal that lives in conditional token probabilities. ANS here only models marginal frequency — how often each token appears globally. The top compressors model P(token | all previous tokens), which is what LLMs do. That's the remaining 2x on the table.
Getting from 5x to 9x requires a language model. Getting from 1x to 5x just requires a tokenizer and 30 lines of Python.
What Needs to Change
Nothing in the current stack prevents this. The pieces exist:
tiktokenis already a dependency in any LLM-adjacent stackconstrictionis a Rust-backed library with a Python API, used in ML compression research- Decoding is fast — ANS is symmetric, O(n) in both directions
What would need to change in a vector DB like Qdrant:
- Accept
encoding: "r50k"as a payload storage option - Pack token IDs as uint16, run ANS on the stream
- Decompress transparently on read
The index, vectors, and query pipeline stay identical. This is purely a payload storage optimization.
The latency overhead is acceptable for both paths:
- Write (encode): 156µs per doc. Ingest pipelines already spend far more than this on embedding generation (~25ms for a 780-token doc on H100). It's a rounding error.
- Read (decode): 41µs per doc. A typical RAG query returns 10–20 docs — that's 400–800µs of decode overhead on a path that already includes vector search and reranking. Negligible.
Current Proposed
───────────────────────────────────────────────────────
INSERT INSERT
payload: {"text": "..."} payload: {"text": "..."}
│ │
▼ ▼
JSON → RocksDB (raw) tokenize → uint16 IDs → ANS
│
▼
RocksDB (compressed)
READ READ
fetch raw JSON bytes fetch compressed bytes
│ │
▼ ▼
return as-is ANS decode → token IDs → decode
│
▼
return original text
The entire round-trip is transparent to the user. Same API, 5x smaller payloads.
Key Takeaways
- Vector databases optimize vectors aggressively but store raw text payloads verbatim — a 5x compression opportunity.
- BPE tokenization (r50k, vocab=50,257) produces uint16 token IDs — 2 bytes each — that already beat gzip on raw text with zero further compression.
- Adding ANS entropy coding on token IDs brings you to 5x compression using 30 lines of Python and infrastructure already in every ML stack.
- r50k is perfectly lossless (0% CER). mxbai WordPiece gives 5.7x but lowercases text — fixable with simple post-processing, but also 5–10× slower to tokenize.
- Latency is not a concern: encode adds 156µs/doc (vs ~25ms for embedding generation); decode adds 41µs/doc (negligible on a query path that already does vector search + reranking).
- The tokenizer dominates encode latency (132µs), not ANS (30µs). If you tokenize at ingest anyway, the marginal encode cost is just those 30µs.
- The gap between this (~5x) and the Hutter Prize record (~9x) is conditional token probability — i.e., you need a language model. Getting to 5x doesn't.
- At English Wikipedia scale (7M docs), this takes payload storage from 31.5 GB to 6.2 GB. The bigger win is I/O: 5x faster bulk reads, indexing, and replication.
Acknowledgements
The compression benchmarks were run using tiktoken, zstandard, lz4, and constriction. The Hutter Prize numbers are from Matt Mahoney's Large Text Compression Benchmark.
- https://www.mattmahoney.net/dc/text.html
- https://arxiv.org/abs/2309.10668 (Language Modeling is Compression, DeepMind)
- https://bellard.org/nncp/
- https://github.com/bamler-lab/constriction
- https://github.com/openai/tiktoken