Published on

Tokenization is a Compression Codec Nobody Uses That Way

Tokenization as Compression Cover

Looking for TL;DR? Check key takeaways

Every vector search engine (aka vector DB) stores two things for each point: the vector and the payload. Vectors get all the attention: quantization, compression, careful index design. The text in the payload gets LZ4 at best.

This post makes one claim: tokenize that text before compressing it, and you roughly double what generic byte codecs achieve. I'll show the benchmark, the latency cost (small), and what it would take to add this to a real database.

The Forgotten Layer

When you insert a document into Qdrant, or Elasticsearch, you typically store something like:

{
  "id": "doc_123",
  "vector": [0.12, -0.34, ...],
  "payload": {
    "text": "Retrieval-Augmented Generation is a technique..."
  }
}

Qdrant stores on-disk payloads in Gridstore (its custom KV engine, replacing RocksDB since v1.13) with LZ4 compression on by default. Elasticsearch packs stored fields into 16 KB blocks and compresses each block with LZ4, so a handful of documents share a compression window. Either way, the codec sees at most a few documents at a time and has no learned model of the language, just whatever byte patterns happen to repeat within the block. That gets you roughly 1.15–2x on text.

But there is a component that knows a lot about the structure of natural language, and it's already part of every embedding and language model: the tokenizer.

How It Works

BYTE CODEC PATHTOKEN + ANS PATHtextRetrieval-Augmented 1,853 chars = 1,853 bytes (UTF-8)LZ41,468 bytes1.3×textRetrieval-Augmented Generation is392 tokens · static ANS table (WikiText-103)ANS648 bytes2.9×first 20 chars shown; r50k tokenizer (GPT-2 BPE, 50,257-token vocab); same 1,853-byte reference doc; static ANS table trained on WikiText-103

The top path is what databases do today: store UTF-8 bytes, compress with a generic codec. The bottom path is my proposal: tokenize first, then entropy-code the token IDs.

Step 1: BPE tokenization. BPE (Byte Pair Encoding) starts with raw bytes and repeatedly merges the most frequent adjacent pair into a new token, building up a vocabulary of common subwords. OpenAI ships two BPE tokenizers: r50k (used by GPT-2/GPT-3, 50,257 tokens) and cl100k (used by GPT-3.5/GPT-4, 100,277 tokens). cl100k has a larger vocabulary so it learns longer merges — it produces fewer tokens per document but each token ID needs 4 bytes (uint32) since the vocab exceeds the uint16 limit of 65,535. r50k fits in 2 bytes (uint16).

BPE merges frequent byte sequences into single tokens. "language" is one token. "Retrieval" is three (Ret + ri + eval). On average one BPE token covers 3/4 of a word. An average English word is ~5 characters plus a space, so:

6 bytes/word × 3/4 words/token = 4.5 bytes/token (raw text)
2 bytes/token (uint16 IDs)
ratio: ~2.25x

Just packing token IDs into uint16 already compresses, before any actual compression algorithm runs.

Step 2: ANS entropy coding. gzip and LZ4 see opaque bytes and can't know that [0x01, 0x7F] is token 383. An entropy coder working directly on token IDs can give frequent tokens short codes: "the" is 40x more common than "embeddings", so it gets a much shorter code. I used ANS (Asymmetric Numeral Systems, the entropy coder inside zstd) via the constriction library.

One design choice matters here: the frequency table is trained once on a large corpus and shared, not built per document. A per-document table compresses better on paper, but the decoder needs the table to reconstruct the document, so you'd have to store it alongside each payload. That's ~900 bytes for a typical 400-token chunk, which erases the entire gain. A shared static table costs nothing per document. It's completely possible to learn this static table on a general corpus and reuse it everywhere.

The whole thing is ~30 lines of Python. Expand to see.
import constriction, numpy as np, tiktoken

enc   = tiktoken.get_encoding("r50k_base")
VOCAB = 50_257

# ── one-time setup: train on corpus ──────────────────────────────────────────
corpus_ids = enc.encode(corpus_text)
counts = np.bincount(corpus_ids, minlength=VOCAB).astype(np.float64) + 1  # Laplace
probs  = counts / counts.sum()
model  = constriction.stream.model.Categorical(probs, perfect=False)

# ── per-document encode (store this) ─────────────────────────────────────────
def compress(text):
    ids   = enc.encode(text)
    coder = constriction.stream.stack.AnsCoder()
    coder.encode_reverse(np.array(ids, dtype=np.int32), model)
    return len(ids).to_bytes(4, 'big') + coder.get_compressed().tobytes()

# ── per-document decode ───────────────────────────────────────────────────────
def decompress(data):
    n   = int.from_bytes(data[:4], 'big')
    buf = np.frombuffer(data[4:], dtype=np.uint32).copy()
    ids = constriction.stream.stack.AnsCoder(buf).decode(model, n).tolist()
    return enc.decode(ids)

The Benchmark

Measured across 1,773 WikiText-103 test articles (≥30 words). The ANS frequency table was trained on the WikiText-103 train split, a shared static model with no per-document overhead.

Methodminmedianmax
LZ4 (default in Qdrant/ES)0.88x1.15x1.78x
gzip -90.99x1.66x2.34x
zstd -221.11x1.68x2.39x
brotli q=111.22x2.26x3.13x
r50k BPE uint16 (raw)1.04x2.33x3.34x
cl100k BPE uint32 (raw)0.45x1.13x1.62x
zstd dict 112KB (WikiText)1.51x2.61x3.63x
r50k BPE + static ANS1.68x3.37x4.30x
cl100k BPE + static ANS1.68x3.30x4.14x

All methods above are lossless (0% character error rate on full round trip).

Method glossary: what the names mean
  • LZ4: default codec in Qdrant and Elasticsearch. Optimized for speed, not ratio.
  • gzip -9: max compression level, classic deflate. The -9 is the level (1–9).
  • zstd -22: Zstandard at level 22 ("ultra" mode). Much slower to compress than default, same decompression speed.
  • brotli q=11: Google's codec at max quality. Best ratio of the byte codecs, but slow.
  • r50k BPE uint16 (raw): r50k has 50,257 tokens, which fits in uint16 (max 65,535), so 2 bytes per token ID.
  • cl100k BPE uint32 (raw): cl100k has 100,277 tokens, which exceeds uint16 and requires uint32 (4 bytes per token). It produces fewer tokens per document than r50k (larger vocab = longer merges), but 4 bytes × fewer tokens still comes out worse than 2 bytes × more tokens. Hence 1.13x vs 2.33x.
  • zstd dict 112KB: zstd with a 112KB dictionary trained on WikiText-103 (zstd --train). The fairest byte-codec comparison since it also learns from the corpus.
  • r50k / cl100k BPE + static ANS: ANS allocates bits based on token frequency, not a fixed byte width. A common token gets ~1 bit and a rare one gets ~16 bits, regardless of vocab size. So the uint32 penalty that crushes cl100k raw disappears entirely, and both tokenizers land near 3.3x.

Two things stand out. r50k raw uint16 (no compression algorithm at all) already beats every byte codec except brotli. And r50k + static ANS beats everything by a wide margin. cl100k raw uint32 sits near 1.13x, worse than even gzip, yet cl100k + ANS recovers to 3.30x: the fixed-width storage penalty vanishes once ANS allocates bits by frequency instead of by vocab size.

The fairest competitor is zstd with a corpus-trained dictionary (zstd --train), since it also learns from the corpus. It reaches 2.61x median, still 30% behind. The difference is that the token model operates on units of language: "embeddings" is one symbol with a known global frequency, where a byte-level model has to rediscover that pattern from scratch.

What does it cost in latency?

Measured across 200 WikiText-103 articles (50–500 words), static corpus ANS model:

Methodenc medianenc p99dec mediandec p99
LZ415µs36µs1µs2µs
gzip -923µs44µs19µs30µs
zstd -2268µs193µs3µs6µs
zstd dict129µs326µs2µs5µs
r50k + ANS54µs129µs18µs44µs
cl100k + ANS55µs156µs18µs51µs

LZ4 decompresses in 1µs, designed entirely for speed. zstd dict's encoder is slowest at 129µs (best of three dictionary sizes tested: 32KB, 64KB, 112KB), though it decompresses in 2µs. The tokenizer methods sit comfortably in the middle: ~54µs encode (dominated by the tokenizer, not ANS), ~18µs decode.

Note that today's embedding models use different tokenizers (WordPiece, SentencePiece) than LLM BPE tokenizers, so you can't skip the tokenization step by reusing embedding tokenization. The real benefit runs the other way: if you store token IDs in LLM format (r50k/cl100k), you avoid re-tokenizing retrieved documents at inference time when feeding them to the LLM, since the token IDs are already there.

Side quest: mxbai's WordPiece tokenizer compresses slightly better but lowercases your text. Expand if curious.

The mixedbread embedding model (mxbai-embed-large-v1) uses BERT's WordPiece tokenizer with a 30,522-token vocabulary. Smaller vocabulary means more common tokens, lower entropy, slightly better compression. Across 1,773 WikiText-103 test articles:

Methodminmedianmax
r50k BPE + static ANS1.68x3.37x4.30x
mxbai WordPiece + static ANS1.63x3.56x4.54x

~5% better compression at the median. The problem is character error rate (CER): 80.4% on average, every single article affected. After applying punctuation-spacing fixes (regex for ( x )(x), etc.), CER stays at 81.3%, so the fix is noise. The damage is overwhelmingly BERT's lowercasing: "Qdrant""qdrant", "Wikipedia""wikipedia". WikiText is dense with capitalized words (titles, proper nouns, sentence starts), so virtually every article is corrupted. Case is gone and unrecoverable. mxbai is not a viable lossless codec for real text.

What It Saves at Scale

English Wikipedia is ~7M articles averaging 708 words. Applying the corpus-average ratios:

MethodPer doc7M docsvs raw
Raw UTF-84,497 bytes31.5 GB1x
zstd -222,382 bytes16.7 GB1.9x
r50k + static ANS1,342 bytes9.4 GB3.35x

At a billion documents that's 1.3 TB instead of 4.5 TB. At $0.08/GB for SSD and $0.02/GB for S3-class object storage:

Storage tierRaw UTF-8r50k + ANSSaved
SSD ($0.08/GB)$360/mo$107/mo$253/mo
S3 ($0.02/GB)$90/mo$27/mo$63/mo

Every snapshot, replica, and bulk transfer shrinks by the same 3.35× factor. The I/O win is often worth more than the storage bill: 3.35x fewer bytes means 3.35x faster bulk reads, indexing, and replication.

One scoping note so the claim doesn't overreach: in collections with large embeddings, the vectors dominate total bytes (a 1536-dim float32 vector is ~6 KB, often bigger than the text next to it). Payload compression matters most when payloads are large, vectors are quantized, or text is the point of the collection.

How Far Can This Go?

The static frequency table only knows how often each token appears globally. The next step up is a bigram table (P(token | previous token)) which costs 1.4 MB and gets you to about 4.35x. A full bigram table (69 MB) reaches ~4.7x.

How can a probabilistic model recover the exact text?

The probability model only controls how many bits each token gets. The token IDs themselves are always stored and recovered exactly — nothing is approximated.

Take the sequence " language" → " model" → " retrieval":

  • Unigram ANS: " model" appears in 0.3% of all tokens → ~8 bits
  • Bigram ANS: " model" after " language" is very common → maybe ~4 bits

The decoder recovers " model" exactly either way. It just decoded " language" first, so it knows to use P(· | " language") as the ruler for the next token — and gets the exact same ID back.

Think of it as Morse code with a smarter codebook. E gets one dot not because "E" is approximate, but because it's frequent. A better frequency estimate means shorter codes, not lossy data.

Past that is LM territory. Chinchilla 70B achieves 0.664 bits/byte on Wikipedia (roughly 12× compression) because a language model is literally computing P(token | all previous tokens), which is the optimal entropy coder by definition. The gap between 3.3× and 12× is entirely conditional probability.

MethodRatioPractical?
BPE + static ANS (this post)~3.3×Yes
BPE + bigram table~4.5×Yes (1.4 MB overhead)
Chinchilla 70B~12×No

And because the stored format is just token IDs, swapping the unigram table for a bigram table later is a codec upgrade, not a data migration.

What Would a Database Need to Do?

Nothing exotic. For Qdrant specifically:

  1. Accept encoding: "r50k" as a payload storage option
  2. Tokenize on insert, pack IDs as uint16, run ANS with the shared static table
  3. Decompress transparently on read

The index, vectors, and query pipeline don't change at all.

Current                          Proposed
───────────────────────────────────────────────────────
INSERT                           INSERT
  payload: {"text": "..."}         payload: {"text": "..."}
       │                                  │
       ▼                                  ▼
  JSON → Gridstore (LZ4)         tokenize → uint16 IDs → ANS
                                  Gridstore (ANS)

READ                             READ
  fetch LZ4 JSON bytes             fetch ANS-compressed bytes
       │                                  │
       ▼                                  ▼
  return as-is                   ANS decode → token IDs → decode
                                  return original text

There is one real cost: the vocabulary and frequency table become part of the on-disk format forever, so encodings have to be versioned per collection.

There's also a bigger-picture reason to like token-native storage as search and generation converge: keeping UTF-8 blobs in the index while the model reasons in tokens means maintaining two representations of the same text.

Key Takeaways

  • Vector databases apply LZ4 to text payloads (1.15x). Tokenizing first and entropy-coding the token IDs gives 3.35x on average. Lossless, 2.9x better than LZ4.
  • Raw uint16 token IDs alone reach 2.4x, beating every byte codec except brotli, with no compression algorithm at all.
  • The frequency table must be static and shared. Per-document tables cost ~900 bytes each and erase the gain.
  • Latency is a non-issue: ANS adds ~30µs to encoding (the tokenizer you may already run dominates), and decode is ~41µs.
  • The same idea stretches to ~4.7x with a bigram table (1.4 MB of table already gets 4.35x). Past ~5x you need an actual language model.
  • The trade-off to respect: the vocabulary becomes part of your on-disk format forever, so version your encodings.

Acknowledgements

The benchmarks use tiktoken, zstandard, lz4, and constriction. The "language modeling is compression" framing comes from DeepMind's 2023 paper.

Citation

If you find this useful, please cite:

@misc{kumar2026tokenization,
  author       = {Kumar Shivendu},
  title        = {Tokenization is a Compression Codec Nobody Uses That Way},
  year         = {2026},
  url          = {https://www.kshivendu.dev/blog/tokenization-compression},
  note         = {Blog post}
}