BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box

The Half Of Hybrid Search Nobody Explains

When an AI agent searches unstructured content, the searchable surface is the text a perception layer extracted: transcripts from speech, OCR from frames and documents, captions from scenes, detected labels. Over that text, almost every production pipeline runs two retrievers and fuses them: a dense vector retriever for meaning, and a lexical retriever for exact terms. The dense half gets pages of explanation, the embedding model, the index, the geometry. The lexical half gets one line: "and BM25 for keywords."

That asymmetry is a problem, because when retrieval fails on an exact term, when an agent cannot find the clip that says serial "RX-4490B" even though those characters are right there in the OCR text, the bug is almost always in the lexical half, and you cannot debug a black box. BM25 is not a vague "keyword match." It is a specific, well-understood scoring function over a specific data structure, and once you can see the structure and the formula, the failures stop being mysterious and the tuning knobs start making sense.

This guide opens that box. It is the companion to fusion: fusion teaches how to combine the BM25 score with a dense score, this teaches where the BM25 score comes from in the first place.

The Inverted Index: How Text Becomes Searchable At All

Lexical search does not scan documents at query time. Scanning a million transcripts for a word per query would be hopeless. Instead, at ingestion time it builds an inverted index: a map from each term to the list of documents that contain it, with per-document statistics.

term         posting list (doc_id : term_frequency, positions)
---------    ---------------------------------------------------
"overheat"   12:3[5,40,91]   88:1[7]   204:2[12,77]
"serial"     12:1[4]         301:1[2]
"rx-4490b"   12:1[6]

Each entry in a term's posting list records which documents contain the term, how many times (the term frequency, tf), and optionally the positions (needed for phrase queries like "free returns"). To answer a query, the engine looks up only the posting lists for the query terms and walks them, instead of touching documents that share no terms at all. This is why lexical search is fast and why it scales: work is proportional to how many documents contain the query terms, not to corpus size.

Two global statistics get computed at index build time and drive scoring:

Document frequency (df) of a term: how many documents contain it at all. "the" has enormous df; "rx-4490b" has df of 1. This is the raw material for IDF below.

Document length and the average document length across the corpus. Length is what lets the scorer avoid rewarding a long document just for being long.

Everything BM25 does is arithmetic over these stored numbers. There is no model inference at query time, which is part of why lexical retrieval is cheap, interpretable, and a permanent fixture alongside dense search rather than a legacy technique.

From TF-IDF To BM25

The intuition behind lexical scoring predates BM25 and is worth stating cleanly, because BM25 is a careful repair of two flaws in the naive version.

Term frequency (tf). A document that mentions "overheat" three times is more about overheating than one that mentions it once. So score should rise with tf.

Inverse document frequency (idf). A term that appears in almost every document ("the", "video") tells you nothing about relevance, while a rare term ("rx-4490b") is enormously discriminative. So each term's contribution should be weighted by how rare it is across the corpus. The standard form, with N documents and df the document frequency of the term:

idf(term) = ln( 1 + (N - df + 0.5) / (df + 0.5) )

The shape is what matters: as df approaches N (the term is everywhere), idf approaches zero; as df shrinks toward 1 (the term is rare), idf grows large. A rare exact identifier therefore dominates the score, which is precisely the behavior that makes lexical search good at exactly the matches dense search smears away.

Naive TF-IDF just multiplies tf by idf and sums over query terms. It has two well-known defects, and BM25 fixes both.

The BM25 Scoring Formula, Term By Term

BM25 (the name comes from "Best Match 25", the 25th formulation in a research line, also called Okapi BM25) scores a document D against a query Q by summing a per-term contribution over every query term:

score(D, Q) = sum over terms t in Q of

    idf(t) * ( tf(t,D) * (k1 + 1) )
             ---------------------------------------------------
             tf(t,D) + k1 * (1 - b + b * (len(D) / avgdl))

where tf(t,D) is the term's frequency in the document, len(D) is the document length in tokens, avgdl is the average document length, and k1 and b are tunable parameters (defaults around k1 = 1.2, b = 0.75). It looks dense, but each piece is one of the two repairs to TF-IDF.

Repair 1: Term-frequency saturation (the k1 term)

Naive tf grows linearly: a document with tf of 100 scores 100x a document with tf of 1. That is wrong. The first occurrence of "overheat" tells you the document is about overheating; the hundredth tells you almost nothing new. BM25 makes tf saturate.

Look at the fraction with length set aside: tf*(k1+1) / (tf + k1). As tf grows, this curve climbs steeply at first and then flattens toward an asymptote of (k1 + 1). The difference between tf of 1 and tf of 2 is large; the difference between tf of 50 and tf of 51 is nearly nothing.

tf:                 1     2     3     5    10    50
naive tf:           1     2     3     5    10    50
BM25 (k1=1.2):    1.0   1.4   1.6   1.8   2.0   2.2   (saturating toward ~2.2)

k1 controls how fast saturation kicks in. A small k1 saturates almost immediately, so presence matters far more than count, good for short fields like titles or captions where repetition is meaningless. A large k1 keeps the curve closer to linear, letting frequency matter more, occasionally useful for long-form transcripts where genuine repetition signals topicality. k1 = 0 collapses BM25 toward pure presence-or-absence.

Repair 2: Document-length normalization (the b term)

A 10,000-word transcript will naturally contain "overheat" more times than a 200-word caption, purely because it is longer, not because it is more relevant. Without correction, BM25 would reward long documents for their length. The factor (1 - b + b * len(D)/avgdl) in the denominator divides the term frequency by how long the document is relative to the corpus average.

b controls how aggressively length is penalized. At b = 1, length normalization is full: a document twice the average length has its tf effectively halved. At b = 0, length is ignored entirely, longer documents win. The default b = 0.75 is a partial penalty, and it is one of the highest-leverage knobs for mixed-length multimodal corpora, where a single collection holds 5-second caption snippets next to hour-long transcript chunks.

Put the two repairs together and BM25 says: reward rare query terms (idf), let repetition count but with diminishing returns (k1 saturation), and do not let a document win just for being long (b normalization). That is the whole algorithm.

A Worked Example

Corpus of N = 1,000,000 documents, avgdl = 300 tokens. Query: "overheat rx-4490b". Take defaults k1 = 1.2, b = 0.75.

"overheat":  df = 40,000   -> idf = ln(1 + (1e6 - 40000 + 0.5)/(40000 + 0.5)) ~ 3.18
"rx-4490b":  df = 1        -> idf = ln(1 + (1e6 - 1 + 0.5)/(1 + 0.5))         ~ 13.51

The rare identifier carries roughly four times the weight of the common word before frequency is even considered. Now score document 12 (len = 250, so slightly shorter than average): tf("overheat") = 3, tf("rx-4490b") = 1.

length factor = 1 - 0.75 + 0.75 * (250/300) = 0.875

overheat term:  3.18 * (3 * 2.2) / (3 + 1.2 * 0.875)   = 3.18 * 6.6 / 4.05  ~ 5.18
rx-4490b term: 13.51 * (1 * 2.2) / (1 + 1.2 * 0.875)   = 13.51 * 2.2 / 2.05 ~ 14.50

score(doc 12) ~ 19.68

Notice the single occurrence of the rare serial number contributes nearly three times what three occurrences of the common word do. That is BM25 working as intended, and it is exactly the signal a dense retriever cannot produce, because to a dense model "rx-4490b" is an out-of-distribution token string with no useful semantic neighborhood. This is the concrete reason hybrid search keeps a lexical leg.

The Tokenizer Decides Everything Upstream

Here is the part that silently breaks exact-match retrieval more than any formula detail. BM25 matches terms, and a term is whatever the analyzer produced at index time, the tokenizer plus its normalization steps. If the query and the document do not produce the same token, the posting list lookup misses and the score is zero, no matter how perfect the math downstream.

The analyzer typically does some of: lowercasing, splitting on whitespace and punctuation, removing stopwords, and stemming (reducing "running", "runs", "ran" to a root like "run"). Each step helps recall on natural language and can wreck recall on identifiers:

Punctuation splitting can turn "RX-4490B" into "rx" and "4490b", so a query for the exact "RX-4490B" string, if tokenized the same way, still works, but a phrase or exact-match expectation can break depending on configuration.

Stemming is great for "retention" matching "retain" but actively harmful for codes and proper nouns it mangles.

Stopword removal silently drops query terms; a query that is mostly stopwords can end up nearly empty.

The index and the query must use the same analyzer. Index with stemming and query without it and you get systematic misses that look like data problems but are configuration problems.

The practical rule: for fields that hold identifiers, codes, SKUs, and names, you usually want a minimal analyzer (lowercase, no stemming, conservative splitting) or a separate exact-match field, while natural-language fields like transcripts want the full linguistic treatment. A surprising share of "lexical search is broken" reports are an analyzer mismatch, not BM25 at all.

What This Means For An Agent

An agent issues one query and trusts that the right evidence surfaces. BM25 is the leg that guarantees exact terms are not lost, and understanding it changes three things an agent builder controls.

Exact identifiers are BM25's job, and the tokenizer gates it. When an agent must find a serial, a case number, a promo code, or a surname, the dense retriever will smear it and BM25 will nail it, but only if the analyzer tokenizes the identifier consistently at index and query time. If exact-match retrieval is failing, inspect the analyzer before touching the formula.

k1 and b are real levers, not folklore. A mixed-length multimodal corpus (caption snippets beside long transcripts) is the classic case where the default b under- or over-penalizes length and reranking quietly suffers. These are measured settings, tuned against a labeled set, not constants to inherit.

The BM25 score is one input to fusion, and it is unbounded and query-dependent by construction. The rare-term spike you saw above (a contribution of 14.5 from a single token) is exactly why you cannot add BM25 to a bounded cosine score, and why fusion exists as a separate problem. Knowing where the spike comes from is what makes the fusion choices in the companion guide make sense rather than feel arbitrary.

Doing This In Mixpeek

In Mixpeek, lexical retrieval is a feature-search stage over a full-text field, and it lives in the same retriever as dense search so an agent issues one query and gets both. The lexical stage runs BM25 over the inverted index Mixpeek builds on the text field at ingestion, while the dense stage runs ANN over the embedding field, and a fusion stage merges them.

from mixpeek import Mixpeek

client = Mixpeek(api_key="mxp_sk_...")

# A retriever whose lexical leg is BM25 over the transcript text field,
# fused with dense semantic search.
client.retrievers.create(
    namespace="support-video",
    retriever_name="exact_plus_semantic",
    stages=[
        {"stage_type": "feature_search", "stage_id": "lexical",
         "parameters": {"field_name": "transcript_text", "method": "bm25", "limit": 1000}},
        {"stage_type": "feature_search", "stage_id": "dense",
         "parameters": {"field_name": "transcript_embedding", "limit": 1000}},
        {"stage_type": "rank_fusion", "stage_id": "fuse",
         "parameters": {"method": "rrf", "k": 60, "inputs": ["lexical", "dense"]}},
    ],
)

# The serial number is found by the BM25 leg even though it is meaningless
# to the embedding model; the concept is found by the dense leg.
results = client.retrievers.execute(
    retriever_id="exact_plus_semantic",
    inputs={"text": "the clip mentioning serial RX-4490B about overheating"},
)

The thing to internalize is that the lexical stage is not magic keyword matching; it is the inverted index and BM25 formula from this guide, scored over whatever tokens the text field's analyzer produced. If an exact identifier ever fails to surface, the debugging path is: confirm the term is in the field, confirm the analyzer tokenizes it the same way at index and query time, then look at k1, b, and the fusion weight, in that order.

Key Takeaways

1. Lexical search runs over an inverted index: a map from each term to a posting list of documents, term frequencies, and positions, built at ingestion so query work scales with matching documents, not corpus size.

2. IDF weights rare terms heavily and common terms near zero, which is why a single rare identifier can dominate a BM25 score and why lexical search wins exactly where dense search smears exact tokens away.

3. BM25 repairs naive TF-IDF with two mechanisms: term-frequency saturation (the k1 term, so the hundredth occurrence barely beats the first) and document-length normalization (the b term, so long documents do not win just for being long).

4. k1 and b are the tuning knobs: small k1 favors presence over count (good for short fields), b near 0.75 partially penalizes length and is high-leverage for mixed-length multimodal corpora.

5. The tokenizer/analyzer silently decides recall. Index and query must use the same analyzer, and identifier fields usually want minimal tokenization (no stemming) so codes and names survive, most "lexical search is broken" reports are analyzer mismatches.

6. The BM25 score is unbounded and query-dependent, which is precisely why it cannot be added to a bounded cosine score and why fusion is a separate problem handled in the companion guide.

The Half Of Hybrid Search Nobody Explains

The Inverted Index: How Text Becomes Searchable At All

From TF-IDF To BM25

The BM25 Scoring Formula, Term By Term

Repair 1: Term-frequency saturation (the k1 term)

Repair 2: Document-length normalization (the b term)

A Worked Example

The Tokenizer Decides Everything Upstream

What This Means For An Agent

Doing This In Mixpeek

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Build a Multimodal Search Pipeline

Related guides

Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking

Semantic Caching: How Agents Skip Work They Have Already Done

Filtered Vector Search: How Agents Combine Similarity with Hard Constraints