The Half Of Hybrid Search Nobody Explains
When an AI agent searches unstructured content, the searchable surface is the text a perception layer extracted: transcripts from speech, OCR from frames and documents, captions from scenes, detected labels. Over that text, almost every production pipeline runs two retrievers and fuses them: a dense vector retriever for meaning, and a lexical retriever for exact terms. The dense half gets pages of explanation, the embedding model, the index, the geometry. The lexical half gets one line: "and BM25 for keywords."
That asymmetry is a problem, because when retrieval fails on an exact term, when an agent cannot find the clip that says serial "RX-4490B" even though those characters are right there in the OCR text, the bug is almost always in the lexical half, and you cannot debug a black box. BM25 is not a vague "keyword match." It is a specific, well-understood scoring function over a specific data structure, and once you can see the structure and the formula, the failures stop being mysterious and the tuning knobs start making sense.
This guide opens that box. It is the companion to fusion: fusion teaches how to combine the BM25 score with a dense score, this teaches where the BM25 score comes from in the first place.
The Inverted Index: How Text Becomes Searchable At All
Lexical search does not scan documents at query time. Scanning a million transcripts for a word per query would be hopeless. Instead, at ingestion time it builds an inverted index: a map from each term to the list of documents that contain it, with per-document statistics.
term posting list (doc_id : term_frequency, positions)
--------- ---------------------------------------------------
"overheat" 12:3[5,40,91] 88:1[7] 204:2[12,77]
"serial" 12:1[4] 301:1[2]
"rx-4490b" 12:1[6]
Each entry in a term's posting list records which documents contain the term, how many times (the term frequency, tf), and optionally the positions (needed for phrase queries like "free returns"). To answer a query, the engine looks up only the posting lists for the query terms and walks them, instead of touching documents that share no terms at all. This is why lexical search is fast and why it scales: work is proportional to how many documents contain the query terms, not to corpus size.
Two global statistics get computed at index build time and drive scoring:
Everything BM25 does is arithmetic over these stored numbers. There is no model inference at query time, which is part of why lexical retrieval is cheap, interpretable, and a permanent fixture alongside dense search rather than a legacy technique.
From TF-IDF To BM25
The intuition behind lexical scoring predates BM25 and is worth stating cleanly, because BM25 is a careful repair of two flaws in the naive version.
Term frequency (tf). A document that mentions "overheat" three times is more about overheating than one that mentions it once. So score should rise with tf.
Inverse document frequency (idf). A term that appears in almost every document ("the", "video") tells you nothing about relevance, while a rare term ("rx-4490b") is enormously discriminative. So each term's contribution should be weighted by how rare it is across the corpus. The standard form, with N documents and df the document frequency of the term:
idf(term) = ln( 1 + (N - df + 0.5) / (df + 0.5) )
The shape is what matters: as df approaches N (the term is everywhere), idf approaches zero; as df shrinks toward 1 (the term is rare), idf grows large. A rare exact identifier therefore dominates the score, which is precisely the behavior that makes lexical search good at exactly the matches dense search smears away.
Naive TF-IDF just multiplies tf by idf and sums over query terms. It has two well-known defects, and BM25 fixes both.
The BM25 Scoring Formula, Term By Term
BM25 (the name comes from "Best Match 25", the 25th formulation in a research line, also called Okapi BM25) scores a document D against a query Q by summing a per-term contribution over every query term:
score(D, Q) = sum over terms t in Q of
idf(t) * ( tf(t,D) * (k1 + 1) )
---------------------------------------------------
tf(t,D) + k1 * (1 - b + b * (len(D) / avgdl))
where tf(t,D) is the term's frequency in the document, len(D) is the document length in tokens, avgdl is the average document length, and k1 and b are tunable parameters (defaults around k1 = 1.2, b = 0.75). It looks dense, but each piece is one of the two repairs to TF-IDF.
Repair 1: Term-frequency saturation (the k1 term)
Naive tf grows linearly: a document with tf of 100 scores 100x a document with tf of 1. That is wrong. The first occurrence of "overheat" tells you the document is about overheating; the hundredth tells you almost nothing new. BM25 makes tf saturate.
Look at the fraction with length set aside: tf*(k1+1) / (tf + k1). As tf grows, this curve climbs steeply at first and then flattens toward an asymptote of (k1 + 1). The difference between tf of 1 and tf of 2 is large; the difference between tf of 50 and tf of 51 is nearly nothing.
tf: 1 2 3 5 10 50
naive tf: 1 2 3 5 10 50
BM25 (k1=1.2): 1.0 1.4 1.6 1.8 2.0 2.2 (saturating toward ~2.2)
k1 controls how fast saturation kicks in. A small k1 saturates almost immediately, so presence matters far more than count, good for short fields like titles or captions where repetition is meaningless. A large k1 keeps the curve closer to linear, letting frequency matter more, occasionally useful for long-form transcripts where genuine repetition signals topicality. k1 = 0 collapses BM25 toward pure presence-or-absence.
Repair 2: Document-length normalization (the b term)
A 10,000-word transcript will naturally contain "overheat" more times than a 200-word caption, purely because it is longer, not because it is more relevant. Without correction, BM25 would reward long documents for their length. The factor (1 - b + b * len(D)/avgdl) in the denominator divides the term frequency by how long the document is relative to the corpus average.
b controls how aggressively length is penalized. At b = 1, length normalization is full: a document twice the average length has its tf effectively halved. At b = 0, length is ignored entirely, longer documents win. The default b = 0.75 is a partial penalty, and it is one of the highest-leverage knobs for mixed-length multimodal corpora, where a single collection holds 5-second caption snippets next to hour-long transcript chunks.
Put the two repairs together and BM25 says: reward rare query terms (idf), let repetition count but with diminishing returns (k1 saturation), and do not let a document win just for being long (b normalization). That is the whole algorithm.
A Worked Example
Corpus of N = 1,000,000 documents, avgdl = 300 tokens. Query: "overheat rx-4490b". Take defaults k1 = 1.2, b = 0.75.
"overheat": df = 40,000 -> idf = ln(1 + (1e6 - 40000 + 0.5)/(40000 + 0.5)) ~ 3.18
"rx-4490b": df = 1 -> idf = ln(1 + (1e6 - 1 + 0.5)/(1 + 0.5)) ~ 13.51
The rare identifier carries roughly four times the weight of the common word before frequency is even considered. Now score document 12 (len = 250, so slightly shorter than average): tf("overheat") = 3, tf("rx-4490b") = 1.
length factor = 1 - 0.75 + 0.75 * (250/300) = 0.875
overheat term: 3.18 * (3 * 2.2) / (3 + 1.2 * 0.875) = 3.18 * 6.6 / 4.05 ~ 5.18
rx-4490b term: 13.51 * (1 * 2.2) / (1 + 1.2 * 0.875) = 13.51 * 2.2 / 2.05 ~ 14.50
score(doc 12) ~ 19.68
Notice the single occurrence of the rare serial number contributes nearly three times what three occurrences of the common word do. That is BM25 working as intended, and it is exactly the signal a dense retriever cannot produce, because to a dense model "rx-4490b" is an out-of-distribution token string with no useful semantic neighborhood. This is the concrete reason hybrid search keeps a lexical leg.
The Tokenizer Decides Everything Upstream
Here is the part that silently breaks exact-match retrieval more than any formula detail. BM25 matches terms, and a term is whatever the analyzer produced at index time, the tokenizer plus its normalization steps. If the query and the document do not produce the same token, the posting list lookup misses and the score is zero, no matter how perfect the math downstream.
The analyzer typically does some of: lowercasing, splitting on whitespace and punctuation, removing stopwords, and stemming (reducing "running", "runs", "ran" to a root like "run"). Each step helps recall on natural language and can wreck recall on identifiers:
The practical rule: for fields that hold identifiers, codes, SKUs, and names, you usually want a minimal analyzer (lowercase, no stemming, conservative splitting) or a separate exact-match field, while natural-language fields like transcripts want the full linguistic treatment. A surprising share of "lexical search is broken" reports are an analyzer mismatch, not BM25 at all.
What This Means For An Agent
An agent issues one query and trusts that the right evidence surfaces. BM25 is the leg that guarantees exact terms are not lost, and understanding it changes three things an agent builder controls.
Doing This In Mixpeek
In Mixpeek, lexical retrieval is a feature-search stage over a full-text field, and it lives in the same retriever as dense search so an agent issues one query and gets both. The lexical stage runs BM25 over the inverted index Mixpeek builds on the text field at ingestion, while the dense stage runs ANN over the embedding field, and a fusion stage merges them.
from mixpeek import Mixpeek
client = Mixpeek(api_key="mxp_sk_...")
# A retriever whose lexical leg is BM25 over the transcript text field,
# fused with dense semantic search.
client.retrievers.create(
namespace="support-video",
retriever_name="exact_plus_semantic",
stages=[
{"stage_type": "feature_search", "stage_id": "lexical",
"parameters": {"field_name": "transcript_text", "method": "bm25", "limit": 1000}},
{"stage_type": "feature_search", "stage_id": "dense",
"parameters": {"field_name": "transcript_embedding", "limit": 1000}},
{"stage_type": "rank_fusion", "stage_id": "fuse",
"parameters": {"method": "rrf", "k": 60, "inputs": ["lexical", "dense"]}},
],
)
# The serial number is found by the BM25 leg even though it is meaningless
# to the embedding model; the concept is found by the dense leg.
results = client.retrievers.execute(
retriever_id="exact_plus_semantic",
inputs={"text": "the clip mentioning serial RX-4490B about overheating"},
)
The thing to internalize is that the lexical stage is not magic keyword matching; it is the inverted index and BM25 formula from this guide, scored over whatever tokens the text field's analyzer produced. If an exact identifier ever fails to surface, the debugging path is: confirm the term is in the field, confirm the analyzer tokenizes it the same way at index and query time, then look at k1, b, and the fusion weight, in that order.
Key Takeaways
1. Lexical search runs over an inverted index: a map from each term to a posting list of documents, term frequencies, and positions, built at ingestion so query work scales with matching documents, not corpus size.
2. IDF weights rare terms heavily and common terms near zero, which is why a single rare identifier can dominate a BM25 score and why lexical search wins exactly where dense search smears exact tokens away.
3. BM25 repairs naive TF-IDF with two mechanisms: term-frequency saturation (the k1 term, so the hundredth occurrence barely beats the first) and document-length normalization (the b term, so long documents do not win just for being long).
4. k1 and b are the tuning knobs: small k1 favors presence over count (good for short fields), b near 0.75 partially penalizes length and is high-leverage for mixed-length multimodal corpora.
5. The tokenizer/analyzer silently decides recall. Index and query must use the same analyzer, and identifier fields usually want minimal tokenization (no stemming) so codes and names survive, most "lexical search is broken" reports are analyzer mismatches.
6. The BM25 score is unbounded and query-dependent, which is precisely why it cannot be added to a bounded cosine score and why fusion is a separate problem handled in the companion guide.