Video Embedding Benchmark 2026: Gemini vs Marengo vs X-CLIP vs SigLIP vs InternVideo2

At Mixpeek, we process video embeddings at scale for our customers' retrieval pipelines. When Google dropped Gemini Embedding 2 — claiming to unify text, image, video, and audio in one embedding space — we needed to know: does it actually work for video retrieval? And how does it compare to purpose-built alternatives?

So we built a benchmark. Not a synthetic one with cherry-picked examples — a proper IR evaluation with graded relevance — dataset, code, and results all open-sourced on GitHub — following the same methodology as BEIR and MTEB. Twenty CC0 videos, sixty queries, six models, reproducible results.

Here's what we found.

The Results

Model	Dims	NDCG@5	NDCG@10	R@1	R@5	MRR	Latency	Type
Gemini Embedding 2	3072	0.697	0.769	0.200	0.717	0.896	2,458ms	API
Marengo 2.7*	1024	0.721	0.760	0.250	0.743	1.000	18,148ms	API
Mixedbread Wholembed v3**	ColBERT	0.644	0.757	0.216	0.649	0.932	500ms	API (Stores)
X-CLIP Base	512	0.327	0.470	0.067	0.367	0.520	192ms	Local
SigLIP 2 SO400M	1152	0.202	0.325	0.075	0.237	0.466	636ms	Local
InternVideo2 6B	768	0.186	0.302	0.046	0.237	0.405	24,817ms	Local

* Marengo results based on 34/60 queries due to Twelve Labs free-tier rate limits. ** Mixedbread results based on 37/60 queries due to rate limiting. Uses Stores API (ColBERT-style late interaction).

What These Metrics Mean (and Why You Should Care)

If you're building any kind of search or retrieval system over video, these numbers tell you how often your users will find what they're looking for. Let me break them down:

NDCG@K (Normalized Discounted Cumulative Gain) is the primary metric. It measures ranking quality with graded relevance — not just "did you find it?" but "did you rank the best match highest?" An NDCG@10 of 0.769 means Gemini gets the ranking mostly right across the top 10 results. An NDCG@10 of 0.302 (InternVideo2) means the ranking is barely better than random.

MRR (Mean Reciprocal Rank) answers: "Where does the first correct result appear?" Marengo's perfect 1.000 means the right video was always ranked #1. Every single query. Gemini's 0.896 means the right answer is usually in the top 2. X-CLIP's 0.520 means you're typically scrolling to position 2-3 before finding something relevant.

Recall@K tells you what fraction of relevant results appear in the top K. At R@5, Marengo retrieves 74% of relevant videos in the top 5 results. InternVideo2 only gets 24%. If you're building a search UI that shows 5 results per page, that's the difference between useful and useless.

Latency is wall-clock time to embed one video. X-CLIP does it in 192ms locally. Marengo takes 18 seconds through their API. That's a 94x difference. For batch processing, this might not matter. For real-time applications, it's a dealbreaker.

The Surprising Takeaways

1. Three API models in a tight race

The top 3 — Gemini, Marengo, and Mixedbread — cluster tightly at NDCG@10 0.757–0.769. Marengo's perfect MRR is genuinely impressive (it never puts the wrong video first), while Mixedbread's ColBERT-style late interaction achieves 0.932 MRR with a very different architecture. But Gemini is close on all metrics and handles text, images, audio, and documents too. For most teams, Gemini's versatility at 7x faster latency probably wins. Mixedbread's Stores approach is interesting — you upload videos once and search via API. No embedding vectors to manage, no vector DB needed.

2. Model size doesn't predict quality

InternVideo2 has 6 billion parameters. X-CLIP has ~150 million. X-CLIP beats InternVideo2 on every single metric by a wide margin (0.470 vs 0.302 NDCG@10). The reason: InternVideo2's Stage2 checkpoint is optimized for multimodal pretraining, not zero-shot retrieval. Architecture and training objective matter more than parameter count.

3. Frame averaging is a dead end

SigLIP 2 is a fantastic image encoder. But sampling 8 frames and averaging their embeddings gives you 0.325 NDCG@10 — barely above InternVideo2's pretrained checkpoint. Video is not a bag of frames. Temporal structure — what happens between frames — carries critical information for retrieval. X-CLIP's cross-frame attention proves this: same number of frames, 1.4x better results.

4. The API vs. self-hosted gap is 2x

All three API models (Gemini, Marengo, Mixedbread) score 0.75+ NDCG@10. The best open-source model (X-CLIP) scores 0.470. That's a 2x quality gap. If you need high-quality video retrieval today, you're paying for an API. The open-source video embedding space is still immature.

What This Means For Your Architecture

If you're building a search product:

Use Gemini Embedding 2. It has the best balance of quality, latency, and cost (free tier covers 1K videos/day). The 3072-dim vectors are large, but you get Matryoshka support — truncate to 768 dims with minimal quality loss. Marengo is slightly better on retrieval but 7x slower and costs $0.033/min.

If latency matters more than quality:

X-CLIP runs locally in 192ms on consumer hardware. At 0.470 NDCG@10, it's good enough for recommendation systems, deduplication, or coarse-grained search where you can refine results downstream.

If you're evaluating at scale:

Don't use InternVideo2 for retrieval. Despite the hype, the Stage2 checkpoint isn't designed for zero-shot embedding similarity. If you need an open-source model with >1B params, wait for a contrastive-tuned variant or fine-tune it yourself.

If you're at Mixpeek:

This benchmark directly informs our pipeline. We're integrating Gemini Embedding 2 as a first-class embedding option alongside our existing extractors. The quality-to-latency ratio is unmatched, and the multimodal unification means our customers can search across video, images, and documents with a single model.

Methodology

We want this to be reproducible. Here's exactly what we did:

Dataset: 20 CC0 videos from Pexels across 5 categories (sports, cooking, nature, urban, technology). All normalized to 640x360, 24fps, 10s max, H.264.
Queries: 60 text queries with graded relevance (0/1/2), three per video: exact match, partial match, and hard negative (semantically adjacent but wrong domain).
Metrics: Standard IR evaluation following BEIR/MTEB conventions — NDCG@K with graded relevance as the primary metric.
Embedding: All vectors L2-normalized. Retrieval by cosine similarity. Random seed fixed at 42.
Frame-based models: 8 frames uniformly sampled (SigLIP, X-CLIP) or 4 frames (InternVideo2). API models process the full video.

Hard Negative Performance

We specifically designed queries to confuse models — e.g., "A technician carefully assembling small electronic parts by hand" for a cooking video (both involve precise hand movements). This tests whether models understand semantics or just match visual patterns.

Model	NDCG@1	NDCG@5	MRR
Marengo 2.7*	1.000	0.846	1.000
Mixedbread Wholembed v3**	1.000	0.802	1.000
Gemini Embedding 2	0.800	0.790	0.900
SigLIP 2 SO400M	0.400	0.236	0.556
X-CLIP Base	0.200	0.322	0.467
InternVideo2 6B	0.200	0.220	0.423

Marengo and Mixedbread were never fooled — both achieve perfect NDCG@1 and MRR on hard negatives. Gemini was fooled once. The open-source models were confused frequently. This is arguably the most important test for production retrieval — false positives in search results destroy user trust.

Cost Comparison

Model	Pricing	Est. Cost / 1K Videos	Vector Storage
Gemini Embedding 2	Free tier (1K/day)	~$0	12,288 B/vec
Marengo 2.7	$0.033/min	~$5-15	4,096 B/vec
Mixedbread Wholembed v3	Free tier, then per-token	~$0	N/A (server-side)
X-CLIP Base	Self-hosted	GPU cost only	2,048 B/vec
SigLIP 2	Self-hosted	GPU cost only	4,608 B/vec
InternVideo2 6B	Self-hosted	GPU cost only	3,072 B/vec

Gemini at free tier is almost unfair. For most teams processing <1K videos/day, the cost is literally zero. Marengo's per-minute pricing adds up fast for large video libraries but might be worth it if retrieval quality is your north star metric.

Reproduce It Yourself

Everything is open source — code, dataset (all 20 videos), and results — in a single repo:

github.com/mixpeek/video-embedding-benchmark

What's in the repo:

All 20 CC0 videos included directly (~13MB) — no separate download step
data/queries.json with 60 graded-relevance queries
Pre-computed results for all 6 models in results/
Full benchmark + adapter code for every model
Clone and run python report.py to verify our numbers — no API keys needed

git clone https://github.com/mixpeek/video-embedding-benchmark.git
cd video-embedding-benchmark

# Videos are already in the repo — no download needed
ls data/videos/

# Run individual models
python benchmark.py --model gemini      # needs GEMINI_API_KEY
python benchmark.py --model xclip       # runs locally
python benchmark.py --model siglip      # runs locally
python benchmark.py --model mixedbread  # needs MIXEDBREAD_API_KEY

# Generate comparison report from pre-computed results
python report.py

We'll update this post as we complete the remaining Marengo queries and add Amazon Nova Multimodal to the benchmark.

What's Next

This benchmark covers retrieval quality, but that's only one dimension. We're planning to extend it with:

Longer videos — 30s, 60s, 5min clips to test how models degrade with length
Domain-specific evaluation — medical, security, retail video datasets
Cross-modal retrieval — image-to-video, video-to-video search
Matryoshka dimension scaling — how much quality do you lose at 256d vs 3072d?

If you're working on video search or embeddings and want to collaborate on the benchmark, reach out. We're at mixpeek.com or @mixpeek on GitHub.

Ethan Steininger is the founder of Mixpeek, a multimodal processing platform for video, image, text, and audio understanding at scale.