Former lead of MongoDB's Search Team, Ethan noticed the most common problem customers faced was building indexing and search infrastructure on their S3 buckets. Mixpeek was born.

Recent Posts

Your company brain can't watch a video

Company brains argue vector search versus knowledge graphs while indexing only text. Most institutional knowledge is video and audio. Here's how to index all of it.

July 2, 20266 min read

Changing embedding models doesn't have to break your index

Vector spaces aren't portable across models. Your index is stateful — that's why model upgrades are painful. Here's the versioning pattern that makes it a non-event.

June 19, 20263 min read

Post-filtering vector search leaks data

Vector search has no row-level security, so apps filter results after retrieval and leak data. Here is authorized multimodal search done right: OpenFGA enforced server-side, fail-closed.

June 8, 20266 min read

We built a vector store on object storage and it's 50x cheaper

Every vector database forces you to declare dimensions and distance metrics before writing a single vector. Schema-on-write, compute pushdown, and learned indexes fix the three things they got wrong.

May 26, 20266 min read

The 3072-Dimension Problem

A 3072-dimensional embedding encodes everything about a video and distinguishes nothing. Decomposing content into named, measurable features, then placing them in a queryable hierarchy, is how multimodal search actually works at scale.

April 29, 20266 min read

Why Vector Search Alone Can't Find What's in Your Videos

Text-only RAG pipelines miss 80%% of what is in your content. A video contains faces, dialogue, on-screen text, background music, and brand logos. No single embedding captures all of that. The solution is multi-stage retrieval.

April 26, 20266 min read

Multimodal Taxonomies: How to Classify Video, Images, Audio, and Text with One Category System

Traditional taxonomies classify one content type at a time. Multimodal taxonomies unify classification across every format using embedding similarity the missing layer between raw AI features and structured, searchable metadata.

April 25, 20265 min read

Object Storage Comparison 2026: 21 Providers, Real Pricing, and the Gotchas Nobody Tells You

We compared 21 S3-compatible object storage providers across pricing, egress, features, and fine print. AWS S3 costs 15x more than the cheapest alternative for the same workload. Here's everything we found.

April 10, 20264 min read

Building a Kalshi Trading Bot with Semantic Search and LLM Extraction

How we built an autonomous Kalshi trading bot using the Kalshi API and Mixpeek's video transcription, semantic search, and LLM data extraction no external tools required.

April 2, 20266 min read

The Multimodal Data Warehouse: Why Unstructured Data Needs Its Own Snowflake

We are drowning in unstructured data — video, audio, images, documents, IoT — but our infrastructure still assumes everything is a row or a vector. The multimodal data warehouse is the missing layer: object decomposition, tiered storage, and multi-stage retrieval pipelines for the AI era.

March 28, 202610 min read

ColQwen2 + MUVERA: Multimodal Late Interaction Retrieval That Actually Scales

We benchmarked every viable approach to multimodal document retrieval on financial tables (ViDoRe/TabFQuAD) and found a combination that hasn't been published before: ColQwen2 + MUVERA. It retains 99.4% of brute-force quality at a fraction of the cost, and obliterates OCR-based search. The Problem Late interaction models like ColBERT and ColPali represent documents as sets of vectors—one per token or image patch. At query time, every query token finds its best-matching document token (MaxSim/

March 25, 20263 min read