Sync videos from Mux into Mixpeek for automatic multimodal extraction — scene understanding, object detection, face identity, OCR, and transcription. Build visual search retrievers that let users find the exact frame, scene, or spoken word across your entire video library.

Read the Docs Start Building Schedule Walkthrough

Measurable impact from day one

What teams see after connecting Mux to Mixpeek

95%

Less manual review

Teams find the exact frame in seconds instead of scrubbing through hours of footage

Manual indexing steps

Every Mux upload is decomposed and searchable within minutes, no human intervention

40–60%

Lower processing costs

Selective sync filters ensure only relevant assets are indexed, eliminating waste

<200ms

Search latency

Visual, face, transcript, and OCR queries across 10,000+ hours of video

100%

Audit coverage

Every extraction step is logged from Mux ingest to search index, audit-ready out of the box

<4 hrs

Time to first query

Connect Mux, configure filters, and run your first search query in under 4 hours

Finding a scene in your video library

Before

Scrub through footage manually

Rely on hand-entered tags

Hours per search request

No face or speech search

After

Type a natural language query

Auto-extracted visual + audio features

Results in <200ms

Face, transcript, OCR in one query

The Problem

Video platforms store thousands of hours of content, but the footage itself is a black box. Finding a specific scene, verifying talent rights across a library, or searching for on-screen text means scrubbing through videos manually. Metadata is limited to what was entered at upload time — titles and tags that go stale fast. Teams waste hours on manual review that should take seconds.

The Solution

Mixpeek connects directly to Mux via selective sync. When a video lands in Mux, Mixpeek automatically decomposes it into frames and audio segments, then runs multimodal extractors — visual embeddings, object detection, face recognition, OCR, and speech transcription. Every extracted feature is indexed into a retriever so your team can search across scenes, objects, spoken words, and on-screen text from a single query.

Pipeline Architecture

Hover over each step to see how the components connect

Mux Selective Sync

Webhook + Filters

Videos uploaded to Mux trigger a webhook. Selective sync filters decide which assets flow into Mixpeek based on metadata, passthrough flags, or asset tags.

Asset Ingest

Mixpeek Namespace

Filtered Mux assets are pulled into a Mixpeek namespace. RAW video formats are converted via custom plugins (RED R3D, ARRI RAW) before processing.

Multimodal Decomposition

Extractors

Each video is decomposed into frames and audio segments. Extractors run in parallel: visual embeddings, object detection, face identity, OCR, and speech transcription.

Feature Indexing

Collections

Extracted features are stored in Mixpeek collections with full lineage back to the source Mux asset, timestamp, and frame number.

Visual Search Retriever

Feature Search + Filters

A retriever combines vector similarity, face identity matching, metadata filters, and full-text search across transcripts and OCR output.

Audit Trail

Batch Processing

Every pipeline step is logged — from Mux webhook receipt through extraction completion — providing full observability and compliance lineage.

Mux Integration Deep Dive

Selective sync lets you control exactly which Mux assets flow into Mixpeek using metadata filters and passthrough flags. When a video is uploaded to Mux with the right metadata, a webhook fires and Mixpeek pulls the asset automatically. RAW formats (RED R3D, ARRI RAW) are converted via custom plugins before extraction. The pipeline decomposes each video into scene compositions, detected objects, recognized faces, on-screen text, and transcribed speech — then indexes everything into a visual search retriever with feature search, face identity, and full-text stages. An audit trail tracks every step from ingest to searchable index.

Solution