Video understanding demonstrates the full warehouse flow: Decompose video into scenes, faces, and speech; Store features across tiers; Reassemble answers through retrieval pipelines.
How It Works
When you ingest a video, Mixpeek runs a multi-stage pipeline:
- Chunking — Videos split into segments using scene detection, silence detection, or fixed intervals
- Parallel Extraction — Multiple extractors run concurrently:
- Transcription: Whisper extracts speech-to-text with timestamps
- Visual Embeddings: Multimodal model generates embeddings from keyframes
- Thumbnails: Representative frames extracted for each segment
- Description & OCR — Gemini generates segment descriptions and extracts on-screen text
- Multi-Vector Indexing — Separate embeddings for transcription and visual content enable hybrid search
At query time, the retriever searches across both visual and transcript embeddings, fusing results to find moments by what’s shown or what’s said.
| Extractor | Outputs |
|---|
video_extractor@v1 | Scene embeddings, keyframes, timestamps |
audio_extractor@v1 | Transcription, speaker diarization |
text_extractor@v1 | Text embeddings, OCR from frames |
face_extractor@v1 | Face embeddings, bounding boxes |
1. Create a Bucket
POST /v1/buckets
{
"bucket_name": "video-catalog",
"schema": {
"properties": {
"video_url": { "type": "url", "required": true },
"title": { "type": "text" },
"category": { "type": "text" }
}
}
}
2. Create Collections
For scenes:
POST /v1/collections
{
"collection_name": "video-scenes",
"source": { "type": "bucket", "bucket_id": "bkt_videos" },
"feature_extractor": {
"feature_extractor_name": "video_extractor",
"version": "v1",
"input_mappings": { "video_url": "video_url" },
"parameters": {
"scene_detection_threshold": 0.3,
"keyframe_interval": 30,
"max_scenes": 100
},
"field_passthrough": [
{ "source_path": "title" },
{ "source_path": "category" }
]
}
}
For transcripts:
POST /v1/collections
{
"collection_name": "video-transcripts",
"source": { "type": "bucket", "bucket_id": "bkt_videos" },
"feature_extractor": {
"feature_extractor_name": "audio_extractor",
"version": "v1",
"input_mappings": { "audio_url": "video_url" },
"parameters": {
"transcription_model": "whisper-large-v3",
"language": "en",
"enable_diarization": true
},
"field_passthrough": [
{ "source_path": "title" },
{ "source_path": "category" }
]
}
}
3. Ingest Videos
POST /v1/buckets/{bucket_id}/objects
{
"key_prefix": "/marketing/demos",
"metadata": {
"title": "Product Launch Q4 2025",
"category": "marketing"
},
"blobs": [
{
"property": "video_url",
"type": "video",
"url": "s3://my-bucket/demos/product-launch.mp4"
}
]
}
4. Process
POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_video_001"] }
POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit
5. Create a Hybrid Retriever
Combine visual and transcript search:
POST /v1/retrievers
{
"retriever_name": "video-search",
"collection_ids": ["col_video_scenes", "col_video_transcripts"],
"input_schema": {
"properties": {
"query_text": { "type": "text", "required": true },
"query_image": { "type": "url" },
"category": { "type": "text" }
}
},
"stages": [
{
"stage_name": "hybrid_search",
"version": "v1",
"parameters": {
"queries": [
{
"feature_address": "mixpeek://video_extractor@v1/scene_embedding",
"input_mapping": { "image": "query_image" },
"weight": 0.6
},
{
"feature_address": "mixpeek://audio_extractor@v1/transcript_embedding",
"input_mapping": { "text": "query_text" },
"weight": 0.4
}
],
"fusion_method": "rrf",
"limit": 20
}
},
{
"stage_name": "filter",
"version": "v1",
"parameters": {
"filters": {
"field": "metadata.category",
"operator": "eq",
"value": "{{inputs.category}}"
}
}
}
]
}
6. Search
Text query:
POST /v1/retrievers/{retriever_id}/execute
{
"inputs": {
"query_text": "someone explaining product features",
"category": "marketing"
},
"limit": 10
}
Image query (find similar scenes):
POST /v1/retrievers/{retriever_id}/execute
{
"inputs": {
"query_image": "s3://my-bucket/reference-scene.jpg",
"query_text": "product demonstration"
},
"limit": 10
}
Moment-Level Search
Filter by timestamp:
POST /v1/retrievers/{retriever_id}/execute
{
"inputs": { "query_text": "pricing discussion" },
"filters": {
"field": "segment_metadata.start_time",
"operator": "gte",
"value": 60.0
}
}
Speaker-Specific Search
With diarization enabled:
{
"filters": {
"field": "metadata.speaker_id",
"operator": "eq",
"value": "SPEAKER_001"
}
}
Output Example
Scene document from video_extractor@v1:
{
"document_id": "doc_scene_123",
"source_object_id": "obj_video_001",
"metadata": {
"title": "Product Launch Q4 2025",
"scene_index": 3,
"start_time": 45.2,
"end_time": 58.7,
"keyframe_url": "s3://my-bucket/keyframes/scene_003.jpg"
}
}
Parameters
| Parameter | Effect |
|---|
scene_detection_threshold | Lower = more scenes (0.2-0.5) |
keyframe_interval | Seconds between keyframes |
max_scenes | Cap scenes per video |
transcription_model | whisper-base (fast) or whisper-large-v3 (accurate) |