Text, Image, Video & Audio Vectors

Multimodal Embeddings API

Embeddings are the atoms of the multimodal data warehouse: the features that multi-stage retrieval pipelines query across. Each file is decomposed into dense vectors, and pipelines compose filter, search, rerank, and enrich stages on top to deliver precise results.

What Are Multimodal Embeddings?

Embeddings are dense vector representations that capture the semantic meaning of content. Multimodal embeddings extend this concept across data types, mapping text, images, video, and audio into a shared vector space where similarity reflects meaning.

Text Embeddings

Generate dense vector representations of text using models like E5, BGE, and multilingual transformers. Capture semantic meaning for search, classification, and clustering.

E5-Large-v2

BGE-M3

Multilingual-E5

Image Embeddings

Encode images into vectors using vision models like CLIP, SigLIP, and domain-specific vision transformers. Enable visual search and image-text matching.

CLIP ViT-L/14

SigLIP SO400M

DINOv2

Video Embeddings

Generate embeddings at the frame and scene level from video content. Search within videos by visual content, spoken words, and on-screen text.

CLIP per-frame

Scene-level pooling

Temporal encoders

Audio Embeddings

Embed audio content including speech, music, and environmental sounds. Combine with transcript embeddings for comprehensive audio understanding.

CLAP

Whisper-based

Speaker encoders

Supported Embedding Models

Choose from a curated set of production-grade models, or bring your own.

Model	Modalities	Dimensions	Best For
CLIP ViT-L/14	Image, Text	768	Cross-modal image-text retrieval
SigLIP SO400M	Image, Text	1152	High-accuracy visual search and classification
E5-Large-v2	Text	1024	English text retrieval and semantic search
BGE-M3	Text	1024	Multilingual text with dense + sparse vectors
DINOv2	Image	768	Visual feature extraction without text alignment
Whisper + E5	Audio	1024	Speech content retrieval via transcription embedding

How It Works

From raw content to searchable embeddings in four steps.

Choose Models

Select embedding models for each modality from Mixpeek's model library, or register your own custom models as feature extractors.

Ingest Content

Upload files to an S3-compatible bucket or send them through the API. Mixpeek automatically routes each file to the appropriate embedding model.

Generate Vectors

Models run on distributed GPU infrastructure, producing embedding vectors for each piece of content with automatic batching and error handling.

Index & Search

Embeddings are stored in Qdrant for fast approximate nearest-neighbor search. Build retrieval pipelines that query across all modalities.

Embedding Portability and Versioning

Every vector is locked to the model that created it. Upgrade the model, and your entire index becomes incompatible. Mixpeek solves this with namespace isolation and source-data retention.

The Lock-In Problem

CLIP vectors and SigLIP vectors are not interchangeable. They live in completely different mathematical spaces. Mixing them in the same index silently degrades retrieval quality.

The Upgrade Trap

When a better model ships, you either freeze on the old one or re-encode everything. A million documents can take days of compute and thousands of dollars. Most teams postpone until it becomes an emergency.

How Mixpeek Solves It

Every vector is tagged with its extractor, model version, and embedding space. Run v1 and v2 namespaces side by side, compare quality, and cut over when ready. Source files are always retained for re-encoding.

Upgrade without downtime

Namespace isolation

Each model version gets its own namespace. Your v1 index keeps serving while v2 backfills in the background. No mixed-version indexes, no quality degradation.

Automatic re-encoding

Create a new collection with the updated extractor, point it at the same bucket, and trigger reprocessing. The batch pipeline handles backfill automatically.

Retriever-level cutover

Point your retriever at the v2 namespace without changing application code. Migration is a configuration change, not a code deployment.

Source-data retention

Raw files are always stored alongside vectors. You never lose the ability to rebuild your index with a new model because the originals are retained in your storage tier.

Read the full portability guide|Embedding portability glossary|Embedding versioning glossary

Use Cases

Embeddings power a wide range of AI applications across industries.

Semantic Search

Search by meaning rather than keywords. Find relevant content even when the query uses different terminology than the source material.

Cross-Modal Retrieval

Query in one modality and retrieve results in another. Search video with text, find images with audio descriptions, or match documents to visual content.

Duplicate Detection

Identify near-duplicate content across your corpus by comparing embedding similarity. Works across modalities and detects semantic duplicates, not just pixel-identical copies.

Content Classification

Classify content into categories using embedding similarity to reference examples. Enable zero-shot classification without collecting labeled training data for each category.

Recommendation Systems

Build content recommendations by finding embeddings similar to user interaction history. Works across content types for multimodal recommendation.

RAG Applications

Power retrieval-augmented generation by embedding your knowledge base and retrieving relevant context for LLM prompts across text, images, and documents.

Simple API Integration

Generate and search embeddings across modalities with a few lines of code.

embeddings_example.py

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Generate embeddings for a text query
text_embedding = client.embed.text(
    model="e5-large-v2",
    input="quarterly revenue growth in emerging markets"
)

# Generate embeddings for an image
image_embedding = client.embed.image(
    model="clip-vit-l-14",
    input="s3://product-images/catalog/item-4821.jpg"
)

# Search across all modalities with text
results = client.retrievers.search(
    retriever_id="multimodal-index",
    queries=[
        {
            "type": "text",
            "value": "product packaging with sustainability labels",
            "modalities": ["text", "image", "video"]
        }
    ],
    limit=20
)

# Compare embedding similarity directly
similarity = client.embed.compare(
    embedding_a=text_embedding.vector,
    embedding_b=image_embedding.vector,
    metric="cosine"
)
print(f"Cross-modal similarity: {similarity:.4f}")

Frequently Asked Questions

What are multimodal embeddings?

Multimodal embeddings are vector representations that capture the semantic meaning of content across different data types -- text, images, video, and audio. By mapping diverse content into a shared vector space, embeddings enable similarity search, cross-modal retrieval, and AI applications that understand meaning rather than just matching keywords or pixels.

What embedding models does Mixpeek support?

Mixpeek supports a range of embedding models for different modalities: CLIP and SigLIP for vision-language alignment, E5 and BGE for text embedding, DINOv2 for visual features, and Whisper-based pipelines for audio content. You can also register custom models through the plugin system to use proprietary or fine-tuned models.

Can I use my own embedding models with Mixpeek?

Yes. Mixpeek's plugin system lets you register custom feature extractors that call any model endpoint. Define the input/output schema including vector dimensions, and Mixpeek handles orchestration, batching, retries, and indexing. This works with HuggingFace models, custom PyTorch endpoints, or any HTTP-based inference service.

How do cross-modal embeddings work?

Cross-modal embeddings are produced by models trained with contrastive learning objectives that align representations from different modalities. For example, CLIP and SigLIP learn to place matching image-text pairs close together in the same vector space. This means a text query vector can be compared directly against image vectors to find visually matching content, enabling cross-modal retrieval.

What vector dimensions does Mixpeek support?

Mixpeek supports arbitrary embedding dimensions -- whatever your models produce. Common dimensions include 384, 512, 768, 1024, and 1152 depending on the model. The system stores vectors in Qdrant, which supports dense, sparse, and multi-vector representations with configurable distance metrics.

How does Mixpeek handle embedding generation at scale?

Mixpeek uses Ray for distributed model inference across GPU workers. When you trigger batch processing, the engine distributes embedding generation across available compute with automatic batching, load balancing, and fault recovery. This handles millions of documents with progress tracking and configurable concurrency limits.

Can I store multiple embedding types per document?

Yes. Mixpeek supports named vectors in Qdrant, allowing you to store multiple embedding representations per document. For example, a document can have both a text embedding and a visual embedding (from a document page image), and your retrieval pipeline can query either or both.

How do I choose the right embedding model for my use case?

The choice depends on your modalities and use case. For text-only search, E5 or BGE models offer strong performance. For cross-modal image-text retrieval, CLIP or SigLIP is recommended. For visual-only similarity, DINOv2 provides excellent features. Mixpeek makes it easy to test multiple models by creating separate collections with different extractors and comparing retrieval quality.

Can I upgrade embedding models without re-indexing everything at once?

Yes. Mixpeek uses namespace isolation so each model version gets its own index. You create a new collection with the updated extractor, trigger reprocessing against the same source data, and the batch pipeline backfills in the background. Your v1 index keeps serving while v2 builds. When quality is validated, you point your retriever at the new namespace. Zero downtime, no mixed-version indexes.

Are embeddings from different models compatible with each other?

No. Vectors from different models occupy incompatible coordinate spaces, even when the dimensionality is the same. CLIP and SigLIP both produce 768-dimensional vectors, but comparing them produces meaningless results. Mixpeek prevents this by isolating each model version in its own namespace and tracking the extractor, version, and embedding space for every stored vector.

Start Generating Multimodal Embeddings

One API for text, image, video, and audio embeddings. Get started with our free tier or talk to us about enterprise deployment.