Multimodal Embeddings API
Embeddings are the atoms of the multimodal data warehouse: the features that multi-stage retrieval pipelines query across. Each file is decomposed into dense vectors, and pipelines compose filter, search, rerank, and enrich stages on top to deliver precise results.
What Are Multimodal Embeddings?
Embeddings are dense vector representations that capture the semantic meaning of content. Multimodal embeddings extend this concept across data types, mapping text, images, video, and audio into a shared vector space where similarity reflects meaning.
Text Embeddings
Generate dense vector representations of text using models like E5, BGE, and multilingual transformers. Capture semantic meaning for search, classification, and clustering.
Image Embeddings
Encode images into vectors using vision models like CLIP, SigLIP, and domain-specific vision transformers. Enable visual search and image-text matching.
Video Embeddings
Generate embeddings at the frame and scene level from video content. Search within videos by visual content, spoken words, and on-screen text.
Audio Embeddings
Embed audio content including speech, music, and environmental sounds. Combine with transcript embeddings for comprehensive audio understanding.
Supported Embedding Models
Choose from a curated set of production-grade models, or bring your own.
| Model | Modalities | Dimensions | Best For |
|---|---|---|---|
| CLIP ViT-L/14 | Image, Text | 768 | Cross-modal image-text retrieval |
| SigLIP SO400M | Image, Text | 1152 | High-accuracy visual search and classification |
| E5-Large-v2 | Text | 1024 | English text retrieval and semantic search |
| BGE-M3 | Text | 1024 | Multilingual text with dense + sparse vectors |
| DINOv2 | Image | 768 | Visual feature extraction without text alignment |
| Whisper + E5 | Audio | 1024 | Speech content retrieval via transcription embedding |
How It Works
From raw content to searchable embeddings in four steps.
Choose Models
Select embedding models for each modality from Mixpeek's model library, or register your own custom models as feature extractors.
Ingest Content
Upload files to an S3-compatible bucket or send them through the API. Mixpeek automatically routes each file to the appropriate embedding model.
Generate Vectors
Models run on distributed GPU infrastructure, producing embedding vectors for each piece of content with automatic batching and error handling.
Index & Search
Embeddings are stored in Qdrant for fast approximate nearest-neighbor search. Build retrieval pipelines that query across all modalities.
Use Cases
Embeddings power a wide range of AI applications across industries.
Semantic Search
Search by meaning rather than keywords. Find relevant content even when the query uses different terminology than the source material.
Cross-Modal Retrieval
Query in one modality and retrieve results in another. Search video with text, find images with audio descriptions, or match documents to visual content.
Duplicate Detection
Identify near-duplicate content across your corpus by comparing embedding similarity. Works across modalities and detects semantic duplicates, not just pixel-identical copies.
Content Classification
Classify content into categories using embedding similarity to reference examples. Enable zero-shot classification without collecting labeled training data for each category.
Recommendation Systems
Build content recommendations by finding embeddings similar to user interaction history. Works across content types for multimodal recommendation.
RAG Applications
Power retrieval-augmented generation by embedding your knowledge base and retrieving relevant context for LLM prompts across text, images, and documents.
Simple API Integration
Generate and search embeddings across modalities with a few lines of code.
from mixpeek import Mixpeek
client = Mixpeek(api_key="YOUR_API_KEY")
# Generate embeddings for a text query
text_embedding = client.embed.text(
model="e5-large-v2",
input="quarterly revenue growth in emerging markets"
)
# Generate embeddings for an image
image_embedding = client.embed.image(
model="clip-vit-l-14",
input="s3://product-images/catalog/item-4821.jpg"
)
# Search across all modalities with text
results = client.retrievers.search(
retriever_id="multimodal-index",
queries=[
{
"type": "text",
"value": "product packaging with sustainability labels",
"modalities": ["text", "image", "video"]
}
],
limit=20
)
# Compare embedding similarity directly
similarity = client.embed.compare(
embedding_a=text_embedding.vector,
embedding_b=image_embedding.vector,
metric="cosine"
)
print(f"Cross-modal similarity: {similarity:.4f}")Frequently Asked Questions
What are multimodal embeddings?
Multimodal embeddings are vector representations that capture the semantic meaning of content across different data types -- text, images, video, and audio. By mapping diverse content into a shared vector space, embeddings enable similarity search, cross-modal retrieval, and AI applications that understand meaning rather than just matching keywords or pixels.
What embedding models does Mixpeek support?
Mixpeek supports a range of embedding models for different modalities: CLIP and SigLIP for vision-language alignment, E5 and BGE for text embedding, DINOv2 for visual features, and Whisper-based pipelines for audio content. You can also register custom models through the plugin system to use proprietary or fine-tuned models.
Can I use my own embedding models with Mixpeek?
Yes. Mixpeek's plugin system lets you register custom feature extractors that call any model endpoint. Define the input/output schema including vector dimensions, and Mixpeek handles orchestration, batching, retries, and indexing. This works with HuggingFace models, custom PyTorch endpoints, or any HTTP-based inference service.
How do cross-modal embeddings work?
Cross-modal embeddings are produced by models trained with contrastive learning objectives that align representations from different modalities. For example, CLIP and SigLIP learn to place matching image-text pairs close together in the same vector space. This means a text query vector can be compared directly against image vectors to find visually matching content, enabling cross-modal retrieval.
What vector dimensions does Mixpeek support?
Mixpeek supports arbitrary embedding dimensions -- whatever your models produce. Common dimensions include 384, 512, 768, 1024, and 1152 depending on the model. The system stores vectors in Qdrant, which supports dense, sparse, and multi-vector representations with configurable distance metrics.
How does Mixpeek handle embedding generation at scale?
Mixpeek uses Ray for distributed model inference across GPU workers. When you trigger batch processing, the engine distributes embedding generation across available compute with automatic batching, load balancing, and fault recovery. This handles millions of documents with progress tracking and configurable concurrency limits.
Can I store multiple embedding types per document?
Yes. Mixpeek supports named vectors in Qdrant, allowing you to store multiple embedding representations per document. For example, a document can have both a text embedding and a visual embedding (from a document page image), and your retrieval pipeline can query either or both.
How do I choose the right embedding model for my use case?
The choice depends on your modalities and use case. For text-only search, E5 or BGE models offer strong performance. For cross-modal image-text retrieval, CLIP or SigLIP is recommended. For visual-only similarity, DINOv2 provides excellent features. Mixpeek makes it easy to test multiple models by creating separate collections with different extractors and comparing retrieval quality.
