NEWMVS for embeddings. Managed for files. Both on object storage.Vectors or files. Pick a path.Start →

Start here

Vector Store (MVS)

Bring your own vectors. Dense, sparse, and BM25 search on object storage.

Managed Indexing

Connect a bucket and auto-extract scenes, faces, OCR, transcripts, and embeddings.

Build

Compose multi-stage search in <100ms: filter, join, rerank.

Feature Extractors

Typed pipelines for faces, scenes, transcripts, OCR, fingerprints.

S3, GCS, R2, Mux, LangChain, MCP, and more. Connect your stack.

Generate and store embeddings from 50+ models, then search them.

By Industry

Map, search, and reuse the moments that perform. Plugs into iconik & Mux.

Talent search, brand safety, creative analytics.

Scene search, recommendation, archive access.

Visual search, PDP enrichment, catalog QA.

Lecture search, transcript Q&A, content safety.

View all solutions →

By Use Case

Face & Person Search

Find anyone across video libraries in milliseconds.

IP & Copyright Detection

Logos, songs, faces: one pipeline, one report.

Visual Taste & Recs

Scene-similarity ranked recommendations with RL.

Brand & Ad Safety

Pre-publish content screening at bid-time speeds.

View all use cases →

Build & evaluate

API reference, SDKs, recipes, and architecture guides.

Browse supported HuggingFace models by task and modality.

Mixpeek vs. Pinecone, Weaviate, Twelve Labs, more.

Media & data converters — no account, runs in your browser.

See what teams are building with Mixpeek.

Learn

Best-of comparisons: vector DBs, embedding models, moderation APIs.

Vendor-neutral deep dives on perception, retrieval, and embeddings.

Launches, deep dives, and field notes from our engineers.

Papers behind multimodal search — MUVERA, SAM 3, and more — explained.

Every term you need: embeddings to re-rankers.

Videos, diagrams & university →

Mission, team, and the multimodal vision.

We're hiring across research, infra, and design.

Talk to sales, support, or press.

45-min working session on your data — leave with a running notebook.

Sign in Request Demo Get started →

Models/Embeddings/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

HFVisual Embeddingsmit

CLIP-ViT-bigG-14-laion2B-39B-b160k

by laion

Open-source CLIP trained on 2B image-text pairs at giant scale

109Kdl/month

314likes

1.8Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Feature URI

mixpeek://image_extractor@v1/laion_openclip_bigG_v1

Overview

OpenCLIP is the open-source reproduction of CLIP by the LAION/ML Foundations community. This ViT-bigG/14 variant was trained on LAION-2B (2 billion image-text pairs), achieving up to 85.4% ImageNet zero-shot accuracy — surpassing OpenAI's original CLIP.

On Mixpeek, OpenCLIP provides the highest-accuracy open-weight visual embeddings for text-to-image and image-to-image retrieval at scale.

Architecture

Vision Transformer (ViT-bigG/14) with ~1.8B vision parameters. Trained with contrastive learning on LAION-2B dataset for 39B samples seen. Produces 1280-dim embeddings projected to shared vision-text space.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "image_embedding",
    version: "v1",
    parameters: { model_id: "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k" },
  },
});

Capabilities

85.4% ImageNet zero-shot accuracy
Trained on 2B open image-text pairs
1280-dimensional dense embeddings
Strongest open-weight CLIP variant
Supports both ViT and ConvNeXt backbones

Use Cases on Mixpeek

Large-scale visual search with maximum accuracy

Cross-modal retrieval across image and text

E-commerce product similarity and discovery

Foundation model for downstream vision-language tasks

Benchmarks

Dataset	Metric	Score	Source
ImageNet zero-shot	Top-1 Accuracy	80.1%	Schuhmann et al., 2022 — Table 9
VTAB+ (avg 35 tasks)	Accuracy	75.3%	Schuhmann et al., 2022 — Table 10

Performance

Input Size224×224 px

Embedding Dim1280

GPU Latency~35ms / image (A100)

CPU Latency~450ms / image

GPU Throughput~28 images/sec (A100)

GPU Memory~5.1 GB

2.5B params — largest open CLIP variant

Common Pipeline Companions

openai/whisper-large-v3

Audio for video pipelines

facebook/detr-resnet-50

Detection on retrieved frames

Explore on Mixpeek

More Embeddings models

Compare alternatives in this category

Best Multimodal Embedding Models

Hand-picked tools & platforms compared

How CLIP, SigLIP & CLAP Work

Deep-dive technical guide

Feature Extractors

See how Mixpeek runs models as extractors

Mixpeek Vector Store

Store & search embeddings at scale

Usage-based pricing for pipelines

All Curated Lists

Compare models, APIs & infrastructure

Specification

FrameworkHF

Organizationlaion

FeatureVisual Embeddings

Output768-dim vector

Modalitiesvideo, image

RetrieverVector Search

Parameters1.8B

Licensemit

Downloads/mo109K

Likes314

Research Paper

Reproducible Scaling Laws for Contrastive Language-Image Learning

Build a pipeline with CLIP-ViT-bigG-14-laion2B-39B-b160k

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Run on your data, free

Alternative Models

openai/clip-vit-large-patch14

Visual Embeddings

google/siglip-base-patch16-224

Visual Embeddings

google/siglip2-giant-opt-patch16-384

Visual Embeddings

facebook/dinov2-large

Visual Embeddings

Related in Embeddings

laion/clap-htsat-fused

Audio Embeddings

facebook/pe-av-large

Audio Embeddings

tsinghua-ee/WAVE-7B

Audio Embeddings

facebook/encodec_24khz

Audio Embeddings