NEWAnnouncing Vector Store Object Storage — 50x cheaper than traditional vector databases.Vector Store Object Storage — 50x cheaper.Read the post →

Ingest & Store

Feature Extractors

Typed pipelines for faces, scenes, transcripts, OCR, fingerprints.

Vector Store (MVS)

Mixpeek Vector Store: horizontally scaled, feature-aware indexes.

Retrieve & Analyze

Compose multi-stage search in <100ms:filter, join, rerank.

Group scenes, faces or objects by similarity with Thompson sampling.

Encode your domain as versioned ontologies enforced at query time.

By Industry

Talent search, brand safety, creative analytics.

Scene search, recommendation, archive access.

Visual search, PDP enrichment, catalog QA.

Lecture search, transcript Q&A, content safety.

View all solutions →

By Use Case

Face & Person Search

Find anyone across video libraries in milliseconds.

IP & Copyright Detection

Logos, songs, faces:one pipeline, one report.

Visual Taste & Recs

Scene-similarity ranked recommendations with RL.

Brand & Ad Safety

Pre-publish content screening at bid-time speeds.

View all use cases →

Build

API reference, SDKs, recipes, and architecture guides.

Launches, deep dives, and field notes from our engineers.

Browse supported HuggingFace models by task and modality.

See what teams are building with Mixpeek.

Latest releases, fixes, and improvements.

Education

Multimodal University

Fundamentals of multimodal retrieval, modules + certs.

Every term you need:embeddings to re-rankers.

Talks, demos, and customer sessions on demand.

Mixpeek vs. Pinecone, Weaviate, Twelve Labs, more.

Mission, team, and the multimodal vision.

We're hiring across research, infra, and design.

Talk to sales, support, or press.

White-glove 30-day production pilot for new customers.

Integrations Pricing

Sign in Request Demo

Models/Embeddings/facebook/dinov3-large

PyTorchVisual EmbeddingsApache 2.0

dinov3-large

by facebook

Next-generation self-supervised vision model with Gram anchoring and 6.7B scaling

450Kdl/month

300M (Large), 6.7B (ViT-7B)params

Source Use in Pipeline

Identifiers

Model ID

facebook/dinov3-large

Feature URI

mixpeek://image_extractor@v1/facebook_dinov3_large_v1

Overview

DINOv3 is Meta AI's successor to DINOv2, introducing Gram anchoring to solve dense feature degradation during long training schedules. It scales up to 6.7B parameters (ViT-7B) and trains on 1.7 billion web images plus 493M satellite images, making it the most versatile vision foundation model available.

On Mixpeek, DINOv3 delivers state-of-the-art visual features for tasks ranging from classification and segmentation to satellite/aerial imagery analysis, all without fine-tuning.

Architecture

Vision Transformer with patch size 16. Scales from ViT-S (21M) to ViT-7B (6.7B params). Introduces Gram anchoring to stabilize dense features during extended training. Also distills into ConvNeXt backbones. Supports flexible resolution and post-hoc text alignment.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "my-collection",
  source: { url: "https://example.com/satellite.tiff" },
  feature_extractors: [{
    name: "image_embedding",
    version: "v1",
    params: { model_id: "facebook/dinov3-large" }
  }]
});

Capabilities

Gram anchoring for stable dense feature training
Scales up to 6.7B parameters (ViT-7B)
Trained on 1.7B web + 493M satellite images
ViT and ConvNeXt backbone variants
Multi-domain: natural images and satellite/aerial imagery

Use Cases on Mixpeek

High-fidelity visual search across massive image collections

Satellite and aerial imagery analysis

Dense segmentation and depth estimation

Foundation for downstream classification without fine-tuning

Benchmarks

Dataset	Metric	Score	Source
ImageNet (linear probe)	Top-1 Accuracy	83.1%	DINOv3 model card

Performance

Input Size224×224 px

Embedding Dim1024

GPU Latency~11ms / image (A100)

CPU Latency~130ms / image

GPU Throughput~90 images/sec (A100)

GPU Memory~1.3 GB

Common Pipeline Companions

openai/clip-vit-large-patch14

Text-aligned retrieval

facebook/sam-vit-huge

Segmentation of matched regions

Specification

FrameworkPyTorch

Organizationfacebook

FeatureVisual Embeddings

Output768-dim vector

Modalitiesvideo, image

RetrieverVector Search

Parameters300M (Large), 6.7B (ViT-7B)

LicenseApache 2.0

Downloads/mo450K

Research Paper

DINOv3

Build a pipeline with dinov3-large

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Alternative Models

openai/clip-vit-large-patch14

Visual Embeddings

google/siglip-base-patch16-224

Visual Embeddings

google/siglip2-giant-opt-patch16-384

Visual Embeddings

facebook/dinov2-large

Visual Embeddings

Related in Embeddings

laion/clap-htsat-fused

Audio Embeddings

facebook/encodec_24khz

Audio Embeddings

BAAI/bge-large-en-v1.5

Text Embeddings

sentence-transformers/all-MiniLM-L6-v2

Text Embeddings