dinov2-large
by facebook
Self-supervised vision foundation model producing all-purpose visual features
facebook/dinov2-largemixpeek://image_extractor@v1/facebook_dinov2_large_v1Overview
DINOv2 is a self-supervised vision foundation model from Meta AI that learns robust visual features without any labels. Trained on a curated dataset of 142M images (LVD-142M) using a combination of DINO and iBOT objectives, it produces dense features that work across image distributions and tasks without fine-tuning.
On Mixpeek, DINOv2 provides high-quality visual embeddings for similarity search, classification, and dense prediction tasks. Its features are especially strong for fine-grained visual understanding.
Architecture
Vision Transformer (ViT-L/14) with 24 layers, 1024-dim hidden size, 16 attention heads. Trained via self-distillation from a 1B-parameter ViT-g teacher. Includes register tokens to fix attention artifacts in feature maps.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/image.jpg" },
feature_extractors: [{
name: "image_embedding",
version: "v1",
params: { model_id: "facebook/dinov2-large" }
}]
});Capabilities
- Self-supervised visual features without any labels
- 1024-dimensional dense embeddings per patch
- Linear-probe classification at 87.1% ImageNet accuracy (ViT-g)
- Strong on depth estimation, segmentation, retrieval
- Register tokens for clean dense feature maps
Use Cases on Mixpeek
Specification
Research Paper
DINOv2: Learning Robust Visual Features without Supervision
arxiv.orgBuild a pipeline with dinov2-large
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder