Mixpeek Logo
    Login / Signup
    Models/Embeddings/facebook/dinov2-large
    HFVisual EmbeddingsApache 2.0

    dinov2-large

    by facebook

    Self-supervised vision foundation model producing all-purpose visual features

    2.8Mdl/month
    300Mparams
    Identifiers
    Model ID
    facebook/dinov2-large
    Feature URI
    mixpeek://image_extractor@v1/facebook_dinov2_large_v1

    Overview

    DINOv2 is a self-supervised vision foundation model from Meta AI that learns robust visual features without any labels. Trained on a curated dataset of 142M images (LVD-142M) using a combination of DINO and iBOT objectives, it produces dense features that work across image distributions and tasks without fine-tuning.

    On Mixpeek, DINOv2 provides high-quality visual embeddings for similarity search, classification, and dense prediction tasks. Its features are especially strong for fine-grained visual understanding.

    Architecture

    Vision Transformer (ViT-L/14) with 24 layers, 1024-dim hidden size, 16 attention heads. Trained via self-distillation from a 1B-parameter ViT-g teacher. Includes register tokens to fix attention artifacts in feature maps.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/image.jpg" },
      feature_extractors: [{
        name: "image_embedding",
        version: "v1",
        params: { model_id: "facebook/dinov2-large" }
      }]
    });

    Capabilities

    • Self-supervised visual features without any labels
    • 1024-dimensional dense embeddings per patch
    • Linear-probe classification at 87.1% ImageNet accuracy (ViT-g)
    • Strong on depth estimation, segmentation, retrieval
    • Register tokens for clean dense feature maps

    Use Cases on Mixpeek

    Visual similarity search across image and video libraries
    Fine-grained product matching and deduplication
    Dense feature extraction for segmentation and depth estimation
    Domain-agnostic visual representation for downstream models

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters300M
    LicenseApache 2.0
    Downloads/mo2.8M

    Research Paper

    DINOv2: Learning Robust Visual Features without Supervision

    arxiv.org

    Build a pipeline with dinov2-large

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder