Mixpeek Logo
    Login / Signup
    Models/Embeddings/google/siglip2-giant-opt-patch16-384
    HFVisual EmbeddingsApache 2.0

    siglip2-giant-opt-patch16-384

    by google

    Multilingual vision-language encoder with dense features and localization

    1.2Mdl/month
    1Bparams
    Identifiers
    Model ID
    google/siglip2-giant-opt-patch16-384
    Feature URI
    mixpeek://image_extractor@v1/google_siglip2_giant_v1

    Overview

    SigLIP 2 extends the sigmoid contrastive objective with captioning-based pretraining, self-supervised losses, and online data curation into a unified recipe. It produces stronger vision-language encoders with significantly improved localization and dense feature quality.

    On Mixpeek, SigLIP 2 provides the strongest zero-shot visual embeddings from Google, achieving 85.0% ImageNet accuracy at the giant scale. Its improved spatial understanding makes it ideal for tasks requiring localization alongside retrieval.

    Architecture

    Vision Transformer (ViT-g) with ~1B parameters at 384px resolution. Combines sigmoid contrastive loss with captioning, self-distillation, and masked prediction objectives. Supports multi-resolution and native aspect ratio inputs.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/image.jpg" },
      feature_extractors: [{
        name: "image_embedding",
        version: "v1",
        params: { model_id: "google/siglip2-giant-opt-patch16-384" }
      }]
    });

    Capabilities

    • 85.0% ImageNet zero-shot accuracy (ViT-g, 384px)
    • Strong localization and dense spatial features
    • Multilingual understanding with de-biasing
    • Multi-resolution and native aspect ratio support
    • Excellent VLM backbone (PaLI, Gemini)

    Use Cases on Mixpeek

    Cross-modal search with multilingual text queries
    Visual grounding and localization tasks
    High-accuracy zero-shot visual classification
    Foundation encoder for vision-language applications

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters1B
    LicenseApache 2.0
    Downloads/mo1.2M

    Research Paper

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    arxiv.org

    Build a pipeline with siglip2-giant-opt-patch16-384

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder