siglip2-giant-opt-patch16-384
by google
Multilingual vision-language encoder with dense features and localization
google/siglip2-giant-opt-patch16-384mixpeek://image_extractor@v1/google_siglip2_giant_v1Overview
SigLIP 2 extends the sigmoid contrastive objective with captioning-based pretraining, self-supervised losses, and online data curation into a unified recipe. It produces stronger vision-language encoders with significantly improved localization and dense feature quality.
On Mixpeek, SigLIP 2 provides the strongest zero-shot visual embeddings from Google, achieving 85.0% ImageNet accuracy at the giant scale. Its improved spatial understanding makes it ideal for tasks requiring localization alongside retrieval.
Architecture
Vision Transformer (ViT-g) with ~1B parameters at 384px resolution. Combines sigmoid contrastive loss with captioning, self-distillation, and masked prediction objectives. Supports multi-resolution and native aspect ratio inputs.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/image.jpg" },
feature_extractors: [{
name: "image_embedding",
version: "v1",
params: { model_id: "google/siglip2-giant-opt-patch16-384" }
}]
});Capabilities
- 85.0% ImageNet zero-shot accuracy (ViT-g, 384px)
- Strong localization and dense spatial features
- Multilingual understanding with de-biasing
- Multi-resolution and native aspect ratio support
- Excellent VLM backbone (PaLI, Gemini)
Use Cases on Mixpeek
Specification
Research Paper
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
arxiv.orgBuild a pipeline with siglip2-giant-opt-patch16-384
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder