NEWAnnouncing Vector Store Object Storage — 50x cheaper than traditional vector databases.Vector Store Object Storage — 50x cheaper.Read the post →

Ingest & Store

Feature Extractors

Typed pipelines for faces, scenes, transcripts, OCR, fingerprints.

Vector Store (MVS)

Mixpeek Vector Store: horizontally scaled, feature-aware indexes.

Retrieve & Analyze

Compose multi-stage search in <100ms:filter, join, rerank.

Group scenes, faces or objects by similarity with Thompson sampling.

Encode your domain as versioned ontologies enforced at query time.

By Industry

Talent search, brand safety, creative analytics.

Scene search, recommendation, archive access.

Visual search, PDP enrichment, catalog QA.

Lecture search, transcript Q&A, content safety.

View all solutions →

By Use Case

Face & Person Search

Find anyone across video libraries in milliseconds.

IP & Copyright Detection

Logos, songs, faces:one pipeline, one report.

Visual Taste & Recs

Scene-similarity ranked recommendations with RL.

Brand & Ad Safety

Pre-publish content screening at bid-time speeds.

View all use cases →

Build

API reference, SDKs, recipes, and architecture guides.

Launches, deep dives, and field notes from our engineers.

Browse supported HuggingFace models by task and modality.

See what teams are building with Mixpeek.

Latest releases, fixes, and improvements.

Education

Multimodal University

Fundamentals of multimodal retrieval, modules + certs.

Every term you need:embeddings to re-rankers.

Talks, demos, and customer sessions on demand.

Mixpeek vs. Pinecone, Weaviate, Twelve Labs, more.

Mission, team, and the multimodal vision.

We're hiring across research, infra, and design.

Talk to sales, support, or press.

White-glove 30-day production pilot for new customers.

Integrations Pricing

Sign in Request Demo

Models/Embeddings/google/siglip2-giant-opt-patch16-384

HFVisual EmbeddingsApache 2.0

siglip2-giant-opt-patch16-384

by google

Multilingual vision-language encoder with dense features and localization

1.2Mdl/month

1Bparams

HuggingFace Use in Pipeline

Identifiers

Model ID

google/siglip2-giant-opt-patch16-384

Feature URI

mixpeek://image_extractor@v1/google_siglip2_giant_v1

Overview

SigLIP 2 extends the sigmoid contrastive objective with captioning-based pretraining, self-supervised losses, and online data curation into a unified recipe. It produces stronger vision-language encoders with significantly improved localization and dense feature quality.

On Mixpeek, SigLIP 2 provides the strongest zero-shot visual embeddings from Google, achieving 85.0% ImageNet accuracy at the giant scale. Its improved spatial understanding makes it ideal for tasks requiring localization alongside retrieval.

Architecture

Vision Transformer (ViT-g) with ~1B parameters at 384px resolution. Combines sigmoid contrastive loss with captioning, self-distillation, and masked prediction objectives. Supports multi-resolution and native aspect ratio inputs.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "my-collection",
  source: { url: "https://example.com/image.jpg" },
  feature_extractors: [{
    name: "image_embedding",
    version: "v1",
    params: { model_id: "google/siglip2-giant-opt-patch16-384" }
  }]
});

Capabilities

85.0% ImageNet zero-shot accuracy (ViT-g, 384px)
Strong localization and dense spatial features
Multilingual understanding with de-biasing
Multi-resolution and native aspect ratio support
Excellent VLM backbone (PaLI, Gemini)

Use Cases on Mixpeek

Cross-modal search with multilingual text queries

Visual grounding and localization tasks

High-accuracy zero-shot visual classification

Foundation encoder for vision-language applications

Benchmarks

Dataset	Metric	Score	Source
ImageNet zero-shot	Top-1 Accuracy	83.4%	SigLIP2 model card
COCO (text→image)	Recall@1	45.3%	SigLIP2 model card

Performance

Input Size384×384 px

Embedding Dim1152

GPU Latency~22ms / image (A100)

CPU Latency~280ms / image

GPU Throughput~45 images/sec (A100)

GPU Memory~4.2 GB

1.1B params — giant variant for highest accuracy

Common Pipeline Companions

openai/whisper-large-v3

Audio track processing for video

BAAI/bge-large-en-v1.5

Text embedding for hybrid retrieval

Specification

FrameworkHF

Organizationgoogle

FeatureVisual Embeddings

Output768-dim vector

Modalitiesvideo, image

RetrieverVector Search

Parameters1B

LicenseApache 2.0

Downloads/mo1.2M

Research Paper

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Build a pipeline with siglip2-giant-opt-patch16-384

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Alternative Models

openai/clip-vit-large-patch14

Visual Embeddings

google/siglip-base-patch16-224

Visual Embeddings

facebook/dinov2-large

Visual Embeddings

facebook/dinov3-large

Visual Embeddings

Related in Embeddings

laion/clap-htsat-fused

Audio Embeddings

facebook/encodec_24khz

Audio Embeddings

BAAI/bge-large-en-v1.5

Text Embeddings

sentence-transformers/all-MiniLM-L6-v2

Text Embeddings