NEWVector Store Object Storage — 50x cheaper.Read the post →
    Models/Captioning/google/paligemma2-3b-mix-448
    HFScene CaptioningGemma

    paligemma2-3b-mix-448

    by google

    Versatile 3B vision-language model for captioning, VQA, OCR, and detection

    1.8Mdl/month
    3Bparams
    Identifiers
    Model ID
    google/paligemma2-3b-mix-448
    Feature URI
    mixpeek://image_extractor@v1/google_paligemma2_3b_v1

    Overview

    PaliGemma 2 is Google DeepMind's updated vision-language model combining a SigLIP vision encoder with a Gemma 2 language model. The 3B-mix-448 variant is fine-tuned on a diverse mixture of 30+ academic tasks at 448x448 resolution, making it ready to use out of the box for captioning, OCR, visual question answering, object detection, and segmentation.

    On Mixpeek, PaliGemma2 3B is a lightweight but highly capable visual understanding model that excels at structured extraction tasks. Its fine-tuning on diverse tasks means it handles everything from document OCR to scene captioning without additional training.

    Architecture

    SigLIP vision encoder (ViT-So400m) paired with a Gemma 2 2B language model. The vision encoder processes 448x448 images into visual tokens that are concatenated with text tokens for the language model. Fine-tuned on 30+ task mixtures using task-specific prefixes.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/document.jpg" },
    feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
    model_id: "google/paligemma2-3b-mix-448"
    }
    }]
    });

    Capabilities

    • Multi-task fine-tuning: captioning, VQA, OCR, detection, segmentation
    • 448x448 input resolution for detailed visual understanding
    • Strong performance on text-heavy visual tasks (DocVQA, TextVQA)
    • 30+ academic task mixtures out of the box

    Use Cases on Mixpeek

    Multi-task visual feature extraction in a single compact model pass
    Document OCR and visual Q&A for mixed-layout content
    Lightweight scene captioning for large image catalogs

    Benchmarks

    DatasetMetricScoreSource
    COCO CaptionsCIDEr141.9Steiner et al., 2024 — PaliGemma 2 paper
    VQAv2Accuracy83.2%Steiner et al., 2024 — PaliGemma 2 paper
    TextVQA (448)Accuracy~73%Steiner et al., 2024 — PaliGemma 2 paper

    Performance

    Input Size448×448 px
    GPU Latency~20ms / image (A100)
    GPU Throughput~50 images/sec (A100)
    GPU Memory~6.2 GB (bf16)

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters3B
    LicenseGemma
    Downloads/mo1.8M

    Research Paper

    PaliGemma 2: A Family of Versatile VLMs for Transfer

    arxiv.org

    Build a pipeline with paligemma2-3b-mix-448

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio