paligemma2-3b-mix-448

by google

Versatile 3B vision-language model for captioning, VQA, OCR, and detection

1.8Mdl/month

3Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

google/paligemma2-3b-mix-448

Feature URI

mixpeek://image_extractor@v1/google_paligemma2_3b_v1

Overview

PaliGemma 2 is Google DeepMind's updated vision-language model combining a SigLIP vision encoder with a Gemma 2 language model. The 3B-mix-448 variant is fine-tuned on a diverse mixture of 30+ academic tasks at 448x448 resolution, making it ready to use out of the box for captioning, OCR, visual question answering, object detection, and segmentation.

On Mixpeek, PaliGemma2 3B is a lightweight but highly capable visual understanding model that excels at structured extraction tasks. Its fine-tuning on diverse tasks means it handles everything from document OCR to scene captioning without additional training.

Architecture

SigLIP vision encoder (ViT-So400m) paired with a Gemma 2 2B language model. The vision encoder processes 448x448 images into visual tokens that are concatenated with text tokens for the language model. Fine-tuned on 30+ task mixtures using task-specific prefixes.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_description",
    version: "v1",
    parameters: { model_id: "google/paligemma2-3b-mix-448" },
  },
});

Capabilities

Multi-task fine-tuning: captioning, VQA, OCR, detection, segmentation
448x448 input resolution for detailed visual understanding
Strong performance on text-heavy visual tasks (DocVQA, TextVQA)
30+ academic task mixtures out of the box

Use Cases on Mixpeek

Multi-task visual feature extraction in a single compact model pass

Document OCR and visual Q&A for mixed-layout content

Lightweight scene captioning for large image catalogs

Benchmarks

Dataset	Metric	Score	Source
COCO Captions	CIDEr	141.9	Steiner et al., 2024 — PaliGemma 2 paper
VQAv2	Accuracy	83.2%	Steiner et al., 2024 — PaliGemma 2 paper
TextVQA (448)	Accuracy	~73%	Steiner et al., 2024 — PaliGemma 2 paper