paligemma2-3b-mix-448
by google
Versatile 3B vision-language model for captioning, VQA, OCR, and detection
google/paligemma2-3b-mix-448mixpeek://image_extractor@v1/google_paligemma2_3b_v1Overview
PaliGemma 2 is Google DeepMind's updated vision-language model combining a SigLIP vision encoder with a Gemma 2 language model. The 3B-mix-448 variant is fine-tuned on a diverse mixture of 30+ academic tasks at 448x448 resolution, making it ready to use out of the box for captioning, OCR, visual question answering, object detection, and segmentation.
On Mixpeek, PaliGemma2 3B is a lightweight but highly capable visual understanding model that excels at structured extraction tasks. Its fine-tuning on diverse tasks means it handles everything from document OCR to scene captioning without additional training.
Architecture
SigLIP vision encoder (ViT-So400m) paired with a Gemma 2 2B language model. The vision encoder processes 448x448 images into visual tokens that are concatenated with text tokens for the language model. Fine-tuned on 30+ task mixtures using task-specific prefixes.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/document.jpg" },feature_extractors: [{name: "scene_description",version: "v1",params: {model_id: "google/paligemma2-3b-mix-448"}}]});
Capabilities
- Multi-task fine-tuning: captioning, VQA, OCR, detection, segmentation
- 448x448 input resolution for detailed visual understanding
- Strong performance on text-heavy visual tasks (DocVQA, TextVQA)
- 30+ academic task mixtures out of the box
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| COCO Captions | CIDEr | 141.9 | Steiner et al., 2024 — PaliGemma 2 paper |
| VQAv2 | Accuracy | 83.2% | Steiner et al., 2024 — PaliGemma 2 paper |
| TextVQA (448) | Accuracy | ~73% | Steiner et al., 2024 — PaliGemma 2 paper |
Performance
Specification
Research Paper
PaliGemma 2: A Family of Versatile VLMs for Transfer
arxiv.orgBuild a pipeline with paligemma2-3b-mix-448
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio