gemma-4-4b-it
by google
Instruction-tuned 4B multimodal model with text, image, and audio input
google/gemma-4-4b-itmixpeek://image_extractor@v1/google_gemma4_4b_v1Overview
Gemma 4 4B IT is Google DeepMind's instruction-tuned multimodal model from the Gemma 4 family, optimized for following complex instructions across text, image, and audio inputs. With a 128K token context window and Apache 2.0 licensing, it brings frontier-class instruction following to a compact 4B form factor.
On Mixpeek, Gemma 4 4B IT serves as a versatile instruction-following backbone for structured extraction tasks, answering questions about visual content, and generating structured metadata from multimodal inputs.
Architecture
Decoder-only transformer with hybrid attention interleaving local sliding-window and full global attention. Supports multimodal inputs (text, image, audio) through integrated encoders. Uses Per-Layer Embeddings (PLE) for efficient parameter utilization. Final attention layer always global.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/image.jpg" },feature_extractors: [{name: "scene_description",version: "v1",params: {model_id: "google/gemma-4-4b-it"}}]});
Capabilities
- Instruction-tuned for complex multi-step task following
- Multimodal input: text, image, and audio understanding
- 128K token context window
- Built-in thinking mode for chain-of-thought reasoning
- Apache 2.0 license for commercial deployment
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MMLU Pro | Accuracy | ~55% | Gemma 4 technical report |
| AIME 2026 | Accuracy | 42.5% | Gemma 4 technical report |
Performance
Specification
Research Paper
Gemma 4 model overview
arxiv.orgBuild a pipeline with gemma-4-4b-it
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio