gemma-4-4b-it

by google

Instruction-tuned 4B multimodal model with text, image, and audio input

1.9Mdl/month

4Bparams

HuggingFace Use in Pipeline

Identifiers

Model ID

google/gemma-4-4b-it

Feature URI

mixpeek://image_extractor@v1/google_gemma4_4b_v1

Overview

Gemma 4 4B IT is Google DeepMind's instruction-tuned multimodal model from the Gemma 4 family, optimized for following complex instructions across text, image, and audio inputs. With a 128K token context window and Apache 2.0 licensing, it brings frontier-class instruction following to a compact 4B form factor.

On Mixpeek, Gemma 4 4B IT serves as a versatile instruction-following backbone for structured extraction tasks, answering questions about visual content, and generating structured metadata from multimodal inputs.

Architecture

Decoder-only transformer with hybrid attention interleaving local sliding-window and full global attention. Supports multimodal inputs (text, image, audio) through integrated encoders. Uses Per-Layer Embeddings (PLE) for efficient parameter utilization. Final attention layer always global.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "my-collection",
  source: { url: "https://example.com/image.jpg" },
  feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
      model_id: "google/gemma-4-4b-it"
    }
  }]
});