NEWVector Store Object Storage — 50x cheaper.Read the post →
    Models/Captioning/google/gemma-4-4b-it
    HFScene CaptioningApache 2.0

    gemma-4-4b-it

    by google

    Instruction-tuned 4B multimodal model with text, image, and audio input

    1.9Mdl/month
    4Bparams
    Identifiers
    Model ID
    google/gemma-4-4b-it
    Feature URI
    mixpeek://image_extractor@v1/google_gemma4_4b_v1

    Overview

    Gemma 4 4B IT is Google DeepMind's instruction-tuned multimodal model from the Gemma 4 family, optimized for following complex instructions across text, image, and audio inputs. With a 128K token context window and Apache 2.0 licensing, it brings frontier-class instruction following to a compact 4B form factor.

    On Mixpeek, Gemma 4 4B IT serves as a versatile instruction-following backbone for structured extraction tasks, answering questions about visual content, and generating structured metadata from multimodal inputs.

    Architecture

    Decoder-only transformer with hybrid attention interleaving local sliding-window and full global attention. Supports multimodal inputs (text, image, audio) through integrated encoders. Uses Per-Layer Embeddings (PLE) for efficient parameter utilization. Final attention layer always global.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/image.jpg" },
    feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
    model_id: "google/gemma-4-4b-it"
    }
    }]
    });

    Capabilities

    • Instruction-tuned for complex multi-step task following
    • Multimodal input: text, image, and audio understanding
    • 128K token context window
    • Built-in thinking mode for chain-of-thought reasoning
    • Apache 2.0 license for commercial deployment

    Use Cases on Mixpeek

    Structured metadata extraction from images and video frames via natural language instructions
    Visual question answering across document and media collections
    Instruction-driven content analysis for automated tagging and classification pipelines

    Benchmarks

    DatasetMetricScoreSource
    MMLU ProAccuracy~55%Gemma 4 technical report
    AIME 2026Accuracy42.5%Gemma 4 technical report

    Performance

    Input SizeText + 224×224 px images + audio
    GPU Latency~22ms / image (A100)
    GPU Throughput~45 images/sec (A100)
    GPU Memory~3.2 GB (bf16)

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters4B
    LicenseApache 2.0
    Downloads/mo1.9M

    Research Paper

    Gemma 4 model overview

    arxiv.org

    Build a pipeline with gemma-4-4b-it

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio