NEWVector Store Object Storage — 50x cheaper.Read the post →
    Models/Captioning/Qwen/Qwen3-VL-8B-Instruct
    HFScene CaptioningApache 2.0

    Qwen3-VL-8B-Instruct

    by Qwen

    8B vision-language model with 262K context and strong visual reasoning

    2.8Mdl/month
    8.77Bparams
    Identifiers
    Model ID
    Qwen/Qwen3-VL-8B-Instruct
    Feature URI
    mixpeek://image_extractor@v1/qwen3_vl_8b_v1

    Overview

    Qwen3-VL-8B-Instruct is Alibaba's instruction-tuned vision-language model that combines an 8B parameter dense language model with a 400M SigLIP-2 vision encoder. It supports text, image, and video understanding with a native 262K token context window extensible to ~1M tokens, delivering performance that surpasses models 3x its size on key benchmarks.

    On Mixpeek, Qwen3-VL-8B powers rich visual understanding tasks including scene captioning, document analysis, and video comprehension where you need detailed visual reasoning without the cost of running a 30B+ model.

    Architecture

    Early-fusion multimodal architecture built on a dense hybrid foundation of Gated Delta Networks and Gated Attention. The 8B LLM backbone is augmented with a 400M SigLIP-2 SO vision encoder, two-layer MLP mergers, and DeepStack adapters for multimodal and video capabilities.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
    model_id: "Qwen/Qwen3-VL-8B-Instruct"
    }
    }]
    });

    Capabilities

    • Text, image, and video understanding in a single model
    • 262K token context window (extensible to ~1M via YaRN)
    • Strong spatial perception and visual reasoning
    • GUI interaction and visual agent capabilities
    • 96.1% accuracy on DocVQA

    Use Cases on Mixpeek

    Rich scene description for video archives with detailed spatial and temporal reasoning
    Document visual Q&A for scanned forms, invoices, and mixed-layout content
    Video understanding across long-form content with fine-grained temporal search

    Benchmarks

    DatasetMetricScoreSource
    DocVQA (test)Accuracy96.1%Qwen3-VL technical report
    OCRBenchAccuracy89.6%Qwen3-VL technical report
    MMBench-V1.1Accuracy85.0%Qwen3-VL technical report

    Performance

    Input SizeText + variable resolution images/video
    GPU Latency~55ms / image (A100)
    GPU Throughput~18 images/sec (A100)
    GPU Memory~17 GB (bf16)

    Specification

    FrameworkHF
    OrganizationQwen
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters8.77B
    LicenseApache 2.0
    Downloads/mo2.8M

    Research Paper

    Qwen3-VL Technical Report

    arxiv.org

    Build a pipeline with Qwen3-VL-8B-Instruct

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio