Qwen3-VL-8B-Instruct
by Qwen
8B vision-language model with 262K context and strong visual reasoning
Qwen/Qwen3-VL-8B-Instructmixpeek://image_extractor@v1/qwen3_vl_8b_v1Overview
Qwen3-VL-8B-Instruct is Alibaba's instruction-tuned vision-language model that combines an 8B parameter dense language model with a 400M SigLIP-2 vision encoder. It supports text, image, and video understanding with a native 262K token context window extensible to ~1M tokens, delivering performance that surpasses models 3x its size on key benchmarks.
On Mixpeek, Qwen3-VL-8B powers rich visual understanding tasks including scene captioning, document analysis, and video comprehension where you need detailed visual reasoning without the cost of running a 30B+ model.
Architecture
Early-fusion multimodal architecture built on a dense hybrid foundation of Gated Delta Networks and Gated Attention. The 8B LLM backbone is augmented with a 400M SigLIP-2 SO vision encoder, two-layer MLP mergers, and DeepStack adapters for multimodal and video capabilities.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/video.mp4" },feature_extractors: [{name: "scene_description",version: "v1",params: {model_id: "Qwen/Qwen3-VL-8B-Instruct"}}]});
Capabilities
- Text, image, and video understanding in a single model
- 262K token context window (extensible to ~1M via YaRN)
- Strong spatial perception and visual reasoning
- GUI interaction and visual agent capabilities
- 96.1% accuracy on DocVQA
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| DocVQA (test) | Accuracy | 96.1% | Qwen3-VL technical report |
| OCRBench | Accuracy | 89.6% | Qwen3-VL technical report |
| MMBench-V1.1 | Accuracy | 85.0% | Qwen3-VL technical report |
Performance
Specification
Research Paper
Qwen3-VL Technical Report
arxiv.orgBuild a pipeline with Qwen3-VL-8B-Instruct
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio