NEWMVS for embeddings. Managed for files. Both on object storage.Vectors or files. Pick a path.Start →

Start here

Vector Store (MVS)

Bring your own vectors. Dense, sparse, and BM25 search on object storage.

Managed Indexing

Connect a bucket and auto-extract scenes, faces, OCR, transcripts, and embeddings.

Build

Compose multi-stage search in <100ms: filter, join, rerank.

Feature Extractors

Typed pipelines for faces, scenes, transcripts, OCR, fingerprints.

S3, GCS, R2, Mux, LangChain, MCP, and more. Connect your stack.

Generate and store embeddings from 50+ models, then search them.

By Industry

Map, search, and reuse the moments that perform. Plugs into iconik & Mux.

Talent search, brand safety, creative analytics.

Scene search, recommendation, archive access.

Visual search, PDP enrichment, catalog QA.

Lecture search, transcript Q&A, content safety.

View all solutions →

By Use Case

Face & Person Search

Find anyone across video libraries in milliseconds.

IP & Copyright Detection

Logos, songs, faces: one pipeline, one report.

Visual Taste & Recs

Scene-similarity ranked recommendations with RL.

Brand & Ad Safety

Pre-publish content screening at bid-time speeds.

View all use cases →

Build & evaluate

API reference, SDKs, recipes, and architecture guides.

Browse supported HuggingFace models by task and modality.

Mixpeek vs. Pinecone, Weaviate, Twelve Labs, more.

Media & data converters — no account, runs in your browser.

See what teams are building with Mixpeek.

Learn

Best-of comparisons: vector DBs, embedding models, moderation APIs.

Vendor-neutral deep dives on perception, retrieval, and embeddings.

Launches, deep dives, and field notes from our engineers.

Papers behind multimodal search — MUVERA, SAM 3, and more — explained.

Every term you need: embeddings to re-rankers.

Videos, diagrams & university →

Mission, team, and the multimodal vision.

We're hiring across research, infra, and design.

Talk to sales, support, or press.

45-min working session on your data — leave with a running notebook.

Sign in Request Demo Get started →

Models/Captioning/Qwen/Qwen3-VL-8B-Instruct

HFScene CaptioningApache 2.0

Qwen3-VL-8B-Instruct

by Qwen

8B vision-language model with 262K context and strong visual reasoning

2.8Mdl/month

8.77Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

Qwen/Qwen3-VL-8B-Instruct

Feature URI

mixpeek://image_extractor@v1/qwen3_vl_8b_v1

Overview

Qwen3-VL-8B-Instruct is Alibaba's instruction-tuned vision-language model that combines an 8B parameter dense language model with a 400M SigLIP-2 vision encoder. It supports text, image, and video understanding with a native 262K token context window extensible to ~1M tokens, delivering performance that surpasses models 3x its size on key benchmarks.

On Mixpeek, Qwen3-VL-8B powers rich visual understanding tasks including scene captioning, document analysis, and video comprehension where you need detailed visual reasoning without the cost of running a 30B+ model.

Architecture

Early-fusion multimodal architecture built on a dense hybrid foundation of Gated Delta Networks and Gated Attention. The 8B LLM backbone is augmented with a 400M SigLIP-2 SO vision encoder, two-layer MLP mergers, and DeepStack adapters for multimodal and video capabilities.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_description",
    version: "v1",
    parameters: { model_id: "Qwen/Qwen3-VL-8B-Instruct" },
  },
});

Capabilities

Text, image, and video understanding in a single model
262K token context window (extensible to ~1M via YaRN)
Strong spatial perception and visual reasoning
GUI interaction and visual agent capabilities
96.1% accuracy on DocVQA

Use Cases on Mixpeek

Rich scene description for video archives with detailed spatial and temporal reasoning

Document visual Q&A for scanned forms, invoices, and mixed-layout content

Video understanding across long-form content with fine-grained temporal search

Benchmarks

Dataset	Metric	Score	Source
DocVQA (test)	Accuracy	96.1%	Qwen3-VL technical report
OCRBench	Accuracy	89.6%	Qwen3-VL technical report
MMBench-V1.1	Accuracy	85.0%	Qwen3-VL technical report

Performance

Input SizeText + variable resolution images/video

GPU Latency~55ms / image (A100)

GPU Throughput~18 images/sec (A100)

GPU Memory~17 GB (bf16)

Common Pipeline Companions

Qwen/Qwen3-Embedding-4B

Text embedding for transcript search

openai/whisper-large-v3

Audio transcription for video pipelines

Explore on Mixpeek

More Captioning models

Compare alternatives in this category

Best Document AI Platforms

Hand-picked tools & platforms compared

OCR & Document AI Internals

Deep-dive technical guide

Feature Extractors

See how Mixpeek runs models as extractors

Mixpeek Vector Store

Store & search embeddings at scale

Usage-based pricing for pipelines

All Curated Lists

Compare models, APIs & infrastructure

Specification

FrameworkHF

OrganizationQwen

FeatureScene Captioning

Outputtext

Modalitiesvideo, image

RetrieverSemantic Search

Parameters8.77B

LicenseApache 2.0

Downloads/mo2.8M

Research Paper

Qwen3-VL Technical Report

Build a pipeline with Qwen3-VL-8B-Instruct

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Run on your data, free

Alternative Models

Salesforce/blip2-opt-2.7b

Scene Captioning

microsoft/Florence-2-large

Scene Captioning

google/paligemma2-3b-mix-448

Scene Captioning

google/gemma-4-E2B-it

Scene Captioning