NEWVector Store Object Storage — 50x cheaper.Read the post →
    Models/Speech & Audio/Qwen/Qwen3-ASR-1.7B
    HFTranscriptionApache 2.0

    Qwen3-ASR-1.7B

    by Qwen

    State-of-the-art open-source ASR for 52 languages with streaming and offline modes

    320Kdl/month
    1.7Bparams
    Identifiers
    Model ID
    Qwen/Qwen3-ASR-1.7B
    Feature URI
    mixpeek://transcription@v1/qwen3_asr_1b_v1

    Overview

    Qwen3-ASR-1.7B is Alibaba's flagship open-source speech recognition model supporting 52 languages and dialects. It combines a 300M-parameter AuT audio encoder with a Qwen3-1.7B decoder, achieving state-of-the-art performance among open-source ASR models and competing with the strongest proprietary APIs including OpenAI Whisper large v3.

    On Mixpeek, Qwen3-ASR powers multilingual transcription pipelines that need broad language coverage beyond European languages. Its dual-mode architecture supports both streaming inference with 1-8 second chunks and offline processing of long recordings, making it versatile for real-time and batch workloads across 52 languages.

    Architecture

    AuT audio encoder (300M params, attention-encoder-decoder, 1024 hidden size) compresses audio 8x to 12.5 Hz representations. Qwen3-1.7B decoder for text generation. Dynamic flash attention window (1s-8s) enables both streaming and offline inference. Total 1.7B parameters.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/multilingual-video.mp4" },
    feature_extractors: [{
    name: "transcription",
    version: "v1",
    params: {
    model_id: "Qwen/Qwen3-ASR-1.7B"
    }
    }]
    });

    Capabilities

    • 52 languages and dialects with automatic language detection
    • 1.63% WER on LibriSpeech Clean (offline mode)
    • Streaming inference with 1-8 second dynamic chunks
    • Timestamp prediction for word-level alignment
    • Competitive with strongest proprietary ASR APIs

    Use Cases on Mixpeek

    Global multilingual video transcription spanning 52 languages for international content libraries
    Streaming ASR for live captioning and real-time translation pipelines
    Batch transcription of music, speech, and song content with language identification

    Benchmarks

    DatasetMetricScoreSource
    LibriSpeech Clean (offline)WER1.63%Alibaba, Jan 2026 — Technical Report
    LibriSpeech Other (offline)WER3.38%Alibaba, Jan 2026 — Technical Report

    Performance

    Input SizeVariable-length audio (streaming: 1-8s chunks; offline: unlimited)
    GPU Latency~4s / minute of audio (A100, offline)
    GPU Throughput~15x realtime (A100)
    GPU Memory~5 GB

    Specification

    FrameworkHF
    OrganizationQwen
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters1.7B
    LicenseApache 2.0
    Downloads/mo320K

    Research Paper

    Qwen3-ASR Technical Report

    arxiv.org

    Build a pipeline with Qwen3-ASR-1.7B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio