NEWVector Store Object Storage — 50x cheaper.Read the post →
    Models/Speech & Audio/distil-whisper/distil-large-v3
    HFTranscriptionMIT

    distil-large-v3

    by distil-whisper

    6x faster speech recognition distilled from Whisper Large v3

    4.8Mdl/month
    756Mparams
    Identifiers
    Model ID
    distil-whisper/distil-large-v3
    Feature URI
    mixpeek://transcription@v1/distilwhisper_large_v3

    Overview

    Distil-Whisper Large v3 is a knowledge-distilled variant of OpenAI's Whisper Large v3 that achieves within 1% word error rate of the teacher model while running 6.3x faster. The distillation process copies the full encoder and selects a subset of maximally spaced decoder layers, reducing the parameter count by 51% without significant quality loss.

    On Mixpeek, Distil-Whisper is the recommended transcription model for high-throughput pipelines where you need to process large audio and video libraries quickly while maintaining near-Whisper-level accuracy.

    Architecture

    Encoder-decoder Transformer. The encoder is entirely copied from Whisper Large v3 and frozen during training. The decoder uses a subset of the teacher's decoder layers, initialized from maximally spaced positions. Trained via knowledge distillation on pseudo-labeled audio data.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "audio_transcription",
    version: "v1",
    params: {
    model_id: "distil-whisper/distil-large-v3"
    }
    }]
    });

    Capabilities

    • 6.3x faster than Whisper Large v3
    • Within 1% WER of the teacher on long-form audio
    • 51% fewer parameters than Whisper Large v3
    • Word-level timestamps and language detection
    • Robust to background noise and accents

    Use Cases on Mixpeek

    High-throughput transcription of large video archives where speed is critical
    Real-time subtitle generation for live streaming pipelines
    Cost-efficient batch processing of audio content at scale

    Benchmarks

    DatasetMetricScoreSource
    LibriSpeech (test-clean)WER~2.1%Gandhi et al., 2023 — within 1% of Whisper Large v3
    OOD short-form (4 datasets)Avg WERWithin 1.5% of teacherDistil-Whisper model card
    Long-form (sequential)WER delta< 1% vs Large v3Distil-Whisper model card

    Performance

    Input Size30s audio chunks
    GPU Latency~50ms / 30s chunk (A100)
    GPU Throughput~35× realtime (A100)
    GPU Memory~1.6 GB

    756M params — 6.3x faster than Whisper Large v3 with near-identical accuracy

    Specification

    FrameworkHF
    Organizationdistil-whisper
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters756M
    LicenseMIT
    Downloads/mo4.8M

    Research Paper

    Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

    arxiv.org

    Build a pipeline with distil-large-v3

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio