distil-large-v3

by distil-whisper

6x faster speech recognition distilled from Whisper Large v3

4.8Mdl/month

756Mparams

HuggingFace Run on your data, free

Identifiers

Model ID

distil-whisper/distil-large-v3

Feature URI

mixpeek://transcription@v1/distilwhisper_large_v3

Overview

Distil-Whisper Large v3 is a knowledge-distilled variant of OpenAI's Whisper Large v3 that achieves within 1% word error rate of the teacher model while running 6.3x faster. The distillation process copies the full encoder and selects a subset of maximally spaced decoder layers, reducing the parameter count by 51% without significant quality loss.

On Mixpeek, Distil-Whisper is the recommended transcription model for high-throughput pipelines where you need to process large audio and video libraries quickly while maintaining near-Whisper-level accuracy.

Architecture

Encoder-decoder Transformer. The encoder is entirely copied from Whisper Large v3 and frozen during training. The decoder uses a subset of the teacher's decoder layers, initialized from maximally spaced positions. Trained via knowledge distillation on pseudo-labeled audio data.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "audio_transcription",
    version: "v1",
    parameters: { model_id: "distil-whisper/distil-large-v3" },
  },
});

Capabilities

6.3x faster than Whisper Large v3
Within 1% WER of the teacher on long-form audio
51% fewer parameters than Whisper Large v3
Word-level timestamps and language detection
Robust to background noise and accents

Use Cases on Mixpeek

High-throughput transcription of large video archives where speed is critical

Real-time subtitle generation for live streaming pipelines

Cost-efficient batch processing of audio content at scale

Benchmarks

Dataset	Metric	Score	Source
LibriSpeech (test-clean)	WER	~2.1%	Gandhi et al., 2023 — within 1% of Whisper Large v3
OOD short-form (4 datasets)	Avg WER	Within 1.5% of teacher	Distil-Whisper model card
Long-form (sequential)	WER delta	< 1% vs Large v3	Distil-Whisper model card