NEWVector Store Object Storage — 50x cheaper.Read the post →
    Models/Speech & Audio/nvidia/parakeet-tdt-0.6b-v3
    NeMoTranscriptionCC-BY-4.0

    parakeet-tdt-0.6b-v3

    by nvidia

    600M multilingual ASR with 25-language support and automatic language detection

    420Kdl/month
    600Mparams
    Identifiers
    Model ID
    nvidia/parakeet-tdt-0.6b-v3
    Feature URI
    mixpeek://transcription@v1/nvidia_parakeet_tdt_v3

    Overview

    Parakeet TDT 0.6B v3 is NVIDIA's multilingual speech-to-text model built on the FastConformer-TDT architecture and trained on over 670,000 hours of audio from NVIDIA's Granary dataset. It extends the English-only v2 to 25 European languages with automatic language detection, achieving a 6.34% average WER on the HuggingFace Open ASR Leaderboard while maintaining among the highest throughput of any multilingual model.

    On Mixpeek, Parakeet TDT powers cost-efficient multilingual transcription pipelines where Whisper-class accuracy is needed at lower compute cost. Its 600M parameter count and FastConformer architecture deliver excellent throughput for batch processing large audio and video archives across European languages.

    Architecture

    FastConformer encoder with Token-and-Duration Transducer (TDT) decoder. 600M parameters. Uses a unified SentencePiece tokenizer with 8,192-token vocabulary. Supports audio up to 3 hours via local attention mode. Automatic language identification across 25 languages.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/interview.mp4" },
    feature_extractors: [{
    name: "transcription",
    version: "v1",
    params: {
    model_id: "nvidia/parakeet-tdt-0.6b-v3"
    }
    }]
    });

    Capabilities

    • 25 European languages with automatic detection
    • 1.93% WER on LibriSpeech test-clean
    • 6.34% average WER on Open ASR Leaderboard
    • Audio up to 3 hours via local attention mode
    • Word-level timestamps included

    Use Cases on Mixpeek

    Multilingual video transcription for European content libraries at scale
    Batch audio processing of podcasts and meetings across 25 languages
    Cost-efficient ASR pipeline replacing Whisper for European language content

    Benchmarks

    DatasetMetricScoreSource
    LibriSpeech test-cleanWER1.93%NVIDIA, 2025 — Model Card
    LibriSpeech test-otherWER3.59%NVIDIA, 2025 — Model Card
    Open ASR Leaderboard (avg)WER6.34%NVIDIA, 2025 — Model Card

    Performance

    Input SizeVariable-length audio (up to 3 hours)
    GPU Latency~3s / minute of audio (A100)
    GPU Throughput~20x realtime (A100)
    GPU Memory~2.5 GB

    Specification

    FrameworkNeMo
    Organizationnvidia
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters600M
    LicenseCC-BY-4.0
    Downloads/mo420K

    Research Paper

    Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient Multilingual ASR

    arxiv.org

    Build a pipeline with parakeet-tdt-0.6b-v3

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio