NEWVector Store Object Storage — 50x cheaper.Read the post →
    Models/Speech & Audio/usefulsensors/moonshine-streaming-medium
    HFTranscriptionMIT

    moonshine-streaming-medium

    by usefulsensors

    245M streaming ASR with 107ms latency — beats Whisper Large V3 at 6x fewer parameters

    180Kdl/month
    245Mparams
    Identifiers
    Model ID
    usefulsensors/moonshine-streaming-medium
    Feature URI
    mixpeek://transcription@v1/moonshine_streaming_medium_v1

    Overview

    Moonshine Streaming Medium is a 245M-parameter automatic speech recognition model designed for real-time, low-latency streaming on edge-class hardware. It pairs a lightweight 50Hz audio frontend with a sliding-window Transformer encoder that uses bounded local attention and no positional embeddings (an "ergodic" encoder), while an adapter injects positional information before a standard autoregressive decoder.

    Trained on roughly 300K hours of speech data, the model achieves transcription quality on par with Whisper Large V3 while running at 107ms latency on a MacBook Pro and using 6x fewer parameters. On Mixpeek, Moonshine Streaming provides a fast, lightweight alternative to Whisper for English ASR pipelines where latency and compute cost matter more than multilingual support.

    Architecture

    Lightweight 50Hz audio frontend + sliding-window Transformer encoder with bounded local attention and no positional embeddings (ergodic encoder). Adapter layer injects positional information before autoregressive decoder. 245M total parameters. Trained on ~300K hours of speech data.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "live-content",
    source: { url: "https://example.com/livestream.mp4" },
    feature_extractors: [{
    feature: "transcription",
    model: "usefulsensors/moonshine-streaming-medium"
    }]
    });

    Capabilities

    • 107ms streaming latency on consumer hardware
    • Accuracy matching Whisper Large V3 at 6x fewer params
    • Ergodic encoder for unbounded-length streaming
    • Optimized for edge and on-device deployment
    • 245M parameters — fits on mobile and embedded hardware

    Use Cases on Mixpeek

    Real-time video transcription: stream captions during live content ingestion
    Edge ASR pipelines: transcribe audio on-device before uploading to Mixpeek
    Low-latency content indexing: process audio streams with minimal delay for near-real-time search

    Benchmarks

    DatasetMetricScoreSource
    LibriSpeech (clean)WER~3.0%Useful Sensors, 2026 — arxiv:2602.12241
    Edge latency (MacBook Pro)Latency107msUseful Sensors, 2026 — arxiv:2602.12241
    vs Whisper Large V3Params ratio6x smaller, comparable WERUseful Sensors, 2026 — arxiv:2602.12241

    Performance

    Input SizeStreaming audio (unbounded length)
    GPU Latency~50ms / chunk (A100)
    CPU Latency~107ms / chunk (MacBook Pro)
    GPU Throughput~20x real-time (A100)
    GPU Memory~0.8 GB

    Specification

    FrameworkHF
    Organizationusefulsensors
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters245M
    LicenseMIT
    Downloads/mo180K

    Research Paper

    Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

    arxiv.org

    Build a pipeline with moonshine-streaming-medium

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio