moonshine-streaming-medium

by usefulsensors

245M streaming ASR with 107ms latency — beats Whisper Large V3 at 6x fewer parameters

180Kdl/month

245Mparams

HuggingFace Run on your data, free

Identifiers

Model ID

usefulsensors/moonshine-streaming-medium

Feature URI

mixpeek://transcription@v1/moonshine_streaming_medium_v1

Overview

Moonshine Streaming Medium is a 245M-parameter automatic speech recognition model designed for real-time, low-latency streaming on edge-class hardware. It pairs a lightweight 50Hz audio frontend with a sliding-window Transformer encoder that uses bounded local attention and no positional embeddings (an "ergodic" encoder), while an adapter injects positional information before a standard autoregressive decoder.

Trained on roughly 300K hours of speech data, the model achieves transcription quality on par with Whisper Large V3 while running at 107ms latency on a MacBook Pro and using 6x fewer parameters. On Mixpeek, Moonshine Streaming provides a fast, lightweight alternative to Whisper for English ASR pipelines where latency and compute cost matter more than multilingual support.

Architecture

Lightweight 50Hz audio frontend + sliding-window Transformer encoder with bounded local attention and no positional embeddings (ergodic encoder). Adapter layer injects positional information before autoregressive decoder. 245M total parameters. Trained on ~300K hours of speech data.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "transcription",
    version: "v1",
    parameters: { model_id: "usefulsensors/moonshine-streaming-medium" },
  },
});

Capabilities

107ms streaming latency on consumer hardware
Accuracy matching Whisper Large V3 at 6x fewer params
Ergodic encoder for unbounded-length streaming
Optimized for edge and on-device deployment
245M parameters — fits on mobile and embedded hardware

Use Cases on Mixpeek

Real-time video transcription: stream captions during live content ingestion

Edge ASR pipelines: transcribe audio on-device before uploading to Mixpeek

Low-latency content indexing: process audio streams with minimal delay for near-real-time search

Benchmarks

Dataset	Metric	Score	Source
LibriSpeech (clean)	WER	~3.0%	Useful Sensors, 2026 — arxiv:2602.12241
Edge latency (MacBook Pro)	Latency	107ms	Useful Sensors, 2026 — arxiv:2602.12241
vs Whisper Large V3	Params ratio	6x smaller, comparable WER	Useful Sensors, 2026 — arxiv:2602.12241