moonshine-streaming-medium
by usefulsensors
245M streaming ASR with 107ms latency — beats Whisper Large V3 at 6x fewer parameters
usefulsensors/moonshine-streaming-mediummixpeek://transcription@v1/moonshine_streaming_medium_v1Overview
Moonshine Streaming Medium is a 245M-parameter automatic speech recognition model designed for real-time, low-latency streaming on edge-class hardware. It pairs a lightweight 50Hz audio frontend with a sliding-window Transformer encoder that uses bounded local attention and no positional embeddings (an "ergodic" encoder), while an adapter injects positional information before a standard autoregressive decoder.
Trained on roughly 300K hours of speech data, the model achieves transcription quality on par with Whisper Large V3 while running at 107ms latency on a MacBook Pro and using 6x fewer parameters. On Mixpeek, Moonshine Streaming provides a fast, lightweight alternative to Whisper for English ASR pipelines where latency and compute cost matter more than multilingual support.
Architecture
Lightweight 50Hz audio frontend + sliding-window Transformer encoder with bounded local attention and no positional embeddings (ergodic encoder). Adapter layer injects positional information before autoregressive decoder. 245M total parameters. Trained on ~300K hours of speech data.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "live-content",source: { url: "https://example.com/livestream.mp4" },feature_extractors: [{feature: "transcription",model: "usefulsensors/moonshine-streaming-medium"}]});
Capabilities
- 107ms streaming latency on consumer hardware
- Accuracy matching Whisper Large V3 at 6x fewer params
- Ergodic encoder for unbounded-length streaming
- Optimized for edge and on-device deployment
- 245M parameters — fits on mobile and embedded hardware
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| LibriSpeech (clean) | WER | ~3.0% | Useful Sensors, 2026 — arxiv:2602.12241 |
| Edge latency (MacBook Pro) | Latency | 107ms | Useful Sensors, 2026 — arxiv:2602.12241 |
| vs Whisper Large V3 | Params ratio | 6x smaller, comparable WER | Useful Sensors, 2026 — arxiv:2602.12241 |
Performance
Specification
Research Paper
Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
arxiv.orgBuild a pipeline with moonshine-streaming-medium
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio