WAVE-7B
by tsinghua-ee
Unified audio-visual embeddings for text, audio, silent video, and synchronized clips
tsinghua-ee/WAVE-7Bmixpeek://audio_extractor@v1/tsinghua_wave_7b_v1Overview
WAVE 7B is a Qwen2.5-Omni based embedding model for unified audio-visual retrieval. It creates a shared representation space for text, audio, silent video, and synchronized audio-video inputs, with prompt-aware embeddings for instruction-specific retrieval.
On Mixpeek, WAVE is a strong candidate when agents need to search multimodal observations where sound and motion both matter. A support agent can retrieve the clip where a machine squeals before stopping; a media agent can find a scene by its crowd sound and camera motion; an inspection agent can search for audiovisual anomalies without relying on transcripts alone.
Architecture
7B-class multimodal embedding model built on Qwen2.5-Omni with hierarchical feature fusion and a dual audio encoder for speech and environmental sound. It is trained with multimodal, multitask contrastive objectives across text, audio, video, and audio-video pairs.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "audio_embeddings",
version: "v1",
parameters: { model_id: "tsinghua-ee/WAVE-7B" },
},
});Capabilities
- Any-to-any retrieval across text, audio, video, and audio-video clips
- Prompt-aware embeddings for task-specific search
- Strong audio and audiovisual retrieval performance
- Apache 2.0 license
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MMEB-v2-video | Overall | 59.9 | WAVE model card |
| AudioCaps | Audio retrieval | 44.2 | WAVE model card |
| VGGSound | Audio-video retrieval | 25.0 | WAVE model card |
Performance
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
WAVE: Learning Unified and Versatile Audio-Visual Embeddings with Multimodal LLM
arxiv.orgBuild a pipeline with WAVE-7B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio