NEWVectors or files. Pick a path.Start →
    Models/Embeddings/tsinghua-ee/WAVE-7B
    HFAudio Embeddingsapache-2.0

    WAVE-7B

    by tsinghua-ee

    Unified audio-visual embeddings for text, audio, silent video, and synchronized clips

    230dl/month
    5likes
    7Bparams
    Identifiers
    Model ID
    tsinghua-ee/WAVE-7B
    Feature URI
    mixpeek://audio_extractor@v1/tsinghua_wave_7b_v1

    Overview

    WAVE 7B is a Qwen2.5-Omni based embedding model for unified audio-visual retrieval. It creates a shared representation space for text, audio, silent video, and synchronized audio-video inputs, with prompt-aware embeddings for instruction-specific retrieval.

    On Mixpeek, WAVE is a strong candidate when agents need to search multimodal observations where sound and motion both matter. A support agent can retrieve the clip where a machine squeals before stopping; a media agent can find a scene by its crowd sound and camera motion; an inspection agent can search for audiovisual anomalies without relying on transcripts alone.

    Architecture

    7B-class multimodal embedding model built on Qwen2.5-Omni with hierarchical feature fusion and a dual audio encoder for speech and environmental sound. It is trained with multimodal, multitask contrastive objectives across text, audio, video, and audio-video pairs.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    // Managed: create a collection over a bucket; Mixpeek runs this model's extractor
    const collection = await mx.collections.create({
      namespace_id: "my-namespace",
      collection_name: "my-collection",
      source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
      feature_extractor: {
        feature_extractor_name: "audio_embeddings",
        version: "v1",
        parameters: { model_id: "tsinghua-ee/WAVE-7B" },
      },
    });

    Capabilities

    • Any-to-any retrieval across text, audio, video, and audio-video clips
    • Prompt-aware embeddings for task-specific search
    • Strong audio and audiovisual retrieval performance
    • Apache 2.0 license

    Use Cases on Mixpeek

    Search surveillance, robotics, or inspection footage by audiovisual events
    Find media clips from natural-language descriptions of sound and motion
    Build agent perception memory across microphones and cameras
    Retrieve audio-only evidence and video-only evidence with one model family

    Benchmarks

    DatasetMetricScoreSource
    MMEB-v2-videoOverall59.9WAVE model card
    AudioCapsAudio retrieval44.2WAVE model card
    VGGSoundAudio-video retrieval25.0WAVE model card

    Performance

    Input SizeText, audio, silent video, or synchronized audio-video
    Embedding DimModel dependent
    GPU LatencyInput dependent
    GPU ThroughputBatch by clip for best throughput
    GPU Memory~15 GB plus serving overhead

    Specification

    FrameworkHF
    Organizationtsinghua-ee
    FeatureAudio Embeddings
    Output512-dim vector
    Modalitiesvideo, audio
    RetrieverAudio Similarity
    Parameters7B
    Licenseapache-2.0
    Downloads/mo230
    Likes5

    Research Paper

    WAVE: Learning Unified and Versatile Audio-Visual Embeddings with Multimodal LLM

    arxiv.org

    Build a pipeline with WAVE-7B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio