NEWVector Store Object Storage — 50x cheaper.Read the post →
    Technical Deep Dive

    Your ad library already has a taxonomy.You just haven't discovered it yet.

    Mixpeek clusters every ad into composition families, pacing patterns, hook archetypes, and performance tiers — then builds a searchable index so you never go clip-by-clip again.

    ClusteringCentroid LabelingIndex QualitySearch vs. Generate
    The Insight

    Manual tagging doesn't scale.
    The data already knows what it is.

    When you have thousands of ad creatives, no team can tag every clip by composition, pacing, hook style, and audience fit. But that structure already exists in the data — latent in pixel patterns, edit rhythms, and performance curves. The trick is to let the embeddings surface it.

    Instead of imposing a taxonomy, Mixpeek discovers one. We embed each ad along four axes using models like CLIP ViT-L/14, DINOv2-Large, and SigLIP2-Giant for visual composition, then cluster the resulting vector space to find natural groupings. The taxonomy that emerges is the real taxonomy of your library — not what someone guessed it should be.

    The Pipeline

    Ingest → Embed → Cluster → Label

    Every ad passes through a four-stage pipeline. By the end, each clip belongs to a cluster with a human-readable description and keyword set — searchable without ever watching the footage.

    mixpeek — taxonomy discovery pipeline
    Raw video, images, performance CSVs

    1. Multi-axis embedding

    Each ad is decomposed into four independent vector representations. Visual features come from DINOv2-Large (self-supervised, no labels needed) and CLIP ViT-L/14 (bridging vision to language). Pacing is extracted from frame-level temporal analysis. Hook and script are embedded via Nomic Embed v2 or BGE-M3. Performance metrics form a structured feature vector.

    See also: ColPali v1.3 for patch-level visual retrieval

    2. Hierarchical clustering

    We use density-based clustering (HDBSCAN) over the combined embedding space, indexed by FAISS for billion-scale nearest-neighbor search. Clusters form naturally — no fixed k required. The hierarchy is real: broad families split into sub-clusters, which split into micro-patterns.

    Paper: McInnes et al., 2017 — HDBSCAN

    3. Centroid labeling

    Each cluster needs a description good enough to search against. We sample representative clips from the centroid — not just the nearest single point, but a weighted sampling that converges on the true cluster average. Then a vision-language model like Florence-2 or Qwen3-VL-8B synthesizes a human-readable description plus keywords.

    Fallback: BLIP-2 for lighter-weight captioning

    4. Index construction

    The labeled clusters become a searchable index. Each cluster's description embedding, keyword set, and centroid vector are stored in a Mixpeek Vector Store. A single query checks the index, not every clip — reducing millions of comparisons to hundreds.

    Paper: Johnson et al., 2019 — FAISS

    What Emerges

    A taxonomy you didn't design.
    But one that's actually real.

    When you cluster 50,000 ads across four axes, families emerge: UGC close-up with fast pacing, studio product shots with slow reveals, testimonial-led hooks with problem-agitate-solve arcs. These aren't categories you defined — they're patterns the data proved exist.

    mixpeek — discovered taxonomy browser
    Ad Library12,847 ads → 284 clusters → 4 families
    ASMR texture close-up
    892
    Before/after transformation
    734
    Routine walkthrough
    621
    Product unboxing
    600

    The hierarchy is navigable. A top-level cluster like "UGC Beauty" might contain sub-clusters for ASMR texture close-ups, before/after transformations, and routine walkthroughs. Each level has its own centroid description and keyword set.

    This matters because search quality depends on the granularity of the taxonomy. A query for "organic-feeling skincare hook with product close-up" should match a specific sub-cluster, not a broad family. The deeper the tree, the more precise the retrieval.

    Representation Quality

    The description is the index.
    Get it wrong, retrieval fails.

    A cluster centroid is just a point in vector space. To make it searchable by humans, you need a text description that faithfully represents everything in the cluster. Not just the closest item — the full spread.

    Sampling strategies for centroid description

    Naive

    Take the single nearest clip to the centroid. Fast, but not representative — it captures one mode of the cluster and misses edge cases.

    Weighted random

    Sample clips weighted by distance from centroid. Better coverage, but can over-represent dense regions and miss sparse tails.

    Convergent

    Iteratively sample and refine the description until the average distance between the description embedding and random samples asymptotically approaches zero. The most expensive, but provably representative.

    "You continuously sample until it asymptotically gets towards equal to the average. Check the difference between your representation and a random sample — and slowly modify it until that goes to constant."

    The convergent approach uses vision-language models like Florence-2 and Qwen3-VL-8B to synthesize descriptions, then measures the cosine similarity between the description embedding and randomly sampled cluster members. When the similarity metric stabilizes, the description is representative.

    Benchmarking

    Can you find a clip without watching it?

    The acid test for any index: pick a random clip, generate a description, search for it using the taxonomy. If you find it reliably, the index works. If you don't, the cluster descriptions aren't descriptive enough.

    mixpeek — index quality benchmark
    Retrieval Test — Round 0/5
    ClipGenerated QueryMatched ClusterScoreHit
    Starting benchmark...

    The retrieval test

    1. Select a random clip from the library
    2. Generate multiple natural-language descriptions of it (using Qwen3-VL-8B)
    3. Search the taxonomy index using each description
    4. Measure: does the correct cluster appear in top-k results?
    5. Repeat across a statistically significant sample

    What "good" looks like

    The search isn't pure semantic — it's a combination of the description embedding (via BGE-M3 or Nomic Embed v2), the visual vector from CLIP ViT-L/14, and structured metadata (tags, keywords, performance features). Multiple signals compound.

    Target: 95%+ recall at k=5 across randomized test queries. When recall drops below threshold, the cluster descriptions need more annotation — but annotating an existing cluster is trivial compared to tagging clips individually.

    The Decision

    When you can't find it, you generate it.
    That costs 10× more.

    Every query either matches an existing clip or doesn't. A miss means generating new creative — filming, editing, or AI synthesis. The quality of the index directly determines how often you find vs. how often you pay to create.

    Cost per query decisionBased on 10,000 production queries
    Found in library
    87%$0.003/clip
    Reuse existing clip
    Found with AI edit
    8%$0.12/clip
    Minor AI modification
    Must generate new
    5%$2.40/clip
    Full AI generation or filming
    Blended cost per query$0.13
    Without taxonomy (flat search)$0.89
    Savings85% reduction

    This is why indexing quality isn't academic — it's a cost function. Every percentage point of recall translates directly to production savings. A 90% recall index means 10% of queries trigger expensive generation. A 98% recall index cuts that cost by 80%.

    And because cluster annotation is incremental (you can always add more descriptions, keywords, and sample points to an existing cluster), the ROI of improving the index compounds over time. The library gets more searchable with every iteration — and every new ad that enters the system enriches the taxonomy.

    Models & Research

    The models that power discovery

    Every stage of the pipeline maps to specific models. Mixpeek orchestrates these through Feature Extractors and Retrievers, so you configure which models run at each stage.

    Visual Embedding

    Encode composition, color, framing, and visual style into dense vectors.

    CLIP ViT-L/14Vision-language bridge
    DINOv2-LargeSelf-supervised visual features
    SigLIP2-GiantContrastive image-text alignment
    ColPali v1.3Patch-level visual retrieval
    Text Embedding

    Embed hook scripts, CTA language, and cluster descriptions for search.

    BGE-M3Multilingual dense retrieval
    Nomic Embed v2MoE text embeddings
    Cluster Labeling

    Generate human-readable descriptions and keyword sets for each cluster.

    Florence-2Detailed visual captioning
    Qwen3-VL-8BVision-language understanding
    BLIP-2Efficient image-to-text
    Indexing & Search

    Store and search cluster vectors at billion scale.

    FAISSGPU-accelerated vector search

    Key papers

    In Practice

    From taxonomy to production query

    Once the taxonomy is discovered, it feeds directly into Mixpeek's retrieval layer. A Creative DNA query doesn't scan every clip — it traverses the taxonomy tree, matches against cluster descriptions, then retrieves the best candidates within matching clusters.

    mixpeek — taxonomy-aware retrieval
    from mixpeek import Mixpeek
    
    mx = Mixpeek(api_key="YOUR_KEY")
    
    # Step 1: Discover taxonomies from your ad library
    taxonomy = mx.clusters.create(
        collection="ad_library",
        features=["visual_embedding", "pacing", "hook_script", "performance"],
        method="hdbscan",
        min_cluster_size=15,
        label_model="florence-2-large"    # auto-generate cluster descriptions
    )
    
    # Step 2: Browse what emerged
    for cluster in taxonomy.clusters:
        print(f"{cluster.id}: {cluster.label}")
        print(f"  Keywords: {cluster.keywords}")
        print(f"  Size: {cluster.count} ads")
        print(f"  Avg ROAS: {cluster.metadata['avg_roas']:.1f}x")
    
    # Step 3: Search using the taxonomy
    results = mx.retrievers.search(
        collection="ad_library",
        query="fast-paced UGC hook with product close-up, cold fintech audience",
        taxonomy_id=taxonomy.id,         # searches clusters first, then clips
        top_k=10,
        filters={"performance.roas_7d": {"$gte": 2.5}}
    )
    
    # Step 4: Check if you need to generate or can reuse
    for result in results:
        if result.score > 0.85:
            print(f"✓ Reuse: {result.asset_id} (score: {result.score:.2f})")
        else:
            print(f"✗ Generate: no strong match (best: {result.score:.2f})")

    The cluster ID becomes a first-class input. Instead of running a flat vector search across millions of clips, the retriever traverses the taxonomy — matching your query to cluster descriptions first, then searching within the top-matching clusters. This reduces latency by orders of magnitude while improving precision.

    And because the taxonomy is discovered, not defined, it evolves as your library grows. New ads get assigned to existing clusters or spawn new ones. The system learns what kinds of creative you produce — and what kinds you don't yet have.

    See the taxonomy in your library

    20 minutes. Your ads. Real clustering. We'll show you what families already exist in your library.