Technical Deep Dive

Your ad library already has a taxonomy.
You just haven't discovered it yet.

Mixpeek clusters every ad into composition families, pacing patterns, hook archetypes, and performance tiers — then builds a searchable index so you never go clip-by-clip again.

ClusteringCentroid LabelingIndex QualitySearch vs. Generate

The Insight

Manual tagging doesn't scale.
The data already knows what it is.

When you have thousands of ad creatives, no team can tag every clip by composition, pacing, hook style, and audience fit. But that structure already exists in the data — latent in pixel patterns, edit rhythms, and performance curves. The trick is to let the embeddings surface it.

Instead of imposing a taxonomy, Mixpeek discovers one. We embed each ad along four axes using models like CLIP ViT-L/14, DINOv2-Large, and SigLIP2-Giant for visual composition, then cluster the resulting vector space to find natural groupings. The taxonomy that emerges is the real taxonomy of your library — not what someone guessed it should be.

The Pipeline

Ingest → Embed → Cluster → Label

Every ad passes through a four-stage pipeline. By the end, each clip belongs to a cluster with a human-readable description and keyword set — searchable without ever watching the footage.

mixpeek — taxonomy discovery pipeline

→

Raw video, images, performance CSVs

1. Multi-axis embedding

Each ad is decomposed into four independent vector representations. Visual features come from DINOv2-Large (self-supervised, no labels needed) and CLIP ViT-L/14 (bridging vision to language). Pacing is extracted from frame-level temporal analysis. Hook and script are embedded via Nomic Embed v2 or BGE-M3. Performance metrics form a structured feature vector.

See also: ColPali v1.3 for patch-level visual retrieval

2. Hierarchical clustering

We use density-based clustering (HDBSCAN) over the combined embedding space, indexed by FAISS for billion-scale nearest-neighbor search. Clusters form naturally — no fixed k required. The hierarchy is real: broad families split into sub-clusters, which split into micro-patterns.

Paper: McInnes et al., 2017 — HDBSCAN

3. Centroid labeling

Each cluster needs a description good enough to search against. We sample representative clips from the centroid — not just the nearest single point, but a weighted sampling that converges on the true cluster average. Then a vision-language model like Florence-2 or Qwen3-VL-8B synthesizes a human-readable description plus keywords.

Fallback: BLIP-2 for lighter-weight captioning

4. Index construction

The labeled clusters become a searchable index. Each cluster's description embedding, keyword set, and centroid vector are stored in a Mixpeek Vector Store. A single query checks the index, not every clip — reducing millions of comparisons to hundreds.

Paper: Johnson et al., 2019 — FAISS

What Emerges

A taxonomy you didn't design.
But one that's actually real.

When you cluster 50,000 ads across four axes, families emerge: UGC close-up with fast pacing, studio product shots with slow reveals, testimonial-led hooks with problem-agitate-solve arcs. These aren't categories you defined — they're patterns the data proved exist.

mixpeek — discovered taxonomy browser

Ad Library12,847 ads → 284 clusters → 4 families

ASMR texture close-up

892

Before/after transformation

734

Routine walkthrough

621

Product unboxing

600

The hierarchy is navigable. A top-level cluster like "UGC Beauty" might contain sub-clusters for ASMR texture close-ups, before/after transformations, and routine walkthroughs. Each level has its own centroid description and keyword set.

This matters because search quality depends on the granularity of the taxonomy. A query for "organic-feeling skincare hook with product close-up" should match a specific sub-cluster, not a broad family. The deeper the tree, the more precise the retrieval.

Representation Quality

The description is the index.
Get it wrong, retrieval fails.

A cluster centroid is just a point in vector space. To make it searchable by humans, you need a text description that faithfully represents everything in the cluster. Not just the closest item — the full spread.

Sampling strategies for centroid description

Naive

Take the single nearest clip to the centroid. Fast, but not representative — it captures one mode of the cluster and misses edge cases.

Weighted random

Sample clips weighted by distance from centroid. Better coverage, but can over-represent dense regions and miss sparse tails.

Convergent

Iteratively sample and refine the description until the average distance between the description embedding and random samples asymptotically approaches zero. The most expensive, but provably representative.

"You continuously sample until it asymptotically gets towards equal to the average. Check the difference between your representation and a random sample — and slowly modify it until that goes to constant."

The convergent approach uses vision-language models like Florence-2 and Qwen3-VL-8B to synthesize descriptions, then measures the cosine similarity between the description embedding and randomly sampled cluster members. When the similarity metric stabilizes, the description is representative.

Benchmarking

Can you find a clip without watching it?

The acid test for any index: pick a random clip, generate a description, search for it using the taxonomy. If you find it reliably, the index works. If you don't, the cluster descriptions aren't descriptive enough.

mixpeek — index quality benchmark

Retrieval Test — Round 0/5

ClipGenerated QueryMatched ClusterScoreHit

Starting benchmark...

The retrieval test

Select a random clip from the library
Generate multiple natural-language descriptions of it (using Qwen3-VL-8B)
Search the taxonomy index using each description
Measure: does the correct cluster appear in top-k results?
Repeat across a statistically significant sample

What "good" looks like

The search isn't pure semantic — it's a combination of the description embedding (via BGE-M3 or Nomic Embed v2), the visual vector from CLIP ViT-L/14, and structured metadata (tags, keywords, performance features). Multiple signals compound.

Target: 95%+ recall at k=5 across randomized test queries. When recall drops below threshold, the cluster descriptions need more annotation — but annotating an existing cluster is trivial compared to tagging clips individually.

The Decision

When you can't find it, you generate it.
That costs 10× more.

Every query either matches an existing clip or doesn't. A miss means generating new creative — filming, editing, or AI synthesis. The quality of the index directly determines how often you find vs. how often you pay to create.

Cost per query decisionBased on 10,000 production queries

Found in library

87%$0.003/clip

Reuse existing clip

Found with AI edit

8%$0.12/clip

Minor AI modification

Must generate new

5%$2.40/clip

Full AI generation or filming

Blended cost per query$0.13

Without taxonomy (flat search)$0.89

Savings85% reduction

This is why indexing quality isn't academic — it's a cost function. Every percentage point of recall translates directly to production savings. A 90% recall index means 10% of queries trigger expensive generation. A 98% recall index cuts that cost by 80%.

And because cluster annotation is incremental (you can always add more descriptions, keywords, and sample points to an existing cluster), the ROI of improving the index compounds over time. The library gets more searchable with every iteration — and every new ad that enters the system enriches the taxonomy.

Models & Research

The models that power discovery

Every stage of the pipeline maps to specific models. Mixpeek orchestrates these through Feature Extractors and Retrievers, so you configure which models run at each stage.

Visual Embedding

Encode composition, color, framing, and visual style into dense vectors.

CLIP ViT-L/14 →Vision-language bridge

DINOv2-Large →Self-supervised visual features

SigLIP2-Giant →Contrastive image-text alignment

ColPali v1.3 →Patch-level visual retrieval

Text Embedding

Embed hook scripts, CTA language, and cluster descriptions for search.

BGE-M3 →Multilingual dense retrieval

Nomic Embed v2 →MoE text embeddings

Cluster Labeling

Generate human-readable descriptions and keyword sets for each cluster.

Florence-2 →Detailed visual captioning

Qwen3-VL-8B →Vision-language understanding

BLIP-2 →Efficient image-to-text

Indexing & Search

Store and search cluster vectors at billion scale.

FAISS →GPU-accelerated vector search

Key papers

Scalable Nearest Neighbors for Interactive Retrieval

Johnson et al., 2019 (Meta AI)

The FAISS library — billion-scale vector similarity search with GPU-accelerated clustering.

Learning Transferable Visual Models From Natural Language Supervision

Radford et al., 2021 (OpenAI)

CLIP — the contrastive model that bridges visual and textual embedding spaces.

DINOv2: Learning Robust Visual Features without Supervision

Oquab et al., 2023 (Meta AI)

Self-supervised vision features that cluster visual similarity without any labels.

HDBSCAN: Hierarchical Density-Based Clustering

McInnes et al., 2017

Density-based clustering that discovers natural groupings without requiring a fixed k.

ColPali: Efficient Document Retrieval with Vision Language Models

Faysse et al., 2024

Late-interaction retrieval over visual patches — relevant to multi-modal taxonomy matching.

In Practice

From taxonomy to production query

Once the taxonomy is discovered, it feeds directly into Mixpeek's retrieval layer. A Creative DNA query doesn't scan every clip — it traverses the taxonomy tree, matches against cluster descriptions, then retrieves the best candidates within matching clusters.

mixpeek — taxonomy-aware retrieval

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_KEY")

# Step 1: Discover taxonomies from your ad library
taxonomy = mx.clusters.create(
    collection="ad_library",
    features=["visual_embedding", "pacing", "hook_script", "performance"],
    method="hdbscan",
    min_cluster_size=15,
    label_model="florence-2-large"    # auto-generate cluster descriptions
)

# Step 2: Browse what emerged
for cluster in taxonomy.clusters:
    print(f"{cluster.id}: {cluster.label}")
    print(f"  Keywords: {cluster.keywords}")
    print(f"  Size: {cluster.count} ads")
    print(f"  Avg ROAS: {cluster.metadata['avg_roas']:.1f}x")

# Step 3: Search using the taxonomy
results = mx.retrievers.search(
    collection="ad_library",
    query="fast-paced UGC hook with product close-up, cold fintech audience",
    taxonomy_id=taxonomy.id,         # searches clusters first, then clips
    top_k=10,
    filters={"performance.roas_7d": {"$gte": 2.5}}
)

# Step 4: Check if you need to generate or can reuse
for result in results:
    if result.score > 0.85:
        print(f"✓ Reuse: {result.asset_id} (score: {result.score:.2f})")
    else:
        print(f"✗ Generate: no strong match (best: {result.score:.2f})")

The cluster ID becomes a first-class input. Instead of running a flat vector search across millions of clips, the retriever traverses the taxonomy — matching your query to cluster descriptions first, then searching within the top-matching clusters. This reduces latency by orders of magnitude while improving precision.

And because the taxonomy is discovered, not defined, it evolves as your library grows. New ads get assigned to existing clusters or spawn new ones. The system learns what kinds of creative you produce — and what kinds you don't yet have.

See the taxonomy in your library

20 minutes. Your ads. Real clustering. We'll show you what families already exist in your library.

Book a conversation ← Back to Creative DNA

Your ad library already has a taxonomy.You just haven't discovered it yet.

Manual tagging doesn't scale.The data already knows what it is.