Your ad library already has a taxonomy.
You just haven't discovered it yet.
Mixpeek clusters every ad into composition families, pacing patterns, hook archetypes, and performance tiers — then builds a searchable index so you never go clip-by-clip again.
Manual tagging doesn't scale.
The data already knows what it is.
When you have thousands of ad creatives, no team can tag every clip by composition, pacing, hook style, and audience fit. But that structure already exists in the data — latent in pixel patterns, edit rhythms, and performance curves. The trick is to let the embeddings surface it.
Instead of imposing a taxonomy, Mixpeek discovers one. We embed each ad along four axes using models like CLIP ViT-L/14, DINOv2-Large, and SigLIP2-Giant for visual composition, then cluster the resulting vector space to find natural groupings. The taxonomy that emerges is the real taxonomy of your library — not what someone guessed it should be.
Ingest → Embed → Cluster → Label
Every ad passes through a four-stage pipeline. By the end, each clip belongs to a cluster with a human-readable description and keyword set — searchable without ever watching the footage.
1. Multi-axis embedding
Each ad is decomposed into four independent vector representations. Visual features come from DINOv2-Large (self-supervised, no labels needed) and CLIP ViT-L/14 (bridging vision to language). Pacing is extracted from frame-level temporal analysis. Hook and script are embedded via Nomic Embed v2 or BGE-M3. Performance metrics form a structured feature vector.
See also: ColPali v1.3 for patch-level visual retrieval
2. Hierarchical clustering
We use density-based clustering (HDBSCAN) over the combined embedding space, indexed by FAISS for billion-scale nearest-neighbor search. Clusters form naturally — no fixed k required. The hierarchy is real: broad families split into sub-clusters, which split into micro-patterns.
3. Centroid labeling
Each cluster needs a description good enough to search against. We sample representative clips from the centroid — not just the nearest single point, but a weighted sampling that converges on the true cluster average. Then a vision-language model like Florence-2 or Qwen3-VL-8B synthesizes a human-readable description plus keywords.
Fallback: BLIP-2 for lighter-weight captioning
4. Index construction
The labeled clusters become a searchable index. Each cluster's description embedding, keyword set, and centroid vector are stored in a Mixpeek Vector Store. A single query checks the index, not every clip — reducing millions of comparisons to hundreds.
Paper: Johnson et al., 2019 — FAISS
A taxonomy you didn't design.
But one that's actually real.
When you cluster 50,000 ads across four axes, families emerge: UGC close-up with fast pacing, studio product shots with slow reveals, testimonial-led hooks with problem-agitate-solve arcs. These aren't categories you defined — they're patterns the data proved exist.
The hierarchy is navigable. A top-level cluster like "UGC Beauty" might contain sub-clusters for ASMR texture close-ups, before/after transformations, and routine walkthroughs. Each level has its own centroid description and keyword set.
This matters because search quality depends on the granularity of the taxonomy. A query for "organic-feeling skincare hook with product close-up" should match a specific sub-cluster, not a broad family. The deeper the tree, the more precise the retrieval.
The description is the index.
Get it wrong, retrieval fails.
A cluster centroid is just a point in vector space. To make it searchable by humans, you need a text description that faithfully represents everything in the cluster. Not just the closest item — the full spread.
Sampling strategies for centroid description
Take the single nearest clip to the centroid. Fast, but not representative — it captures one mode of the cluster and misses edge cases.
Sample clips weighted by distance from centroid. Better coverage, but can over-represent dense regions and miss sparse tails.
Iteratively sample and refine the description until the average distance between the description embedding and random samples asymptotically approaches zero. The most expensive, but provably representative.
"You continuously sample until it asymptotically gets towards equal to the average. Check the difference between your representation and a random sample — and slowly modify it until that goes to constant."
The convergent approach uses vision-language models like Florence-2 and Qwen3-VL-8B to synthesize descriptions, then measures the cosine similarity between the description embedding and randomly sampled cluster members. When the similarity metric stabilizes, the description is representative.
Can you find a clip without watching it?
The acid test for any index: pick a random clip, generate a description, search for it using the taxonomy. If you find it reliably, the index works. If you don't, the cluster descriptions aren't descriptive enough.
The retrieval test
- Select a random clip from the library
- Generate multiple natural-language descriptions of it (using Qwen3-VL-8B)
- Search the taxonomy index using each description
- Measure: does the correct cluster appear in top-k results?
- Repeat across a statistically significant sample
What "good" looks like
The search isn't pure semantic — it's a combination of the description embedding (via BGE-M3 or Nomic Embed v2), the visual vector from CLIP ViT-L/14, and structured metadata (tags, keywords, performance features). Multiple signals compound.
Target: 95%+ recall at k=5 across randomized test queries. When recall drops below threshold, the cluster descriptions need more annotation — but annotating an existing cluster is trivial compared to tagging clips individually.
When you can't find it, you generate it.
That costs 10× more.
Every query either matches an existing clip or doesn't. A miss means generating new creative — filming, editing, or AI synthesis. The quality of the index directly determines how often you find vs. how often you pay to create.
This is why indexing quality isn't academic — it's a cost function. Every percentage point of recall translates directly to production savings. A 90% recall index means 10% of queries trigger expensive generation. A 98% recall index cuts that cost by 80%.
And because cluster annotation is incremental (you can always add more descriptions, keywords, and sample points to an existing cluster), the ROI of improving the index compounds over time. The library gets more searchable with every iteration — and every new ad that enters the system enriches the taxonomy.
The models that power discovery
Every stage of the pipeline maps to specific models. Mixpeek orchestrates these through Feature Extractors and Retrievers, so you configure which models run at each stage.
Encode composition, color, framing, and visual style into dense vectors.
Embed hook scripts, CTA language, and cluster descriptions for search.
Generate human-readable descriptions and keyword sets for each cluster.
Store and search cluster vectors at billion scale.
Key papers
Scalable Nearest Neighbors for Interactive Retrieval
Johnson et al., 2019 (Meta AI)
The FAISS library — billion-scale vector similarity search with GPU-accelerated clustering.
Learning Transferable Visual Models From Natural Language Supervision
Radford et al., 2021 (OpenAI)
CLIP — the contrastive model that bridges visual and textual embedding spaces.
DINOv2: Learning Robust Visual Features without Supervision
Oquab et al., 2023 (Meta AI)
Self-supervised vision features that cluster visual similarity without any labels.
HDBSCAN: Hierarchical Density-Based Clustering
McInnes et al., 2017
Density-based clustering that discovers natural groupings without requiring a fixed k.
ColPali: Efficient Document Retrieval with Vision Language Models
Faysse et al., 2024
Late-interaction retrieval over visual patches — relevant to multi-modal taxonomy matching.
From taxonomy to production query
Once the taxonomy is discovered, it feeds directly into Mixpeek's retrieval layer. A Creative DNA query doesn't scan every clip — it traverses the taxonomy tree, matches against cluster descriptions, then retrieves the best candidates within matching clusters.
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_KEY")
# Step 1: Discover taxonomies from your ad library
taxonomy = mx.clusters.create(
collection="ad_library",
features=["visual_embedding", "pacing", "hook_script", "performance"],
method="hdbscan",
min_cluster_size=15,
label_model="florence-2-large" # auto-generate cluster descriptions
)
# Step 2: Browse what emerged
for cluster in taxonomy.clusters:
print(f"{cluster.id}: {cluster.label}")
print(f" Keywords: {cluster.keywords}")
print(f" Size: {cluster.count} ads")
print(f" Avg ROAS: {cluster.metadata['avg_roas']:.1f}x")
# Step 3: Search using the taxonomy
results = mx.retrievers.search(
collection="ad_library",
query="fast-paced UGC hook with product close-up, cold fintech audience",
taxonomy_id=taxonomy.id, # searches clusters first, then clips
top_k=10,
filters={"performance.roas_7d": {"$gte": 2.5}}
)
# Step 4: Check if you need to generate or can reuse
for result in results:
if result.score > 0.85:
print(f"✓ Reuse: {result.asset_id} (score: {result.score:.2f})")
else:
print(f"✗ Generate: no strong match (best: {result.score:.2f})")The cluster ID becomes a first-class input. Instead of running a flat vector search across millions of clips, the retriever traverses the taxonomy — matching your query to cluster descriptions first, then searching within the top-matching clusters. This reduces latency by orders of magnitude while improving precision.
And because the taxonomy is discovered, not defined, it evolves as your library grows. New ads get assigned to existing clusters or spawn new ones. The system learns what kinds of creative you produce — and what kinds you don't yet have.
See the taxonomy in your library
20 minutes. Your ads. Real clustering. We'll show you what families already exist in your library.