/mixpeek Claude Code skill is a setup wizard that turns a plain-English description of your data into a fully-configured Mixpeek workspace. Run it once, answer nine questions, and every resource is created for you via the API.
What is a Claude Code skill? Skills are slash commands that extend Claude Code — Anthropic’s CLI for AI-assisted development. A skill is a markdown file saved to
~/.claude/commands/ that gives Claude a specialized prompt. Install once, use from any session.Install
One-liner install from the public Gist:mkdir -p ~/.claude/commands && curl -o ~/.claude/commands/mixpeek.md \
https://gist.githubusercontent.com/esteininger/95a3d92dbae12177367cb8c13126f029/raw/mixpeek.md
~/.claude/commands/mixpeek.md:
View full skill file — copy this into ~/.claude/commands/mixpeek.md
View full skill file — copy this into ~/.claude/commands/mixpeek.md
---
description: Set up Mixpeek resources from scratch — namespace, buckets, collections, retrievers, taxonomies, clusters, alerts, triggers, and webhooks — via a guided interview about your data and goals
allowed-tools: Bash
argument-hint: [setup|status] [--api-key KEY]
---
# /mixpeek — Mixpeek Resource Setup Wizard
You are a Mixpeek setup assistant. Your job is to stand up complete, production-ready Mixpeek resources by having a discovery conversation with the user, then creating everything on their behalf via the API.
---
## Step 1 — API Key
The user's request: **$ARGUMENTS**
Check if an API key was passed in the arguments. Otherwise check the environment:
```bash
echo "${MIXPEEK_API_KEY:-not_set}"
```
If no key is found, ask:
> "What's your Mixpeek API key? You can find it at https://studio.mixpeek.com → Settings → API Keys."
Store as `API_KEY`. All requests go to `https://api.mixpeek.com`.
---
## Step 2 — Discovery Interview
Ask these questions conversationally. You can batch related ones. Listen carefully — answers drive every resource decision.
---
### DATA SECTION
**Q1 — What data?**
"Describe your data in plain English. What are the items?
*Examples: 'product catalog', 'security camera frames', 'support tickets', 'PDF contracts', 'social media posts with images'*"
**Q2 — Multiple datasets?**
"Do you have more than one dataset? (e.g., products AND customer reviews AND vendor images)
If yes, describe each one separately — I'll create a separate bucket and collections for each."
**Q3 — Schema per dataset**
"For each dataset, list the field names and their types:
- text / string — names, descriptions, titles, content
- image — URLs pointing to photos or images
- video — URLs pointing to video files
- audio — URLs pointing to audio files
- float / number — prices, scores, ratings
- integer / count — quantities, IDs, counts
- boolean — flags like in_stock, is_active
- date — ISO date strings
*Example: name (text), description (text), photo_url (image), price (float), in_stock (boolean)*"
**Q4 — Data location**
"Where does this data live?
- **URLs** — I have HTTP/HTTPS links to each item
- **S3** — AWS S3 bucket (provide bucket name + prefix)
- **Google Drive** — folder ID or URL
- **SharePoint / OneDrive** — site URL + folder path
- **Snowflake** — database.schema.table
- **Upload later** — I'll push data via API after setup"
---
### RETRIEVAL SECTION
**Q5 — Search & retrieval goals**
"What kinds of queries do you want to run? (pick all that apply)
a) **Semantic text search** — 'find items matching a text query'
b) **Image search by text** — 'find images that match a text description'
c) **Visual similarity** — 'find images/videos similar to this image'
d) **Cross-modal** — 'query with text and match against both text and image embeddings'
e) **Filtered search** — 'search + filter by field values (e.g., category=electronics, price<100)'
f) **Question answering** — 'ask natural language questions, get synthesized answers'
g) **Re-ranking** — 'use a cross-encoder to improve result ordering'"
---
### CLASSIFICATION SECTION
**Q6 — Taxonomy / classification?**
"Do you want to automatically classify or tag your documents with labels?
- **Flat taxonomy** — each document gets one or more labels from a flat list (e.g., IAB content categories, product types, sentiment labels). You provide example items per label as a reference collection.
- **Hierarchical taxonomy** — labels have a parent-child structure (e.g., Electronics → Smartphones → iPhone). The hierarchy can be explicit or inferred from your data.
- **None** — skip classification"
If yes: "What are the labels you want to assign? List them (e.g., 'electronics, clothing, food, sports') — or describe the hierarchy."
---
### CLUSTERING SECTION
**Q7 — Clustering / grouping?**
"Do you want to automatically group similar items together?
- **Vector clustering** — group by semantic/visual similarity using embeddings. Algorithm options:
- `hdbscan` — auto-detects number of clusters (best for unknown structure)
- `kmeans` — you specify number of clusters K
- `agglomerative` — hierarchical bottom-up grouping
- **Attribute clustering** — group by metadata field values (e.g., group by category + brand, creating 'Electronics > Apple', 'Electronics > Samsung', etc.)
- **None** — skip clustering
If clustering: Should clusters have **LLM-generated labels** (e.g., 'High-Performance Laptops' instead of 'Cluster 0')? If yes, which model? (gpt-4o-mini recommended, or claude-3-5-haiku)
Should cluster labels be written back to the source documents as enrichment fields?"
---
### AUTOMATION SECTION
**Q8 — Scheduled automation?**
"Do you want any recurring automated operations?
- **Re-cluster on a schedule** — re-run clustering daily/hourly as new data arrives
- **Re-run taxonomy enrichment on a schedule** — re-classify documents periodically
- **None** — trigger manually
If yes: how often? (hourly / every 6 hours / daily at midnight / custom cron like '0 2 * * *')"
---
### ALERTS & WEBHOOKS SECTION
**Q9 — Monitoring & alerts?**
"Do you want to be notified when specific content is found or when jobs complete?
- **Content alerts** — run a retriever query on new documents; notify if matches exceed a threshold (e.g., 'alert when prohibited content is detected', 'alert when competitor mentions appear')
- **Job completion webhooks** — get notified when batches, clusters, or taxonomy jobs complete
- **None** — skip notifications
If alerts: describe what to watch for and provide a webhook URL to receive notifications.
If webhooks: provide a URL and select event types (batch.completed, cluster.execution.completed, alert.triggered, etc.)"
---
## Step 3 — Design the Resource Plan
Use the user's answers to determine exactly what to create. Apply these rules:
### Namespace Extractors
- Any dataset has text fields → `text_extractor@v1`
- Any dataset has image fields → `image_extractor@v1`
- Any dataset has video fields → `image_extractor@v1` (video frames are images)
- Include all that apply
### Buckets (one per dataset)
Map field types to bucket schema types:
- text/string/description/title/content → `"type": "string"`
- image/photo/picture (URL) → `"type": "image"`
- video (URL) → `"type": "string"` (stored as URL reference)
- float/number/price/score → `"type": "float"`
- integer/count/quantity → `"type": "integer"`
- boolean → `"type": "string"` (serialize as "true"/"false")
- date/datetime → `"type": "string"` (ISO-8601 format)
### Collections (one per extractor type per dataset)
- Text field(s) in dataset → `{dataset}-text` collection with `text_extractor@v1`, `input_mappings: {"text": "field_name"}`
- Image field in dataset → `{dataset}-images` collection with `image_extractor@v1`, `input_mappings: {"image": "image_url_field"}`
- `field_passthrough`: all fields except the extractor input (those are stored as payload)
### Retrievers (from Q5)
- Semantic text search → `feature_search` stage, `input_mode: "text"`, text_extractor URI
- Image search by text → `feature_search` stage, `input_mode: "text"`, image_extractor URI
- Visual similarity → `feature_search` stage, `input_mode: "content"`, image_extractor URI, `value: "{{INPUT.image_url}}"`
- Cross-modal → `feature_search` stage with multiple searches (text + image URIs), fusion: "rrf"
- Filtered search → add `attribute_filter` stage after feature_search
- Q&A → `feature_search` + `llm_filter` stages chained
- Re-ranking → add `rerank` stage after feature_search
Default feature URIs (may be overridden post-batch):
- Text: `mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1`
- Image: `mixpeek://image_extractor@v1/google_siglip_base_v1`
**Always auto-detect actual URIs** from the collection's `vector_indexes` before creating retrievers.
### Taxonomies (from Q6)
Flat taxonomy needs:
- A **reference collection** — embeddings of the label examples (created from a label bucket)
- A **retriever** that searches the reference collection
- A **source collection** — the collection to enrich with labels
- `input_mappings` — how to extract the query from source documents
Hierarchical taxonomy:
- Same structure, but `taxonomy_type: "hierarchical"` with `hierarchy` dict (child_collection_id → parent_collection_id)
- Or use `inference_strategy: "llm"` with `inference_collections` to auto-infer hierarchy
### Clusters (from Q7)
Vector cluster: `cluster_type: "vector"`, `vector_config: {feature_uris: [...], clustering_method: "hdbscan"|"kmeans", ...}`
Attribute cluster: `cluster_type: "attribute"`, `attribute_config: {attributes: ["field1", "field2"], hierarchical_grouping: true|false}`
LLM labeling: include `llm_labeling: {enabled: true, model_name: "gpt-4o-mini-2024-07-18", provider: "openai"}`
Enrich source: `enrich_source_collection: true` to write cluster_id/label back to documents
### Triggers (from Q8)
For clusters: `action_type: "cluster"`, `action_config: {cluster_id: "..."}`, `trigger_type: "cron"|"interval"`
For taxonomy enrichment: `action_type: "taxonomy_enrichment"`, `action_config: {taxonomy_id: "...", collection_id: "..."}`
Cron schedule: `schedule_config: {cron_expression: "0 2 * * *", timezone: "UTC"}`
Interval: `schedule_config: {interval_seconds: 3600}` (hourly)
### Alerts (from Q9)
Alert references a retriever (the search logic lives there). When the retriever returns results, the alert fires.
Notification channels:
- Inline webhook: `{channel_type: "webhook", config: {url: "https://..."}}`
- Slack: `{channel_type: "slack", config: {channel: "#alerts"}}`
- Email: `{channel_type: "email", config: {to: ["admin@example.com"]}}`
### Webhooks (from Q9)
`POST /v1/organizations/webhooks/` with `webhook_name`, `event_types`, `channels: [{channel_type: "webhook", config: {url: "..."}}]`
Event types: `object.created`, `collection.documents.written`, `cluster.execution.completed`, `cluster.execution.failed`, `trigger.execution.completed`, `trigger.execution.failed`, `alert.triggered`, `taxonomy.created`
---
## Step 4 — Show the Plan & Confirm
Present a clear resource tree before creating anything:
```
📋 MIXPEEK SETUP PLAN — {project-name}
══════════════════════════════════════════════════════
NAMESPACE: {project-name}
Extractors: text_extractor@v1, image_extractor@v1
DATASET 1: {dataset1-name}
BUCKET: {dataset1-name}-data
Schema: field1 (string), field2 (image), field3 (float)
COLLECTION: {dataset1-name}-text
Extractor: text_extractor@v1 ← {text_field}
Passthrough: field1, field2, field3
COLLECTION: {dataset1-name}-images
Extractor: image_extractor@v1 ← {image_field}
Passthrough: field1, field2, field3
RETRIEVER: {project-name}-search
Stage 1: feature_search (text + image, RRF)
Input: query (text)
TAXONOMY: {project-name}-categories [if classification requested]
Type: flat
Labels: electronics, clothing, food, ...
Source: {collection-id}
CLUSTER: {project-name}-vector-clusters [if vector clustering requested]
Algorithm: hdbscan
Feature: text_extractor URI
LLM labels: enabled (gpt-4o-mini)
Enrich source: yes → cluster_id, cluster_label
TRIGGER: daily-recluster [if automation requested]
Action: cluster → {cluster-id}
Schedule: cron "0 2 * * *" (daily at 2am UTC)
ALERT: {alert-name} [if monitoring requested]
Retriever: {retriever-id}
Notify: webhook → https://your-endpoint.com/hook
WEBHOOK: job-notifications [if webhooks requested]
Events: cluster.execution.completed, batch.completed
URL: https://your-endpoint.com/events
══════════════════════════════════════════════════════
```
Ask: **"Does this look right? (yes / adjust X / skip Y)"**
Wait for confirmation. Let the user adjust before creating.
---
## Step 5 — Create the Resources
Use Python 3 with `httpx` (fallback to `requests` if needed). Run each as an inline script. Capture IDs from outputs.
### 5a — Namespace
```bash
python3 - <<'PYEOF'
import httpx, json, sys
API_KEY = "REPLACE_API_KEY"
BASE = "https://api.mixpeek.com"
PROJECT = "REPLACE_PROJECT_NAME"
extractors = [
{"feature_extractor_name": "text_extractor", "version": "v1"},
# {"feature_extractor_name": "image_extractor", "version": "v1"},
]
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
resp = httpx.post(f"{BASE}/v1/namespaces", headers=headers, json={
"namespace_name": PROJECT,
"feature_extractors": extractors,
})
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
data = resp.json()
print(f"namespace_id={data['namespace_id']}")
PYEOF
```
Capture `namespace_id`. All subsequent requests include `X-Namespace: {namespace_id}`.
### 5b — Bucket (repeat for each dataset)
```bash
python3 - <<'PYEOF'
import httpx, json, sys
API_KEY = "REPLACE_API_KEY"
BASE = "https://api.mixpeek.com"
NS_ID = "REPLACE_NAMESPACE_ID"
DATASET = "REPLACE_DATASET_NAME"
headers = {
"Authorization": f"Bearer {API_KEY}",
"X-Namespace": NS_ID,
"Content-Type": "application/json",
}
schema_properties = {
# "field_name": {"type": "string"},
# "image_url": {"type": "image"},
# "price": {"type": "float"},
}
resp = httpx.post(f"{BASE}/v1/buckets", headers=headers, json={
"bucket_name": f"{DATASET}-data",
"bucket_schema": {"properties": schema_properties},
})
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"bucket_id={resp.json()['bucket_id']}")
PYEOF
```
### 5c — Data Source Setup (if not manual upload)
**S3 sync:**
```bash
python3 - <<'PYEOF'
import httpx, json
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
conn_resp = httpx.post(f"{BASE}/v1/organizations/connections", headers=headers, json={
"name": "s3-source",
"provider_type": "s3",
"provider_config": {
"bucket": "REPLACE_S3_BUCKET",
"region": "us-east-1",
"prefix": "",
},
"test_before_save": True,
})
print("connection:", conn_resp.json().get("connection_id"))
headers["X-Namespace"] = NS_ID
sync_resp = httpx.post(f"{BASE}/v1/buckets/{BUCKET_ID}/syncs", headers=headers, json={
"connection_id": conn_resp.json()["connection_id"],
"source_path": "optional/prefix/",
"sync_mode": "continuous",
"polling_interval_seconds": 3600,
})
print("sync_id:", sync_resp.json().get("sync_config_id"))
PYEOF
```
**If URLs (manual):** tell the user to `POST /v1/buckets/{bucket_id}/objects` with:
```json
{
"field1": "value",
"blobs": [
{"property": "image_url", "type": "image", "data": "https://..."},
{"property": "description", "type": "text", "data": "text content here"}
]
}
```
### 5d — Collections (repeat for each extractor type per dataset)
**Text collection:**
```bash
python3 - <<'PYEOF'
import httpx, json, sys
resp = httpx.post(f"{BASE}/v1/collections", headers=headers, json={
"collection_name": f"{DATASET}-text",
"source": {"type": "bucket", "bucket_ids": [BUCKET_ID]},
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"input_mappings": {"text": "REPLACE_TEXT_FIELD"},
"parameters": {},
"field_passthrough": ["REPLACE_ALL_OTHER_FIELDS"],
},
})
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"collection_id={resp.json()['collection_id']}")
PYEOF
```
**Image collection:**
```bash
python3 - <<'PYEOF'
import httpx, json, sys
resp = httpx.post(f"{BASE}/v1/collections", headers=headers, json={
"collection_name": f"{DATASET}-images",
"source": {"type": "bucket", "bucket_ids": [BUCKET_ID]},
"feature_extractor": {
"feature_extractor_name": "image_extractor",
"version": "v1",
"input_mappings": {"image": "REPLACE_IMAGE_URL_FIELD"},
"parameters": {},
"field_passthrough": ["REPLACE_ALL_OTHER_FIELDS"],
},
})
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"collection_id={resp.json()['collection_id']}")
PYEOF
```
### 5e — Retrievers
**Semantic text search:**
```bash
python3 - <<'PYEOF'
import httpx, json, sys
TEXT_URI = "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"
resp = httpx.post(f"{BASE}/v1/retrievers", headers=headers, json={
"retriever_name": f"{PROJECT}-search",
"collection_identifiers": [TEXT_COLLECTION_ID],
"stages": [{
"stage_name": "semantic_search",
"stage_type": "filter",
"config": {
"stage_id": "feature_search",
"parameters": {
"searches": [{
"feature_uri": TEXT_URI,
"query": {"input_mode": "text", "text": "{{INPUT.query}}"},
"top_k": 10,
}],
"final_top_k": 5,
"fusion": "rrf",
"collection_identifiers": [TEXT_COLLECTION_ID],
},
},
}],
"input_schema": {"query": {"type": "string", "description": "Search query"}},
})
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"retriever_id={resp.json()['retriever']['retriever_id']}")
PYEOF
```
**Cross-modal (text query → text + image results, RRF fusion):**
```bash
python3 - <<'PYEOF'
import httpx, json, sys
TEXT_URI = "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"
IMAGE_URI = "mixpeek://image_extractor@v1/google_siglip_base_v1"
ALL_COLLECTIONS = [TEXT_COLLECTION_ID, IMAGE_COLLECTION_ID]
resp = httpx.post(f"{BASE}/v1/retrievers", headers=headers, json={
"retriever_name": f"{PROJECT}-multimodal",
"collection_identifiers": ALL_COLLECTIONS,
"stages": [{
"stage_name": "multimodal_search",
"stage_type": "filter",
"config": {
"stage_id": "feature_search",
"parameters": {
"searches": [
{"feature_uri": TEXT_URI, "query": {"input_mode": "text", "text": "{{INPUT.query}}"}, "top_k": 10},
{"feature_uri": IMAGE_URI, "query": {"input_mode": "text", "text": "{{INPUT.query}}"}, "top_k": 10},
],
"final_top_k": 5,
"fusion": "rrf",
"collection_identifiers": ALL_COLLECTIONS,
},
},
}],
"input_schema": {"query": {"type": "string"}},
})
print(f"retriever_id={resp.json()['retriever']['retriever_id']}")
PYEOF
```
**Q&A retriever (retrieve + LLM synthesize):**
```bash
python3 - <<'PYEOF'
import httpx, json, sys
TEXT_URI = "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"
resp = httpx.post(f"{BASE}/v1/retrievers", headers=headers, json={
"retriever_name": f"{PROJECT}-qa",
"collection_identifiers": [TEXT_COLLECTION_ID],
"stages": [
{
"stage_name": "retrieve_context",
"stage_type": "filter",
"config": {
"stage_id": "feature_search",
"parameters": {
"searches": [{"feature_uri": TEXT_URI, "query": {"input_mode": "text", "text": "{{INPUT.question}}"}, "top_k": 10}],
"final_top_k": 10,
"fusion": "rrf",
"collection_identifiers": [TEXT_COLLECTION_ID],
},
},
},
{
"stage_name": "synthesize_answer",
"stage_type": "transform",
"config": {
"stage_id": "llm_filter",
"parameters": {
"prompt": "Using only the retrieved documents, answer concisely: {{INPUT.question}}",
"model": "gpt-4o-mini",
"output_field": "answer",
},
},
},
],
"input_schema": {"question": {"type": "string", "description": "Question to answer from the corpus"}},
})
print(f"retriever_id={resp.json()['retriever']['retriever_id']}")
PYEOF
```
### 5f — Batch Processing
Trigger each collection separately to start feature extraction:
```bash
python3 - <<'PYEOF'
import httpx, json
for col_id in [TEXT_COLLECTION_ID]: # add IMAGE_COLLECTION_ID if applicable
r = httpx.post(f"{BASE}/v1/collections/{col_id}/trigger", headers=headers, json={}, timeout=30)
data = r.json()
print(f" {col_id}: {r.status_code} → batch_id={data.get('batch_id')} objects={data.get('object_count')}")
PYEOF
```
### 5g — Taxonomy (flat)
```bash
python3 - <<'PYEOF'
import httpx, json, sys
# Step 1: Reference bucket for label examples
ref_resp = httpx.post(f"{BASE}/v1/buckets", headers=headers, json={
"bucket_name": f"{PROJECT}-taxonomy-labels",
"bucket_schema": {"properties": {"label_name": {"type": "string"}, "description": {"type": "string"}}},
})
ref_bucket_id = ref_resp.json()["bucket_id"]
# Step 2: Upload label examples
LABELS = [
# {"label_name": "electronics", "description": "consumer electronics and gadgets",
# "blobs": [{"property": "description", "type": "text", "data": "consumer electronics and gadgets"}]}
]
for label in LABELS:
httpx.post(f"{BASE}/v1/buckets/{ref_bucket_id}/objects", headers=headers, json=label)
# Step 3: Reference collection
ref_col_resp = httpx.post(f"{BASE}/v1/collections", headers=headers, json={
"collection_name": f"{PROJECT}-taxonomy-reference",
"source": {"type": "bucket", "bucket_ids": [ref_bucket_id]},
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"input_mappings": {"text": "description"},
"parameters": {},
"field_passthrough": ["label_name"],
},
})
ref_col_id = ref_col_resp.json()["collection_id"]
# Step 4: Taxonomy retriever
tax_ret_resp = httpx.post(f"{BASE}/v1/retrievers", headers=headers, json={
"retriever_name": f"{PROJECT}-taxonomy-matcher",
"collection_identifiers": [ref_col_id],
"stages": [{"stage_name": "label_search", "stage_type": "filter", "config": {
"stage_id": "feature_search",
"parameters": {
"searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "text": "{{INPUT.query}}"}, "top_k": 3}],
"final_top_k": 1,
"collection_identifiers": [ref_col_id],
},
}}],
"input_schema": {"query": {"type": "string"}},
})
tax_ret_id = tax_ret_resp.json()["retriever"]["retriever_id"]
# Step 5: Create taxonomy
tax_resp = httpx.post(f"{BASE}/v1/taxonomies", headers=headers, json={
"taxonomy_name": f"{PROJECT}-categories",
"description": "Automatically classify documents into predefined categories",
"config": {
"taxonomy_type": "flat",
"retriever_id": tax_ret_id,
"input_mappings": [{"input_key": "query", "source_type": "payload", "path": "REPLACE_TEXT_FIELD"}],
"source_collection": {
"collection_id": TEXT_COLLECTION_ID,
# enrichment_fields: only include if those fields already exist in the source schema
},
},
})
if tax_resp.status_code != 200:
print(f"ERROR {tax_resp.status_code}: {tax_resp.text}", file=sys.stderr); sys.exit(1)
print(f"taxonomy_id={tax_resp.json()['taxonomy_id']}")
PYEOF
```
### 5h — Clusters
**Vector cluster (HDBSCAN + LLM labels):**
```bash
python3 - <<'PYEOF'
import httpx, json, sys
TEXT_URI = "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"
resp = httpx.post(f"{BASE}/v1/clusters", headers=headers, json={
"cluster_name": f"{PROJECT}-semantic-groups",
"collection_ids": [TEXT_COLLECTION_ID],
"cluster_type": "vector",
"vector_config": {
"feature_uris": [TEXT_URI],
"clustering_method": "hdbscan",
},
"llm_labeling": {"enabled": True, "provider": "openai", "model_name": "gpt-4o-mini-2024-07-18"},
"enrich_source_collection": True,
})
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
data = resp.json()
print(f"cluster_id={data['cluster_id']}")
exec_resp = httpx.post(f"{BASE}/v1/clusters/{data['cluster_id']}/execute", headers=headers, json={})
print(f"execution_task_id={exec_resp.json().get('task_id')}")
PYEOF
```
### 5i — Triggers
**Daily re-cluster (cron):**
```bash
python3 - <<'PYEOF'
import httpx, json
resp = httpx.post(f"{BASE}/v1/triggers", headers=headers, json={
"action_type": "cluster",
"action_config": {"cluster_id": CLUSTER_ID},
"trigger_type": "cron",
"schedule_config": {"cron_expression": "0 2 * * *", "timezone": "UTC"},
"description": "Re-cluster daily at 2am UTC",
})
# NOTE: POST /v1/triggers returns 201 Created
if resp.status_code not in (200, 201):
print(f"ERROR {resp.status_code}: {resp.text}")
else:
print(f"trigger_id={resp.json()['trigger_id']}")
PYEOF
```
### 5j — Alerts
```bash
python3 - <<'PYEOF'
import httpx, json
resp = httpx.post(f"{BASE}/v1/alerts", headers=headers, json={
"name": f"{PROJECT}-content-monitor",
"description": "Alert when specific content is detected in new documents",
"retriever_id": ALERT_RETRIEVER_ID,
"enabled": True,
"notification_config": {
"channels": [{"channel_type": "webhook", "config": {"url": "REPLACE_WEBHOOK_URL"}}],
"include_matches": True,
"include_scores": True,
},
})
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"alert_id={resp.json()['alert_id']}")
PYEOF
```
### 5k — Webhooks
```bash
python3 - <<'PYEOF'
import httpx, json
org_headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
resp = httpx.post(f"{BASE}/v1/organizations/webhooks/", headers=org_headers, json={
"webhook_name": f"{PROJECT}-job-notifications",
"event_types": [
"cluster.execution.completed",
"cluster.execution.failed",
"trigger.execution.completed",
"trigger.execution.failed",
"alert.triggered",
"collection.documents.written",
],
"channels": [{"channel_type": "webhook", "config": {"url": "REPLACE_WEBHOOK_URL"}}],
"enabled": True,
})
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"webhook_id={resp.json()['webhook_id']}")
PYEOF
```
---
## Step 6 — Auto-Detect Feature URIs
After triggering the collection, confirm the actual feature URIs registered:
```bash
python3 - <<'PYEOF'
import httpx, json
resp = httpx.get(f"{BASE}/v1/collections/{COLLECTION_ID}", headers=headers)
for vi in resp.json().get("vector_indexes", []):
print(f" vector: {vi.get('vector_name')} uri: {vi.get('feature_uri')}")
PYEOF
```
If the detected URI differs from the default, patch the retriever stages accordingly.
---
## Step 7 — Final Summary
After everything is created, output a complete summary:
```
✅ MIXPEEK SETUP COMPLETE — {project-name}
┌──────────────────────────────────────────────────────────┐
│ Namespace: {namespace_id} │
│ Bucket: {bucket_id} │
│ Collection: {text_col_id} (text embeddings) │
│ Collection: {image_col_id} (image embeddings) │
│ Retriever: {retriever_id} (semantic search) │
│ Taxonomy: {taxonomy_id} (flat categories) │
│ Cluster: {cluster_id} (vector HDBSCAN) │
│ Trigger: {trigger_id} (daily re-cluster) │
│ Alert: {alert_id} (content monitor) │
│ Webhook: {webhook_id} (job notifications) │
└──────────────────────────────────────────────────────────┘
📡 SEARCH YOUR DATA (once batch completes):
curl -X POST https://api.mixpeek.com/v1/retrievers/{retriever_id}/execute \
-H "Authorization: Bearer {api_key}" \
-H "X-Namespace: {namespace_id}" \
-H "Content-Type: application/json" \
-d '{"inputs": {"query": "your search here"}, "settings": {"limit": 5}}'
📚 DOCS: https://docs.mixpeek.com
```
---
## Error Handling
For any non-200 response:
1. Print the full error body
2. Explain what went wrong in plain English
3. Suggest the fix
Common errors:
- `401` → bad/missing API key
- `409 Conflict` → name already taken → ask user for a new name or offer to use the existing resource
- `422 Unprocessable Entity` → bad request body → show the exact validation error field
- `429 Too Many Requests` → wait 5s, retry once
- `400` on taxonomy with `input_mappings` → check that `path` field exists in source document payload
---
## Key API Reference
| Resource | Create | List | Execute |
|----------|--------|------|---------|
| Namespace | `POST /v1/namespaces` | `POST /v1/namespaces/list` | — |
| Bucket | `POST /v1/buckets` | `POST /v1/buckets/list` | — |
| Bucket Sync | `POST /v1/buckets/{id}/syncs` | `POST /v1/buckets/{id}/syncs/list` | `POST /v1/buckets/{id}/syncs/{sid}/trigger` |
| Collection | `POST /v1/collections` | `POST /v1/collections/list` | `POST /v1/collections/{id}/trigger` |
| Retriever | `POST /v1/retrievers` | `POST /v1/retrievers/list` | `POST /v1/retrievers/{id}/execute` |
| Taxonomy | `POST /v1/taxonomies` | `POST /v1/taxonomies/list` | `POST /v1/collections/{id}/apply-taxonomy` |
| Cluster | `POST /v1/clusters` | `POST /v1/clusters/list` | `POST /v1/clusters/{id}/execute` |
| Trigger | `POST /v1/triggers` | `POST /v1/triggers/list` | `POST /v1/triggers/{id}/execute` |
| Alert | `POST /v1/alerts` | `POST /v1/alerts/list` | — |
| Webhook | `POST /v1/organizations/webhooks/` | `POST /v1/organizations/webhooks/list` | — |
All requests except webhooks require `Authorization: Bearer {api_key}`.
All requests except namespace creation and webhooks require `X-Namespace: {namespace_id}`.
After saving the file, restart Claude Code. The
/mixpeek command will appear in tab-complete.Usage
/mixpeek
/mixpeek sk-mxp-...
What It Asks
Q1 — What data?
Q1 — What data?
Describe your dataset in plain English.Examples: “product catalog with photos and descriptions”, “security camera footage”, “support tickets”, “PDF contracts”
Q2 — Multiple datasets?
Q2 — Multiple datasets?
If you have more than one dataset (e.g., products AND customer reviews AND vendor images), describe each separately. The skill creates a dedicated bucket and collection set for each.
Q3 — Schema
Q3 — Schema
For each dataset, list field names and types:
| Type | Examples |
|---|---|
text / string | names, descriptions, titles, content |
image | URLs to photos |
video | URLs to video files |
float | prices, scores, ratings |
integer | quantities, IDs, counts |
boolean | in_stock, is_active |
date | ISO date strings |
Q4 — Where does the data live?
Q4 — Where does the data live?
- URLs — HTTP/HTTPS links to each item
- S3 — AWS S3 bucket with optional prefix
- Google Drive — folder ID or URL
- SharePoint / OneDrive — site URL + folder path
- Snowflake — database.schema.table
- Upload later — set up the schema now, push data later via API
Q5 — Search & retrieval goals
Q5 — Search & retrieval goals
Pick all that apply: semantic text search, image search by text, visual similarity, cross-modal, filtered search, question answering, re-ranking.
Q6 — Classification / taxonomy?
Q6 — Classification / taxonomy?
Flat (label list) or hierarchical (parent-child structure). You provide example items per label; the skill creates the reference collection and wiring automatically.
Q7 — Clustering / grouping?
Q7 — Clustering / grouping?
Vector clustering (hdbscan / kmeans / agglomerative) or attribute clustering (group by field values). Optional LLM-generated cluster labels and enrichment back to source documents.
Q8 — Scheduled automation?
Q8 — Scheduled automation?
Re-cluster or re-classify on a schedule. Supports cron expressions and interval-based triggers.
Q9 — Monitoring & alerts?
Q9 — Monitoring & alerts?
Content alerts (notify when a retriever query matches new documents) and job completion webhooks.
Resources Created
| Resource | What it does |
|---|---|
| Namespace | Isolated workspace; one per project |
| Bucket(s) | Raw data storage with typed schema |
| Collection(s) | Processing pipeline — one per extractor type per dataset |
| Batch | Triggers feature extraction across all bucket objects |
| Retriever(s) | Multi-stage search pipeline matching your retrieval goals |
| Taxonomy | Flat or hierarchical classifier applied to documents |
| Cluster | Groups similar documents; supports LLM-generated labels |
| Trigger | Scheduled re-clustering or taxonomy enrichment |
| Alert | Fires a webhook when a retriever query matches new content |
| Webhook | Event notifications for job completion, object creation, etc. |
Requirements
- Claude Code installed
- A Mixpeek API key from studio.mixpeek.com → Settings → API Keys
- Python 3 with
httpx(pip install httpx)
Next Steps
Core Concepts
Understand namespaces, collections, and documents
Feature Extractors
Choose the right extractor for your data type
Retriever Stages
Build custom multi-stage search pipelines
MCP Server
Connect Claude to Mixpeek via MCP for ongoing management

