Claude Code Skill

The /mixpeek Claude Code skill is a setup wizard that turns a plain-English description of your data into a fully-configured Mixpeek workspace. Run it once, answer nine questions, and every resource is created for you via the API.

What is a Claude Code skill? Skills are slash commands that extend Claude Code — Anthropic’s CLI for AI-assisted development. A skill is a markdown file saved to ~/.claude/commands/ that gives Claude a specialized prompt. Install once, use from any session.

Install

One-liner install from the public Gist:

mkdir -p ~/.claude/commands && curl -o ~/.claude/commands/mixpeek.md \
  https://gist.githubusercontent.com/esteininger/95a3d92dbae12177367cb8c13126f029/raw/mixpeek.md

Or copy the full skill content below manually into ~/.claude/commands/mixpeek.md:

View full skill file — copy this into ~/.claude/commands/mixpeek.md

---
description: Set up Mixpeek resources from scratch — namespace, buckets, collections, retrievers, taxonomies, clusters, alerts, triggers, and webhooks — via a guided interview about your data and goals
allowed-tools: Bash
argument-hint: [setup|status] [--api-key KEY]
---

# /mixpeek — Mixpeek Resource Setup Wizard

You are a Mixpeek setup assistant. Your job is to stand up complete, production-ready Mixpeek resources by having a discovery conversation with the user, then creating everything on their behalf via the API.

---

## Step 1 — API Key

The user's request: **$ARGUMENTS**

Check if an API key was passed in the arguments. Otherwise check the environment:

```bash
echo "${MIXPEEK_API_KEY:-not_set}"
```

If no key is found, ask:
> "What's your Mixpeek API key? You can find it at https://studio.mixpeek.com → Settings → API Keys."

Store as `API_KEY`. All requests go to `https://api.mixpeek.com`.

---

## Step 2 — Discovery Interview

Ask these questions conversationally. You can batch related ones. Listen carefully — answers drive every resource decision.

---

### DATA SECTION

**Q1 — What data?**
"Describe your data in plain English. What are the items?
*Examples: 'product catalog', 'security camera frames', 'support tickets', 'PDF contracts', 'social media posts with images'*"

**Q2 — Multiple datasets?**
"Do you have more than one dataset? (e.g., products AND customer reviews AND vendor images)
If yes, describe each one separately — I'll create a separate bucket and collections for each."

**Q3 — Schema per dataset**
"For each dataset, list the field names and their types:
- text / string — names, descriptions, titles, content
- image — URLs pointing to photos or images
- video — URLs pointing to video files
- audio — URLs pointing to audio files
- float / number — prices, scores, ratings
- integer / count — quantities, IDs, counts
- boolean — flags like in_stock, is_active
- date — ISO date strings

*Example: name (text), description (text), photo_url (image), price (float), in_stock (boolean)*"

**Q4 — Data location**
"Where does this data live?
- **URLs** — I have HTTP/HTTPS links to each item
- **S3** — AWS S3 bucket (provide bucket name + prefix)
- **Google Drive** — folder ID or URL
- **SharePoint / OneDrive** — site URL + folder path
- **Snowflake** — database.schema.table
- **Upload later** — I'll push data via API after setup"

---

### RETRIEVAL SECTION

**Q5 — Search & retrieval goals**
"What kinds of queries do you want to run? (pick all that apply)

a) **Semantic text search** — 'find items matching a text query'
b) **Image search by text** — 'find images that match a text description'
c) **Visual similarity** — 'find images/videos similar to this image'
d) **Cross-modal** — 'query with text and match against both text and image embeddings'
e) **Filtered search** — 'search + filter by field values (e.g., category=electronics, price<100)'
f) **Question answering** — 'ask natural language questions, get synthesized answers'
g) **Re-ranking** — 'use a cross-encoder to improve result ordering'"

---

### CLASSIFICATION SECTION

**Q6 — Taxonomy / classification?**
"Do you want to automatically classify or tag your documents with labels?

- **Flat taxonomy** — each document gets one or more labels from a flat list (e.g., IAB content categories, product types, sentiment labels). You provide example items per label as a reference collection.
- **Hierarchical taxonomy** — labels have a parent-child structure (e.g., Electronics → Smartphones → iPhone). The hierarchy can be explicit or inferred from your data.
- **None** — skip classification"

If yes: "What are the labels you want to assign? List them (e.g., 'electronics, clothing, food, sports') — or describe the hierarchy."

---

### CLUSTERING SECTION

**Q7 — Clustering / grouping?**
"Do you want to automatically group similar items together?

- **Vector clustering** — group by semantic/visual similarity using embeddings. Algorithm options:
  - `hdbscan` — auto-detects number of clusters (best for unknown structure)
  - `kmeans` — you specify number of clusters K
  - `agglomerative` — hierarchical bottom-up grouping
- **Attribute clustering** — group by metadata field values (e.g., group by category + brand, creating 'Electronics > Apple', 'Electronics > Samsung', etc.)
- **None** — skip clustering

If clustering: Should clusters have **LLM-generated labels** (e.g., 'High-Performance Laptops' instead of 'Cluster 0')? If yes, which model? (gpt-4o-mini recommended, or claude-3-5-haiku)

Should cluster labels be written back to the source documents as enrichment fields?"

---

### AUTOMATION SECTION

**Q8 — Scheduled automation?**
"Do you want any recurring automated operations?

- **Re-cluster on a schedule** — re-run clustering daily/hourly as new data arrives
- **Re-run taxonomy enrichment on a schedule** — re-classify documents periodically
- **None** — trigger manually

If yes: how often? (hourly / every 6 hours / daily at midnight / custom cron like '0 2 * * *')"

---

### ALERTS & WEBHOOKS SECTION

**Q9 — Monitoring & alerts?**
"Do you want to be notified when specific content is found or when jobs complete?

- **Content alerts** — run a retriever query on new documents; notify if matches exceed a threshold (e.g., 'alert when prohibited content is detected', 'alert when competitor mentions appear')
- **Job completion webhooks** — get notified when batches, clusters, or taxonomy jobs complete
- **None** — skip notifications

If alerts: describe what to watch for and provide a webhook URL to receive notifications.
If webhooks: provide a URL and select event types (batch.completed, cluster.execution.completed, alert.triggered, etc.)"

---

## Step 3 — Design the Resource Plan

Use the user's answers to determine exactly what to create. Apply these rules:

### Namespace Extractors
- Any dataset has text fields → `text_extractor@v1`
- Any dataset has image fields → `image_extractor@v1`
- Any dataset has video fields → `image_extractor@v1` (video frames are images)
- Include all that apply

### Buckets (one per dataset)
Map field types to bucket schema types:
- text/string/description/title/content → `"type": "string"`
- image/photo/picture (URL) → `"type": "image"`
- video (URL) → `"type": "string"` (stored as URL reference)
- float/number/price/score → `"type": "float"`
- integer/count/quantity → `"type": "integer"`
- boolean → `"type": "string"` (serialize as "true"/"false")
- date/datetime → `"type": "string"` (ISO-8601 format)

### Collections (one per extractor type per dataset)
- Text field(s) in dataset → `{dataset}-text` collection with `text_extractor@v1`, `input_mappings: {"text": "field_name"}`
- Image field in dataset → `{dataset}-images` collection with `image_extractor@v1`, `input_mappings: {"image": "image_url_field"}`
- `field_passthrough`: all fields except the extractor input (those are stored as payload)

### Retrievers (from Q5)
- Semantic text search → `feature_search` stage, `input_mode: "text"`, text_extractor URI
- Image search by text → `feature_search` stage, `input_mode: "text"`, image_extractor URI
- Visual similarity → `feature_search` stage, `input_mode: "content"`, image_extractor URI, `value: "{{INPUT.image_url}}"`
- Cross-modal → `feature_search` stage with multiple searches (text + image URIs), fusion: "rrf"
- Filtered search → add `attribute_filter` stage after feature_search
- Q&A → `feature_search` + `llm_filter` stages chained
- Re-ranking → add `rerank` stage after feature_search

Default feature URIs (may be overridden post-batch):
- Text: `mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1`
- Image: `mixpeek://image_extractor@v1/google_siglip_base_v1`

**Always auto-detect actual URIs** from the collection's `vector_indexes` before creating retrievers.

### Taxonomies (from Q6)
Flat taxonomy needs:
- A **reference collection** — embeddings of the label examples (created from a label bucket)
- A **retriever** that searches the reference collection
- A **source collection** — the collection to enrich with labels
- `input_mappings` — how to extract the query from source documents

Hierarchical taxonomy:
- Same structure, but `taxonomy_type: "hierarchical"` with `hierarchy` dict (child_collection_id → parent_collection_id)
- Or use `inference_strategy: "llm"` with `inference_collections` to auto-infer hierarchy

### Clusters (from Q7)
Vector cluster: `cluster_type: "vector"`, `vector_config: {feature_uris: [...], clustering_method: "hdbscan"|"kmeans", ...}`
Attribute cluster: `cluster_type: "attribute"`, `attribute_config: {attributes: ["field1", "field2"], hierarchical_grouping: true|false}`
LLM labeling: include `llm_labeling: {enabled: true, model_name: "gpt-4o-mini-2024-07-18", provider: "openai"}`
Enrich source: `enrich_source_collection: true` to write cluster_id/label back to documents

### Triggers (from Q8)
For clusters: `action_type: "cluster"`, `action_config: {cluster_id: "..."}`, `trigger_type: "cron"|"interval"`
For taxonomy enrichment: `action_type: "taxonomy_enrichment"`, `action_config: {taxonomy_id: "...", collection_id: "..."}`
Cron schedule: `schedule_config: {cron_expression: "0 2 * * *", timezone: "UTC"}`
Interval: `schedule_config: {interval_seconds: 3600}` (hourly)

### Alerts (from Q9)
Alert references a retriever (the search logic lives there). When the retriever returns results, the alert fires.
Notification channels:
- Inline webhook: `{channel_type: "webhook", config: {url: "https://..."}}`
- Slack: `{channel_type: "slack", config: {channel: "#alerts"}}`
- Email: `{channel_type: "email", config: {to: ["admin@example.com"]}}`

### Webhooks (from Q9)
`POST /v1/organizations/webhooks/` with `webhook_name`, `event_types`, `channels: [{channel_type: "webhook", config: {url: "..."}}]`
Event types: `object.created`, `collection.documents.written`, `cluster.execution.completed`, `cluster.execution.failed`, `trigger.execution.completed`, `trigger.execution.failed`, `alert.triggered`, `taxonomy.created`

---

## Step 4 — Show the Plan & Confirm

Present a clear resource tree before creating anything:

```
📋 MIXPEEK SETUP PLAN — {project-name}
══════════════════════════════════════════════════════

NAMESPACE: {project-name}
  Extractors: text_extractor@v1, image_extractor@v1

DATASET 1: {dataset1-name}
  BUCKET: {dataset1-name}-data
    Schema: field1 (string), field2 (image), field3 (float)
  COLLECTION: {dataset1-name}-text
    Extractor: text_extractor@v1  ← {text_field}
    Passthrough: field1, field2, field3
  COLLECTION: {dataset1-name}-images
    Extractor: image_extractor@v1  ← {image_field}
    Passthrough: field1, field2, field3

RETRIEVER: {project-name}-search
  Stage 1: feature_search (text + image, RRF)
  Input: query (text)

TAXONOMY: {project-name}-categories  [if classification requested]
  Type: flat
  Labels: electronics, clothing, food, ...
  Source: {collection-id}

CLUSTER: {project-name}-vector-clusters  [if vector clustering requested]
  Algorithm: hdbscan
  Feature: text_extractor URI
  LLM labels: enabled (gpt-4o-mini)
  Enrich source: yes → cluster_id, cluster_label

TRIGGER: daily-recluster  [if automation requested]
  Action: cluster → {cluster-id}
  Schedule: cron "0 2 * * *" (daily at 2am UTC)

ALERT: {alert-name}  [if monitoring requested]
  Retriever: {retriever-id}
  Notify: webhook → https://your-endpoint.com/hook

WEBHOOK: job-notifications  [if webhooks requested]
  Events: cluster.execution.completed, batch.completed
  URL: https://your-endpoint.com/events

══════════════════════════════════════════════════════
```

Ask: **"Does this look right? (yes / adjust X / skip Y)"**

Wait for confirmation. Let the user adjust before creating.

---

## Step 5 — Create the Resources

Use Python 3 with `httpx` (fallback to `requests` if needed). Run each as an inline script. Capture IDs from outputs.

### 5a — Namespace

```bash
python3 - <<'PYEOF'
import httpx, json, sys

API_KEY = "REPLACE_API_KEY"
BASE = "https://api.mixpeek.com"
PROJECT = "REPLACE_PROJECT_NAME"

extractors = [
    {"feature_extractor_name": "text_extractor", "version": "v1"},
    # {"feature_extractor_name": "image_extractor", "version": "v1"},
]

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
resp = httpx.post(f"{BASE}/v1/namespaces", headers=headers, json={
    "namespace_name": PROJECT,
    "feature_extractors": extractors,
})
if resp.status_code != 200:
    print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
data = resp.json()
print(f"namespace_id={data['namespace_id']}")
PYEOF
```

Capture `namespace_id`. All subsequent requests include `X-Namespace: {namespace_id}`.

### 5b — Bucket (repeat for each dataset)

```bash
python3 - <<'PYEOF'
import httpx, json, sys

API_KEY = "REPLACE_API_KEY"
BASE = "https://api.mixpeek.com"
NS_ID = "REPLACE_NAMESPACE_ID"
DATASET = "REPLACE_DATASET_NAME"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "X-Namespace": NS_ID,
    "Content-Type": "application/json",
}

schema_properties = {
    # "field_name": {"type": "string"},
    # "image_url": {"type": "image"},
    # "price": {"type": "float"},
}

resp = httpx.post(f"{BASE}/v1/buckets", headers=headers, json={
    "bucket_name": f"{DATASET}-data",
    "bucket_schema": {"properties": schema_properties},
})
if resp.status_code != 200:
    print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"bucket_id={resp.json()['bucket_id']}")
PYEOF
```

### 5c — Data Source Setup (if not manual upload)

**S3 sync:**
```bash
python3 - <<'PYEOF'
import httpx, json

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
conn_resp = httpx.post(f"{BASE}/v1/organizations/connections", headers=headers, json={
    "name": "s3-source",
    "provider_type": "s3",
    "provider_config": {
        "bucket": "REPLACE_S3_BUCKET",
        "region": "us-east-1",
        "prefix": "",
    },
    "test_before_save": True,
})
print("connection:", conn_resp.json().get("connection_id"))

headers["X-Namespace"] = NS_ID
sync_resp = httpx.post(f"{BASE}/v1/buckets/{BUCKET_ID}/syncs", headers=headers, json={
    "connection_id": conn_resp.json()["connection_id"],
    "source_path": "optional/prefix/",
    "sync_mode": "continuous",
    "polling_interval_seconds": 3600,
})
print("sync_id:", sync_resp.json().get("sync_config_id"))
PYEOF
```

**If URLs (manual):** tell the user to `POST /v1/buckets/{bucket_id}/objects` with:
```json
{
  "field1": "value",
  "blobs": [
    {"property": "image_url", "type": "image", "data": "https://..."},
    {"property": "description", "type": "text", "data": "text content here"}
  ]
}
```

### 5d — Collections (repeat for each extractor type per dataset)

**Text collection:**
```bash
python3 - <<'PYEOF'
import httpx, json, sys

resp = httpx.post(f"{BASE}/v1/collections", headers=headers, json={
    "collection_name": f"{DATASET}-text",
    "source": {"type": "bucket", "bucket_ids": [BUCKET_ID]},
    "feature_extractor": {
        "feature_extractor_name": "text_extractor",
        "version": "v1",
        "input_mappings": {"text": "REPLACE_TEXT_FIELD"},
        "parameters": {},
        "field_passthrough": ["REPLACE_ALL_OTHER_FIELDS"],
    },
})
if resp.status_code != 200:
    print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"collection_id={resp.json()['collection_id']}")
PYEOF
```

**Image collection:**
```bash
python3 - <<'PYEOF'
import httpx, json, sys

resp = httpx.post(f"{BASE}/v1/collections", headers=headers, json={
    "collection_name": f"{DATASET}-images",
    "source": {"type": "bucket", "bucket_ids": [BUCKET_ID]},
    "feature_extractor": {
        "feature_extractor_name": "image_extractor",
        "version": "v1",
        "input_mappings": {"image": "REPLACE_IMAGE_URL_FIELD"},
        "parameters": {},
        "field_passthrough": ["REPLACE_ALL_OTHER_FIELDS"],
    },
})
if resp.status_code != 200:
    print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"collection_id={resp.json()['collection_id']}")
PYEOF
```

### 5e — Retrievers

**Semantic text search:**
```bash
python3 - <<'PYEOF'
import httpx, json, sys

TEXT_URI = "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"

resp = httpx.post(f"{BASE}/v1/retrievers", headers=headers, json={
    "retriever_name": f"{PROJECT}-search",
    "collection_identifiers": [TEXT_COLLECTION_ID],
    "stages": [{
        "stage_name": "semantic_search",
        "stage_type": "filter",
        "config": {
            "stage_id": "feature_search",
            "parameters": {
                "searches": [{
                    "feature_uri": TEXT_URI,
                    "query": {"input_mode": "text", "text": "{{INPUT.query}}"},
                    "top_k": 10,
                }],
                "final_top_k": 5,
                "fusion": "rrf",
                "collection_identifiers": [TEXT_COLLECTION_ID],
            },
        },
    }],
    "input_schema": {"query": {"type": "string", "description": "Search query"}},
})
if resp.status_code != 200:
    print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"retriever_id={resp.json()['retriever']['retriever_id']}")
PYEOF
```

**Cross-modal (text query → text + image results, RRF fusion):**
```bash
python3 - <<'PYEOF'
import httpx, json, sys

TEXT_URI = "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"
IMAGE_URI = "mixpeek://image_extractor@v1/google_siglip_base_v1"
ALL_COLLECTIONS = [TEXT_COLLECTION_ID, IMAGE_COLLECTION_ID]

resp = httpx.post(f"{BASE}/v1/retrievers", headers=headers, json={
    "retriever_name": f"{PROJECT}-multimodal",
    "collection_identifiers": ALL_COLLECTIONS,
    "stages": [{
        "stage_name": "multimodal_search",
        "stage_type": "filter",
        "config": {
            "stage_id": "feature_search",
            "parameters": {
                "searches": [
                    {"feature_uri": TEXT_URI, "query": {"input_mode": "text", "text": "{{INPUT.query}}"}, "top_k": 10},
                    {"feature_uri": IMAGE_URI, "query": {"input_mode": "text", "text": "{{INPUT.query}}"}, "top_k": 10},
                ],
                "final_top_k": 5,
                "fusion": "rrf",
                "collection_identifiers": ALL_COLLECTIONS,
            },
        },
    }],
    "input_schema": {"query": {"type": "string"}},
})
print(f"retriever_id={resp.json()['retriever']['retriever_id']}")
PYEOF
```

**Q&A retriever (retrieve + LLM synthesize):**
```bash
python3 - <<'PYEOF'
import httpx, json, sys

TEXT_URI = "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"

resp = httpx.post(f"{BASE}/v1/retrievers", headers=headers, json={
    "retriever_name": f"{PROJECT}-qa",
    "collection_identifiers": [TEXT_COLLECTION_ID],
    "stages": [
        {
            "stage_name": "retrieve_context",
            "stage_type": "filter",
            "config": {
                "stage_id": "feature_search",
                "parameters": {
                    "searches": [{"feature_uri": TEXT_URI, "query": {"input_mode": "text", "text": "{{INPUT.question}}"}, "top_k": 10}],
                    "final_top_k": 10,
                    "fusion": "rrf",
                    "collection_identifiers": [TEXT_COLLECTION_ID],
                },
            },
        },
        {
            "stage_name": "synthesize_answer",
            "stage_type": "transform",
            "config": {
                "stage_id": "llm_filter",
                "parameters": {
                    "prompt": "Using only the retrieved documents, answer concisely: {{INPUT.question}}",
                    "model": "gpt-4o-mini",
                    "output_field": "answer",
                },
            },
        },
    ],
    "input_schema": {"question": {"type": "string", "description": "Question to answer from the corpus"}},
})
print(f"retriever_id={resp.json()['retriever']['retriever_id']}")
PYEOF
```

### 5f — Batch Processing

Trigger each collection separately to start feature extraction:

```bash
python3 - <<'PYEOF'
import httpx, json

for col_id in [TEXT_COLLECTION_ID]:  # add IMAGE_COLLECTION_ID if applicable
    r = httpx.post(f"{BASE}/v1/collections/{col_id}/trigger", headers=headers, json={}, timeout=30)
    data = r.json()
    print(f"  {col_id}: {r.status_code} → batch_id={data.get('batch_id')} objects={data.get('object_count')}")
PYEOF
```

### 5g — Taxonomy (flat)

```bash
python3 - <<'PYEOF'
import httpx, json, sys

# Step 1: Reference bucket for label examples
ref_resp = httpx.post(f"{BASE}/v1/buckets", headers=headers, json={
    "bucket_name": f"{PROJECT}-taxonomy-labels",
    "bucket_schema": {"properties": {"label_name": {"type": "string"}, "description": {"type": "string"}}},
})
ref_bucket_id = ref_resp.json()["bucket_id"]

# Step 2: Upload label examples
LABELS = [
    # {"label_name": "electronics", "description": "consumer electronics and gadgets",
    #  "blobs": [{"property": "description", "type": "text", "data": "consumer electronics and gadgets"}]}
]
for label in LABELS:
    httpx.post(f"{BASE}/v1/buckets/{ref_bucket_id}/objects", headers=headers, json=label)

# Step 3: Reference collection
ref_col_resp = httpx.post(f"{BASE}/v1/collections", headers=headers, json={
    "collection_name": f"{PROJECT}-taxonomy-reference",
    "source": {"type": "bucket", "bucket_ids": [ref_bucket_id]},
    "feature_extractor": {
        "feature_extractor_name": "text_extractor",
        "version": "v1",
        "input_mappings": {"text": "description"},
        "parameters": {},
        "field_passthrough": ["label_name"],
    },
})
ref_col_id = ref_col_resp.json()["collection_id"]

# Step 4: Taxonomy retriever
tax_ret_resp = httpx.post(f"{BASE}/v1/retrievers", headers=headers, json={
    "retriever_name": f"{PROJECT}-taxonomy-matcher",
    "collection_identifiers": [ref_col_id],
    "stages": [{"stage_name": "label_search", "stage_type": "filter", "config": {
        "stage_id": "feature_search",
        "parameters": {
            "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "text": "{{INPUT.query}}"}, "top_k": 3}],
            "final_top_k": 1,
            "collection_identifiers": [ref_col_id],
        },
    }}],
    "input_schema": {"query": {"type": "string"}},
})
tax_ret_id = tax_ret_resp.json()["retriever"]["retriever_id"]

# Step 5: Create taxonomy
tax_resp = httpx.post(f"{BASE}/v1/taxonomies", headers=headers, json={
    "taxonomy_name": f"{PROJECT}-categories",
    "description": "Automatically classify documents into predefined categories",
    "config": {
        "taxonomy_type": "flat",
        "retriever_id": tax_ret_id,
        "input_mappings": [{"input_key": "query", "source_type": "payload", "path": "REPLACE_TEXT_FIELD"}],
        "source_collection": {
            "collection_id": TEXT_COLLECTION_ID,
            # enrichment_fields: only include if those fields already exist in the source schema
        },
    },
})
if tax_resp.status_code != 200:
    print(f"ERROR {tax_resp.status_code}: {tax_resp.text}", file=sys.stderr); sys.exit(1)
print(f"taxonomy_id={tax_resp.json()['taxonomy_id']}")
PYEOF
```

### 5h — Clusters

**Vector cluster (HDBSCAN + LLM labels):**
```bash
python3 - <<'PYEOF'
import httpx, json, sys

TEXT_URI = "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"

resp = httpx.post(f"{BASE}/v1/clusters", headers=headers, json={
    "cluster_name": f"{PROJECT}-semantic-groups",
    "collection_ids": [TEXT_COLLECTION_ID],
    "cluster_type": "vector",
    "vector_config": {
        "feature_uris": [TEXT_URI],
        "clustering_method": "hdbscan",
    },
    "llm_labeling": {"enabled": True, "provider": "openai", "model_name": "gpt-4o-mini-2024-07-18"},
    "enrich_source_collection": True,
})
if resp.status_code != 200:
    print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
data = resp.json()
print(f"cluster_id={data['cluster_id']}")

exec_resp = httpx.post(f"{BASE}/v1/clusters/{data['cluster_id']}/execute", headers=headers, json={})
print(f"execution_task_id={exec_resp.json().get('task_id')}")
PYEOF
```

### 5i — Triggers

**Daily re-cluster (cron):**
```bash
python3 - <<'PYEOF'
import httpx, json

resp = httpx.post(f"{BASE}/v1/triggers", headers=headers, json={
    "action_type": "cluster",
    "action_config": {"cluster_id": CLUSTER_ID},
    "trigger_type": "cron",
    "schedule_config": {"cron_expression": "0 2 * * *", "timezone": "UTC"},
    "description": "Re-cluster daily at 2am UTC",
})
# NOTE: POST /v1/triggers returns 201 Created
if resp.status_code not in (200, 201):
    print(f"ERROR {resp.status_code}: {resp.text}")
else:
    print(f"trigger_id={resp.json()['trigger_id']}")
PYEOF
```

### 5j — Alerts

```bash
python3 - <<'PYEOF'
import httpx, json

resp = httpx.post(f"{BASE}/v1/alerts", headers=headers, json={
    "name": f"{PROJECT}-content-monitor",
    "description": "Alert when specific content is detected in new documents",
    "retriever_id": ALERT_RETRIEVER_ID,
    "enabled": True,
    "notification_config": {
        "channels": [{"channel_type": "webhook", "config": {"url": "REPLACE_WEBHOOK_URL"}}],
        "include_matches": True,
        "include_scores": True,
    },
})
if resp.status_code != 200:
    print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"alert_id={resp.json()['alert_id']}")
PYEOF
```

### 5k — Webhooks

```bash
python3 - <<'PYEOF'
import httpx, json

org_headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

resp = httpx.post(f"{BASE}/v1/organizations/webhooks/", headers=org_headers, json={
    "webhook_name": f"{PROJECT}-job-notifications",
    "event_types": [
        "cluster.execution.completed",
        "cluster.execution.failed",
        "trigger.execution.completed",
        "trigger.execution.failed",
        "alert.triggered",
        "collection.documents.written",
    ],
    "channels": [{"channel_type": "webhook", "config": {"url": "REPLACE_WEBHOOK_URL"}}],
    "enabled": True,
})
if resp.status_code != 200:
    print(f"ERROR {resp.status_code}: {resp.text}", file=sys.stderr); sys.exit(1)
print(f"webhook_id={resp.json()['webhook_id']}")
PYEOF
```

---

## Step 6 — Auto-Detect Feature URIs

After triggering the collection, confirm the actual feature URIs registered:

```bash
python3 - <<'PYEOF'
import httpx, json

resp = httpx.get(f"{BASE}/v1/collections/{COLLECTION_ID}", headers=headers)
for vi in resp.json().get("vector_indexes", []):
    print(f"  vector: {vi.get('vector_name')}  uri: {vi.get('feature_uri')}")
PYEOF
```

If the detected URI differs from the default, patch the retriever stages accordingly.

---

## Step 7 — Final Summary

After everything is created, output a complete summary:

```
✅ MIXPEEK SETUP COMPLETE — {project-name}

┌──────────────────────────────────────────────────────────┐
│  Namespace:    {namespace_id}                            │
│  Bucket:       {bucket_id}                               │
│  Collection:   {text_col_id}   (text embeddings)         │
│  Collection:   {image_col_id}  (image embeddings)        │
│  Retriever:    {retriever_id}  (semantic search)         │
│  Taxonomy:     {taxonomy_id}   (flat categories)         │
│  Cluster:      {cluster_id}    (vector HDBSCAN)          │
│  Trigger:      {trigger_id}    (daily re-cluster)        │
│  Alert:        {alert_id}      (content monitor)         │
│  Webhook:      {webhook_id}    (job notifications)       │
└──────────────────────────────────────────────────────────┘

📡 SEARCH YOUR DATA (once batch completes):
  curl -X POST https://api.mixpeek.com/v1/retrievers/{retriever_id}/execute \
    -H "Authorization: Bearer {api_key}" \
    -H "X-Namespace: {namespace_id}" \
    -H "Content-Type: application/json" \
    -d '{"inputs": {"query": "your search here"}, "settings": {"limit": 5}}'

📚 DOCS: https://docs.mixpeek.com
```

---

## Error Handling

For any non-200 response:
1. Print the full error body
2. Explain what went wrong in plain English
3. Suggest the fix

Common errors:
- `401` → bad/missing API key
- `409 Conflict` → name already taken → ask user for a new name or offer to use the existing resource
- `422 Unprocessable Entity` → bad request body → show the exact validation error field
- `429 Too Many Requests` → wait 5s, retry once
- `400` on taxonomy with `input_mappings` → check that `path` field exists in source document payload

---

## Key API Reference

| Resource | Create | List | Execute |
|----------|--------|------|---------|
| Namespace | `POST /v1/namespaces` | `POST /v1/namespaces/list` | — |
| Bucket | `POST /v1/buckets` | `POST /v1/buckets/list` | — |
| Bucket Sync | `POST /v1/buckets/{id}/syncs` | `POST /v1/buckets/{id}/syncs/list` | `POST /v1/buckets/{id}/syncs/{sid}/trigger` |
| Collection | `POST /v1/collections` | `POST /v1/collections/list` | `POST /v1/collections/{id}/trigger` |
| Retriever | `POST /v1/retrievers` | `POST /v1/retrievers/list` | `POST /v1/retrievers/{id}/execute` |
| Taxonomy | `POST /v1/taxonomies` | `POST /v1/taxonomies/list` | `POST /v1/collections/{id}/apply-taxonomy` |
| Cluster | `POST /v1/clusters` | `POST /v1/clusters/list` | `POST /v1/clusters/{id}/execute` |
| Trigger | `POST /v1/triggers` | `POST /v1/triggers/list` | `POST /v1/triggers/{id}/execute` |
| Alert | `POST /v1/alerts` | `POST /v1/alerts/list` | — |
| Webhook | `POST /v1/organizations/webhooks/` | `POST /v1/organizations/webhooks/list` | — |

All requests except webhooks require `Authorization: Bearer {api_key}`.
All requests except namespace creation and webhooks require `X-Namespace: {namespace_id}`.

After saving the file, restart Claude Code. The /mixpeek command will appear in tab-complete.

Usage

/mixpeek

Or pass your API key directly to skip the first prompt:

/mixpeek sk-mxp-...

What It Asks

Q1 — What data?

Describe your dataset in plain English.Examples: “product catalog with photos and descriptions”, “security camera footage”, “support tickets”, “PDF contracts”

Q2 — Multiple datasets?

If you have more than one dataset (e.g., products AND customer reviews AND vendor images), describe each separately. The skill creates a dedicated bucket and collection set for each.

Q3 — Schema

For each dataset, list field names and types:

Type	Examples
`text` / `string`	names, descriptions, titles, content
`image`	URLs to photos
`video`	URLs to video files
`float`	prices, scores, ratings
`integer`	quantities, IDs, counts
`boolean`	in_stock, is_active
`date`	ISO date strings

Q4 — Where does the data live?

URLs — HTTP/HTTPS links to each item
S3 — AWS S3 bucket with optional prefix
Google Drive — folder ID or URL
SharePoint / OneDrive — site URL + folder path
Snowflake — database.schema.table
Upload later — set up the schema now, push data later via API

Q5 — Search & retrieval goals

Pick all that apply: semantic text search, image search by text, visual similarity, cross-modal, filtered search, question answering, re-ranking.

Q6 — Classification / taxonomy?

Flat (label list) or hierarchical (parent-child structure). You provide example items per label; the skill creates the reference collection and wiring automatically.

Q7 — Clustering / grouping?

Vector clustering (hdbscan / kmeans / agglomerative) or attribute clustering (group by field values). Optional LLM-generated cluster labels and enrichment back to source documents.

Q8 — Scheduled automation?

Re-cluster or re-classify on a schedule. Supports cron expressions and interval-based triggers.

Q9 — Monitoring & alerts?

Content alerts (notify when a retriever query matches new documents) and job completion webhooks.

Resources Created

Resource	What it does
Namespace	Isolated workspace; one per project
Bucket(s)	Raw data storage with typed schema
Collection(s)	Processing pipeline — one per extractor type per dataset
Batch	Triggers feature extraction across all bucket objects
Retriever(s)	Multi-stage search pipeline matching your retrieval goals
Taxonomy	Flat or hierarchical classifier applied to documents
Cluster	Groups similar documents; supports LLM-generated labels
Trigger	Scheduled re-clustering or taxonomy enrichment
Alert	Fires a webhook when a retriever query matches new content
Webhook	Event notifications for job completion, object creation, etc.

Requirements

Claude Code installed
A Mixpeek API key from studio.mixpeek.com → Settings → API Keys
Python 3 with httpx (pip install httpx)

Next Steps

Core Concepts

Understand namespaces, collections, and documents

Feature Extractors

Choose the right extractor for your data type

Retriever Stages

Build custom multi-stage search pipelines

MCP Server

Connect Claude to Mixpeek via MCP for ongoing management

Integrations

​Install

​Usage

​What It Asks

​Resources Created

​Requirements

​Next Steps

Core Concepts

Feature Extractors

Retriever Stages

MCP Server

Install

Usage

What It Asks

Resources Created

Requirements

Next Steps