Skip to main content
Buckets are the ingestion layer of the warehouse, where raw files enter before being decomposed into features. They organize raw inputs before the Engine transforms them into documents. Each bucket enforces a JSON schema that describes the blobs you expect to ingest (text, image, audio, video, json, binary).

Create a Bucket

curl -sS -X POST "$MP_API_URL/v1/buckets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "product-catalog",
    "description": "E-commerce product data",
    "schema": {
      "properties": {
        "product_text": { "type": "text", "required": true },
        "hero_image": { "type": "image" },
        "spec_sheet": { "type": "json" }
      }
    }
  }'
Response fields:
  • bucket_id
  • schema with validation metadata
  • object_count
  • created_at

Bucket Schema

  • Uses a lightweight JSON schema subset (type, required, enum, description).
  • Validates each object’s blobs before storing metadata.
  • Helps collections map input fields to feature extractor targets.
Example schema fragment:
{
  "properties": {
    "transcript": {
      "type": "text",
      "description": "Full podcast transcript",
      "required": true
    },
    "audio_file": {
      "type": "audio",
      "required": true
    }
  }
}

Manage Buckets

  • Get bucketGET /v1/buckets/{bucket_id}
  • List bucketsPOST /v1/buckets/list (supports filters, sort, pagination)
  • Delete bucketDELETE /v1/buckets/{bucket_id} (removes objects and blobs)
Buckets are strictly namespace-scoped: the same bucket name can exist in different namespaces without conflict.

Bucket vs Collection

AspectBucketCollection
PurposeRaw input registryProcessed documents + features
SchemaBlob validationOutput schema (deterministic)
StorageMongoDB (metadata) + S3 (blobs)MongoDB (metadata) + MVS (vectors/payloads)
ProcessingNoneRuns feature extractors via Engine

Best Practices

  • One bucket per data domain (products, support tickets, surveillance footage).
  • Keep schemas coarse; collections can slice the data differently downstream.
  • Use key_prefix in objects to group files (e.g., /2025/01/).
  • Leverage metadata for later filtering (set tags at ingestion time).
Buckets give you a reliable staging area for multimodal data—clean separation before you branch into multiple collection-specific processing pipelines.