Batching

Batches let you process many objects in one asynchronous job. They store the list of object IDs, generate extractor manifests, and provide a task handle so you can monitor progress.

Create batch

Supply object IDs (or create an empty batch and add objects later).

Submit batch

API flattens manifests into per-extractor Parquet artifacts and writes them to S3.

Engine processes

Ray pollers pick up the batch, execute extractors tier-by-tier, and write documents to MVS.

Webhook & cache updates

Engine emits webhook events, Celery Beat invalidates caches, and collection schemas update.

Create a Batch

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "batch_name": "products-2025-10-28",
    "object_ids": ["obj_abc", "obj_def"]
  }'

Add more objects later:

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_xyz"] }'

Submit for Processing

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>/submit" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "include_processing_history": true }'

include_processing_history=true records each enrichment operation in internal_metadata.processing_history.
Response contains a task_id; poll /v1/tasks/{task_id} or the batch resource directly.

Lifecycle & Status

Status	Meaning
`DRAFT`	Created but not submitted
`QUEUED`	Submitted; waiting for poller pickup
`PROCESSING`	Ray job running feature extractors
`COMPLETED`	All extractors finished successfully
`FAILED`	Extractors or Ray job failed (see `error_message`)

Status updates synchronize to both the batch resource and the associated task.

Under the Hood

API writes manifest metadata to MongoDB and extractor row artifacts to S3.
Ray poller queries MongoDB every 5 seconds for PENDING batches.
Poller submits a Ray job with manifest details.
Worker downloads artifacts, runs extractors in dependency tiers, and writes documents to MVS/MongoDB.
QdrantBatchProcessor emits webhook events and updates collection index signatures.

Monitoring

Real-time progress

Poll the batch endpoint to get live progress:

curl -sS "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

The response includes a progress object with real-time processing metrics:

Field	Description
`progress.processed`	Number of objects processed so far
`progress.total`	Total objects in the batch
`progress.percent`	Completion percentage
`progress.items_per_second`	Current processing throughput
`progress.eta_seconds`	Estimated seconds until completion
`progress.current_stage`	Current pipeline stage (`name`, `index`, `total`)
`progress.errors`	Number of processing errors
`status_message`	Human-readable summary (e.g. “Processing 21,611/46,160 objects (46.8%) · 1.4 items/sec · ~4h 57m remaining”)
`health`	Overall batch health (`healthy`, `degraded`, `unhealthy`)
`estimated_completion`	ISO 8601 timestamp of estimated completion

The status_message field provides a one-line summary suitable for dashboards and logging. For programmatic monitoring, use the progress object fields directly.

Other monitoring options

GET /v1/tasks/<task_id> – track task-level progress (Redis TTL ≈ 24h).
GET /v1/buckets/<bucket_id>/batches/<batch_id>/health – batch health check.
Webhook events (collection.documents.written) notify you when documents land.

Scaling Tips

Chunk large imports into batches of 1k–10k objects to keep pollers responsive.
Parallelize submissions—pollers handle multiple batches concurrently.
Use namespaces to isolate environments; pollers are namespace-aware.
Retry safely—batch submission and task polling are idempotent.
Pipeline scheduling—combine Celery Beat or your orchestrator to submit batches on cron.

Batching keeps ingestion resilient and scalable—separate raw uploads from heavy compute, then let the Engine take over on its own schedule.

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

Create a Batch

Submit for Processing

Lifecycle & Status

Under the Hood

Monitoring

Real-time progress

Other monitoring options

Scaling Tips

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​Create a Batch

​Submit for Processing

​Lifecycle & Status

​Under the Hood

​Monitoring

​Real-time progress

​Other monitoring options

​Scaling Tips

​Related APIs

Create a Batch

Submit for Processing

Lifecycle & Status

Under the Hood

Monitoring

Real-time progress

Other monitoring options

Scaling Tips

Related APIs