Skip to main content
Batches let you process many objects in one asynchronous job. They store the list of object IDs, generate extractor manifests, and provide a task handle so you can monitor progress.
1

Create batch

Supply object IDs (or create an empty batch and add objects later).
2

Submit batch

API flattens manifests into per-extractor Parquet artifacts and writes them to S3.
3

Engine processes

Ray pollers pick up the batch, execute extractors tier-by-tier, and write documents to MVS.
4

Webhook & cache updates

Engine emits webhook events, Celery Beat invalidates caches, and collection schemas update.

Create a Batch

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "batch_name": "products-2025-10-28",
    "object_ids": ["obj_abc", "obj_def"]
  }'
Add more objects later:
curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_xyz"] }'

Submit for Processing

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>/submit" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "include_processing_history": true }'
  • include_processing_history=true records each enrichment operation in internal_metadata.processing_history.
  • Response contains a task_id; poll /v1/tasks/{task_id} or the batch resource directly.

Lifecycle & Status

StatusMeaning
DRAFTCreated but not submitted
QUEUEDSubmitted; waiting for poller pickup
PROCESSINGRay job running feature extractors
COMPLETEDAll extractors finished successfully
FAILEDExtractors or Ray job failed (see error_message)
Status updates synchronize to both the batch resource and the associated task.

Under the Hood

  1. API writes manifest metadata to MongoDB and extractor row artifacts to S3.
  2. Ray poller queries MongoDB every 5 seconds for PENDING batches.
  3. Poller submits a Ray job with manifest details.
  4. Worker downloads artifacts, runs extractors in dependency tiers, and writes documents to MVS/MongoDB.
  5. QdrantBatchProcessor emits webhook events and updates collection index signatures.

Monitoring

Real-time progress

Poll the batch endpoint to get live progress:
curl -sS "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"
The response includes a progress object with real-time processing metrics:
FieldDescription
progress.processedNumber of objects processed so far
progress.totalTotal objects in the batch
progress.percentCompletion percentage
progress.items_per_secondCurrent processing throughput
progress.eta_secondsEstimated seconds until completion
progress.current_stageCurrent pipeline stage (name, index, total)
progress.errorsNumber of processing errors
status_messageHuman-readable summary (e.g. “Processing 21,611/46,160 objects (46.8%) · 1.4 items/sec · ~4h 57m remaining”)
healthOverall batch health (healthy, degraded, unhealthy)
estimated_completionISO 8601 timestamp of estimated completion
The status_message field provides a one-line summary suitable for dashboards and logging. For programmatic monitoring, use the progress object fields directly.

Other monitoring options

  • GET /v1/tasks/<task_id> – track task-level progress (Redis TTL ≈ 24h).
  • GET /v1/buckets/<bucket_id>/batches/<batch_id>/health – batch health check.
  • Webhook events (collection.documents.written) notify you when documents land.

Scaling Tips

  • Chunk large imports into batches of 1k–10k objects to keep pollers responsive.
  • Parallelize submissions—pollers handle multiple batches concurrently.
  • Use namespaces to isolate environments; pollers are namespace-aware.
  • Retry safely—batch submission and task polling are idempotent.
  • Pipeline scheduling—combine Celery Beat or your orchestrator to submit batches on cron.
Batching keeps ingestion resilient and scalable—separate raw uploads from heavy compute, then let the Engine take over on its own schedule.