Document intelligence uses the warehouse’s Decompose layer to extract structure, text, and layout from PDFs and scanned documents, then makes them queryable through multi-stage retrieval.
How It Works
When you ingest a document, Mixpeek runs a multi-stage pipeline:
- Content Extraction — Text extraction from native PDFs, OCR fallback for scanned pages
- Hierarchical Chunking — Documents split into pages, sections, or paragraphs with parent-child relationships
- Semantic Extraction — Document type detection, section classification, and metadata inference
- Multi-Vector Embeddings — Separate embeddings for titles, summaries, and full text
- Indexing — Chunks stored with metadata for filtered vector search
At query time, the retriever searches across embeddings and joins results from multiple collections (text, tables, entities) back to their source documents.
| Extractor | Use For |
|---|
pdf_extractor@v1 | Native PDF text, metadata, page chunking |
document_extractor@v1 | OCR for scanned docs, layout detection |
table_extractor@v1 | Table detection and cell extraction |
text_extractor@v1 | Text embeddings, NER, summarization |
1. Create a Bucket
POST /v1/buckets
{
"bucket_name": "contracts",
"schema": {
"properties": {
"document_url": { "type": "url", "required": true },
"document_type": { "type": "text" },
"contract_date": { "type": "datetime" }
}
}
}
2. Create Collections
For text extraction:
POST /v1/collections
{
"collection_name": "contracts-text",
"source": { "type": "bucket", "bucket_id": "bkt_contracts" },
"feature_extractor": {
"feature_extractor_name": "pdf_extractor",
"version": "v1",
"input_mappings": { "document_url": "document_url" },
"parameters": {
"chunk_strategy": "page",
"enable_ocr_fallback": true
},
"field_passthrough": [
{ "source_path": "document_type" },
{ "source_path": "contract_date" }
]
}
}
For tables:
POST /v1/collections
{
"collection_name": "contracts-tables",
"source": { "type": "bucket", "bucket_id": "bkt_contracts" },
"feature_extractor": {
"feature_extractor_name": "table_extractor",
"version": "v1",
"input_mappings": { "document_url": "document_url" },
"parameters": {
"output_format": "json",
"min_confidence": 0.7
}
}
}
3. Ingest Documents
POST /v1/buckets/{bucket_id}/objects
{
"key_prefix": "/2025/agreements",
"metadata": {
"document_type": "vendor_agreement",
"contract_date": "2025-01-15T00:00:00Z"
},
"blobs": [
{
"property": "document_url",
"type": "document",
"url": "s3://my-bucket/contracts/vendor-001.pdf"
}
]
}
4. Process
POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_001", "obj_002"] }
POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit
5. Create a Retriever
POST /v1/retrievers
{
"retriever_name": "contract-search",
"collection_ids": ["col_contracts_text", "col_contracts_tables"],
"input_schema": {
"properties": {
"query": { "type": "text", "required": true },
"document_type": { "type": "text" }
}
},
"stages": [
{
"stage_name": "filter",
"version": "v1",
"parameters": {
"filters": {
"field": "metadata.document_type",
"operator": "eq",
"value": "{{inputs.document_type}}"
}
}
},
{
"stage_name": "knn_search",
"version": "v1",
"parameters": {
"feature_address": "mixpeek://pdf_extractor@v1/text_embedding",
"input_mapping": { "text": "query" },
"limit": 50
}
}
]
}
6. Query
POST /v1/retrievers/{retriever_id}/execute
{
"inputs": {
"query": "termination clauses with 30-day notice",
"document_type": "vendor_agreement"
},
"limit": 10
}
Named Entity Recognition
Enable NER to extract entities like dates, amounts, and names:
{
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"parameters": {
"enable_ner": true,
"entity_types": ["PERSON", "ORG", "DATE", "MONEY"]
}
}
}
Filter by entity:
{
"filters": {
"field": "metadata.entities.ORG",
"operator": "contains",
"value": "Acme Corp"
}
}
Multi-Page Assembly
Retrieve all pages from a document using lineage:
GET /v1/documents/{document_id}/lineage