Mixpeek Logo

    Document Classification Pipeline

    Classify documents into custom business categories using layout-aware extraction and taxonomy enrichment. Handles invoices, contracts, reports, forms, and correspondence by analyzing both textual content and visual document structure.

    text
    image
    Multi-Stage
    from mixpeek import Mixpeek
    client = Mixpeek(api_key="YOUR_API_KEY")
    # Create taxonomy for document types
    taxonomy = client.taxonomies.create(
    namespace_id="ns_your_namespace",
    name="document_types",
    taxonomy_type="hierarchical",
    hierarchy=[
    {"node_id": "invoice", "collection_id": "col_invoice_examples"},
    {"node_id": "contract", "collection_id": "col_contract_examples"},
    {"node_id": "report", "collection_id": "col_report_examples"},
    {"node_id": "form", "collection_id": "col_form_examples"},
    {"node_id": "correspondence", "collection_id": "col_letter_examples"},
    ]
    )
    # Create document collection with layout extraction
    collection = client.collections.create(
    namespace_id="ns_your_namespace",
    name="incoming_documents",
    extractors=["document-graph-extractor", "text-extractor"]
    )
    # Apply taxonomy for automatic classification
    client.collections.apply_taxonomy(
    collection_id="col_incoming_documents",
    taxonomy_id=taxonomy["taxonomy_id"]
    )
    # Upload documents for classification
    client.buckets.upload(bucket_id="bkt_docs", url="s3://your-bucket/incoming/")
    # Query classified documents
    docs = client.documents.list(
    collection_id="col_incoming_documents",
    filters={"taxonomy_enrichment.category": "invoice"}
    )
    print(f"Found {len(docs['results'])} invoices")

    Feature Extractors

    Retriever Stages

    aggregate

    Compute aggregations (COUNT, SUM, AVG, etc.) on pipeline results

    reduce

    Use Cases Using This Recipe

    Advanced
    12 min

    Clinical NLP at Scale

    Extract structured intelligence from clinical notes, pathology reports, and medical records

    94% F1 on medical NER benchmarks

    Entity extraction accuracy

    Who It's For

    Healthcare IT teams, clinical informatics departments, and health systems processing thousands of clinical documents daily