Skip to main content

Overview

Mixpeek and Databricks occupy different layers of the data stack. Mixpeek ingests unstructured multimodal files and extracts structured features, embeddings, and classifications. Databricks provides the lakehouse platform — Delta Lake for storage, Unity Catalog for governance, and integrated ML for training and serving models. Together, they give you a complete path from raw media files to governed, analytics-ready data.

Mixpeek

Ingests unstructured files, extracts features (embeddings, transcripts, classifications, metadata), and powers multimodal retrieval.

Databricks

Stores structured outputs in Delta tables, enforces governance via Unity Catalog, and runs ML training and serving at scale.

Architecture

                        Mixpeek                              Databricks
               +-----------------------+            +------------------------+
               |                       |            |                        |
  Files -----> |  Buckets & Collections|            |   Delta Lake Tables    |
  (images,     |                       |            |                        |
   video,      |  Decompose files into |  export    |  - classifications     |
   audio,      |  features:            | ---------> |  - extracted metadata  |
   PDFs)       |   - embeddings        |            |  - taxonomy labels     |
               |   - transcripts       |            |  - document payloads   |
               |   - classifications   |            |                        |
               |   - metadata          |  enrich    |  Unity Catalog governs |
               |                       | <--------- |  all tables. Databricks|
               |  Retrieval & Search   |            |  ML retrains models.   |
               +-----------------------+            +------------------------+

Use Cases

Write extracted features as Delta tables

After Mixpeek processes your files, export the structured outputs — transcripts, object detections, taxonomy labels, metadata — as Delta tables. This makes them queryable with Spark SQL, joinable with your existing business data, and available to any tool in the Databricks ecosystem.

Use Unity Catalog for governance

Unity Catalog provides fine-grained access control, lineage tracking, and audit logging for all data assets. Once Mixpeek outputs land in Delta tables, Unity Catalog governs who can access them and how they flow through your organization.

Combine Mixpeek retrieval with Databricks ML

Use Mixpeek to power real-time multimodal search and retrieval. Feed the same structured features into Databricks ML for batch training — fine-tune classifiers, build recommendation models, or run large-scale analytics on extracted content.

Quick Start

Export Mixpeek document metadata to a Delta table using the Mixpeek Python SDK and the Databricks SQL Connector.
1

Install dependencies

pip install mixpeek databricks-sql-connector
2

List documents from Mixpeek

from mixpeek import Mixpeek

client = Mixpeek(api_key="your-api-key")

# List documents from a collection
documents = client.collections.documents.list(
    collection_id="your-collection-id",
    page_size=100
)
3

Write to a Delta table via Databricks SQL

from databricks import sql
import json

connection = sql.connect(
    server_hostname="YOUR_WORKSPACE.cloud.databricks.com",
    http_path="/sql/1.0/warehouses/YOUR_WAREHOUSE_ID",
    access_token="YOUR_ACCESS_TOKEN"
)

cursor = connection.cursor()

# Create table if it does not exist
cursor.execute("""
    CREATE TABLE IF NOT EXISTS mixpeek_catalog.default.documents (
        document_id STRING,
        source_url STRING,
        content_type STRING,
        metadata STRING,
        created_at TIMESTAMP
    )
""")

# Insert each document
for doc in documents:
    cursor.execute(
        """
        INSERT INTO mixpeek_catalog.default.documents
            (document_id, source_url, content_type, metadata, created_at)
        VALUES (%s, %s, %s, %s, %s)
        """,
        (
            doc.get("document_id"),
            doc.get("source", {}).get("url"),
            doc.get("content_type"),
            json.dumps(doc.get("metadata", {})),
            doc.get("created_at"),
        )
    )

connection.commit()
cursor.close()
connection.close()
For production workloads, write Mixpeek outputs to cloud storage (S3 or ADLS) and use Databricks Auto Loader to incrementally ingest new files into Delta tables.

When to Use Each

CapabilityMixpeekDatabricks
Ingest unstructured files (video, images, audio, PDFs)YesNo
Extract features (embeddings, transcripts, classifications)YesNo
Multimodal semantic searchYesNo
Structured SQL analyticsNoYes (Spark SQL)
Data governance and lineageDocument-level ACLUnity Catalog
ML model training and servingNoYes (MLflow, Model Serving)
Streaming ingestionWebhooks + batch triggersStructured Streaming, Auto Loader
Mixpeek handles everything before the data is structured. Databricks handles everything after. Use both to bridge the gap between raw multimodal files and governed, ML-ready data.
  • Taxonomies — classify content and export labels
  • SQL Lookup Stage — query external databases from retriever pipelines
  • API Call Stage — call external APIs during retrieval
  • Webhooks — trigger Databricks jobs when Mixpeek processing completes