Migrate from Elasticsearch

Elasticsearch is a search engine built on keyword matching and BM25. Over time, you may have bolted on vector search (dense_vector), external embedding generation, and custom pipelines to handle multimodal content. Mixpeek unifies all of this: feature extraction, tiered storage, and multi-stage retrieval in a single API. This guide walks you through migrating your search workload from Elasticsearch to Mixpeek.

Why Migrate

Elasticsearch started as a keyword search engine. Vector search, embedding generation, and multimodal processing are additions you configure and maintain yourself. Mixpeek was built from the ground up as a multimodal data warehouse where feature extraction, storage tiering, and multi-stage retrieval are native primitives, not plugins.

Elasticsearch	Mixpeek
Keyword search (BM25) with bolted-on vector search	Native semantic, keyword, and hybrid search in one pipeline
You generate and manage embeddings externally	Feature extractors handle embedding generation automatically
Single storage tier (hot-warm-cold requires manual ILM)	Automatic tiered storage: active, cold, archive, up to 90% savings
Complex DSL for combining query types	Multi-stage retriever pipelines: chain stages declaratively
Ingest pipelines for basic transforms	Collections with ML-powered feature extractors (CLIP, Whisper, LayoutLM)

Concept Mapping

Elasticsearch	Mixpeek	Notes
Index	Namespace	Top-level container for your data
Document	Document	Mixpeek documents contain extracted features, metadata, and source lineage
Mapping / Schema	Collection + Feature Extractor	Collections define what features to extract; schema is derived automatically
DSL query	Retriever (with stages)	Stages are like query clauses, composable and ordered
`bool` query (must/should/filter)	Multi-stage pipeline	Each clause becomes a stage: search, filter, boost, rerank
Aggregation	Reduce stages / Taxonomies	Group, classify, and summarize results
Ingest pipeline	Collection + Feature Extractors	Mixpeek pipelines extract ML features, not just field transforms
Analyzer (tokenizer + filters)	Feature Extractor configuration	Extractors handle tokenization, embedding, and structured extraction
ILM (Index Lifecycle Management)	Automatic storage tiering	Hot, cold, archive managed by the platform

Migration Steps

Create a Namespace

Replace your Elasticsearch index with a Mixpeek namespace.

curl -X POST https://api.mixpeek.com/v1/namespaces \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespace_name": "knowledge-base"
  }'

Replace Index Mappings with Collections

Instead of defining field types and analyzers, create a collection with a feature extractor that matches your content.

PUT /knowledge-base
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "english" },
      "body": { "type": "text", "analyzer": "english" },
      "embedding": { "type": "dense_vector", "dims": 768 },
      "category": { "type": "keyword" },
      "published_at": { "type": "date" }
    }
  }
}

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Collection handles schema, embedding, and indexing
collection = client.collections.create(
    collection_name="articles",
    feature_extractor={
        "feature_extractor_name": "multimodal",
        "version": "v1"
    },
    namespace="knowledge-base"
)

You do not need to define field types or manage embedding dimensions. The feature extractor handles all of this based on your content.

Replace Ingest Pipelines with Feature Extraction

Elasticsearch ingest pipelines handle basic field transforms. Mixpeek collections run ML models on your content: generating embeddings, extracting entities, transcribing audio, and more.

# Upload source files - the collection handles everything
client.objects.create(
    bucket_id="your-bucket-id",
    key_prefix="/articles",
    blobs=[
        {"property": "content", "data": "s3://your-bucket/article-001.pdf"}
    ],
    namespace="knowledge-base"
)

Do not try to bulk-import your Elasticsearch documents or vectors. Re-ingest your source files so the pipeline can extract multi-layered features and build proper lineage.

Translate DSL Queries to Retriever Stages

Elasticsearch’s query DSL maps naturally to Mixpeek retriever stages. Each DSL clause becomes a stage in the pipeline.

POST /knowledge-base/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "knn": {
            "field": "embedding",
            "query_vector": [0.1, 0.2, ...],
            "num_candidates": 50,
            "k": 20
          }
        }
      ],
      "filter": [
        { "term": { "category": "engineering" } },
        { "range": { "published_at": { "gte": "2025-01-01" } } }
      ]
    }
  },
  "size": 10
}

# One call - embedding, search, and filtering handled
results = client.retrievers.execute(
    retriever_id="article-search",
    inputs={
        "query": "distributed systems architecture",
        "category": "engineering"
    },
    limit=10,
    namespace="knowledge-base"
)

Build Multi-Stage Retriever Pipelines

Define a retriever that chains stages together. This replaces complex DSL queries with a declarative pipeline.

curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "X-Namespace: knowledge-base" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "article-search",
    "stages": [
      {
        "stage_name": "hybrid_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "fusion": "rrf",
            "final_top_k": 50,
            "searches": [
              {
                "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 50
              },
              {
                "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "lexical": true,
                "top_k": 50
              }
            ]
          }
        }
      },
      {
        "stage_name": "filter_category",
        "stage_type": "filter",
        "config": {
          "stage_id": "attribute_filter",
          "parameters": {
            "field": "category",
            "operator": "eq",
            "value": "{{INPUT.category}}"
          }
        }
      },
      {
        "stage_name": "rerank",
        "stage_type": "sort",
        "config": {
          "stage_id": "rerank",
          "parameters": {
            "top_k": 10
          }
        }
      }
    ]
  }'

Notice the hybrid approach: semantic search and keyword search run as separate stages, then results are combined through reranking. No need to manually tune BM25 weights against vector scores.

Test and Verify

Execute your retriever and compare results against your Elasticsearch baseline.

curl -X POST https://api.mixpeek.com/v1/retrievers/{retriever_id}/execute \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "X-Namespace: knowledge-base" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "query": "distributed systems architecture",
      "category": "engineering"
    },
    "limit": 10
  }'

What You Gain

Capability	Elasticsearch	Mixpeek
Hybrid search	Manual BM25 + kNN score tuning	Semantic and keyword stages with automatic reranking
Feature extraction	External embedding generation, custom ingest pipelines	Built-in ML extractors: CLIP, Whisper, LayoutLM, and more
Multimodal	Text-native; images/video require custom plugins	Native support for video, audio, images, and documents in the same namespace
Storage tiering	ILM policies you configure and maintain	Automatic tiering: active, cold, archive, managed by the platform
No infrastructure	Clusters, shards, replicas, JVM tuning	Fully managed API, no cluster operations
Lineage	Documents disconnected from source files	Trace any result back through document, object, and source file
Multi-stage pipelines	Complex nested DSL	Declarative stage pipelines: search, filter, rerank, enrich

Next Steps

Quickstart

Get Mixpeek running in 10 minutes

Feature Extractors

Learn about automatic feature extraction

Retrievers

Build multi-stage retrieval pipelines

Core Concepts

Understand the data model

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

Migrate from Elasticsearch

Why Migrate

Concept Mapping

Migration Steps

What You Gain

Next Steps

Quickstart

Feature Extractors

Retrievers

Core Concepts

​Why Migrate

​Concept Mapping

​Migration Steps

​What You Gain

​Next Steps

Quickstart

Feature Extractors

Retrievers

Core Concepts

Why Migrate

Concept Mapping

Migration Steps

What You Gain

Next Steps