Skip to main content
Elasticsearch is a powerful search engine built on keyword matching and BM25. Over time, you may have bolted on vector search (dense_vector), external embedding generation, and custom pipelines to handle multimodal content. Mixpeek unifies all of this: feature extraction, tiered storage, and multi-stage retrieval in a single API. This guide walks you through migrating your search workload from Elasticsearch to Mixpeek.

Why Migrate

Elasticsearch started as a keyword search engine. Vector search, embedding generation, and multimodal processing are additions you configure and maintain yourself. Mixpeek was built from the ground up as a multimodal data warehouse where feature extraction, storage tiering, and multi-stage retrieval are native primitives, not plugins.
ElasticsearchMixpeek
Keyword search (BM25) with bolted-on vector searchNative semantic, keyword, and hybrid search in one pipeline
You generate and manage embeddings externallyFeature extractors handle embedding generation automatically
Single storage tier (hot-warm-cold requires manual ILM)Automatic tiered storage: active, cold, archive, up to 90% savings
Complex DSL for combining query typesMulti-stage retriever pipelines: chain stages declaratively
Ingest pipelines for basic transformsCollections with ML-powered feature extractors (CLIP, Whisper, LayoutLM)

Concept Mapping

ElasticsearchMixpeekNotes
IndexNamespaceTop-level container for your data
DocumentDocumentMixpeek documents contain extracted features, metadata, and source lineage
Mapping / SchemaCollection + Feature ExtractorCollections define what features to extract; schema is derived automatically
DSL queryRetriever (with stages)Stages are like query clauses, composable and ordered
bool query (must/should/filter)Multi-stage pipelineEach clause becomes a stage: search, filter, boost, rerank
AggregationReduce stages / TaxonomiesGroup, classify, and summarize results
Ingest pipelineCollection + Feature ExtractorsMixpeek pipelines extract ML features, not just field transforms
Analyzer (tokenizer + filters)Feature Extractor configurationExtractors handle tokenization, embedding, and structured extraction
ILM (Index Lifecycle Management)Automatic storage tieringHot, cold, archive managed by the platform

Migration Steps

1

Create a Namespace

Replace your Elasticsearch index with a Mixpeek namespace.
curl -X POST https://api.mixpeek.com/v1/namespaces \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespace_name": "knowledge-base"
  }'
2

Replace Index Mappings with Collections

Instead of defining field types and analyzers, create a collection with a feature extractor that matches your content.
PUT /knowledge-base
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "english" },
      "body": { "type": "text", "analyzer": "english" },
      "embedding": { "type": "dense_vector", "dims": 768 },
      "category": { "type": "keyword" },
      "published_at": { "type": "date" }
    }
  }
}
You do not need to define field types or manage embedding dimensions. The feature extractor handles all of this based on your content.
3

Replace Ingest Pipelines with Feature Extraction

Elasticsearch ingest pipelines handle basic field transforms. Mixpeek collections run ML models on your content: generating embeddings, extracting entities, transcribing audio, and more.
# Upload source files - the collection handles everything
client.objects.create(
    bucket_id="your-bucket-id",
    key_prefix="/articles",
    blobs=[
        {"property": "content", "url": "s3://your-bucket/article-001.pdf"}
    ],
    namespace="knowledge-base"
)
Do not try to bulk-import your Elasticsearch documents or vectors. Re-ingest your source files so the pipeline can extract multi-layered features and build proper lineage.
4

Translate DSL Queries to Retriever Stages

Elasticsearch’s query DSL maps naturally to Mixpeek retriever stages. Each DSL clause becomes a stage in the pipeline.
POST /knowledge-base/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "knn": {
            "field": "embedding",
            "query_vector": [0.1, 0.2, ...],
            "num_candidates": 50,
            "k": 20
          }
        }
      ],
      "filter": [
        { "term": { "category": "engineering" } },
        { "range": { "published_at": { "gte": "2025-01-01" } } }
      ]
    }
  },
  "size": 10
}
5

Build Multi-Stage Retriever Pipelines

Define a retriever that chains stages together. This replaces complex DSL queries with a declarative pipeline.
curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "X-Namespace: knowledge-base" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "article-search",
    "stages": [
      {
        "type": "semantic_search",
        "config": {
          "query": "{{INPUT.query}}",
          "top_k": 50
        }
      },
      {
        "type": "keyword_search",
        "config": {
          "query": "{{INPUT.query}}",
          "top_k": 50
        }
      },
      {
        "type": "attribute_filter",
        "config": {
          "filters": {
            "category": "{{INPUT.category}}"
          }
        }
      },
      {
        "type": "rerank",
        "config": {
          "top_k": 10
        }
      }
    ]
  }'
Notice the hybrid approach: semantic search and keyword search run as separate stages, then results are combined through reranking. No need to manually tune BM25 weights against vector scores.
6

Test and Verify

Execute your retriever and compare results against your Elasticsearch baseline.
curl -X POST https://api.mixpeek.com/v1/retrievers/{retriever_id}/execute \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "X-Namespace: knowledge-base" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "query": "distributed systems architecture",
      "category": "engineering"
    },
    "limit": 10
  }'

What You Gain

CapabilityElasticsearchMixpeek
Hybrid searchManual BM25 + kNN score tuningSemantic and keyword stages with automatic reranking
Feature extractionExternal embedding generation, custom ingest pipelinesBuilt-in ML extractors: CLIP, Whisper, LayoutLM, and more
MultimodalText-native; images/video require custom pluginsNative support for video, audio, images, and documents in the same namespace
Storage tieringILM policies you configure and maintainAutomatic tiering: active, cold, archive, managed by the platform
No infrastructureClusters, shards, replicas, JVM tuningFully managed API, no cluster operations
LineageDocuments disconnected from source filesTrace any result back through document, object, and source file
Multi-stage pipelinesComplex nested DSLDeclarative stage pipelines: search, filter, rerank, enrich

Next Steps

Quickstart

Get Mixpeek running in 10 minutes

Feature Extractors

Learn about automatic feature extraction

Retrievers

Build multi-stage retrieval pipelines

Core Concepts

Understand the data model