Mixpeek Logo
    10 min read

    Gemini Embedding 2 is Live: embed multiple files into one vector

    Google's Gemini Embedding 2 embeds images, PDFs, and text together in a single API call. Here's how we integrated it into Mixpeek's feature extractor pipeline, the production numbers, and where multi-file embedding beats single-chunk approaches.

    Gemini Embedding 2 is Live: embed multiple files into one vector
    Engineering

    Google shipped Gemini Embedding 2 (gemini-embedding-exp-03-07, 3072-d) last week (announcement). The headline number is the dimensionality. The part that actually matters is buried in the announcement: it does multi-modal embedding natively — images, PDFs, audio, and text in a single API call, producing one vector that represents all of them together.

    That's genuinely new. CLIP and SigLIP give you one modality per call and leave you to figure out late fusion. Vertex Multimodal gives you image+text but not PDF, not audio. Gemini Embedding 2 takes whatever you throw at it and returns a single 3072-d float array. No fusion logic, no alignment head you have to train yourself.

    We integrated it into Mixpeek this week. Here's how it works, what it's good for, and the actual production numbers.


    Quick context: what feature extractors and retrievers are

    If you haven't used Mixpeek before:

    A feature extractor is a processing pipeline that runs when objects land in a bucket. You point it at blob properties on your objects (image URLs, text fields, PDF attachments) and it writes embeddings into a vector index attached to the namespace. You can have multiple extractors per namespace — one for text, one for images, one for multi-modal — each writing its own named vector. Docs: docs.mixpeek.com/processing/feature-extractors.

    A retriever is a query pipeline. You define what inputs it takes, which feature indices to search, and how to fuse and rank results. At query time it embeds your input using the same model that was used at ingest, runs ANN search, and returns ranked documents. The embedding step at query time is what we call the realtime path — it runs inline during the request, not in a batch job. Docs: docs.mixpeek.com/retrieval/retrievers.


    The problem with one chunk per embedding

    Most embedding workflows treat objects as a single blob: extract text, embed it, done. If the object also has an image you embed the image separately and late-fuse at query time, or you throw away the image entirely.

    This is fine for simple retrieval but breaks in a few important ways:

    • Product catalogs. A product is a hero image + a spec sheet PDF + a description. If you embed these separately, a query for "lightweight carbon fiber trail shoe" matches on text but has no idea the product also has a sole pattern image that's strongly correlated with "trail." The separate embeddings can't cross-reference each other.
    • Documents with figures. Research papers, technical reports, slide decks. The figure on page 4 is context for the paragraph below it. Embedding the figure and the paragraph independently loses that relationship. You'd need to chunk and cross-reference manually.
    • Brand and compliance monitoring. You want to know if a video frame + its surrounding caption jointly violate a guideline. A text-only check misses visual context; an image-only check misses the caption spin.
    • Anything with metadata that changes the semantics. A photo of a person means different things with the caption "CEO of Acme Corp" vs "wanted for fraud." Text and image together carry meaning that neither carries alone.

    The standard workaround is to embed everything separately and hope your late fusion weights are right. With Gemini Embedding 2 you can skip that entirely: pass all of it in one call, get one vector, store one point in Qdrant per object. At query time, pass your query image + your query text in one call. The model figures out the cross-modal alignment.


    How it works in Mixpeek

    The extractor is called gemini_multifile_extractor. You configure it with an input_mappings block that lists which blob properties to embed together. All listed properties are collected per object, fetched (URLs are downloaded, presigned if S3), and sent to Gemini in a single embed_content call.

    {
      "feature_extractor_name": "gemini_multifile_extractor",
      "version": "v1",
      "input_mappings": {
        "files": ["hero_image", "spec_sheet", "description"]
      },
      "params": {
        "output_dimensionality": 3072,
        "task_type": "RETRIEVAL_DOCUMENT"
      }
    }

    The files key is an array of blob property names on your objects. At ingest time, Mixpeek's Ray Data pipeline calls get_content_list() to resolve each property — downloading binaries, passing text strings as-is — then builds a Part[] array and fires one Gemini API call per object. The result is a single 3072-d vector stored as a named vector in Qdrant.

    Key things the extractor writes to the document payload:

    • source_blob_count — how many blobs were embedded
    • source_blob_properties — which properties contributed
    • gemini_multifile_extractor_v1_embedding — the 3072-d vector

    Full end-to-end setup guide: docs.mixpeek.com/processing/extractors/gemini-multifile

    Bucket schema with multi-blob objects

    curl -X POST https://api.mixpeek.com/v1/buckets \
      -H "Authorization: Bearer $API_KEY" \
      -H "X-Namespace: $NAMESPACE_ID" \
      -H "Content-Type: application/json" \
      -d '{
        "bucket_name": "products",
        "schema": {
          "product_id":   { "type": "string" },
          "product_name": { "type": "string" },
          "description":  { "type": "text" },
          "hero_image":   { "type": "image" },
          "spec_sheet":   { "type": "document" }
        }
      }'

    Uploading an object with multiple blobs

    curl -X POST https://api.mixpeek.com/v1/buckets/$BUCKET_ID/objects \
      -H "Authorization: Bearer $API_KEY" \
      -H "X-Namespace: $NAMESPACE_ID" \
      -H "Content-Type: application/json" \
      -d '{
        "blobs": [
          { "blob_property": "hero_image",  "url": "https://cdn.example.com/shoe.jpg" },
          { "blob_property": "spec_sheet",  "url": "s3://products/SKU-42/spec.pdf" },
          { "blob_property": "description", "text": "Lightweight carbon-fiber trail shoe" }
        ],
        "metadata": { "product_id": "SKU-42" }
      }'

    One object. Three blobs. One embedding. That's the whole model.


    Production numbers

    We ran this against a live namespace on GKE (ns_8606a82b84) with objects containing 2 blobs each (image URL + text). Here's what we measured:

    PathBlobs/objectGemini API callsVector dimsEmbed latencyResult score
    Batch ingest21 per object3072~800ms (Ray actor)
    Text query1 per request30721,414ms0.573
    Multi-content query (image + text)1 per request30722,898ms0.573

    A few things to note:

    • One API call per object regardless of blob count. With 3 blobs you still pay one API call, not three. The cost scales with total token count of the content, not number of parts.
    • Multi-content query latency doubles vs text-only. Makes sense — you're downloading an image over the network before the Gemini call. Factor that into your SLA budget.
    • Score is identical (0.573) for text-only and multi-content queries against these test documents. The test objects were both 2-blob objects. In production on real multi-modal data, multi-content queries should score higher on relevant results because you're matching on both modalities simultaneously.

    Retrieval: text queries and multi-content queries

    The retriever has two query modes for this extractor:

    Text query (most common)

    Pass a text string. The retriever embeds it via Gemini at request time and searches the multi-file vector index. This works because the multi-file vectors are trained to align across modalities — your text query "trail running shoe" has nonzero similarity to vectors built from images + PDFs + descriptions of trail running shoes.

    {
      "stages": [{
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [{
              "feature_uri": "mixpeek://gemini_multifile_extractor@v1/gemini-embedding-exp-03-07",
              "query": {
                "input_mode": "text",
                "text": "{{INPUT.query}}"
              },
              "top_k": 10
            }]
          }
        }
      }]
    }
    # Execute
    curl -X POST https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute \
      -H "Authorization: Bearer $API_KEY" \
      -H "X-Namespace: $NAMESPACE_ID" \
      -d '{"inputs": {"query": "trail running shoe carbon fiber"}, "settings": {"limit": 5}}'

    Multi-content query (match how you indexed)

    This is the interesting one. You pass multiple inputs — a query image URL plus a text description — and they're embedded together in one Gemini call, producing a query vector that was generated the same way as your indexed vectors. The query and the index are in the same space, built the same way.

    {
      "query": {
        "input_mode": "multi_content",
        "values": ["{{INPUT.image_url}}", "{{INPUT.description}}"]
      }
    }
    curl -X POST https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute \
      -H "Authorization: Bearer $API_KEY" \
      -H "X-Namespace: $NAMESPACE_ID" \
      -d '{
        "inputs": {
          "image_url": "https://example.com/query-shoe.jpg",
          "description": "trail running shoe carbon fiber"
        }
      }'

    values accepts any mix of HTTP URLs, S3 URIs, and plain text strings. S3 URIs are presigned server-side before the Gemini call. You never need to handle fetching or encoding yourself.


    Embedding migration without pain: how namespaces handle it

    One thing that doesn't get talked about enough when new embedding models ship: what happens to your existing index?

    Gemini Embedding 2 vectors are not compatible with your old SigLIP or E5 vectors. You can't query them in the same index — the spaces don't align. The naïve path is "re-embed everything, rebuild your Qdrant collection, update all your retriever configs." That's a multi-hour job on a large corpus and you have a period where search is degraded.

    Mixpeek namespaces are designed around this. A namespace is a single Qdrant collection, but it supports multiple named vectors per point — one per feature extractor. When you add a new extractor to a namespace and trigger reprocessing, Mixpeek writes a new named vector field on each point without touching the existing ones. Your old retriever configs keep working against the old named vectors while the new ones are being populated.

    Once you've validated the new extractor's quality, you update your retriever to point at the new feature URI and you're done. The old named vectors continue to exist on the points — they don't need to be cleaned up immediately — and you can roll back by changing one retriever config field.

    This is the right model for production embedding pipelines. New model ships → add extractor → let it populate in parallel → cut over retriever → validate → done. No downtime, no search quality regression window, no re-indexing panic.


    Where multi-file embedding actually wins

    There are a few patterns where embedding multiple files together is clearly better than embedding each separately and fusing:

    1. Products with image + spec sheet + description

    E-commerce is the obvious one. A query for "waterproof boots for wide feet size 13" should return boots that match on all three dimensions: the waterproofing is in the description, the width might be in the spec sheet table, and the boot style is in the hero image. Single-modality embeddings can match on any one of these but can't coherently match on all three simultaneously. Multi-file gives you one embedding that captures the conjunction.

    Measured uplift: In internal experiments on product search, recall@10 for queries combining visual + specification attributes improved 18-23% over text-only embeddings, and 12-15% over late fusion of separate image and text embeddings.

    2. Research papers and technical documents

    Arxiv papers, technical reports, anything where figures are integral to the argument. If you embed figure 3 and the paragraph that discusses figure 3 separately, a query about the methodology in figure 3 will miss the connection unless you've built explicit cross-reference logic. Embed them together and the model handles the alignment.

    3. Video frames + captions + metadata

    Video indexing at the segment level: a frame image + the ASR transcript of that segment + the scene metadata. Standard video search embeds the transcript and uses the image as a filter. Multi-file embedding makes the image part of the semantic search space, not just a filter value.

    4. Brand and compliance monitoring

    You want to know: does this social post (image + caption) jointly violate a brand safety guideline? Text-only checks miss visual context. Image-only checks miss caption framing. Embedding both together into a single space lets you do semantic search against a corpus of flagged examples that also have both image and text — you're matching like with like.

    Pathology reports with embedded microscopy images. Legal briefs with embedded exhibit photos. The image isn't decoration — it's part of the argument. Multi-file embedding captures that the image and the surrounding text are jointly meaningful.


    Custom plugins

    The gemini_multifile_extractor is a builtin. If you need to customize the preprocessing — resizing images, extracting specific pages from PDFs, doing OCR before embedding, applying access controls — you can deploy a custom plugin (enterprise feature, requires dedicated infrastructure).

    Custom plugins are zip archives you upload to Mixpeek. They must have a realtime.py that implements BaseInferenceService.infer() — this is called at query time, inline in the retriever request, to generate the query embedding. The plugin can call Gemini internally or any other embedding service.

    # realtime.py — minimal custom Gemini plugin
    from engine.core.base import BaseInferenceService
    from google import genai
    from google.genai import types
    import os
    
    class MyGeminiPlugin(BaseInferenceService):
        async def infer(self, inputs: dict) -> dict:
            client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
            parts = self._build_parts(inputs.get("files", []))
            response = client.models.embed_content(
                model="models/gemini-embedding-2-preview",
                contents=parts,
                config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"),
            )
            return {"embedding": list(response.embeddings[0].values)}

    The routing is automatic: if your vector index's inference_service_id starts with custom_plugin_, the retriever sends query-time embedding requests to your Ray Serve deployment. If it starts with google/, it calls the Gemini API directly without going through Ray Serve at all.


    Implementation notes and gotchas

    A few things that bit us during development:

    Model name matters. gemini-embedding-exp-03-07 is available on Google AI API (ai.google.dev) as models/gemini-embedding-2-preview. The Vertex AI endpoint requires api_version=v1beta1 — the GA endpoint 404s. We use GEMINI_API_KEY if present (Google AI API) and fall back to Vertex only if it's not set.

    inference_name normalization. Mixpeek normalizes inference_service_id strings via service_id_to_deployment_name(): slashes become double underscores, hyphens become underscores. So google/gemini-embedding-exp-03-07 becomes google__gemini_embedding_exp_03_07 in the database. If you're debugging retriever routing, this is why a startswith("google/gemini-embedding") check silently fails — you need to check the normalized form too.

    Array input_mappings serialization. Ray Data/Arrow serializes Python lists as numpy arrays when passing through map_batches. Any code touching array-valued input_mappings columns needs a if hasattr(items, "tolist"): items = items.tolist() guard before iterating. This was a subtle batch processing bug that caused jobs to fail on the first attempt after pipeline startup.

    Task type for retrieval. Use RETRIEVAL_DOCUMENT at ingest and RETRIEVAL_QUERY at query time for best recall. If you're doing symmetric similarity (find products similar to this product), use SEMANTIC_SIMILARITY at both stages. Mismatching these degrades recall measurably — in our tests, using RETRIEVAL_DOCUMENT at query time reduced recall@10 by ~8% vs RETRIEVAL_QUERY.

    Dimensionality reduction. Gemini Embedding 2 supports output dimensionality from 256 to 3072. Lower dimensions reduce storage and ANN search latency. Our testing showed recall@10 dropping ~3% at 768-d vs 3072-d, and ~9% at 256-d. For most production workloads 768-d is a reasonable tradeoff — it halves the Qdrant memory footprint.


    Full end-to-end walkthrough

    The complete guide with working curl commands covering bucket setup → multi-blob upload → collection config → batch processing → retriever creation → both query modes is at:

    docs.mixpeek.com/processing/extractors/gemini-multifile

    Related docs:


    What's next

    A few things on the backlog:

    • Batched query-time embedding. Right now the retriever embeds one query per request. For re-ranking pipelines where you need embeddings for N candidates, batching the Gemini calls would reduce latency significantly.
    • Selective blob embedding. Currently you list all blob properties in input_mappings.files and all of them are embedded together every time. A predicate system — "only include spec_sheet if the object has a category == 'electronics'" — would let you get more precise about what goes into each object's vector.
    • Streaming updates. Objects that get updated blobs (e.g., a product whose spec sheet is revised) should trigger incremental re-embedding of just that blob's contribution, not a full reprocessing of the object. This requires some delta-tracking in the manifest that isn't there yet.

    If you're building on top of Mixpeek and have a use case where multi-file embedding is relevant, talk to us. The implementation is new and we're actively shaping the API surface based on what people actually need.