NEWMVS for embeddings. Managed for files. Both on object storage.Vectors or files. Pick a path.Start →

Start here

Vector Store (MVS)

Bring your own vectors. Dense, sparse, and BM25 search on object storage.

Managed Indexing

Connect a bucket and auto-extract scenes, faces, OCR, transcripts, and embeddings.

Build

Compose multi-stage search in <100ms: filter, join, rerank.

Feature Extractors

Typed pipelines for faces, scenes, transcripts, OCR, fingerprints.

S3, GCS, R2, Mux, LangChain, MCP, and more. Connect your stack.

Generate and store embeddings from 50+ models, then search them.

By Industry

Map, search, and reuse the moments that perform. Plugs into iconik & Mux.

Talent search, brand safety, creative analytics.

Scene search, recommendation, archive access.

Visual search, PDP enrichment, catalog QA.

Lecture search, transcript Q&A, content safety.

View all solutions →

By Use Case

Face & Person Search

Find anyone across video libraries in milliseconds.

IP & Copyright Detection

Logos, songs, faces: one pipeline, one report.

Visual Taste & Recs

Scene-similarity ranked recommendations with RL.

Brand & Ad Safety

Pre-publish content screening at bid-time speeds.

View all use cases →

Build & evaluate

API reference, SDKs, recipes, and architecture guides.

Browse supported HuggingFace models by task and modality.

Mixpeek vs. Pinecone, Weaviate, Twelve Labs, more.

Media & data converters — no account, runs in your browser.

See what teams are building with Mixpeek.

Learn

Best-of comparisons: vector DBs, embedding models, moderation APIs.

Vendor-neutral deep dives on perception, retrieval, and embeddings.

Launches, deep dives, and field notes from our engineers.

Papers behind multimodal search — MUVERA, SAM 3, and more — explained.

Every term you need: embeddings to re-rankers.

Videos, diagrams & university →

Mission, team, and the multimodal vision.

We're hiring across research, infra, and design.

Talk to sales, support, or press.

45-min working session on your data — leave with a running notebook.

Sign in Request Demo Get started →

Models/Detection & Recognition/AILab-CVC/YOLO-World-L

PyTorchObject DetectionGPL-3.0

YOLO-World-L

by AILab-CVC

Real-time open-vocabulary object detection with text prompts

320Kdl/month

~100Mparams

HuggingFace Run on your data, free

Identifiers

Model ID

AILab-CVC/YOLO-World-L

Feature URI

mixpeek://image_extractor@v1/tencent_yoloworld_large_v1

Overview

YOLO-World extends the YOLO detector family with open-vocabulary detection via vision-language modeling. Users specify objects to detect with text prompts; the model finds them zero-shot at real-time speeds (52 FPS on V100).

On Mixpeek, YOLO-World enables detecting arbitrary objects in video and images using natural language, without retraining for each new category.

Architecture

YOLO backbone with Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN). Uses region-text contrastive loss and a prompt-then-detect paradigm where vocabulary is embedded as model parameters for fast inference.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "object_detection",
    version: "v1",
    parameters: { model_id: "AILab-CVC/YOLO-World-L" },
  },
});

Capabilities

Open-vocabulary detection with text prompts
52 FPS on V100 (real-time)
35.4 AP on LVIS zero-shot
Supports image-prompted detection
ONNX and TFLite INT8 export

Use Cases on Mixpeek

Real-time video monitoring for arbitrary object types

Content moderation with dynamically defined categories

Retail inventory tracking with custom product lists

Open-ended visual question answering pipelines

Benchmarks

Dataset	Metric	Score	Source
LVIS (zero-shot)	AP	35.4	Cheng et al., 2024 — Table 1
COCO val2017	AP	45.7	Cheng et al., 2024 — Table 2

Performance

Input Size640×640 px

GPU Latency~8ms / image (A100)

CPU Latency~95ms / image

GPU Throughput~125 images/sec (A100)

GPU Memory~0.9 GB

Common Pipeline Companions

openai/clip-vit-large-patch14

Embed detected regions

microsoft/Florence-2-large

Caption open-vocabulary detections

Explore on Mixpeek

More Detection & Recognition models

Compare alternatives in this category

Best Computer Vision APIs

Hand-picked tools & platforms compared

Open-Vocabulary Object Detection

Deep-dive technical guide

Feature Extractors

See how Mixpeek runs models as extractors

Mixpeek Vector Store

Store & search embeddings at scale

Usage-based pricing for pipelines

All Curated Lists

Compare models, APIs & infrastructure

Specification

FrameworkPyTorch

OrganizationAILab-CVC

FeatureObject Detection

Outputbbox + label

Modalitiesvideo, image

RetrieverObject Filter

Parameters~100M

LicenseGPL-3.0

Downloads/mo320K

Research Paper

YOLO-World: Real-Time Open-Vocabulary Object Detection

Build a pipeline with YOLO-World-L

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Run on your data, free

Alternative Models

facebook/detr-resnet-50

Object Detection

hustvl/yolos-tiny

Object Detection

ultralytics/yolov8n

Object Detection

IDEA-Research/grounding-dino-base

Object Detection

Related in Detection & Recognition

deepinsight/insightface-retinaface-r50

timesler/facenet-pytorch

fal/AuraFace-v1

Idiap/EdgeFace-S-GAMMA