owlvit-large-patch14

by google

Simple open-vocabulary object detection with Vision Transformers

580Kdl/month

~300Mparams

HuggingFace Use in Pipeline

Identifiers

Model ID

google/owlvit-large-patch14

Feature URI

mixpeek://image_extractor@v1/google_owlvit_large_v1

Overview

OWL-ViT transfers image-text pre-trained models to open-vocabulary object detection using a standard ViT with minimal modifications. It supports both text-conditioned zero-shot detection and one-shot image-conditioned detection.

On Mixpeek, OWL-ViT provides a clean, well-scaling detection model that improves consistently with larger pre-trained backbones and more data.

Architecture

Plain Vision Transformer (ViT-L/14) pre-trained with contrastive image-text learning, then fine-tuned end-to-end for detection. No detection-specific backbone changes needed.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "my-collection",
  source: { url: "https://example.com/image.jpg" },
  feature_extractors: [{
    name: "object_detection",
    version: "v1",
    params: { model_id: "google/owlvit-large-patch14" }
  }]
});