Mixpeek Logo
    Login / Signup
    Models/Detection & Recognition/google/owlvit-large-patch14
    HFObject DetectionApache 2.0

    owlvit-large-patch14

    by google

    Simple open-vocabulary object detection with Vision Transformers

    580Kdl/month
    ~300Mparams
    Identifiers
    Model ID
    google/owlvit-large-patch14
    Feature URI
    mixpeek://image_extractor@v1/google_owlvit_large_v1

    Overview

    OWL-ViT transfers image-text pre-trained models to open-vocabulary object detection using a standard ViT with minimal modifications. It supports both text-conditioned zero-shot detection and one-shot image-conditioned detection.

    On Mixpeek, OWL-ViT provides a clean, well-scaling detection model that improves consistently with larger pre-trained backbones and more data.

    Architecture

    Plain Vision Transformer (ViT-L/14) pre-trained with contrastive image-text learning, then fine-tuned end-to-end for detection. No detection-specific backbone changes needed.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/image.jpg" },
      feature_extractors: [{
        name: "object_detection",
        version: "v1",
        params: { model_id: "google/owlvit-large-patch14" }
      }]
    });

    Capabilities

    • Zero-shot text-conditioned object detection
    • One-shot image-conditioned detection
    • Consistent scaling with model and data size
    • Standard ViT architecture, minimal modifications

    Use Cases on Mixpeek

    Detecting objects from text descriptions in images and video
    One-shot detection using a reference image
    Scalable visual search with text queries

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureObject Detection
    Outputbbox + label
    Modalitiesvideo, image
    RetrieverObject Filter
    Parameters~300M
    LicenseApache 2.0
    Downloads/mo580K

    Research Paper

    Simple Open-Vocabulary Object Detection with Vision Transformers

    arxiv.org

    Build a pipeline with owlvit-large-patch14

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder