vjepa2-vitl-fpc64-256
by facebook
Self-supervised video encoder for retrieval, classification, and VLM perception
facebook/vjepa2-vitl-fpc64-256mixpeek://video_extractor@v1/facebook_vjepa2_vitl_fpc64_256_v1Overview
V-JEPA 2 is Meta FAIR's video representation model trained with a joint embedding predictive architecture. Instead of treating video as independent frames, it learns representations that preserve temporal structure, motion, and object dynamics.
On Mixpeek, V-JEPA 2 is useful as a video feature extractor before retrieval or classification. It gives agents and search systems a compact representation of what happens over time, not just what appears in a sampled keyframe.
Architecture
Vision Transformer video encoder. The ViT-L FPC64 checkpoint samples 64 frames and exposes get_vision_features through Transformers. It can also encode still images by repeating the image across the expected frame dimension.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "video_embedding",
version: "v1",
parameters: { model_id: "facebook/vjepa2-vitl-fpc64-256" },
},
});Capabilities
- Video feature extraction from 64-frame clips
- Temporal representation for retrieval and classification
- Can serve as a video encoder for downstream VLMs
- MIT license
Use Cases on Mixpeek
Performance
Use as a video feature stage, then rerank with captions or transcripts when precision matters
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
arxiv.orgBuild a pipeline with vjepa2-vitl-fpc64-256
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio