Guides
Vendor-neutral, engineer-written guides to the concepts behind multimodal AI — perception, retrieval, embeddings, and the infrastructure agents use to see, hear, and search unstructured data. Learn the idea first; then see how Mixpeek applies it.
79 guides across 14 topics
Semantic Caching: How Agents Skip Work They Have Already Done
A vendor-neutral guide to caching by meaning instead of by exact string. Covers why hash-based caches almost never hit on agent traffic, how a semantic cache is really a tiny vector index of query embeddings, the similarity-threshold precision/recall tradeoff that makes or breaks it, the failure modes (false hits, staleness, negation and entity flips), invalidation strategies, and how to cache retrieval results and tool calls, not just answers, for agents that fan out many near-duplicate queries.
Efficient Attention: How Models Read Hour-Long Video and Book-Length Documents
A vendor-neutral guide to the attention tricks that make million-token multimodal context possible. Covers why dense attention hits a quadratic wall, sliding-window and block-sparse patterns (Longformer, BigBird), learned/dynamic sparsity like MiniMax Sparse Attention, linear and state-space attention (the QK^T V reordering, Mamba-style recurrence), the hybrid full-plus-sparse layer stack, and why efficient attention and retrieval are complements, not rivals, for agents that have to watch whole videos or read whole corpora.
Optical Context Compression: Reading Documents as Images, Not Text
A vendor-neutral guide to the 2026 idea that a picture of text can be cheaper than the text itself. Covers optical 2D mapping, the DeepEncoder architecture (SAM window attention, a 16x bridge compressor, CLIP global attention), the compression-vs-precision curve (97% at under 10x, ~60% at 20x), the OmniDocBench results that made people notice, and why vision-token compression matters for agents that have to read long documents.
Multi-Object Tracking: How Agents Follow Objects Across Video Frames
A vendor-neutral guide to tracking-by-detection — motion prediction with a Kalman filter, data association via IoU and the Hungarian algorithm, ByteTrack's low-score recovery, appearance re-identification, and the ID-switch problem — the pipeline that turns per-frame detections into stable per-object tracks an agent can reason over across time.
Monocular Depth Estimation: How Models Infer 3D From a Single Image
A vendor-neutral guide to inferring depth from one 2D image — why the problem is ill-posed, the pictorial cues models learn, relative vs metric depth and scale ambiguity, how self-supervision trains depth without ground truth, and why a depth channel lets an agent reason about scene geometry it otherwise can't see.
Instance-Level Visual Matching: Finding the Same Object, Not Just Similar Ones
A vendor-neutral guide to geometric visual matching — keypoint detection, local descriptors, descriptor matching, and RANSAC geometric verification — the pipeline an agent uses to confirm two images contain the *same* physical object or scene, which a similarity embedding cannot decide on its own.
Face Recognition and Identity Clustering: How Agents Recognize and Group People in Video
A vendor-neutral walk through the face pipeline an agent uses to answer 'who is in this footage?' and 'find every clip with this person' — detection, alignment, metric-learned embeddings (ArcFace's angular margin), verification vs identification, and the unsupervised identity-clustering problem.
Reasoning Rerankers: How Listwise LLM Rerankers Reorder Retrieval Results
How listwise LLM rerankers (RankGPT-style) and reasoning rerankers (Qwen3-Reranker, Nemotron-style) reorder candidate sets by generating a permutation rather than scoring documents independently, why considering the whole list at once captures signals pointwise cross-encoders miss, the sliding-window strategy, positional bias and its fixes, distillation into cheap rerankers, and budget-aware per-query reranker selection for agents.
Retrieval Feedback Loops: Learning to Rank from Clicks, Outcomes, and Agent Interactions
How a ranked list becomes a hypothesis that interactions test, why naive 'clicked = relevant' is wrong, and how click models, counterfactual learning-to-rank, and online reranking close the loop so agentic search gets better from its own outcomes.
Matryoshka Representation Learning: Nested Embeddings for Adaptive Multimodal Retrieval
How a single embedding model can produce vectors that stay useful when truncated to fewer dimensions, and how AI agents exploit nested embeddings to run fast coarse shortlists then precise full-dimension reranks over huge unstructured corpora.
Filtered Vector Search: How Agents Combine Similarity with Hard Constraints
Almost every agentic query is a vector search plus a constraint -- 'clips from campaign X after May', 'images of red cars in the EU bucket'. This guide explains the three filtering strategies (pre-filter, post-filter, in-place predicate-aware traversal), why each one silently breaks recall or latency at different selectivities, and how a query planner picks between them.
How Vision-Language Models Fuse Image and Text Tokens
A VLM is the component that lets an agent actually see: it turns pixels into tokens an LLM can reason over alongside words. This guide opens the architecture, how a vision encoder produces patch features, how a projector or resampler turns them into language tokens, and the real fusion strategies (prefix concatenation, cross-attention, Q-Former resampling) that decide whether your agent reads a frame accurately or hallucinates over it.
From concept to production
These guides explain how multimodal perception and retrieval actually work. Mixpeek is the platform that runs them — point it at your storage and get back relevant, timestamped results.