Guides

Vendor-neutral, engineer-written guides to the concepts behind multimodal AI — perception, retrieval, embeddings, and the infrastructure agents use to see, hear, and search unstructured data. Learn the idea first; then see how Mixpeek applies it.

79 guides across 14 topics

Retrieval

15 min read

Semantic Caching: How Agents Skip Work They Have Already Done

A vendor-neutral guide to caching by meaning instead of by exact string. Covers why hash-based caches almost never hit on agent traffic, how a semantic cache is really a tiny vector index of query embeddings, the similarity-threshold precision/recall tradeoff that makes or breaks it, the failure modes (false hits, staleness, negation and entity flips), invalidation strategies, and how to cache retrieval results and tool calls, not just answers, for agents that fan out many near-duplicate queries.

Semantic Cache

Retrieval

Agent Infrastructure

Jul 2026Read guide

Architecture

17 min read

Efficient Attention: How Models Read Hour-Long Video and Book-Length Documents

A vendor-neutral guide to the attention tricks that make million-token multimodal context possible. Covers why dense attention hits a quadratic wall, sliding-window and block-sparse patterns (Longformer, BigBird), learned/dynamic sparsity like MiniMax Sparse Attention, linear and state-space attention (the QK^T V reordering, Mamba-style recurrence), the hybrid full-plus-sparse layer stack, and why efficient attention and retrieval are complements, not rivals, for agents that have to watch whole videos or read whole corpora.

Sparse Attention

Linear Attention

Long Context

Jul 2026Read guide

Document Understanding

16 min read

Optical Context Compression: Reading Documents as Images, Not Text

A vendor-neutral guide to the 2026 idea that a picture of text can be cheaper than the text itself. Covers optical 2D mapping, the DeepEncoder architecture (SAM window attention, a 16x bridge compressor, CLIP global attention), the compression-vs-precision curve (97% at under 10x, ~60% at 20x), the OmniDocBench results that made people notice, and why vision-token compression matters for agents that have to read long documents.

Optical Compression

Vision Tokens

DeepSeek-OCR

Jul 2026Read guide

Perception

16 min read

Multi-Object Tracking: How Agents Follow Objects Across Video Frames

A vendor-neutral guide to tracking-by-detection — motion prediction with a Kalman filter, data association via IoU and the Hungarian algorithm, ByteTrack's low-score recovery, appearance re-identification, and the ID-switch problem — the pipeline that turns per-frame detections into stable per-object tracks an agent can reason over across time.

Multi-Object Tracking

ByteTrack

Kalman Filter

Jul 2026Read guide

Perception

15 min read

Monocular Depth Estimation: How Models Infer 3D From a Single Image

A vendor-neutral guide to inferring depth from one 2D image — why the problem is ill-posed, the pictorial cues models learn, relative vs metric depth and scale ambiguity, how self-supervision trains depth without ground truth, and why a depth channel lets an agent reason about scene geometry it otherwise can't see.

Depth Estimation

Monocular Depth

Depth Anything

Jul 2026Read guide

Perception

16 min read

Instance-Level Visual Matching: Finding the Same Object, Not Just Similar Ones

A vendor-neutral guide to geometric visual matching — keypoint detection, local descriptors, descriptor matching, and RANSAC geometric verification — the pipeline an agent uses to confirm two images contain the *same* physical object or scene, which a similarity embedding cannot decide on its own.

Keypoint Matching

SuperPoint

RANSAC

Jun 2026Read guide

Perception

17 min read

Face Recognition and Identity Clustering: How Agents Recognize and Group People in Video

A vendor-neutral walk through the face pipeline an agent uses to answer 'who is in this footage?' and 'find every clip with this person' — detection, alignment, metric-learned embeddings (ArcFace's angular margin), verification vs identification, and the unsupervised identity-clustering problem.

Face Recognition

ArcFace

Metric Learning

Jun 2026Read guide

Retrieval

20 min read

Reasoning Rerankers: How Listwise LLM Rerankers Reorder Retrieval Results

How listwise LLM rerankers (RankGPT-style) and reasoning rerankers (Qwen3-Reranker, Nemotron-style) reorder candidate sets by generating a permutation rather than scoring documents independently, why considering the whole list at once captures signals pointwise cross-encoders miss, the sliding-window strategy, positional bias and its fixes, distillation into cheap rerankers, and budget-aware per-query reranker selection for agents.

Listwise Reranking

LLM Reranker

RankGPT

Jun 2026Read guide

Retrieval

21 min read

Retrieval Feedback Loops: Learning to Rank from Clicks, Outcomes, and Agent Interactions

How a ranked list becomes a hypothesis that interactions test, why naive 'clicked = relevant' is wrong, and how click models, counterfactual learning-to-rank, and online reranking close the loop so agentic search gets better from its own outcomes.

Feedback Loops

Learning to Rank

Click Models

Jun 2026Read guide

Embeddings

20 min read

Matryoshka Representation Learning: Nested Embeddings for Adaptive Multimodal Retrieval

How a single embedding model can produce vectors that stay useful when truncated to fewer dimensions, and how AI agents exploit nested embeddings to run fast coarse shortlists then precise full-dimension reranks over huge unstructured corpora.

Matryoshka

Nested Embeddings

Adaptive Retrieval

Jun 2026Read guide

Retrieval

21 min read

Filtered Vector Search: How Agents Combine Similarity with Hard Constraints

Almost every agentic query is a vector search plus a constraint -- 'clips from campaign X after May', 'images of red cars in the EU bucket'. This guide explains the three filtering strategies (pre-filter, post-filter, in-place predicate-aware traversal), why each one silently breaks recall or latency at different selectivities, and how a query planner picks between them.

Filtered Search

Vector Search

HNSW

Jun 2026Read guide

Agent Perception

18 min read

How Vision-Language Models Fuse Image and Text Tokens

A VLM is the component that lets an agent actually see: it turns pixels into tokens an LLM can reason over alongside words. This guide opens the architecture, how a vision encoder produces patch features, how a projector or resampler turns them into language tokens, and the real fusion strategies (prefix concatenation, cross-attention, Q-Former resampling) that decide whether your agent reads a frame accurately or hallucinates over it.

Vision-Language Models

VLM

Multimodal Fusion

Jun 2026Read guide

From concept to production

These guides explain how multimodal perception and retrieval actually work. Mixpeek is the platform that runs them — point it at your storage and get back relevant, timestamped results.

Start free Book a demo