The embedding step turns each chunk into a vector. ContextFlow does it locally with all-MiniLM-L6-v2: 384 dimensions, no API, no per-call cost, no data leaving the machine. For document retrieval, that’s not a compromise; it’s the right default.
// 01 — SMALL MODEL, LOCAL
all-MiniLM-L6-v2 produces 384-dimensional vectors and runs comfortably on CPU. The trade against a giant hosted model is precision you mostly don’t need for retrieval, in exchange for: zero API cost, zero rate limits, zero privacy risk, and full offline operation. For a private corpus of academic PDFs, “nothing leaves the box” is a feature you can’t buy back later.
// 02 — L2 NORMALIZATION
Embeddings are L2-normalized, which makes cosine similarity reduce to a dot product. Once every vector has length 1, the angle between two vectors (their semantic closeness) is just their dot product, cheaper to compute and exactly what the vector store wants for a cosine collection. A small math choice that makes every query faster.
// 03 — BATCHED THROUGH A PROTOCOL
All of a document’s chunks are encoded in a single batched encode() call, since self-attention parallelizes across the batch, making it far faster than one-at-a-time. And the embedder sits behind a Protocol:
@runtime_checkable
class Embedder(Protocol):
def encode(self, texts: list[str]) -> np.ndarray: ...
Anything with an encode() method satisfies it. Swapping MiniLM for OpenAI, Cohere, or Ollama is one new class and zero changes anywhere else. (That pattern gets its own concept post.)
TAKEAWAYS
- For retrieval, a small local model is usually right. The precision gap rarely matters, and you gain zero cost, zero rate limits, and full privacy.
- L2-normalize your vectors so cosine similarity becomes a dot product: faster queries, and what cosine vector stores expect.
- Hide the model behind a Protocol. The embedder you start with is rarely the one you finish with.
NEXT
- Build log 05: deterministic chunk IDs: idempotency as structure, not a flag.
