DOC: contextflo
STATUS: ● PUBLISHED
SYSTEM CONTEXTFLOW

Chunking With Provenance: 512 Chars, 64 Overlap, Full Lineage

Splitting text so retrieval stays accurate and every result can cite its source.

Cover image — Chunking With Provenance: 512 Chars, 64 Overlap, Full Lineage

Embeddings work on chunks, not whole documents, so how you split the text directly shapes what the search can find. ContextFlow chunks with a structure-aware splitter and stamps every chunk with where it came from, so results are both accurate and citable.

// 01 — STRUCTURE-AWARE SPLITTING

A naive splitter cuts every N characters, slicing through sentences and entities. ContextFlow uses RecursiveCharacterTextSplitter with a hierarchy of separators. It tries to break on the most natural boundary available, in order:

separators = ["\n\n", "\n", ". ", "! ", "? ", " ", ""]
chunk_size = 512    # ~2–3 academic sentences
chunk_overlap = 64  # 12.5%

It prefers paragraph breaks, then line breaks, then sentence ends, falling back to spaces only when it must. Chunks land on meaningful boundaries instead of mid-thought.

// 02 — WHY OVERLAP

Each chunk repeats the last 64 characters of the previous one. That overlap preserves cross-boundary context: a named entity or clause that straddles a split still appears whole in one of the two chunks. Without overlap, the concept that happens to fall on a boundary is findable in neither.

// 03 — PROVENANCE ON EVERY CHUNK

Every chunk carries metadata: source, page_number, chunk_index, char_count. This is what lets a search result say “page 14 of paper.pdf” instead of just handing back text. Provenance is what makes retrieval trustworthy: you can follow any answer back to the exact page it came from and verify it.

TAKEAWAYS

NEXT

@frogwebp brand mark
ANTHONY PENA · @FROGWEBP
I build data systems and write about everything around them, the architecture, the failures, what each one teaches me. Documenting in public since 2021: the process, not just the result.

// NEWSLETTER — THE BUILD LOG SIGNAL

When I ship something or learn something worth keeping, it lands here first — build logs, concepts, and the honest process behind them. Come along; no spam, leave anytime.