A retrieval system that returns text without saying where it’s from is asking you to take its word for it. In RAG (retrieval-augmented generation), that’s not good enough, because the whole point is grounding answers in sources you can check.
What provenance is
Provenance is the metadata that travels with every chunk: which document, which page, which position. In ContextFlow each chunk carries source, page_number, chunk_index, char_count. When a chunk comes back as a search result, it can say “page 14 of tonal-languages.pdf” instead of just handing over a paragraph.
Why it’s non-negotiable
Three reasons:
- Verification. A reader (or a downstream LLM) can follow the citation to the original and confirm the answer is real, not hallucinated.
- Trust. “Here’s the answer, and here’s exactly where it’s from” is a fundamentally different claim than “here’s some text.” One is checkable; the other is faith.
- Debugging. When retrieval returns something irrelevant, provenance tells you which document and page misled it, so you can see why.
Carry it from the start
Provenance can’t be reconstructed after the fact. Once a chunk is embedded without knowing its page, that information is gone. It has to be attached at chunking time and preserved through embedding, storage, and retrieval. It’s cheap to carry and impossible to recover, so you attach it early and never drop it.
Takeaway
Every chunk should know its origin (source, page, position) and carry it all the way to the result. Provenance is what turns retrieved text into a citation, and citation is what makes a RAG system trustworthy instead of merely plausible. Attach it at chunking time; you can’t add it later.
