DOC: contextflo
STATUS: ● PUBLISHED
SYSTEM CONTEXTFLOW

Semantic Search Over 100 PDFs: Meaning, Not Ctrl+F

Ask a question in plain English; get ranked answers with page provenance in under 2 seconds.

Cover image — Semantic Search Over 100 PDFs: Meaning, Not Ctrl+F

Ctrl+F finds keywords. It can’t find an idea phrased differently than you searched for it. ContextFlow is an ETL pipeline that makes a document library searchable by meaning: drop in 100 PDFs, ask “what phonological features distinguish tonal languages?”, and get ranked passages from across the whole corpus in under two seconds, each with a similarity score and the exact page it came from.

This is the first entry in its codex: what it does, and why it’s built for the hard case.

// 01 — THE GAP

Keyword search matches strings. Semantic search matches concepts, by turning text into vectors whose distance reflects meaning. Two passages about the same idea land near each other in vector space even if they share no words. That’s the difference between “find the document containing this phrase” and “find what’s relevant to this question.”

// 02 — THE PIPELINE

Five stages, extract → transform → load → log → query:

PDFs → extract (stream pages, fix encoding)
     → transform (clean unicode · chunk 512/64 · embed 384-dim)
     → load (deterministic IDs · upsert into ChromaDB)
     → log (one SQLite row per run)
     → query (cosine similarity → ranked results + provenance)

It’s fully idempotent: re-running on the same document upserts, never duplicates, and it runs entirely local. The embedding model, the vector store, and the audit log all live on your machine, at zero API cost.

// 03 — BUILT FOR THE HARD CASE

The target isn’t clean corporate PDFs. It’s multilingual academic papers compiled from LaTeX: linguistics texts full of IPA symbols, accented characters, ligatures, and the UTF-8 edge cases that quietly corrupt naive pipelines. Getting those right is most of the work, and it’s where the next few entries go. Validated end-to-end on real arXiv papers: 56 pages, 651 chunks, top-1 similarity 0.802, 60/60 tests green.

TAKEAWAYS

NEXT

@frogwebp brand mark
ANTHONY PENA · @FROGWEBP
I build data systems and write about everything around them, the architecture, the failures, what each one teaches me. Documenting in public since 2021: the process, not just the result.

// NEWSLETTER — THE BUILD LOG SIGNAL

When I ship something or learn something worth keeping, it lands here first — build logs, concepts, and the honest process behind them. Come along; no spam, leave anytime.