Deterministic Chunk IDs: Idempotency as Structure, Not a Flag

Re-ingest a document into ContextFlow and nothing duplicates. The chunks update in place. There’s no “already processed?” check, no dedup pass. Idempotency is built into how chunk IDs are generated, so it can’t be forgotten.

// 01 — THE ID

A chunk’s ID is a hash of its identity: source, page, position:

key = f"{source}::p{page_number}::c{chunk_index}"
chunk_id = hashlib.sha256(key.encode()).hexdigest()[:16]

The same document, chunked the same way, always produces the same IDs. The ID isn’t assigned; it’s derived, so a re-processed chunk arrives carrying the identity it had last time.

// 02 — UPSERT, NOT ADD

The load step uses collection.upsert(), not add(). Combined with stable IDs, re-running on an already-indexed document overwrites those exact chunk IDs instead of inserting new rows. A document you ingest five times occupies the same space as one you ingest once.

// 03 — WHY STRUCTURAL BEATS A FLAG

You could get idempotency with a tracking table (“have I seen this file?”), but that’s a check you can forget, get wrong, or race. Deriving the ID from the content makes duplication impossible by construction: there’s no code path that creates a second copy, because the second copy would have the same ID as the first and upsert onto it. The guarantee lives in the data model, not in a conditional someone has to remember to write.

TAKEAWAYS

Derive IDs from content, not from insertion order. A content-addressed ID makes re-processing recognize itself.
upsert + deterministic IDs = idempotent ingestion with no dedup logic and no tracking table.
Prefer guarantees that live in structure over guarantees that live in a flag. The flag can be bypassed; the structure can’t.

Build log 06: the Airflow DAG that orchestrates the whole cycle.

// 01 — THE ID

// 02 — UPSERT, NOT ADD

// 03 — WHY STRUCTURAL BEATS A FLAG

TAKEAWAYS

NEXT