DOC: contextflo
STATUS: ● PUBLISHED
SYSTEM CONTEXTFLOW

The Airflow DAG: Validate, Extract, Transform, Load, Notify

Orchestration with the right retries per task, and chunks that don't go through XCom.

Cover image — The Airflow DAG: Validate, Extract, Transform, Load, Notify

Run from the CLI, ContextFlow is a script. Run from Airflow, it’s an orchestrated pipeline with retries, run history, and a failure that fails loudly. The DAG is five tasks, each with a deliberate retry policy.

// 01 — THE TASK GRAPH

validate_inputs >> extract >> transform >> load >> notify
max_active_runs = 1   # one model load at a time on CPU
retries = 2
retry_exponential_backoff = True

max_active_runs = 1 prevents two runs from loading the embedding model onto the CPU simultaneously and starving each other.

// 02 — RETRIES THAT MATCH REALITY

Not every task should retry the same way:

The retry count encodes a belief about why a task fails. Retrying a deterministic failure is superstition.

// 03 — THE XCOM FILE-PATH PATTERN

Airflow’s XCom passes data between tasks through its metadata database, capped around 48 KB. Embedded chunks are ~2.5 MB, far too big for this (this caused a real, silent bug; its own anomaly log). The fix: transform writes chunks to a timestamped JSON file on the shared volume and passes only the file path through XCom; load reads the file and deletes it. XCom carries a pointer, never the payload.

TAKEAWAYS

NEXT

@frogwebp brand mark
ANTHONY PENA · @FROGWEBP
I build data systems and write about everything around them, the architecture, the failures, what each one teaches me. Documenting in public since 2021: the process, not just the result.

// NEWSLETTER — THE BUILD LOG SIGNAL

When I ship something or learn something worth keeping, it lands here first — build logs, concepts, and the honest process behind them. Come along; no spam, leave anytime.