An idempotent operation is one you can run any number of times and get the same result. In data engineering, this is the property that makes pipelines safe to retry, re-run, and recover from. It’s the one thing that separates pipelines that fail gracefully from ones that corrupt data when they fail.
Why idempotency matters
Pipelines fail. The network blips, a dependency is slow, the cluster dies mid-run. If your response to failure is “just run it again,” you need the re-run to not double-count, double-insert, or compound the state the first run left behind.
Non-idempotent pipeline: fail halfway through → retry → events processed twice → metrics inflated.
Idempotent pipeline: fail halfway through → retry → same result as if the first attempt completed cleanly.
The four techniques
1. Upsert, not insert. INSERT ... ON CONFLICT DO UPDATE replaces “add a row” with “ensure this row exists with these values.” Re-running the same event twice produces one row, not two.
2. Deterministic IDs. If the row’s identity is derived from its content (e.g., MD5(event_id || date || type)), the upsert key is stable across runs. The same event always maps to the same row.
3. Delete-then-insert for aggregates. If you’re re-materializing a partition or a day’s aggregates, delete the target window first and recompute from scratch. The result is always exactly what the source data says, regardless of how many times you run.
4. Status columns with idempotent transitions. pending → processing → done. A re-run of a “done” row is a no-op. FOR UPDATE SKIP LOCKED (in PostgreSQL) handles concurrent workers claiming rows exactly once.
What idempotency is not
Idempotency is not the same as exactly-once delivery. Exactly-once is a guarantee from the message broker that it never delivers a duplicate. Idempotency is a property of your consumer that makes duplicates harmless even when the broker delivers them.
Build idempotent consumers; treat exactly-once delivery as a nice-to-have.
Takeaway
Design every pipeline stage to be answerable to the question: “what happens if this runs twice?” If the answer is “the same thing as running once,” you have an idempotent pipeline. If the answer is “I hope it doesn’t,” you have a fragile one.
