dim_geography was supposed to hold one row per location. Instead it slowly multiplied: two, three, several rows for what should have been the same place. The unique constraint that was meant to prevent this was working exactly as SQL specifies. That was the problem.
// 01 — THE SETUP
Geography is deduplicated on (country_code, region, city) with a unique constraint, and upsert_geo() relies on that constraint to return the existing geo_id on conflict. Plenty of events have a country but no region or city, for example (US, NULL, NULL).
// 02 — THE SYMPTOM
Two events both with (country='US', region=NULL, city=NULL) produced two rows. Over time, all-NULL-subdivision locations piled up, and upsert_geo() could return different geo_id values for what was semantically the same place, quietly fragmenting any analysis grouped by geography.
// 03 — THE CULPRIT
Standard SQL treats NULL as not equal to NULL for uniqueness. Two rows of (US, NULL, NULL) both pass the unique check, because NULL = NULL is not true. It’s NULL. The constraint never fires, so the upsert inserts a fresh row every time instead of conflicting onto the existing one.
// 04 — THE FIX
PostgreSQL 15 added the exact knob for this: NULLS NOT DISTINCT, which makes NULL equal to NULL for constraint purposes:
UNIQUE NULLS NOT DISTINCT (country_code, region, city)
Now (US, NULL, NULL) conflicts with itself, the upsert returns the existing row, and one place is one row again.
TAKEAWAYS
- By default, unique constraints treat every
NULLas distinct, so rows withNULLs never conflict. IfNULLis a legitimate, recurring value in your dedup key, your “unique” rows aren’t. NULLS NOT DISTINCT(PostgreSQL 15+) makesNULLequalNULLfor uniqueness: the right tool when absent subdivisions are normal data.- “The constraint is working as specified” and “the constraint is doing what I want” are different claims. Know the spec.
NEXT
- Anomaly log: one user, every tenant’s retention: a multi-tenant leak.
