DOC: spectrum-a
STATUS: ● PUBLISHED
SYSTEM SPECTRUM

A/B Experiments With Wilson Score Intervals in SQL

Computing statistically sound confidence intervals for conversion rates without leaving PostgreSQL.

Cover image — A/B Experiments With Wilson Score Intervals in SQL

A/B experiment results are conversion rates with uncertainty. Showing “control: 12%, treatment: 13.4%” without uncertainty is misleading, because that delta might be noise. Spectrum computes Wilson score confidence intervals directly in SQL. This is build log 06.

// 01 — WHY WILSON SCORE, NOT WALD

The naive confidence interval for a proportion is the Wald interval: p ± Z * sqrt(p*(1-p)/n). It breaks at the edges: with very low or very high conversion rates, or with small samples, it produces intervals below 0% or above 100%. Wilson score is better behaved in those regions and is the standard for A/B testing tools.

The formula:

center   = (successes + Z²/2) / (n + Z²)
half     = Z * sqrt(n * p * (1-p) + Z²/4) / (n + Z²)
lower    = center - half
upper    = center + half

where Z = 1.96 for 95% confidence and p = successes / n.

// 02 — THE SQL IMPLEMENTATION

CREATE MATERIALIZED VIEW analytics.mv_experiment_results AS
WITH
z AS (SELECT 1.96 AS val),
counts AS (
    SELECT
        e.experiment_id,
        e.variant,
        e.account_id,
        count(*) AS n,
        count(*) FILTER (WHERE e.converted) AS successes
    FROM warehouse.fact_experiments e
    GROUP BY e.experiment_id, e.variant, e.account_id
),
wilson AS (
    SELECT
        experiment_id,
        variant,
        account_id,
        n,
        successes,
        successes::FLOAT / NULLIF(n, 0) AS conversion_rate,
        -- Wilson interval components
        (successes + (SELECT val FROM z)^2 / 2.0)
            / NULLIF(n + (SELECT val FROM z)^2, 0)                              AS center,
        (SELECT val FROM z) * sqrt(
            n * (successes::FLOAT / NULLIF(n,0))
              * (1 - successes::FLOAT / NULLIF(n,0))
              + (SELECT val FROM z)^2 / 4.0
        ) / NULLIF(n + (SELECT val FROM z)^2, 0)                               AS half_width
    FROM counts
)
SELECT
    experiment_id,
    variant,
    account_id,
    n,
    successes,
    conversion_rate,
    GREATEST(0, center - half_width) AS ci_lower,
    LEAST(1,   center + half_width) AS ci_upper
FROM wilson
WITH NO DATA;

GREATEST(0, ...) and LEAST(1, ...) clamp the bounds to valid probability range.

// 03 — READING THE RESULTS IN GRAFANA

Grafana reads ci_lower, ci_upper, and conversion_rate per variant. The dashboard renders them as error bars: if the control and treatment intervals don’t overlap, the difference is significant at 95% confidence. No statistical library required. The warehouse exports ready-to-visualize numbers.

// 04 — WHAT’S NOT HERE

Statistical significance is not the same as practical significance. A 0.1% conversion lift that’s “significant” with 100,000 users per arm is not necessarily worth acting on. Spectrum shows the intervals; the human decides whether the delta matters for the product.

Spectrum also doesn’t handle sequential testing (running an experiment and peeking at results repeatedly without adjustment). For now, experiments are evaluated once at their end date.

NEXT

@frogwebp brand mark
ANTHONY PENA · @FROGWEBP
I build data systems and write about everything around them, the architecture, the failures, what each one teaches me. Documenting in public since 2021: the process, not just the result.

// NEWSLETTER — THE BUILD LOG SIGNAL

When I ship something or learn something worth keeping, it lands here first — build logs, concepts, and the honest process behind them. Come along; no spam, leave anytime.