DOC: spectrum-s
STATUS: ● PUBLISHED
SYSTEM SPECTRUM

Sessionization in Pure SQL

Grouping raw events into sessions with a 30-minute inactivity gap, no Python required.

Cover image — Sessionization in Pure SQL

A “session” is a group of events from the same user with no gap longer than 30 minutes. Raw events don’t carry session IDs, so the pipeline has to derive them. This is build log 07: the SQL sessionization pattern, and why it’s cleaner than doing it in Python.

// 01 — THE GOAL

Given a stream of events per user sorted by occurred_at, assign a session_id such that:

// 02 — THE WINDOW FUNCTION APPROACH

The classic pattern uses LAG() to detect session boundaries and a cumulative sum to assign IDs:

WITH lagged AS (
    SELECT
        user_id,
        account_id,
        event_id,
        occurred_at,
        LAG(occurred_at) OVER (
            PARTITION BY user_id, account_id
            ORDER BY occurred_at
        ) AS prev_occurred_at
    FROM warehouse.fact_events
),
boundaries AS (
    SELECT
        *,
        CASE
            WHEN prev_occurred_at IS NULL
              OR occurred_at - prev_occurred_at > INTERVAL '30 minutes'
            THEN 1 ELSE 0
        END AS is_new_session
    FROM lagged
),
sessions AS (
    SELECT
        *,
        SUM(is_new_session) OVER (
            PARTITION BY user_id, account_id
            ORDER BY occurred_at
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS session_seq
    FROM boundaries
)
SELECT
    user_id,
    account_id,
    occurred_at,
    event_id,
    MD5(user_id::TEXT || account_id::TEXT || session_seq::TEXT) AS session_id
FROM sessions;

LAG() retrieves the previous timestamp. If the gap exceeds 30 minutes (or it’s the first event), is_new_session = 1. SUM(is_new_session) accumulates: each session boundary increments the counter, grouping all events between boundaries into the same session sequence number. The final MD5(...) turns the per-user sequence number into a stable session ID.

// 03 — WHERE THIS RUNS

Sessionization in Spectrum is a materialized view, not a pipeline step. Sessions are derived analytically, not stored per-event:

CREATE MATERIALIZED VIEW analytics.mv_sessions AS
-- the full query above, without WITH NO DATA so it populates immediately
...;

CREATE UNIQUE INDEX ON analytics.mv_sessions (event_id);

Grafana reads session counts, average session lengths, and events-per-session directly from this view. It’s refreshed concurrently alongside the other analytics views.

// 04 — WHY SQL, NOT PYTHON

The Python equivalent of this is a loop over events sorted by user and time, maintaining a “current session” variable and flushing when the gap exceeds 30 minutes. It works but moves data from the database, iterates in Python, then writes results back. For millions of events, that round-trip is expensive.

The SQL version runs entirely inside PostgreSQL: no data moves, no Python loop, no intermediate serialization. Window functions are optimized for exactly this pattern, and the database already has the events indexed by (user_id, account_id, occurred_at) for exactly this access pattern.

If the logic became complex enough to require custom business rules that can’t be expressed in SQL, Python would be the right tool. At “30-minute inactivity window,” SQL is simpler and faster.

NEXT

The Spectrum build series continues with anomaly logs: bugs I found by reading the dashboards this system generates.

@frogwebp brand mark
ANTHONY PENA · @FROGWEBP
I build data systems and write about everything around them, the architecture, the failures, what each one teaches me. Documenting in public since 2021: the process, not just the result.

// NEWSLETTER — THE BUILD LOG SIGNAL

When I ship something or learn something worth keeping, it lands here first — build logs, concepts, and the honest process behind them. Come along; no spam, leave anytime.