Sessionization in Pure SQL

A “session” is a group of events from the same user with no gap longer than 30 minutes. Raw events don’t carry session IDs, so the pipeline has to derive them. This is build log 07: the SQL sessionization pattern, and why it’s cleaner than doing it in Python.

// 01 — THE GOAL

Given a stream of events per user sorted by occurred_at, assign a session_id such that:

Events within 30 minutes of each other share a session ID.
A new session starts whenever the gap to the previous event exceeds 30 minutes.

// 02 — THE WINDOW FUNCTION APPROACH

The classic pattern uses LAG() to detect session boundaries and a cumulative sum to assign IDs:

WITH lagged AS (
    SELECT
        user_id,
        account_id,
        event_id,
        occurred_at,
        LAG(occurred_at) OVER (
            PARTITION BY user_id, account_id
            ORDER BY occurred_at
        ) AS prev_occurred_at
    FROM warehouse.fact_events
),
boundaries AS (
    SELECT
        *,
        CASE
            WHEN prev_occurred_at IS NULL
              OR occurred_at - prev_occurred_at > INTERVAL '30 minutes'
            THEN 1 ELSE 0
        END AS is_new_session
    FROM lagged
),
sessions AS (
    SELECT
        *,
        SUM(is_new_session) OVER (
            PARTITION BY user_id, account_id
            ORDER BY occurred_at
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS session_seq
    FROM boundaries
)
SELECT
    user_id,
    account_id,
    occurred_at,
    event_id,
    MD5(user_id::TEXT || account_id::TEXT || session_seq::TEXT) AS session_id
FROM sessions;

LAG() retrieves the previous timestamp. If the gap exceeds 30 minutes (or it’s the first event), is_new_session = 1. SUM(is_new_session) accumulates: each session boundary increments the counter, grouping all events between boundaries into the same session sequence number. The final MD5(...) turns the per-user sequence number into a stable session ID.

// 03 — WHERE THIS RUNS

Sessionization in Spectrum is a materialized view, not a pipeline step. Sessions are derived analytically, not stored per-event:

CREATE MATERIALIZED VIEW analytics.mv_sessions AS
-- the full query above, without WITH NO DATA so it populates immediately
...;

CREATE UNIQUE INDEX ON analytics.mv_sessions (event_id);

Grafana reads session counts, average session lengths, and events-per-session directly from this view. It’s refreshed concurrently alongside the other analytics views.

// 04 — WHY SQL, NOT PYTHON

The Python equivalent of this is a loop over events sorted by user and time, maintaining a “current session” variable and flushing when the gap exceeds 30 minutes. It works but moves data from the database, iterates in Python, then writes results back. For millions of events, that round-trip is expensive.

The SQL version runs entirely inside PostgreSQL: no data moves, no Python loop, no intermediate serialization. Window functions are optimized for exactly this pattern, and the database already has the events indexed by (user_id, account_id, occurred_at) for exactly this access pattern.

If the logic became complex enough to require custom business rules that can’t be expressed in SQL, Python would be the right tool. At “30-minute inactivity window,” SQL is simpler and faster.

The Spectrum build series continues with anomaly logs: bugs I found by reading the dashboards this system generates.

// 01 — THE GOAL

// 02 — THE WINDOW FUNCTION APPROACH

// 03 — WHERE THIS RUNS

// 04 — WHY SQL, NOT PYTHON

NEXT