A “session” is a group of events from the same user with no gap longer than 30 minutes. Raw events don’t carry session IDs, so the pipeline has to derive them. This is build log 07: the SQL sessionization pattern, and why it’s cleaner than doing it in Python.
// 01 — THE GOAL
Given a stream of events per user sorted by occurred_at, assign a session_id such that:
- Events within 30 minutes of each other share a session ID.
- A new session starts whenever the gap to the previous event exceeds 30 minutes.
// 02 — THE WINDOW FUNCTION APPROACH
The classic pattern uses LAG() to detect session boundaries and a cumulative sum to assign IDs:
WITH lagged AS (
SELECT
user_id,
account_id,
event_id,
occurred_at,
LAG(occurred_at) OVER (
PARTITION BY user_id, account_id
ORDER BY occurred_at
) AS prev_occurred_at
FROM warehouse.fact_events
),
boundaries AS (
SELECT
*,
CASE
WHEN prev_occurred_at IS NULL
OR occurred_at - prev_occurred_at > INTERVAL '30 minutes'
THEN 1 ELSE 0
END AS is_new_session
FROM lagged
),
sessions AS (
SELECT
*,
SUM(is_new_session) OVER (
PARTITION BY user_id, account_id
ORDER BY occurred_at
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_seq
FROM boundaries
)
SELECT
user_id,
account_id,
occurred_at,
event_id,
MD5(user_id::TEXT || account_id::TEXT || session_seq::TEXT) AS session_id
FROM sessions;
LAG() retrieves the previous timestamp. If the gap exceeds 30 minutes (or it’s the first event), is_new_session = 1. SUM(is_new_session) accumulates: each session boundary increments the counter, grouping all events between boundaries into the same session sequence number. The final MD5(...) turns the per-user sequence number into a stable session ID.
// 03 — WHERE THIS RUNS
Sessionization in Spectrum is a materialized view, not a pipeline step. Sessions are derived analytically, not stored per-event:
CREATE MATERIALIZED VIEW analytics.mv_sessions AS
-- the full query above, without WITH NO DATA so it populates immediately
...;
CREATE UNIQUE INDEX ON analytics.mv_sessions (event_id);
Grafana reads session counts, average session lengths, and events-per-session directly from this view. It’s refreshed concurrently alongside the other analytics views.
// 04 — WHY SQL, NOT PYTHON
The Python equivalent of this is a loop over events sorted by user and time, maintaining a “current session” variable and flushing when the gap exceeds 30 minutes. It works but moves data from the database, iterates in Python, then writes results back. For millions of events, that round-trip is expensive.
The SQL version runs entirely inside PostgreSQL: no data moves, no Python loop, no intermediate serialization. Window functions are optimized for exactly this pattern, and the database already has the events indexed by (user_id, account_id, occurred_at) for exactly this access pattern.
If the logic became complex enough to require custom business rules that can’t be expressed in SQL, Python would be the right tool. At “30-minute inactivity window,” SQL is simpler and faster.
NEXT
The Spectrum build series continues with anomaly logs: bugs I found by reading the dashboards this system generates.
