Activity Schema: The Data Model That Fixed My QA Debugging
What if your entire analytics warehouse was one table? No foreign keys. No 15-table joins. Just events, entities, and time.
That’s the premise of Activity Schema - and after using it on an internal project, I’m convinced it’s the right model for event-driven domains. I built it for a QA team debugging event sequences across 4+ codebases and a million daily users - and it delivered. Catching sequence bugs that would have taken hours to find with traditional joins. Faster debugging cycles. Systematic anomaly detection that became part of daily workflows.
This post isn’t a tutorial. It’s a mental model. I want you to understand why this approach works, when it makes sense, and what tradeoffs you’re signing up for.
When Events Misbehave
The use case where Activity Schema proved itself was debugging a VOD player. Different app versions were emitting events in invalid order - playback starting before player initialization, buffering sessions without end events. Telemetry chaos.
The scale made traditional debugging impractical: 4+ separate codebases (web, mobile, TV, partner extensions), each with multiple active app versions, roughly 1M DAU, playback heartbeats every minute, and 10 distinct player activities. Correlating sequences across this volume meant joining massive tables repeatedly just to answer basic questions.
To answer “which versions have broken event sequences?”, you need to compare timestamps across event types for the same session. In a traditional schema, that means joining the playback_events table to player_events to buffering_events, matching on playback session IDs, and praying your timestamps are consistent.
It works. Until someone asks “show me all sessions where buffering ended but never started.” Now you need outer joins. Then “group by app version and count the anomalies.” Now you’re writing window functions across joined tables.
The query complexity isn’t the real problem. The real problem is that every new question requires a new mental model of how the tables relate.
One Table, Temporal Joins
Activity Schema takes a different approach. Instead of normalizing events into separate tables by type, you store everything in a single stream:
| ts | activity | entity | features |
|---|---|---|---|
| 2025-06-04 10:00:01 | player_ready | session_123 | {app_version: “2.1.0”} |
| 2025-06-04 10:00:03 | video_start | session_123 | {asset_id: “movie_456”} |
| 2025-06-04 10:00:15 | buffer_start | session_123 | {buffer_id: “buf_789”} |
Every row is an activity performed by an entity at a specific time, with optional features as context.
The magic happens when you query. Instead of foreign key joins, you use temporal joins - self-joins based on time relationships.
Business question: “For every user who converted to paid, what was the last video they watched before subscribing, and how many videos total?”
| Metric | Star Schema | Activity Schema |
|---|---|---|
| Tables | 4 | 1 |
| JOINs | 5 | 2 |
| Lines | 30 | 27 |
| Mental model | “How do these tables relate?” | “What’s my cohort? What do I append?” |
Same answer, different mental load. The Activity Schema query follows one repeatable pattern. The star schema requires mapping four tables and their relationships before you write a single line.
Sample output (identical from both approaches):
| user_id | converted_at | plan_type | last_video_title | videos_watched_before |
|---|---|---|---|---|
| user_042 | 2025-06-01 14:32 | paid | Breaking Bad S01E01 | 12 |
| user_087 | 2025-06-02 09:15 | paid | The Office Pilot | 3 |
| user_156 | 2025-06-02 18:45 | paid | Planet Earth II | 27 |
| user_203 | 2025-06-03 11:20 | paid | NULL | 0 |
User 203 converted without watching any videos first - a referral or direct signup worth investigating.
Star Schema SQL (30 lines, 4 tables, 5 JOINs)
| |
Activity Schema SQL (27 lines, 1 table, 2 JOINs)
| |
Temporal Joins
Temporal joins replace foreign keys with time-based relationships. Every query follows the same pattern:
- Define a cohort - the primary activity that sets your row count
- Append related data - join other activities based on when they happened relative to the cohort
The cohort is your anchor. Append operations add columns, never rows. This eliminates fan-traps by design.
The Twelve Relationships
There are exactly 12 ways to relate activities in time, organized into 3 families × 4 temporal positions:
| Ever | Before | After | In Between | |
|---|---|---|---|---|
| First | First ever append | First append before | First append after | First append between cohorts |
| Last | Last ever append | Last append before | Last append after | Last append between cohorts |
| Aggregate | Aggregate all ever | Aggregate before | Aggregate after | Aggregate between cohorts |
Temporal positions:
- Ever - all occurrences regardless of cohort timing (static)
- Before - occurrences before the cohort activity
- After - occurrences after the cohort activity
- In Between - occurrences between cohort activity and the next cohort activity
Families:
- First - earliest occurrence (use
ORDER BY ts ASC) - Last - most recent occurrence (use
ORDER BY ts DESC) - Aggregate - count, sum, average, or other aggregation across all matching occurrences
Once you internalize these 12 patterns, you can express almost any analytics question.
Query Blueprints
Every temporal join follows the same structure. The variations come from two choices:
- Window ordering:
ASCfor First,DESCfor Last - Time filter: none for Ever,
< c.tsfor Before,> c.tsfor After, bounded for In Between
Here’s the First/Last pattern - the foundation for 8 of the 12 relationships:
| |
Aggregate and In Between patterns follow similar logic with GROUP BY and LEAD() window functions respectively.
Get the SQL Cheat Sheet
All 12 temporal join patterns as runnable DuckDB examples. Dialect reference included for BigQuery, Snowflake, and Redshift.
Relationships in Practice
Below are three detailed examples - one from each family. The patterns adapt to all 12 relationships by changing the window ordering and time filter.
Last Before (Last Family)
For every cohort activity, append the most recent occurrence of another activity that happened before it.
Example question: “For every video start, what was the last screen the user viewed?”
Raw activity stream:
| ts | activity | entity | features |
|---|---|---|---|
| 10:00:00 | screen_view | session_1 | {screen: “browse”} |
| 10:00:45 | screen_view | session_1 | {screen: “details”} |
| 10:01:00 | video_start | session_1 | {asset: “movie_A”} |
| 10:05:00 | screen_view | session_1 | {screen: “browse”} |
| 10:05:30 | video_start | session_1 | {asset: “movie_B”} |
After temporal join (cohort: video_start, append: screen_view, last before):
| ts | activity | entity | asset | last_screen_before |
|---|---|---|---|---|
| 10:01:00 | video_start | session_1 | movie_A | details |
| 10:05:30 | video_start | session_1 | movie_B | browse |
Each video start gets the screen that was viewed immediately before it - different values for each cohort row.
First After (First Family)
For every cohort activity, append the earliest occurrence of another activity that happened after it.
Example question: “For every player_ready event, how long until video actually started?”
Raw activity stream:
| ts | activity | entity | features |
|---|---|---|---|
| 10:00:00 | player_ready | playback_1 | {version: “2.1”} |
| 10:00:03 | video_start | playback_1 | {} |
| 10:10:00 | player_ready | playback_2 | {version: “2.1”} |
| 10:10:08 | video_start | playback_2 | {} |
| 10:20:00 | player_ready | playback_3 | {version: “2.0”} |
After temporal join (cohort: player_ready, append: video_start, first after):
| ts | activity | entity | version | video_start_ts | time_to_start |
|---|---|---|---|---|---|
| 10:00:00 | player_ready | playback_1 | 2.1 | 10:00:03 | 3 sec |
| 10:10:00 | player_ready | playback_2 | 2.1 | 10:10:08 | 8 sec |
| 10:20:00 | player_ready | playback_3 | 2.0 | NULL | NULL |
The NULL reveals playback_3 never started - a QA signal worth investigating.
Aggregate In Between (Aggregate Family)
Count, sum, or average activities that occurred between two points.
Example question: “For every completed delivery, how many status updates happened between pickup and dropoff?”
Raw activity stream:
| ts | activity | entity | features |
|---|---|---|---|
| 08:00 | pickup | order_1 | {} |
| 08:15 | status_update | order_1 | {status: “in_transit”} |
| 08:45 | status_update | order_1 | {status: “nearby”} |
| 09:00 | dropoff | order_1 | {} |
| 10:00 | pickup | order_2 | {} |
| 11:30 | dropoff | order_2 | {} |
After temporal join (cohort: dropoff, bounded by pickup, aggregate: count status_update):
| ts | activity | entity | pickup_ts | updates_during_delivery |
|---|---|---|---|---|
| 09:00 | dropoff | order_1 | 08:00 | 2 |
| 11:30 | dropoff | order_2 | 10:00 | 0 |
Order 2 had no status updates between pickup and dropoff - maybe worth flagging.
Adapting Patterns
The remaining 9 relationships follow the same structure:
- Ever variants (First Ever, Last Ever, Aggregate All Ever): Remove the time filter entirely. The result is static across all cohort rows for the same entity.
- Before variants: Add
AND a.ts < c.tsto get occurrences before the cohort. - After variants: Add
AND a.ts > c.tsto get occurrences after the cohort. - In Between variants: Use
LEAD()to find the next cohort occurrence and bound the window.
Swap ORDER BY ts ASC for First, ORDER BY ts DESC for Last.
Building Intuition
The mental shift takes time. You stop thinking “which tables do I join?” and start asking:
- What’s my anchor? (cohort activity)
- What do I need to know? (append activity)
- Which family? (First, Last, or Aggregate)
- Which temporal position? (Ever, Before, After, or In Between)
For the VOD player debugging case:
- Cohort:
video_startevents - Append:
player_ready - Family: First (we want the earliest)
- Position: Before (must happen before video starts)
- If the join returns NULL, you found your bug - video started without player initialization.
Who Benefits Most
Activity Schema isn’t universally better. It shines in specific contexts:
Compliance and QA
This is where I saw it work firsthand. Validating event sequences becomes explicit. “Find sessions where video_start happened before player_ready” is a temporal join that returns NULLs on the append side. What used to be multi-hour debugging sessions turned into five-minute queries.
Regulatory rules often require proving order: “approval happened before execution”, “consent collected before data processing”. The join returns NULL when sequence is violated - compliance checks become queries, not audits.
Customer 360 / User Journey Teams
“What did user X do before churning?” becomes a single self-join instead of a multi-table expedition. Customer timelines live in one place. Dynamic cohorts based on behavioral sequences - “users who viewed pricing 3+ times before signing up” vs “impulse converters” - are aggregate temporal joins that feed directly into marketing systems or personalization engines.
Feature Engineering for ML
Temporal relationships map directly to ML features: “time since last X”, “count of Y before Z”, “first value of A after cohort”. Instead of bespoke feature pipelines per model, you have a standard vocabulary. A churn model and a fraud model use the same append patterns - just different activities.
The value isn’t the single table - it’s that temporal relationships become composable building blocks for any downstream system that needs behavioral context.
The Tradeoffs
I shipped this internally but the project ended before broader adoption. Here’s what I learned about the tradeoffs:
You need a documented vocabulary
The 12 temporal relationships are powerful, but your team needs to internalize them. “Last before” vs “first after” is intuitive once you get it - but there’s a learning curve. Documentation and examples become essential.
Beyond the relationships, you also need consistent, documented activity names. Is it video_start or videoStarted or playback_started? Without a canonical list of activities and their meanings, queries become guesswork. The schema is simple, but governance is not optional.
Features JSON can get bloated
Stuffing contextual data into a JSON column is flexible but dangerous without discipline. Every producer adding “just one more field” leads to bloated payloads with inconsistent keys across activity types. Query syntax varies by warehouse (DuckDB’s json_extract_string vs BigQuery’s JSON_VALUE vs Snowflake’s GET_PATH). You need clear guidelines on what belongs in features vs what deserves its own activity.
Not for every workload
OLTP? No. Real-time dashboards with sub-second latency? Probably not. Activity Schema is optimized for analytical queries on incremental, historical event streams. Know your access patterns.
Incremental updates need thought
Late-arriving events, retroactive corrections, backfills - these are solvable but require explicit handling. The “single source of truth” promise holds only if your ingestion pipeline maintains it.
Where It Didn’t Fit
Not every evaluation ended with adoption. On a different project, we considered Activity Schema for marketing segmentation - the data team was spending 50% of sprint capacity (75% in peak sprints) on repetitive ad-hoc segment requests, blocking development work and creating bottlenecks for campaigns.
Activity Schema seemed promising: behavioral cohorts are exactly what temporal joins excel at. But we rejected it for three reasons:
The learning curve was too steep for this team. The temporal join vocabulary takes time to internalize. For a team already underwater with requests, adding a new mental model wasn’t realistic. They needed relief now, not in three months.
Columnar storage inefficiency. The features JSON column fights columnar storage. Modern warehouses optimize for typed columns they can compress and scan efficiently - not nested JSON that requires parsing at query time. The syntax bloat of JSON_VALUE(features, '$.field') across every query adds friction and runtime overhead.
Cross-entity analytics require multiple stream joins. Segmentation often combines user-level, account-level, and organization-level attributes. Activity Schema’s single-entity-per-stream assumption meant we’d need to join multiple activity streams anyway - erasing the “one table” simplicity.
We’re now evaluating star schema versus OBT (One Big Table) for this use case.
When to Consider This
Ask yourself these questions:
Is your data fundamentally event-based?
If you’re modeling entities that do things over time - users, sessions, orders, devices - Activity Schema maps naturally. If you’re modeling static reference data or slowly-changing dimensions, stick with traditional approaches.
Are your analysts struggling with joins?
When simple questions require complex multi-table queries, the schema is fighting the questions. Activity Schema realigns the data model with how questions are actually asked.
Do you need flexible, ad-hoc analysis?
Star schemas optimize for known queries. Activity Schema optimizes for exploratory analysis where the questions aren’t predetermined. The tradeoff: you need temporal join literacy instead of ERD literacy.
Can you invest in the vocabulary?
This isn’t a drop-in replacement. Your team needs training, documentation, and practice. If you’re moving fast with a small team, the overhead might not pay off. If you’re building a platform for many analysts, the investment compounds.
Are you solving an internal problem first?
Activity Schema works well as a QA or debugging tool before committing to it for customer-facing analytics. Start with an internal use case - event sequence validation, anomaly detection, root cause analysis - where the stakes are lower and learning is faster. Once your team builds intuition, broader adoption becomes natural.
Closing Thoughts
The internal project where I used Activity Schema ended before we could expand it beyond QA. But what I saw convinced me: for a team debugging event sequences daily, the model worked. Queries that used to require mental gymnastics became mechanical. The vocabulary took a few days to internalize - then it clicked.
If I were starting a greenfield analytics project today, I’d seriously consider Activity Schema for the core event layer. Not because it’s simpler in every way - the temporal join vocabulary has a learning curve - but because it aligns the data model with how we actually ask questions about user behavior.
If you’re drowning in joins, debugging event sequences, or building a Customer 360, consider giving it a shot. Start with a single internal use case. Build intuition before going all-in.
The spec lives at activityschema.com. The ideas are simple enough to fit on one page - which might be the strongest argument for trying it.
Don't Rewrite These From Memory
12 copy-paste SQL patterns with generic templates and worked examples for each.