Group 3 - Checklist Level 2¶
dbt Level Up¶
Start on this checklist once you have completed Checklist Level 1.
In this level you will apply the following skills:
- Revenue mart - fix the grain bug and allocate spend by impression share
- Snapshots - SCD Type 2 for campaign budget changes
- Singular tests - custom SQL assertions for revenue correctness
- Documentation - YAML for staging and mart models
Work through the steps in order. Expand a hint only after you've had a genuine attempt - the struggle is where the learning happens!
Target shape: revenue_by_content¶
You'll build the mart up step by step. Refer back to this table whenever you need to know what column comes from where, or what the final grain should look like.
Grain: one row per (campaign_id, content_id, impression_date).
| Column | Source | What it is |
|---|---|---|
campaign_id |
fct_ad_impressions |
Foreign key to stg_ads__campaigns |
content_id |
fct_ad_impressions |
The content item the impression ran against |
impression_date |
fct_ad_impressions |
Day the impressions were served |
impressions_count |
fct_ad_impressions |
Impressions for this (campaign, content, date) |
campaign_type |
stg_ads__campaigns |
Drives the commission rate lookup |
spend_dollars |
stg_ads__spend |
Gross daily campaign spend |
net_spend_dollars |
stg_ads__spend |
Spend minus platform fees |
impression_share |
derived | impressions_count / total_campaign_impressions for the day |
allocated_spend_dollars |
derived | impression_share × spend_dollars |
allocated_net_spend_dollars |
derived | impression_share × net_spend_dollars |
commission_rate |
commission_lookup seed |
MediaPulse's cut for this campaign type |
mediapulse_revenue_dollars |
derived | allocated_net_spend_dollars × commission_rate |
Step 1 - Get the grain right in revenue_by_content.sql¶
- Step complete
Open models/marts/revenue/revenue_by_content.sql. There's an existing model - you'll work from it, not throw it away. The model already has spend and campaigns CTEs you'll keep using in later steps; what's broken is the final aggregation, which groups at campaign_id and loses the per-content breakdown the mart name promises.
For this step, focus on just one thing: getting the grain right. Add an impressions CTE that pulls from fct_ad_impressions, then replace the broken campaign_revenue CTE with a final CTE at the impression grain (campaign_id, content_id, impression_date).
Leave the spend and campaigns CTEs in place; you'll use them for enrichment in Step 2.
Hint: Why the existing aggregation is wrong
The model groups by campaign_id, campaign_name, campaign_type, advertiser_id - there's no content_id in the GROUP BY. You get one row per campaign, not one row per content item per campaign per day. The mart is named "by content" but doesn't actually break down by content.
Hint: Skeleton
with spend as (
select * from {{ ref('stg_ads__spend') }}
),
campaigns as (
select * from {{ ref('stg_ads__campaigns') }}
),
impressions as (
select
campaign_id,
content_id,
impression_date,
impressions_count
from {{ ref('fct_ad_impressions') }}
),
final as (
select
campaign_id,
content_id,
impression_date,
impressions_count
from impressions
)
select * from final
Run it and confirm the row count matches fct_ad_impressions:
Step 2 - Enrich with only what's needed downstream¶
- Step complete
Now bring in the additional columns the downstream logic actually requires - and nothing more. Use the spend and campaigns CTEs that are already at the top of the file (no need to re-declare them). Resist the temptation to pull campaign_name, advertiser_id, or anything else "just in case" - check the target shape table above if you're unsure whether a column will be needed.
What downstream needs are concrete?
- The commission lookup (a later step) joins
commission_lookuponcampaign_type. So you needcampaign_typefromcampaigns. - The allocation calculation (also later) multiplies impression share by gross and net spend. So you need
spend_dollarsandnet_spend_dollarsfromspend, joined on(campaign_id, spend_date = impression_date).
Add an enrichment CTE between impressions and final, and pass those columns through. The grain should not change.
Hint: Enriched skeleton
with spend as (
select * from {{ ref('stg_ads__spend') }}
),
campaigns as (
select * from {{ ref('stg_ads__campaigns') }}
),
impressions as (
select
campaign_id,
content_id,
impression_date,
impressions_count
from {{ ref('fct_ad_impressions') }}
),
enriched as (
select
i.campaign_id,
i.content_id,
i.impression_date,
i.impressions_count,
c.campaign_type,
s.spend_dollars,
s.net_spend_dollars
from impressions i
inner join campaigns c
using (campaign_id)
inner join spend s
on i.campaign_id = s.campaign_id
and i.impression_date = s.spend_date
),
final as (
select
campaign_id,
content_id,
impression_date,
impressions_count,
campaign_type,
spend_dollars,
net_spend_dollars
from enriched
)
select * from final
Re-run the model and confirm the row count is unchanged from Step 1 - if it dropped, your join is filtering rows it shouldn't.
Step 3 - Compute the derived metrics and finalize the mart¶
- Step complete
Now layer the math on top of the enriched CTE so the mart matches the target shape table. There are three derivations to add, in order:
impression_share- what fraction of a campaign's total impressions on a given day belongs to this content item? Compute it fromimpressions_countdivided by the total across all content for the same(campaign_id, impression_date). Watch out for divide-by-zero.allocated_spend_dollarsandallocated_net_spend_dollars- multiply the share by gross and net spend for the row's(campaign_id, impression_date).commission_rateandmediapulse_revenue_dollars- bring incommission_ratefrom thecommission_lookupseed by joining oncampaign_type, then multiply byallocated_net_spend_dollars. Use aleft joinso a missing commission rate surfaces as a NULL revenue row rather than silently dropping the content.
After this step, the row count should still match Step 1 and the column list should match the target shape table exactly.
Hint: Computing impression_share with a window function
The "total campaign impressions per day" can be computed without a separate aggregation CTE - a window function does it in one pass:
Hint: Full model
The complete revenue_by_content.sql after this step:
with spend as (
select * from {{ ref('stg_ads__spend') }}
),
campaigns as (
select * from {{ ref('stg_ads__campaigns') }}
),
impressions as (
select
campaign_id,
content_id,
impression_date,
impressions_count
from {{ ref('fct_ad_impressions') }}
),
enriched as (
select
i.campaign_id,
i.content_id,
i.impression_date,
i.impressions_count,
c.campaign_type,
s.spend_dollars,
s.net_spend_dollars
from impressions i
inner join campaigns c
using (campaign_id)
inner join spend s
on i.campaign_id = s.campaign_id
and i.impression_date = s.spend_date
),
with_share as (
select
*,
impressions_count
/ nullif(sum(impressions_count) over (
partition by campaign_id, impression_date
), 0) as impression_share
from enriched
),
allocated as (
select
*,
impression_share * spend_dollars as allocated_spend_dollars,
impression_share * net_spend_dollars as allocated_net_spend_dollars
from with_share
),
with_commission as (
select
a.*,
cl.commission_rate,
a.allocated_net_spend_dollars * cl.commission_rate as mediapulse_revenue_dollars
from allocated a
left join {{ ref('commission_lookup') }} cl using (campaign_type)
)
select
campaign_id,
content_id,
impression_date,
impressions_count,
campaign_type,
spend_dollars,
net_spend_dollars,
impression_share,
allocated_spend_dollars,
allocated_net_spend_dollars,
commission_rate,
mediapulse_revenue_dollars
from with_commission
After running, spot-check a single (campaign_id, impression_date) and confirm the impression shares for that combination sum to roughly 1.0, and that the allocated spend rows sum to the campaign's gross spend for the day.
Step 4 - Reason about why we need a snapshot¶
- Step complete
Before writing any code, take 5-10 minutes with your group to work through the questions below and write your answers down - you'll feed them straight into Step 5 to make concrete config choices.
-
The state-overwrite problem.
stg_ads__campaignsis rebuilt on every run fromads.campaigns. If an advertiser bumped theirbudget_centsfrom500000to1000000last week, what doesstg_ads__campaignssay the budget was yesterday? What did the source actually say yesterday? -
What you lose. Name two business questions about a campaign you can only answer if you've kept a history of past values - and would silently get the wrong answer for if you just used
stg_ads__campaigns. (Think pacing reports, audit trails, mid-flight budget changes.) -
Which columns move? Open
seeds/raw_ads__campaigns.csv(orselect * from ads.campaigns limit 5) and look at the columns:campaign_id, advertiser_id, campaign_name, campaign_type, start_date, end_date, budget_cents. Which of these are realistically going to change after a campaign launches? Which ones are essentially static? -
Update marker. dbt's
timestampsnapshot strategy needs a single column it can use as the "this row was updated at" marker. Looking at the column list above, is there one? (...there isn't. Note that down - it'll force a strategy choice in Step 5.) -
(Optional, if time permits) Snapshots implement Slowly Changing Dimension Type 2. Skim the dbt snapshots docs for what
dbt_valid_from,dbt_valid_to, anddbt_scd_idmean. Sketch out what the rows for a single campaign look like after dbt detects a budget change.
Hint: The core problem in one sentence
stg_ads__campaigns overwrites itself on every run, so the previous values disappear forever. Any analytics question about "what was true on date X?" silently uses today's values instead - producing answers that look reasonable but are wrong (e.g. a campaign appears to be delivering nicely "under budget", because the budget was actually bumped last Tuesday, not because spend was controlled).
Hint: What check strategy looks like, conceptually
Without an updated_at column, dbt can't know whether a row changed unless you tell it which columns to compare. The check strategy takes a list of columns and creates a new SCD row whenever any of those columns differ from the last snapshotted version. The cost is that you have to maintain that column list as the schema evolves.
Step 5 - Create the snapshot¶
- Step complete
Now turn the answers from Step 4 into config. Create snapshots/snap_ads__campaigns.yml. Three things you've effectively decided already:
unique_key- which column identifies a campaign across versions (your answer to Step 4 question 1's "what is a campaign?").strategy- your answer to Step 4 question 4 should rule outtimestamp. Usecheckinstead.check_cols- the list of columns from Step 4 question 3 that actually move.
Hint: Snapshot skeleton
snapshots:
- name: <string>
relation: ref() | source()
config:
database: <string>
schema: <string>
unique_key: <column_name_or_expression>
strategy: timestamp | check
updated_at: <column_name> # only needed with timestamp strategy
check_cols: [<column_name>] | all # only needed with check strategy
Run it:
Then query the snapshot table - dbt will have added dbt_scd_id, dbt_valid_from, dbt_valid_to, and dbt_updated_at. On this first run every row has dbt_valid_to is null (it's the current version of each campaign). The history would start to accumulate the moment a watched column changed upstream - in production the snapshot job runs on a schedule, and any subsequent change to budget_cents, end_date, or campaign_type would close out the existing row (dbt_valid_to = now()) and insert a new one (dbt_valid_from = now(), dbt_valid_to = null).
Step 6 - Write a singular test: revenue is never null¶
- Step complete
Create tests/assert_revenue_not_null.sql. Every row in revenue_by_content represents spend MediaPulse has allocated to content, so every row should have a derivable mediapulse_revenue_dollars value. A NULL there means we've allocated spend we can't attribute revenue to.
Hint: When does a dbt test pass?
A test passes when the query returns zero rows - any rows returned are failures.
Hint
Run it:
Step 7 - Write singular test: no negative spend¶
- Step complete
Create tests/assert_no_negative_spend.sql. Negative values in stg_ads__spend indicate a data pipeline issue upstream.
Hint
select spend_id, spend_date, spend_dollars
from {{ ref('stg_ads__spend') }}
where spend_dollars < 0
or net_spend_dollars < 0
Run it:
Step 8 - Add YAML for staging models and the mart¶
- Step complete
Create models/staging/ads/_ads__models.yml and models/marts/revenue/_revenue__models.yml. Document columns and add tests.
Include at minimum:
- Primary keys - on every model (
stg_ads__campaigns,stg_ads__spend,stg_ads__impressions,fct_ad_impressions), addnot_nullanduniquetests on the PK column (campaign_id,spend_id,impression_id,impression_id). Forrevenue_by_content, the PK is the composite(campaign_id, content_id, impression_date)- cover that with adbt_utils.unique_combination_of_columnstest. - Not-null on the columns the mart depends on - at minimum the foreign keys and the columns used in joins or measures:
campaign_id,content_id,impression_date,impressions_count,campaign_type,spend_dollars,net_spend_dollars. A NULL in any of these silently breaks downstream joins or measures. - A
relationshipstest onfct_ad_impressions.campaign_id→stg_ads__campaigns.campaign_id. - A
relationshipstest onstg_ads__campaigns.campaign_type→commission_lookup.campaign_type. Every campaign type that appears in the data needs a commission rate downstream - this test enforces that. If it fails, don't just silence it; query the failing rows and work out why the lookup is missing them. - An
accepted_valuestest oncampaign_typeinstg_ads__campaigns. Level 3 Step 2 will reconfigure this test - so make sure it exists.
Figuring out the accepted values
You need to decide what values to put in the test. Two reasonable approaches:
- Query
stg_ads__campaignsand pull the distinctcampaign_typevalues. - Look at
seeds/commission_lookup.csv- the seed already enumerates the campaign types the platform supports.
The two should agree. If they don't, that's itself a finding worth noting - and the relationships test above will tell you which values are out of sync.
Hint: Relationships test
Step 9 - Run dbt build --select +revenue_by_content¶
- Step complete
This builds the full lineage - sources → staging → incremental fact → mart - and runs all tests. Fix any failures.
Hint: If assert_revenue_not_null fails
Look at the failing rows and check the campaign_type column. The most likely cause is a campaign_type value that exists in stg_ads__campaigns but isn't in seeds/commission_lookup.csv, so the left join in revenue_by_content produces a NULL commission_rate (and therefore a NULL mediapulse_revenue_dollars). The relationships test you added in Step 8 (stg_ads__campaigns.campaign_type → commission_lookup.campaign_type) should fail at the same time and point at the same root cause.
Done?
You've built the revenue spine of the MediaPulse data platform - a correct-grain mart, a campaign budget snapshot, and custom SQL assertions that verify revenue integrity.
Now head to Level 3 to configure your tests with severity, where clauses, and dbt_expectations guardrails on your revenue models!