Group 2 - Checklist Level 1¶
dbt Fundamentals + Bug Fixing¶
Start here. The first four steps are about reading critically and fixing what's broken - resist the urge to skip straight to building. Understanding why a bug exists is more valuable than the fix itself.
In this level you will:
- Read and fix existing staging models with real production bugs
- Create a seed for a reference lookup table
- Build a cross-domain mart combining news and podcast content
Work through the steps in order. Expand a hint only after you've had a genuine attempt - the struggle is where the learning happens!
Step 1 - Add tests to stg_news__articles.sql¶
- Step complete
In the _news__models.yml ...
- what is the primary key of the
stg_news__articlesmodel? - run
dbt test --select stg_news__articles- what tests are already being executed? - which built-in test is missing for a primary key? Add it and rerun the test command.
Hint: Tests on source columns
tables:
- name: table_name
identifier: table_name
columns:
- name: column_name
tests:
- not_null
- unique
Run source tests:
Once you have rerun the command, you should see some errors - find the code that is being executed to run this test in the command history -> Debug logs. Can you see why it has errored?
Step 2 - Audit stg_news__articles.sql¶
- Step complete
Open models/staging/news/stg_news__articles.sql and read it carefully. Query the raw source to see if there are any data inconsistencies that should be treated in the staging model.
Look particularly at the article_id - is this unique? Use a window function to check.
Does the staging model handle this? What happens downstream if it doesn't?
Hint: Use a window function to identify duplicates
Run the following:
What is happening?Why are there duplicates?
The raw articles table contains duplicate article_id values - articles get republished with a new updated_at timestamp.
Better yet... you can add a unique test to test uniqueness.
Fix needed!
The current staging model selects * without deduplication. This means any downstream model joining on article_id will fan out and produce inflated row counts.
You will apply this fix in the following step.
Step 3 - Fix stg_news__articles.sql¶
- Step complete
Apply the deduplication fix. Keep only the most recent row per article_id (highest updated_at).
Hint: Adapt the window function from step 1
Remember the code used to check for duplicates? How could this be adapted to keep only the most up to date version of the article (i.e. the one most recently (re)published)?
with window_article as (
select
*,
count(*) over (partition by article_id) as cnt_article_id
from news.articles
)
select * from window_article
where cnt_article_id > 1
order by article_id, published_at
After the fix, re-run the row count check against the staged model and confirm article_id is now unique.
Really stuck? Check this SQL here:
Here is a useful code snippets you can adapt for your needs. Read it through and make sure you an understand each part.
After the fix, re-run the row count check against the staged model and confirm article_id is now unique.
Step 4 - Check for unique combination of columns¶
- Step complete
We have seen that the article_id in the raw data is not unique. However, new articles should only be added to the source when they were updated with a new updated_at timestamp. We can verify this by adding another data test on the uniqueness of the combination.
Add the dbt_utils package to packages.yml:
Run the following command in the CLI to load the packages.
Now add a test for the unique combination of article_id and updated_at to the source in _news__sources.yml.
Hint: unique_combination_of_columns test
Add the test at the table level (not column level) in your .yml file:
Step 5 - Create seeds/category_mapping.csv¶
- Step complete
Articles and podcast episodes both have a category column, but the values don't match what the end users expect. They have created the following mapping file for you:
category,category_group
politics,news_and_current_affairs
technology,tech_and_science
sport,sport_and_health
entertainment,arts_and_culture
Place this file at seeds/category_mapping.csv.
Hint: Adding a seed config
In dbt_project.yml, you can configure the seed's schema and column types:
seeds:
mediapulse:
+schema: schema_name # choose where the seeds should land in the warehouse
category_mapping:
+column_types:
column_name: # add a data type, eg varchar(50)
Then load it:
Step 6 - Build content_performance.sql¶
- Step complete
Create models/marts/content/content_performance.sql. This mart should:
- Pull all articles from
stg_news__articles - Pull all episodes from
stg_podcasts__episodes UNION ALLthe two after normalising to a common schema- Join the result to
category_mapping(your seed) to getcategory_group
Check the existing model first
Open models/marts/content/content_performance.sql. The existing model uses a JOIN between articles and episodes - why is this the incorrect approach?
Hint: Why the model is wrong
The model does:
This produces a cross-join of every article in a category with every episode in the same category - exactly the row explosion the dedup fix was meant to prevent elsewhere. Articles and episodes are separate content items; they should be stacked, not joined.
Hint: The better approach
What you're building: A single unified content table that combines articles and podcast episodes, then enriches it with normalised category labels.
- Pull all articles from stg_news__articles
- Pull all episodes from stg_podcasts__episodes
- UNION ALL the two after normalising to a common schema
- Join the result to category_mapping (your seed) to get normalised_category and category_group
Match the steps to build out the following CTEs
CTEs 1 & 2 - articles and episodes
Pull from each staging model and rename columns into a shared schema that works for both content types. For columns that only apply to one content type, explicitly fill the other with a placeholder. Add a hardcoded column to identify which platform each row came from.
Note: Episodes don't have a
categorycolumn — it lives onstg_podcasts__shows. Joinstg_podcasts__showsinto theepisodesCTE onshow_idto bring it in.
CTE 3 - combined
Stack both CTEs into one dataset, keeping all rows from both.
CTE 4 - with_category
Join to the category_mapping model. Use coalesce to return the mapped category where available, falling back to the raw value if not.
Really stuck? See some example SQL:
Adapt the following SQL code to use a better combination for episodes and articles:
with articles as (
select
article_id as content_id,
article_title as content_title,
published_at,
category as raw_category, 'news' as platform,
word_count as content_length_units, -- words for articles null as duration_seconds
from {{ ref('stg_news__articles') }}
),
episodes as (
select
-- ✏️ normalize columns that episodes have
-- in common to match the above
-- ✏️ add columns that don't exist to match
-- the schema above
duration_seconds
from {{ ref('stg_podcasts__episodes') }}
),
combined as (
select * from articles
union all
select * from episodes
),
with_category as (
select
-- ✏️ add the category group and
-- coalesce the category with raw_category
from -- select the correct CTE
left join -- add the seed reference
using -- which column should you join on?
)
select * from with_category
Step 7 - BONUS: Run dbt build --select +content_performance¶
- Step complete
This builds the entire upstream lineage of your mart plus the mart itself, then runs all tests.
Fix any failures before moving on. A test failure is information - read the error message, query the failing rows, understand why.
Done?
You've fixed two real production bugs and built a cross-domain content mart that unifies news and podcast data. Nice work.
Now head to Level 2 to add snapshots, improve test coverage, and document your models!