Skip to content

Group 4 - Checklist Level 1

dbt Project Audit

This is an audit-first track. Your job is to find what's broken or missing in the MediaPulse project - then fix it. The other groups built features; you build the safety net.

In this level you will:

  • Install audit packages - dbt-project-evaluator and dbt-codegen
  • Run the evaluator and surface violations
  • Triage findings by risk level
  • Fix the highest-priority violations
  • Fill documentation gaps with codegen

Work through the steps in order. Document your decisions as you go - you'll present findings at 16:00 on Day 2.


Step 1 - Add packages and run dbt deps

  • Step complete

Add the three packages to packages.yml:

packages:
  - package: dbt-labs/dbt_project_evaluator
    version: [">=0.8.0", "<1.0.0"]
  - package: dbt-labs/dbt_codegen
    version: [">=0.12.0", "<1.0.0"]
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<1.0.0"]
dbt deps
Hint: If you get version conflicts

Check dbt_project.yml for require-dbt-version. If it's pinned to an old version, some packages may not install. You may need to update the dbt version requirement or pin package versions lower. Run dbt --version to check what's installed.


Step 2 - Run dbt-project-evaluator

  • Step complete
dbt build --select package:dbt_project_evaluator

This runs a set of models that analyse your project metadata. Query the results tables:

-- Which models have no tests?
select * from dbt_project_evaluator.fct_missing_primary_key_tests;

-- Which models have no descriptions?
select * from dbt_project_evaluator.fct_undocumented_models;

-- Which source columns are untested?
select * from dbt_project_evaluator.fct_missing_source_tests;

-- Are there any direct source references in marts? (should go through staging)
select * from dbt_project_evaluator.fct_direct_join_to_source;
Hint: Navigating evaluator output

The full list of evaluator models is in the dbt-project-evaluator docs. Key models to check first:

Model What it flags
fct_missing_primary_key_tests Models with no unique + not_null test combo on a key column
fct_undocumented_models Models with no description
fct_direct_join_to_source Marts that join directly to raw sources, bypassing staging
fct_model_naming_conventions Models that don't follow stg_, int_, fct_, dim_ conventions
fct_missing_source_tests Source columns with no tests at all

Create a simple tracker (a shared doc or whiteboard) categorising findings as: Must fix, Should fix, Won't fix / by design.


Step 3 - Triage and prioritise findings

  • Step complete

For each evaluator violation, decide:

  • Must fix before prod: data correctness risk, missing tests on key columns, direct source references in marts
  • Should fix: documentation gaps, naming convention violations
  • Won't fix / acceptable: architectural decisions made consciously (e.g., a mart that intentionally queries a source for performance)

Record your triage decisions with a brief rationale - you'll present this.

Hint: What usually matters most

In practice, the highest-risk findings are:

  1. Missing unique + not_null tests on primary keys (silent duplicates)
  2. Marts joining directly to raw sources (no lineage visibility, no staging quality gate)
  3. Models with zero documentation (onboarding and debugging pain)

Naming conventions are important for consistency but rarely cause production incidents on their own.


Step 4 - Fix the highest-priority violations

  • Step complete

Work through your Must fix list. Typical fixes:

  • Add missing primary key tests to YAML files
  • Add descriptions to models and columns
  • Refactor any direct-source-to-mart joins to go through staging models
dbt build --select package:dbt_project_evaluator
# Re-run after fixes to confirm violation count drops

Step 5 - Use dbt-codegen to fill documentation gaps

  • Step complete

For any models that lack YAML documentation, use codegen to generate a starter:

# Generate source YAML (useful if other groups left sources undocumented)
dbt run-operation generate_source \
  --args '{"schema_name": "streaming", "database_name": "your_db"}'

# Generate model YAML for a single model
dbt run-operation generate_model_yaml \
  --args '{"model_names": ["stg_streaming__watch_events"]}'

# Generate model YAML for all models in a folder
dbt run-operation generate_model_yaml \
  --args '{"model_names": ["stg_ads__campaigns", "stg_ads__spend", "stg_ads__impressions"]}'

Paste the output into the appropriate YAML files, then fill in the descriptions. Codegen gives you the column list; you provide the meaning.

Hint: What codegen output looks like
version: 2

models:
  - name: stg_streaming__watch_events
    description: ""         # <-- fill this in
    columns:
      - name: watch_event_sk
        description: ""     # <-- fill these in
      - name: event_id
        description: ""
      - name: user_id
        description: ""
      # ... etc

The column list is derived from the model's compiled SQL, so it's always up to date.


Step 6 - BONUS: Deep dive on a specific evaluator finding

  • Step complete

Pick the evaluator finding that surprised you most and dig into it:

  1. Which models are affected?
  2. What's the concrete risk - what could go wrong in production?
  3. Is the fix straightforward or does it require refactoring?

Write a short paragraph (3–5 sentences) describing your finding. You'll use this in your presentation.


Done?

You've installed the audit toolchain, surfaced violations, triaged them by risk, and fixed the highest-priority issues. The project is materially safer than when you started.

Now head to Level 2 to add statistical guardrails, model contracts, design the CI/CD pipeline, and prepare your presentation!