Common Mistakes in DBT That Can Lead to Performance and Operational Issues

DBT (Data Build Tool) is a powerful framework for data transformation, but like any powerful tool, small missteps can lead to big issues — especially at scale. In this post, I’ll walk you through three common pitfalls I’ve seen in

DBT (Data Build Tool) is a powerful framework for data transformation, but like any powerful tool, small missteps can lead to big issues — especially at scale. In this post, I’ll walk you through three common pitfalls I’ve seen in production DBT projects that can cause performance degradation, operational issues, or incorrect lineage tracking — and how to avoid them.

. . .

🚨 Mistake 1: Misusing DELETE+INSERT When unique_key Is Not Truly Unique

What happens: When using the delete+insert materialization strategy in an incremental model, DBT generates a DELETE query that joins the source and target tables on the unique_key. If that key isn’t truly unique, you end up with a many-to-many join—and potentially a massive performance hit.

Problematic scenario:

Let’s say you’re building a gold_model that refreshes data based on tx_date:

				
					{{ config(
    materialized='incremental',
    materialization_strategy='delete+insert',
    unique_key='tx_date'
) }}

SELECT * FROM {{ ref('silver_model') }}sq
				
			

This delete executes into something like:

				
					DELETE FROM gold_model d
USING gold_model__dbt_tmp tmp
WHERE d.tx_date = tmp.tx_date;
				
			

If your tx_date exists on 100,000+ records in both tables, this join explodes, causing long runtimes and high resource usage.

A better way: Use append with a pre_hook

Instead of relying on DBT’s default strategy, handle the deletion yourself:

				
					{{ config(
    materialized='incremental',
    materialization_strategy='append',
    pre_hook="""
        DELETE FROM gold_model
        WHERE tx_date IN (SELECT DISTINCT tx_date FROM {{ ref('silver_model') }})
    """
) }}

SELECT * FROM {{ ref('silver_model') }}sq
				
			

This avoids a complex join and deletes only the necessary records in a much simpler and faster query.

. . .

🚨 Mistake 2: Using  materialized='table' Without Handling Data Types Properly

What happens: When you use materialized='table', DBT issues a CREATE OR REPLACE query each time, which drops and recreates the table. If the data types change between runs (e.g., numeric to string), you risk schema drift and broken joins.

Example:

First run:

				
					{{ config(materialized='table') }}

SELECT 1 AS id
				
			

Second run:

				
					{{ config(materialized='table') }}

SELECT 'a' AS id
				
			

Now, the id column flips from INTEGER to STRING. That inconsistency can break downstream dependencies or cause unexpected casting behavior, like truncating leading zeros in stringified numbers.

Fix it with explicit casting:

Its is always preferable to cast your columns explicitly, especially when dealing with source systems that aren’t strict about data types and materialized as table.

				
					{{ config(materialized='table') }}

SELECT CAST(1 AS VARCHAR(10)) AS id
				
			
. . .

🚨 Mistake 3: Conditional ref() Logic Can Break Lineage in the Manifest

What happens: When you write models that use ref() conditionally based on variables, DBT may not correctly register all dependencies in the manifest.json. This leads to incomplete lineage in the DBT docs or in your CI/CD workflows.

Example with an issue:

				
					{{ config(materialized='table') }}

{% if var('country') == 'IND' %}
    SELECT * FROM {{ ref('logic_for_ind') }}
{% elif var('country') == 'USA' %}
    SELECT * FROM {{ ref('logic_for_usa') }}
{% endif %}
				
			

When running:

				
					dbt compile - vars '{"country": "USA"}' - select gold_model
				
			

The manifest.json may miss the parent models (logic_for_indlogic_for_usa) entirely, because the code path wasn’t executed. This may lead to lineage issues speacilly when manifest is used to maintain lineages such as with Airflow using Astronomer cosmos package.

parent_map in manifest.json


Suggested fix: Use --depends_on at the top:

				
					-- depends_on: [{{ ref('logic_for_ind') }}, {{ ref('logic_for_usa') }}]

{{ config(materialized='table') }}

{% if var('country') == 'IND' %}
    SELECT * FROM {{ ref('logic_for_ind') }}
{% elif var('country') == 'USA' %}
    SELECT * FROM {{ ref('logic_for_usa') }}
{% endif %}
				
			

This ensures the compiler is aware of all possible dependencies, even if they aren’t used in a particular run.

parent_map in manifest.json after forcing dependency


DBT Reference: https://docs.getdbt.com/reference/dbt-jinja-functions/ref#forcing-dependencies

. . .

🔚 Wrapping Up

Small decisions in DBT can have big consequences — especially when you’re working at scale. Here’s a quick recap:

🔁 Mistake: Using DELETE+INSERT with a non-unique unique_key
💥 Impact: Triggers a many-to-many join during deletion, causing serious performance issues, especially with large datasets
✅ Solution: Switch to the append strategy and use a pre_hook to delete using a simplified IN clause for better performance and scalability

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

📦 Mistake: Using materialized='table' without consistent data types
💥 Impact: Causes schema drift due to CREATE OR REPLACE behavior, which can break joins or drop leading zeros in IDs
✅ Solution: Always explicitly cast your columns to the correct data types to ensure consistency across runs

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

🔀 Mistake: Conditional ref() usage without declaring dependencies
💥 Impact: Leads to incomplete lineage in manifest.json, breaking documentation and dependency tracking
✅ Solution: Use --depends_on to declare all potential references, even if they aren’t used in a specific run

Leave a Comment

Your email address will not be published. Required fields are marked *