Case Study

Cost-Efficient Data Engineering

Optimizing Snowflake Costs for High-Volume Airflow-DBT Pipelines
Author – Arpan Shah

Overview

A client was experiencing high Snowflake costs due to inefficiencies in their DBT jobs and warehouse configurations. With over 500+ daily DBT jobs, running for approximately 13 hours with a maximum parallel execution of 39 jobs, optimizing performance while reducing costs was a critical need.

Through a combination of warehouse right-sizing, query optimizations, and improved execution monitoring, we helped the client achieve 17% cost savings without compromising performance.

17%

Overall Cost Reduction

40%

Warehouse Cost Reduction

30%

Query Performance Improvement

Challenges Identified

High Snowflake Warehouse Costs

The warehouse was running at an overprovisioned capacity, leading to unnecessary expenditure.

Idle Compute Costs

Warehouses were not being suspended efficiently, leading to cost leakage.

Long-Running Queries

Inefficient SQL queries were increasing execution times and resource consumption.

Ineffective Model Design

Some DBT models were running as full refresh instead of incremental processing, leading to unnecessary recomputation.

Actions Taken to Reduce Costs

Warehouse Optimization

Optimizing warehouse configurations for better cost efficiency

  • Reduced Warehouse Size: Changed from Medium to Small for better cost efficiency.
  • Enabled Auto-Suspend: Configured warehouses to auto-suspend after inactivity, preventing idle compute costs.
  • Right-Sized Warehouses for Different Workloads: Optimized warehouse usage based on workload concurrency needs.

Identifying Costly Queries

Tracking and optimizing long-running queries

  • Tracked Long-Running Queries: Used Snowflake’s query_history table along with query tags in DBT models.
  • Implemented Query Profiling: Analyzed execution plans to find bottlenecks and inefficient joins.

Example Query to Identify Long-Running Queries:

SELECT query_id, warehouse_name, total_elapsed_time/1000 AS runtime_sec, execution_status, user_name, start_time, end_time, bytes_scanned 
FROM snowflake.account_usage.query_history
WHERE start_time >= 'YYYY-MM-DD HH:MM:SS' -- DBT start time
 AND end_time <= 'YYYY-MM-DD HH:MM:SS'   -- DBT end time
ORDER BY start_time;

DBT Model Optimization

Improving DBT model efficiency and performance

  • Converted Full Refresh Models to Incremental: Reduced unnecessary recomputation and improved efficiency.
  • Optimized SQL Joins: Ensured joins were correctly partitioned and indexed where applicable. With the proper joins and partitioning it can be easily identified from the image below that processing time has been significantly reduced.
  • Split Large Models into Temporary Tables and intermediate models: Reduced memory consumption and improved execution times. There was one model having dependency on 19 tables and all were used in a single query, we split SQL into 5 intermediate DBTmodels and 1 final model to consume all of them.
  • Used Materialized Views: For frequently accessed aggregated data, reducing query runtime by leveraging snowflake caching. Snowflake uses a materialized view as cached data which helps performance enhancement.

Cost Analysis and Performance Monitoring

Tracking and analyzing cost patterns

Identified Cost Patterns in DBT and Snowflake
  • Used run_results.json and DBT Cloud API to analyze model execution times.
  • Determined cost distribution and peak usage times.
  • Differentiated between DBT inefficiencies vs. Snowflake configuration issues.
DBT Execution Time vs. Snowflake Cost Analysis:
Scenario 1: Cost Spreading Across the Day
Scenario 2: Cost Peaking During DBT Runs
Additional Optimization Strategies Implemented
1. Warehouse Scaling & Configuration
  • Avoided Over-Provisioning: Ensured warehouses matched workload requirements.
  • Multi-Cluster Warehouses for High Concurrency: Used selectively for peak workloads.
2. Caching & Query Optimization
  • Leverage Snowflake’s Result Cache: Minimized redundant computations.
  • Implemented Clustering Keys: To optimize large table scans.
  • Used Materialized Views: To improve performance of frequently queried data.
3. Concurrency & Load Management
  • Leverage Snowflake’s Result Cache: Minimized redundant computations.
  • Implemented Clustering Keys: To optimize large table scans.
  • Used Materialized Views: To improve performance of frequently queried data.

Results & Impact

By implementing these strategies, we were able to achieve the following improvements:

Reduced Warehouse Costs by 40%

The warehouse was running at an overprovisioned capacity, leading to unnecessary expenditure.

Improved Query Performance by 30%

Through optimized SQL and incremental models.

Eliminated Idle Costs

By enforcing warehouse suspend policies

Reduction in DBT Total Runtime

Reduced from 13 hours to 9 hours by optimizing joins and clustering.

Enhanced Execution Monitoring

To proactively detect cost-intensive queries.

Query Concurrency Reduction

Reduced max parallel queries from 39 to 25.

These optimizations allowed the client to maintain their high-volume Airflow-DBT workflows efficiently while 17% reducing their Snowflake expenses.

Conclusion

By leveraging query optimization, warehouse right-sizing, and execution monitoring, we successfully helped our client reduce daily Snowflake costs by 17% while maintaining performance. This case study demonstrates how small changes in DBT and Snowflake configurations can lead to substantial cost savings and efficiency gains.

For organizations looking to optimize their Snowflake costs, a structured cost analysis and optimization strategy is key to achieving long-term savings.