Loading 12.5 Billion Rows in dbt + Snowflake

Six-hour data loads. Unsustainable compute costs. Development cycles measured in days instead of hours.

When you're working with massive data tables, standard approaches break down fast.

We recently tackled a challenge that many companies face: a source table with 12.5 billion rows that was too expensive to load daily and too slow to work with during development. The solution we developed transformed both the economics and the operational reality of working with massive datasets.

Why Standard dbt Pipelines Break at Billion-Row Scale

Standard data pipelines offer two options: load everything or load recent data. This binary choice creates real business problems when your data scales to billions of rows:

High compute costs: Full data refreshes can consume thousands of dollars in warehouse credits for a single run.
Slow development cycles: Schema changes or testing require full reloads, turning quick iterations into multi-hour waits.
Risk of errors: Accidentally triggering a full refresh can overwhelm your warehouse and spike costs unexpectedly.
Limited flexibility: You can't easily backfill historical data without manual date calculations and custom scripts.

For companies managing customer data at scale, especially in e-commerce, SaaS, or media industries, these limitations directly impact your ability to make data-driven decisions quickly.

A Three-Mode Incremental Pattern for Large dbt Models

Instead of the standard "all or nothing" approach, we designed a flexible pattern that adapts to different operational needs. Think of it as having three gears for your data pipeline, each optimized for specific situations.

Mode 1: Safe Full-Refresh Development With a One-Day Default

The challenge: During development, you need to test changes without accidentally loading billions of rows.

Our solution: Full refresh mode defaults to loading just one day of data unless you explicitly override it. This guardrail prevents costly accidents while still allowing controlled historical loads when needed.

Business impact: Developers can iterate freely without fear of triggering expensive warehouse operations. What used to be risky now becomes routine.

Mode 2: Incremental Daily Loads With a Three-Day Lookback Window

The challenge: Daily operations need to capture new data plus any late-arriving records, but not at the cost of reprocessing everything.

Our solution: Incremental mode loads recent data with a configurable lookback window. The default three-day window catches late arrivals while keeping daily runs lean and cost-effective.

Business impact

Daily runs complete in 30 minutes instead of 6+ hours, transforming what was unsustainable into routine operations while maintaining data completeness.

Mode 3: Progressive Backfill Mode for Historical Data

The challenge: You need to gradually build up historical data without overwhelming your warehouse or manually calculating date ranges.

Our solution: Backfill mode automatically finds the earliest data in your table and loads the next chunk going backward in time. Run the same command repeatedly to progressively build history at your own pace.

Business impact: Complete control over compute costs while building historical data. No complex scripts, no manual date math, no risk of overloading your warehouse.

How Progressive Backfilling Works in dbt Incremental Models

The most innovative aspect of this approach is how it handles historical data loading. Instead of choosing between "load everything" (expensive and risky) or complex date calculations (error-prone and manual), we built intelligence into the pipeline itself.

The system queries your existing data to understand what you already have, then automatically determines the next chunk to load. Run it once, and it loads three months of history. Run it again, and it loads the next three months before that. Keep running until you have all the history you need.

This progressive approach means you can:

Size chunks appropriately for your warehouse capacity
Load history during low-usage periods to manage costs
Stop and resume historical loading without losing your place
Maintain full visibility into what's being loaded and when

Results: 6-Hour Refreshes Reduced to 30-Minute Incremental Runs

The impact of this approach extended beyond just technical improvements:

Dramatic speed improvement: What took 6+ hours for full refreshes now completes in 30-minute incremental runs, enabling same-day analytics instead of next-day reporting.
Sustainable operations: An unsustainable daily workflow became manageable, with controlled compute costs through precise control over data loading.
Development velocity: Schema changes and testing that previously took hours now complete in minutes, accelerating feature development and iteration cycles.
Operational confidence: Development teams work freely without fear of accidentally triggering expensive operations, and operations teams have precise control over when and how historical data loads.
Flexibility: Different scenarios—daily operations, historical backfills, data recovery—each have optimized approaches instead of forcing everything through the same pipeline.

When to Use a Three-Mode Incremental Pattern for dbt Models

This pattern delivers the most value for companies experiencing specific challenges:

Large data volumes: If your source tables measure in hundreds of millions or billions of rows, standard approaches become impractical and expensive.
Cost-sensitive environments: When warehouse compute costs are a significant line item, the 90% reduction in daily operations makes a real difference to your bottom line.
Frequent schema changes: If your data models are still evolving, you need the ability to iterate quickly without expensive full refreshes.
Historical data needs: When you need to gradually build up years of historical data without overwhelming your infrastructure.
Growth-stage scaling: As you transition from millions to billions of rows, you need patterns that scale gracefully without requiring complete rebuilds.

Why Operational Flexibility Matters as Much as Query Performance

This approach represents a broader principle in data architecture: operational flexibility matters as much as technical performance. Having multiple operational modes within a single model gives teams the control they need to balance speed, cost, and risk based on the situation.

Too often, data pipelines are built with only one scenario in mind—typically daily operations. But real-world data teams face diverse needs: development and testing, historical backfills, data recovery, cost optimization, and more. Building flexibility into your architecture from the start pays dividends across all these scenarios.

Why Scale-Aware Pipeline Patterns Are Essential as Data Volumes Grow

As data volumes continue to grow, patterns like this become increasingly essential. The gap between "it works with small data" and "it works at scale" is widening, and companies that don't adapt their approaches will face escalating costs and slowing velocity.

We've encoded patterns like this into our methodology and tooling, allowing us to deliver production-ready architectures that handle scale gracefully from day one. What used to require weeks of custom development now comes as proven, battle-tested patterns ready to deploy.

The future of data engineering isn't just about faster queries or bigger warehouses—it's about smarter operational patterns that give teams control over the tradeoffs between speed, cost, and flexibility.

Implementation Notes: dbt Incremental Materialization on Snowflake

This three-state incremental pattern is built using dbt and Snowflake, though the core concepts apply to any modern data warehouse. The pattern uses dbt's incremental materialization with custom logic that adapts behavior based on operational context.

We've successfully deployed variations of this pattern across multiple clients handling billions of rows in e-commerce, SaaS, and media industries. Each implementation is tuned to specific data characteristics and business requirements, but the core pattern remains consistent.

About Mammoth Growth

Mammoth Growth specializes in building production-ready data architectures for growth-stage companies. We combine deep expertise in modern data tools like Snowflake and dbt with proven methodologies refined across 900+ successful projects. Our approach delivers enterprise-quality data solutions at unprecedented speed, helping companies transform fragmented data into reliable insights that drive business growth.

Learn more about our data consulting services

How to Load 12.5 Billion Rows in dbt and Snowflake in 30 Minutes