Backfills in Data and Machine Learning: A Primer

sryza · on June 7, 2023

Intro text:

In the life of a data engineer, perhaps nothing inspires more difficulty or awe than the backfill. Flub a backfill, and you'll have bad data, angry stakeholders, and a fat cloud bill. Nail a backfill, and you're rewarded with months or years of pristine data (until the next backfill). Backfills are a central piece of the data engineering skillset, a fiery furnace where all the individual challenges of the profession melt together into one big challenge.

Backfills frequently go wrong. They accidentally target the wrong data. Or they fail, and it’s too difficult to pick up from the middle, so they get restarted from the beginning. Or they leave data in an inconsistent state: the records in related tables don’t match up. And the stakes are high: restarting a large backfill from the beginning can mean spending an extra $10k cloud compute and living with out-of-date data for days.

In this post, we'll survey backfills: what they are, why we need them, what makes them difficult, and how to deal with that difficulty.

Backfills often go hand-in-hand with partitions, an approach to data management that can help make backfills dramatically simpler and more sane. This post will also delve into how to use partitions to avoid many of backfilling’s main pitfalls.