Skip to main content
Backfill replays your pipeline logic across a date range (or other partition keys) that already passed. You use it after fixing bugs, adding columns, or onboarding a new destination that needs history.

Concepts

A backfill is still a pipeline run, but the orchestration supplies bounded partitions instead of only “latest” state.
  • Source nodes read each slice (for example one day of events) according to parameters you pass.
  • Destination nodes must handle overwrite, merge, or append semantics you designed.
  • Downstream schedules may need pausing so backfill and incremental loads do not fight for locks.
Backfill does not magically change retention in object storage or warehouses; ensure upstream data still exists for the range you request.

Configuring date ranges

1

Choose boundaries

Pick inclusive start and end partition values (often YYYY-MM-DD). Align to how the source is partitioned.
2

Set run parameters

Map range tokens to pipeline variables (start_date, end_date, hours, etc.) your nodes reference.
3

Select environment

Run backfills in staging first when volumes are large or logic recently changed.
4

Launch

Start from OrchestrationBackfill (or the pipeline action menu). Confirm estimated cost if the UI surfaces projections.
Best for nightly batch warehouses partitioned by dt.

Incremental vs full strategies

StrategyWhen to useRisk
Incremental backfillReprocess only missing or corrected partitionsMust trust watermark metadata; bugs can skip slices
Full table rebuildSchema overhaul or corrupted dimensionHighest load; requires maintenance window
Merge / upsertIdempotent writes keyed by business idDepends on warehouse merge performance and locks
For incremental models, add assertions (row count floors, null rate checks) per slice so a silent skip does not mark success.
Full backfills often pair with temporary tables and atomic swap patterns to keep production readers consistent mid-run.

Monitoring backfill progress

During execution, watch:
  • Completed vs remaining partitions in the run detail view
  • Per-slice duration trends (slowdown hints at skewed keys or hot partitions)
  • Warehouse slot usage and retry counts
Parallelism that works for nightly incremental loads may throttle sources during backfill. Cap concurrency to respect API quotas and DBA limits.
Cancel oversized jobs from the run page; document whether partial partitions committed so you can resume safely.
Backfill upstream facts before dimensions when foreign keys must exist; or use DAG ordering in an external orchestrator.
Pause CDC consumers if they compete for the same destination table during full rebuilds.

Schedules

Ongoing incremental loads after backfill completes.

Run history

Inspect slice-level status and logs.