Feature pipelines are where most ML projects silently fail. Not at training time — at rerun time. A pipeline that works once but produces different results on a second run is dangerous because you won't know it happened until your model starts drifting.
This post documents the snapshot-based idempotency pattern I ended up using for production feature pipelines on Databricks. I'll describe the problem, the naive approach that doesn't work, and the version that does.
The problem: reruns that corrupt
Most feature engineering pipelines look like this:
- Read raw events from source tables
- Apply transformations (aggregations, joins, normalizations)
- Write the result to a feature table
Simple. But what happens on a rerun?
If you append to the feature table without checking what's already there, you get duplicates. If you overwrite the whole table, you lose historical snapshots that training jobs might be mid-read. If you use MERGE INTO without careful key design, you get partial updates that look correct but aren't.
The real issue is that these pipelines are not idempotent: running them twice with the same input produces a different output than running them once.
The snapshot pattern
The fix is to make snapshot identity explicit. Every run of the pipeline corresponds to exactly one snapshot, identified by a snapshot_date (or run_id if you're working with non-time-based cuts). The contract becomes:
For a given
snapshot_date, the pipeline produces exactly the same rows every time it runs.
This means:
- Delete before write: before writing features for a given
snapshot_date, delete all existing rows with that key. - Write atomically: use a single
INSERT INTO(ordf.write.mode("append")) after the delete. Wrap both in a transaction or use Delta's built-in ACID guarantees. - Never update in place: if you need to fix a historical snapshot, you re-run for that date, which triggers the delete-then-write pattern.
In PySpark on Databricks:
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SNAPSHOT_DATE = "2026-04-01"
FEATURE_TABLE = "catalog.schema.features"
# 1. Delete existing snapshot
delta_table = DeltaTable.forName(spark, FEATURE_TABLE)
delta_table.delete(f"snapshot_date = '{SNAPSHOT_DATE}'")
# 2. Compute features
features_df = (
spark.table("catalog.schema.raw_events")
.filter(f"event_date < '{SNAPSHOT_DATE}'")
.groupBy("entity_id")
.agg(...)
.withColumn("snapshot_date", lit(SNAPSHOT_DATE))
)
# 3. Append (Delta's ACID guarantees the write is atomic)
features_df.write.format("delta").mode("append").saveAsTable(FEATURE_TABLE)
Why not MERGE?
MERGE INTO (upsert) is tempting because it handles the "insert if not exists, update if exists" case in one operation. The problem is that it requires a key that uniquely identifies a row, and in feature tables the key is usually (entity_id, snapshot_date) — but if your pipeline has bugs that generate duplicate entity_id rows for a snapshot, MERGE will silently pick one and discard the other, or fail with a non-determinism error.
Delete-then-append is more explicit. Duplicates within a snapshot surface as errors immediately. The invariant is easy to test: COUNT(*) WHERE snapshot_date = X should always equal the number of entities in that snapshot.
Partitioning by snapshot_date
For this pattern to be efficient, the feature table must be partitioned by snapshot_date. Without partitioning, the DELETE scans the entire table. With partitioning, it's a metadata-only operation on Delta Lake.
features_df.write \
.format("delta") \
.mode("append") \
.partitionBy("snapshot_date") \
.saveAsTable(FEATURE_TABLE)
Add this on table creation. If the table already exists without partitioning, you'll need to rebuild it — there's no in-place repartition for Delta tables on most Databricks runtimes.
Making reruns safe for downstream training jobs
The last concern is concurrent reads. If a training job is reading snapshot_date = '2026-04-01' while you're running a delete-then-write for the same date, does it see a consistent view?
Yes — Delta Lake's MVCC (multi-version concurrency control) guarantees snapshot isolation. The training job reads from the version of the table at query start time. Your delete and rewrite create a new version; the training job's read is unaffected.
This is the main reason to use Delta Lake over plain Parquet for feature tables: you get ACID transactions and time-travel for free, which makes the delete-then-write pattern safe.
Summary
- Make
snapshot_date(or equivalent) an explicit first-class column, not an implicit property of the run - Partition by
snapshot_datefor efficient deletes - Use delete-then-append instead of MERGE or blind overwrites
- Delta Lake's snapshot isolation makes this safe for concurrent readers
This pattern adds a small amount of write overhead (the delete step) but eliminates an entire class of correctness bugs. For production feature pipelines where training jobs rely on historical snapshots, that trade-off is non-negotiable.