DP-750 Certification Practice Question #47

Question

A data engineer writes the following PySpark cell in a notebook to incrementally ingest JSON files that land in an Azure Data Lake Storage container into a Unity Catalog managed table. The container is registered as a Unity Catalog external location.

```python
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/Volumes/main/raw/_schemas/orders")
  .load("abfss://landing@acct.dfs.core.windows.net/orders/")
  .writeStream
  .option("checkpointLocation", "/Volumes/main/raw/_chk/orders")
  .toTable("main.raw.orders")
)
```

The engineer runs the cell repeatedly during development. Each run reprocesses the **same** historical files and re-inserts duplicate rows. Which statement correctly explains Auto Loader's default behavior and the cause of the observed duplicates?

Accepted Answer

Auto Loader provides the `cloudFiles` Structured Streaming source and tracks which files it has already ingested by persisting their metadata in a RocksDB key-value store inside the checkpoint location. With the default `cloudFiles.allowOverwrites = false`, each file is processed exactly once. Duplicates therefore indicate the durable checkpoint state was not preserved across runs (a fresh/temporary checkpoint or a deleted checkpoint directory), which makes Auto Loader rediscover and reprocess the historical files. Setting `allowOverwrites=true` (option C) is the opposite of the fix.

More DP-750 practice questions