Question 47
DP-750 voucher + Udemy course (lifetime access) = ₹3,500 for Indian ID card holders.
Details →A data engineer writes the following PySpark cell in a notebook to incrementally ingest JSON files that land in an Azure Data Lake Storage container into a Unity Catalog managed table. The container is registered as a Unity Catalog external location. ```python (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", "/Volumes/main/raw/_schemas/orders") .load("abfss://[email protected]/orders/") .writeStream .option("checkpointLocation", "/Volumes/main/raw/_chk/orders") .toTable("main.raw.orders") ) ``` The engineer runs the cell repeatedly during development. Each run reprocesses the **same** historical files and re-inserts duplicate rows. Which statement correctly explains Auto Loader's default behavior and the cause of the observed duplicates?
- AAuto Loader has no state, so it always reprocesses every file in the directory; you must add a `WHERE` filter on a timestamp column to deduplicate.
- BBy default Auto Loader processes each file exactly once by tracking discovered file paths in a RocksDB key-value store at the **checkpoint location**; the duplicates are because each run used a *new* (temporary) checkpoint instead of the persistent `checkpointLocation`, or the checkpoint was deleted between runs.
- CAuto Loader requires `cloudFiles.allowOverwrites` set to `true` to avoid duplicates; without it, every file is processed twice.
- DThe `toTable()` sink does not support exactly-once semantics, so duplicates are expected with Auto Loader; switch to `COPY INTO` instead.