DP-750 Certification Practice Question #35

Question

A retail company drops new sales export files into an Azure Data Lake Storage Gen2 container governed by a Unity Catalog external location. The upstream system writes **millions of small files per day**, the schema occasionally gains new columns, and you must guarantee that every file is processed **exactly once** without reprocessing files that were already ingested. The team wants the ingestion job to scale to billions of files over time and to evolve the schema without manual intervention.

Which extraction strategy and source configuration should you implement?

```python
# Candidate pattern under evaluation
(spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/Volumes/main/raw/_schema")
  .load("abfss://sales@acct.dfs.core.windows.net/incoming/"))
```

Accepted Answer

Auto Loader's `cloudFiles` source incrementally and idempotently processes only new files, persisting discovered-file metadata in a RocksDB key-value store in the checkpoint to guarantee exactly-once processing and resume after failures. The `cloudFiles.schemaLocation` option enables schema inference and evolution, and Microsoft documents that `COPY INTO` is appropriate for thousands of files while Auto Loader is recommended once you expect millions or more — exactly this scenario. Full extraction (A, D) reprocesses data and breaks the exactly-once/incremental requirement.

More DP-750 practice questions