Question 35
DP-750 voucher + Udemy course (lifetime access) = ₹3,500 for Indian ID card holders.
Details →A retail company drops new sales export files into an Azure Data Lake Storage Gen2 container governed by a Unity Catalog external location. The upstream system writes **millions of small files per day**, the schema occasionally gains new columns, and you must guarantee that every file is processed **exactly once** without reprocessing files that were already ingested. The team wants the ingestion job to scale to billions of files over time and to evolve the schema without manual intervention. Which extraction strategy and source configuration should you implement? ```python # Candidate pattern under evaluation (spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", "/Volumes/main/raw/_schema") .load("abfss://[email protected]/incoming/")) ```
- AFull extraction on a daily schedule using `spark.read.format("json")` over the entire directory, overwriting the target table each run.
- BIncremental extraction with Auto Loader (`cloudFiles`) reading JSON, using `cloudFiles.schemaLocation` for schema inference and evolution and the RocksDB checkpoint for exactly-once tracking.
- CIncremental extraction with `COPY INTO` from JSON, scheduled hourly, relying on its load history to skip processed files.
- DFull extraction with `CONVERT TO DELTA` over the raw Parquet directory once per day.