DP-203 Certification Practice Question #285

Question

You have an Azure subscription that contains an Azure Data Lake Storage Gen2 container named Container1 and an Azure Synapse Analytics workspace named Workspace1.

Workspace1 contains multiple Apache Spark jobs that reference a large dataset in Container1.

You need to optimize the run times of the jobs.

What should you do?

Accepted Answer

The optimization target is multiple Spark jobs repeatedly referencing a large shared dataset. The most generally effective step is to cache the dataset so Spark can reuse it from memory rather than repeatedly scanning the source in ADLS Gen2. The other options either conflict with the platform design or address narrower scenarios than the broad repeated-read pattern described.

More DP-203 practice questions