DP-750 Certification Practice Question #90

Question

A daily job joins a 3 TB `fact_orders` Delta table to a small 40 MB `dim_country` lookup table and then aggregates. In the Spark UI you find:

- The join stage reports significant **disk spill** and a very large **Shuffle Read** on both sides of a sort-merge join.
- A few post-shuffle partitions are far larger than the rest (the join key is skewed).
- AQE has been turned off in this workspace by a legacy `spark.databricks.optimizer.adaptive.enabled false` setting.

You must reduce shuffle and spill and balance the skewed partitions. Which **three** actions are appropriate? (Choose THREE.)

Accepted Answer

Re-enabling AQE (A) restores runtime partition coalescing and automatic skew-join splitting — the recommended first move when skew and oversized shuffle partitions appear. Broadcasting the tiny 40 MB lookup (B) removes the shuffle of the 3 TB fact table against `dim_country` entirely. Repartitioning on the join key (C) produces balanced partitions and relieves the spill caused by the hot partitions. The rejected options all reduce parallelism or hurt the engine: `coalesce(1)` and `shuffle.partitions=1` collapse everything into a single task (more spill, more skew), and disabling Photon removes vectorized acceleration without touching the shuffle problem.

More DP-750 practice questions