DP-750 Certification Practice Question #56

Question

A bronze ingestion job loads clickstream events into a DataFrame `df` with columns `user_id`, `session_id`, `country`, and `revenue`. Profiling reveals three quality problems you must fix before writing to the silver layer: 1. Some rows are exact duplicates across **all** columns and must be collapsed to a single row. 2. The `country` column has `NULL` values that should be replaced with the literal string `"UNKNOWN"`. 3. The `revenue` column has `NULL` values that should be replaced with `0`. You want the most idiomatic PySpark expression that performs all three fixes in one chained transformation. Which code is correct? ```python # Option to choose result = ( df . . ) ```

Accepted Answer

Removing exact duplicate rows across every column is `dropDuplicates()` with no `subset` argument. Replacing nulls with different values per column is done with a dictionary passed to `na.fill` (equivalently `fillna`), where keys are column names and values are the replacements, so `country` gets `"UNKNOWN"` and `revenue` gets `0`. The other options either delete the null-bearing rows (`na.drop`/`dropna`), dedup on the wrong subset, or attempt a single scalar fill that does not satisfy the mixed string/numeric requirement.

More DP-750 practice questions