Question 56
DP-750 voucher + Udemy course (lifetime access) = ₹3,500 for Indian ID card holders.
Details →A bronze ingestion job loads clickstream events into a DataFrame `df` with columns `user_id`, `session_id`, `country`, and `revenue`. Profiling reveals three quality problems you must fix before writing to the silver layer: 1. Some rows are exact duplicates across **all** columns and must be collapsed to a single row. 2. The `country` column has `NULL` values that should be replaced with the literal string `"UNKNOWN"`. 3. The `revenue` column has `NULL` values that should be replaced with `0`. You want the most idiomatic PySpark expression that performs all three fixes in one chained transformation. Which code is correct? ```python # Option to choose result = ( df .<STEP_1> .<STEP_2> ) ```
- A`df.dropDuplicates().na.fill({"country": "UNKNOWN", "revenue": 0})`
- B`df.distinct().na.drop(subset=["country", "revenue"])`
- C`df.dropDuplicates(["user_id"]).na.fill("UNKNOWN")`
- D`df.na.fill({"country": "UNKNOWN", "revenue": 0}).dropDuplicates(["session_id"])`
- E`df.dropna().fillna({"country": "UNKNOWN", "revenue": 0})`