DP-750 Practice Questions — Page 6

Question 51

You are building a **Lakeflow Spark Declarative Pipeline** that ingests JSON files from a Unity Catalog volume into a bronze **streaming table** using Auto Loader. You author the following SQL:

```sql

CREATE OR REFRESH STREAMING TABLE bronze_orders

AS SELECT * FROM STREAM read_files(

'/Volumes/main/raw/landing/orders',

format => 'json',

inferColumnTypes => true

);

```

Which **three** statements are true about this declarative streaming-table ingestion pattern? (Choose THREE.)

A.The `STREAM` keyword and the `read_files` table-valued function together invoke Auto Loader (`cloudFiles`) functionality for incremental file ingestion.
B.When `read_files` is used inside a `CREATE OR REFRESH STREAMING TABLE` statement in a pipeline, Lakeflow automatically configures and manages the checkpoint and schema locations.
C.You must manually call `.option("checkpointLocation", ...)` and `.writeStream` for this streaming table, because pipelines do not manage triggers or checkpoints.
D.To load files with Auto Loader in a Unity Catalog-enabled pipeline, the source path must be governed by a Unity Catalog external location (or a volume).
E.Streaming tables are designed for append-only sources; new rows are inserted into the target with each pipeline update.
F.`read_files` can only be used in batch materialized views and is not supported for streaming tables.

Question 52

Open question ↗

You must stand up incremental file ingestion from an Azure Data Lake Storage container into a Unity Catalog managed table using Auto Loader, run as a triggered batch job. Order the configuration and authoring steps into the correct sequence to produce a governed, fault-tolerant ingestion job.

Drag each step into the correct order.

```mermaid

flowchart TD

A["Slot 1"] --> B["Slot 2"] --> C["Slot 3"] --> D["Slot 4"] --> E["Slot 5"]

subgraph TILES["Steps (unordered)"]

T1["Create a Unity Catalog external location over the ADLS container and grant READ FILES to the executing identity"]

T2["Define spark.readStream.format('cloudFiles') with cloudFiles.format and a persistent cloudFiles.schemaLocation"]

T3["Configure the writeStream with a unique checkpointLocation and .trigger(availableNow=True)"]

T4["Start the query with .toTable('main.bronze.orders')"]

T5["Schedule the notebook as a Lakeflow Job so the stream restarts and processes only newly arrived files"]

end

```

Question 53

Open question ↗

A data engineering team schedules the following SQL to run **every hour** on a Databricks SQL warehouse to load newly arrived JSON booking files from a Unity Catalog volume into an existing Delta table:

```sql

COPY INTO main.sales.bookings_target

FROM '/Volumes/main/sales/landing/bookings'

FILEFORMAT = JSON

FORMAT_OPTIONS ('mergeSchema' = 'true')

COPY_OPTIONS ('mergeSchema' = 'true');

```

**Proposed solution:** The team relies on `COPY INTO` alone — with no external bookkeeping table, no manual file-tracking, and no MERGE deduplication logic — to guarantee that re-running the scheduled statement does not reload files that were already ingested, so each source file's rows are inserted exactly once even across repeated runs.

Does this solution meet the goal?

A.Yes
B.No

Question 54

Open question ↗

You are a data engineer onboarding a new `bronze.iot.sensor_readings` Delta table into Unity Catalog. Before writing transformation logic, your lead asks you to produce, in a single notebook cell, a profile that returns **summary statistics for numeric, string, and date columns AND histograms of the value distributions for every column** so the team can spot skew, high null fractions, and high-cardinality keys at a glance.

You read the table into a DataFrame named `df`. You want the option that computes the full profile (including value-distribution histograms) over the DataFrame, not just count/mean/stddev/min/max.

```python

df = spark.read.table("bronze.iot.sensor_readings")

# ??? produce a full data profile with histograms

```

Which approach should you use?

A.`df.describe().show()`
B.`df.summary().show()`
C.`dbutils.data.summarize(df)`
D.`df.printSchema()`
E.`df.count()`

Question 55

Open question ↗

You are designing the schema for a `silver.finance.transactions` managed table in Unity Catalog. The business has these requirements for three columns:

- `amount` — monetary value that must store an **exact** base-10 value with no floating-point rounding error, up to 18 total digits with 2 decimal places.

- `event_ts` — the instant an event occurred, which **must be normalized and stored in UTC** and re-rendered in the reader's session local time zone.

- `customer_id` — a whole-number surrogate key that can exceed 2,147,483,647 (the 4-byte integer maximum).

You must pick the data types that satisfy each requirement with the smallest correct type. Which `CREATE TABLE` column definitions are correct?

```sql

CREATE TABLE silver.finance.transactions (

amount <TYPE_1>,

event_ts <TYPE_2>,

customer_id <TYPE_3>

);

```

A.`amount DOUBLE`, `event_ts TIMESTAMP_NTZ`, `customer_id INT`
B.`amount DECIMAL(18,2)`, `event_ts TIMESTAMP`, `customer_id BIGINT`
C.`amount FLOAT`, `event_ts DATE`, `customer_id BIGINT`
D.`amount DECIMAL(18,2)`, `event_ts TIMESTAMP_NTZ`, `customer_id INT`
E.`amount DECIMAL(2,18)`, `event_ts TIMESTAMP`, `customer_id BIGINT`

Question 56

Open question ↗

A bronze ingestion job loads clickstream events into a DataFrame `df` with columns `user_id`, `session_id`, `country`, and `revenue`. Profiling reveals three quality problems you must fix before writing to the silver layer:

1. Some rows are exact duplicates across **all** columns and must be collapsed to a single row.

2. The `country` column has `NULL` values that should be replaced with the literal string `"UNKNOWN"`.

3. The `revenue` column has `NULL` values that should be replaced with `0`.

You want the most idiomatic PySpark expression that performs all three fixes in one chained transformation. Which code is correct?

```python

# Option to choose

result = (

.<STEP_1>

.<STEP_2>

)

```

A.`df.dropDuplicates().na.fill({"country": "UNKNOWN", "revenue": 0})`
B.`df.distinct().na.drop(subset=["country", "revenue"])`
C.`df.dropDuplicates(["user_id"]).na.fill("UNKNOWN")`
D.`df.na.fill({"country": "UNKNOWN", "revenue": 0}).dropDuplicates(["session_id"])`
E.`df.dropna().fillna({"country": "UNKNOWN", "revenue": 0})`

Question 57

Open question ↗

You must build a silver aggregate that answers: **"For fulfilled orders only, how many orders exist per order priority, sorted from most to fewest orders?"** The source is a DataFrame `df_order` with columns `o_orderkey`, `o_orderstatus`, and `o_orderpriority`. A fulfilled order has `o_orderstatus == "F"`.

You want a single chained PySpark expression that filters to fulfilled orders, groups by priority, counts the orders, names the count column `n_orders`, and sorts descending by `n_orders`.

```python

from pyspark.sql.functions import col, count

df_chained = (

df_order

# filter -> group -> aggregate -> sort

)

```

Which expression produces the required result?

Question 58

Open question ↗

You are consolidating two Delta tables in Unity Catalog that have **identical schemas**: `silver.crm.customers_eu` and `silver.crm.customers_us`. For four separate downstream deliverables, you must choose the correct Spark SQL set operator or join for each requirement. All result sets must be **deduplicated** (DISTINCT) unless a join is required.

HOTSPOT — For each requirement, select the correct operation from its dropdown.

```mermaid

flowchart TD

subgraph Requirements

R1["R1: All customers from BOTH tables combined into one deduplicated list"]

R2["R2: Only customers that appear in BOTH tables (overlap)"]

R3["R3: Customers in customers_eu that are NOT in customers_us"]

R4["R4: Enrich orders with each order's customer attributes by matching customer_id"]

end

subgraph Dropdowns["Operation per row"]

D1["UNION / INTERSECT / EXCEPT / JOIN"]

D2["UNION / INTERSECT / EXCEPT / JOIN"]

D3["UNION / INTERSECT / EXCEPT / JOIN"]

D4["UNION / INTERSECT / EXCEPT / JOIN"]

end

R1 --> D1

R2 --> D2

R3 --> D3

R4 --> D4

```

Question 59

Open question ↗

You maintain a `silver.sales.monthly_revenue` table with one row per `(product, month, revenue)`. Analysts need a **wide** report with one row per `product` and a separate column per month (`Jan`, `Feb`, `Mar`) holding the summed revenue. Later, a downstream ML pipeline needs the wide table reshaped **back to long** format with one row per `(product, month, revenue)`.

```sql

-- Wide report (pivot)

SELECT * FROM silver.sales.monthly_revenue

PIVOT (

SUM(revenue) FOR month IN ('Jan', 'Feb', 'Mar')

);

```

Which TWO statements about these reshaping transformations are correct? (Choose TWO.)

A.The `PIVOT` clause rotates unique values of the `month` column into separate columns and requires an aggregate expression (such as `SUM(revenue)`) for the pivoted cells.
B.The `UNPIVOT` clause (or PySpark `unpivot`/`melt`) reverses a pivot by rotating the `Jan`/`Feb`/`Mar` columns back into rows, producing a name column and a value column.
C.`UNPIVOT` fully restores the original pre-pivot DataFrame including the aggregation, because unpivot reverses `SUM()`.
D.By default, `UNPIVOT` includes rows where the value column is `NULL`.
E.Denormalizing by joining a dimension table into the fact table is a valid way to widen a table for reporting, trading storage for fewer joins at query time.

Question 60

Open question ↗

Your team loads several Delta tables in Unity Catalog. For each load scenario below, you must choose the single most appropriate write strategy. The available tiles are Delta/Spark write operations.

DRAG-AND-DROP — Match each load scenario (slot) to the correct operation (tile). Each tile is used exactly once.

```mermaid

flowchart LR

subgraph Tiles["Operation Tiles"]

T1["MERGE INTO ... WHEN MATCHED UPDATE / WHEN NOT MATCHED INSERT"]

T2["MERGE INTO ... WHEN NOT MATCHED THEN INSERT (insert-only)"]

T3["INSERT INTO (append)"]

T4["INSERT OVERWRITE"]

end

subgraph Slots["Load Scenarios"]

S1["S1: CDC feed of customer changes — update existing rows by key, insert new ones (upsert)"]

S2["S2: Append-only event logs that may contain duplicate uniqueId values to skip"]

S3["S3: Bronze raw ingest — always add new immutable records, never update"]

S4["S4: Fully replace a small static lookup table on every run"]

end

S1 -.-> T1

S2 -.-> T2

S3 -.-> T3

S4 -.-> T4

```