DP-750 Practice Questions — Page 7

Question 61

You are a data engineer building a Lakeflow Spark Declarative Pipeline that lands a `silver.transactions` streaming table in Unity Catalog. The business defines three row-level data quality rules and wants the pipeline to **drop** any record that violates **any** rule, while still emitting per-rule pass/fail metrics to the pipeline event log:

- **Nullability:** `transaction_id` must never be null.

- **Cardinality / domain:** `status` must be exactly one of `'PENDING'`, `'SETTLED'`, or `'REVERSED'`.

- **Range:** `amount` must be greater than 0 and at most 50000.

You author the dataset with grouped expectations so a single decorator applies all three checks with one collective action:

```python

from pyspark import pipelines as dp

rules = {

"valid_id": "transaction_id IS NOT NULL",

"valid_status": "status IN ('PENDING','SETTLED','REVERSED')",

"valid_amount": "amount > 0 AND amount <= 50000"

}

@dp.table(name="silver_transactions")

@dp.expect_all_or_drop(rules)

def silver_transactions():

return spark.readStream.table("bronze.transactions_raw")

```

In the **Data quality** tab, for each rule you must select the SINGLE check category that the rule's SQL condition primarily implements.

```mermaid

flowchart TD

R1["Rule valid_id: transaction_id IS NOT NULL"] --> D1{{"Check category?"}}

R2["Rule valid_status: status IN ('PENDING','SETTLED','REVERSED')"] --> D2{{"Check category?"}}

R3["Rule valid_amount: amount > 0 AND amount <= 50000"] --> D3{{"Check category?"}}

D1 -.options.-> O1["Nullability / Cardinality / Range"]

D2 -.options.-> O2["Nullability / Cardinality / Range"]

D3 -.options.-> O3["Nullability / Cardinality / Range"]

```

Question 62

Open question ↗

A Unity Catalog Delta table `finance.ledger` already contains millions of rows. The governance team requires a **hard, enforced** data quality rule: the `posting_year` column (an `INT`) must only ever contain four-digit years between 1900 and 2100. Any future `INSERT` or `MERGE` that would write a value outside that range must fail the transaction at write time, and the existing rows must be validated when the rule is added.

You are choosing how to implement this rule directly on the table (not in a Lakeflow pipeline). You run:

```sql

ALTER TABLE finance.ledger

ADD CONSTRAINT validPostingYear

CHECK (posting_year >= 1900 AND posting_year <= 2100);

```

Which statement most accurately describes the behavior of this command on Azure Databricks?

A.The `CHECK` constraint is informational only and is not enforced; it merely helps the query optimizer, so out-of-range inserts still succeed.
B.The `CHECK` constraint is an enforced constraint; Databricks first verifies that all existing rows satisfy it before adding it, and afterward any write that violates it fails the transaction with an error.
C.The command fails because `CHECK` constraints can only be declared in the `CREATE TABLE` statement, never added with `ALTER TABLE`.
D.The constraint is enforced only for streaming writes via Auto Loader; batch `INSERT` statements bypass it.
E.Adding the constraint silently drops any existing rows that violate the condition, then enforces it for new writes.

Question 63

Open question ↗

An upstream SaaS vendor frequently adds new optional columns to the JSON files they drop into a Unity Catalog volume. You ingest these files into the Delta table `bronze.events` with Auto Loader. Today, when a new column appears, your nightly batch write fails with a schema-mismatch error because Delta Lake enforces the schema on write and rejects columns that are not already in the target table.

The requirement is: **new (additive) columns from the source must be automatically appended to the `bronze.events` schema and their data persisted** — without manually altering the table and without losing data — while existing column types remain unchanged.

You have this Auto Loader write:

```python

(spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "json")

.option("cloudFiles.schemaLocation", "/Volumes/main/bronze/_schemas/events")

.load("/Volumes/main/bronze/landing/events")

.writeStream

# <-- option goes here

.option("checkpointLocation", "/Volumes/main/bronze/_ckpt/events")

.toTable("bronze.events")

)

```

Which option should you add to the **write** so additive source columns are merged into the Delta target schema?

A.`.option("overwriteSchema", "true")`
B.`.option("mergeSchema", "true")`
C.`.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns")`
D.`.option("cloudFiles.inferColumnTypes", "true")`
E.`.option("ignoreChanges", "true")`

Question 64

Open question ↗

You maintain a Lakeflow Spark Declarative Pipeline that produces the `silver.payments` streaming table for a regulated finance workload. The data steward gives you a strict rule: a payment record with a **null `account_id` is never acceptable** — if even one such record arrives, the pipeline update must **stop immediately and atomically roll back the table update** so no partial, dirty data is committed, forcing manual investigation of the upstream source before reprocessing.

Three candidate SQL expectation clauses are below:

```sql

-- Option 1

CONSTRAINT valid_account EXPECT (account_id IS NOT NULL)

-- Option 2

CONSTRAINT valid_account EXPECT (account_id IS NOT NULL) ON VIOLATION DROP ROW

-- Option 3

CONSTRAINT valid_account EXPECT (account_id IS NOT NULL) ON VIOLATION FAIL UPDATE

```

Which clause meets the steward's requirement?

A.Option 1 — `EXPECT (account_id IS NOT NULL)` (warn). It keeps invalid rows but flags them in metrics so the team can react.
B.Option 2 — `EXPECT ... ON VIOLATION DROP ROW`. It discards null-`account_id` rows so they never reach the target.
C.Option 3 — `EXPECT ... ON VIOLATION FAIL UPDATE`. It stops the update on the first invalid record and atomically rolls back the transaction.
D.None of these; expectations cannot halt a pipeline, so you must add a Delta `NOT NULL` constraint to the target table instead.

Question 65

Open question ↗

A retail analytics team operates a Lakeflow Spark Declarative Pipeline (Python) that builds `silver.orders` from a bronze streaming source. Their data-quality SLA has three tiers, and they want each tier handled by the appropriate Lakeflow **expectation action**:

1. **Fatal tier** — `order_id` must never be null. If any null `order_id` arrives, the update must halt immediately and roll back so nothing dirty is committed.

2. **Cleanse tier** — Rows where `order_total` is negative are known-bad noise from a buggy source; they should be silently discarded before the table is written, but the pipeline must keep running.

3. **Observe tier** — Rows where `currency` is not `'USD'` are allowed through (downstream converts them), but the team wants the count of such rows tracked over time in the event log so they can watch the trend.

You must also reuse the **same dictionary of multiple checks with one collective fail action** for the fatal tier across several datasets.

Which expectation operators correctly satisfy these requirements? (Choose THREE.)

A.Use `@dp.expect_or_fail("valid_order_id", "order_id IS NOT NULL")` (or `@dp.expect_all_or_fail({...})` to group several fatal checks) for the fatal tier.
B.Use `@dp.expect_or_drop("non_negative_total", "order_total >= 0")` for the cleanse tier.
C.Use `@dp.expect("usd_only", "currency = 'USD'")` (warn) for the observe tier so rows are retained and pass/fail counts are logged.
D.Use `@dp.expect_or_fail("non_negative_total", "order_total >= 0")` for the cleanse tier.
E.Use `@dp.expect_or_drop("usd_only", "currency = 'USD'")` for the observe tier.
F.Add a Delta `CHECK (order_id IS NOT NULL)` enforced constraint instead of any expectation to satisfy the fatal tier.

Question 66

Open question ↗

A retail data engineering team is implementing a lakehouse on Azure Databricks using the medallion architecture and Unity Catalog. Raw JSON order events land in an external volume and must flow through cleansing, deduplication, and enrichment before being aggregated for the executive sales dashboard.

You are designing the order of operations for a single end-to-end transformation. You must place each processing step in the correct sequence so that data quality improves incrementally from raw ingestion to business-ready aggregates, following Databricks' recommended medallion pattern.

Drag each processing step on the left to the correct position in the pipeline sequence on the right. Each step is used exactly once.

```mermaid

flowchart LR

subgraph Tiles["Processing steps (tiles)"]

T1["Ingest raw JSON with Auto Loader, append-only, minimal transformation"]

T2["Pre-compute daily revenue aggregates as a materialized view"]

T3["Deduplicate, enforce schema, and join to dimension tables"]

T4["Serve curated data marts to the BI dashboard"]

end

subgraph Slots["Pipeline sequence (slots)"]

S1["Slot 1 — Bronze"]

S2["Slot 2 — Silver"]

S3["Slot 3 — Gold"]

S4["Slot 4 — Consumption"]

end

S1 --> S2 --> S3 --> S4

```

Question 67

Open question ↗

A data engineering team must build an incremental ETL workload that ingests CDC events from a Kafka topic into bronze streaming tables, applies SCD Type 2 logic in silver, and refreshes a set of gold materialized views. The team wants the platform to automatically resolve the execution order of the datasets, retry transient failures starting at the most granular unit, and reduce the amount of hand-written Spark and Structured Streaming orchestration code.

They are deciding between two implementation approaches within Lakeflow:

- Author the logic across several Databricks notebooks and orchestrate them as **notebook tasks** in a Lakeflow Job, wiring up the task dependencies and retry logic manually.

- Author the logic as a **Lakeflow Spark Declarative Pipeline (SDP)** using streaming tables and materialized views.

Which approach best meets the requirements with the least custom orchestration code?

A.Use notebook tasks in a Lakeflow Job and manually define the task DAG and per-task retry policies.
B.Use a Lakeflow Spark Declarative Pipeline (SDP) so dataset dependencies and execution order are resolved automatically and transient failures retry at the task → flow → pipeline level.
C.Use a single notebook task that runs all logic sequentially with a try/except block around each stage.
D.Use a Python script task that calls the Jobs REST API to chain notebooks together.

Question 68

Open question ↗

You are designing a multi-task Lakeflow Job with the following four tasks:

- `ingest` — a notebook task that loads raw files into a bronze table.

- `transform` — a notebook task that builds the silver table.

- `validate_quality` — a notebook task that runs data-quality checks and may fail when bad data is detected.

- `cleanup` — a notebook task that must always run to release temporary resources, regardless of whether the upstream tasks succeed, fail, or are skipped.

You configure `cleanup` to depend on `validate_quality`. Which **Run if** dependency condition must you set on the `cleanup` task so that it runs even when `validate_quality` fails or is canceled?

A.All succeeded
B.None failed
C.All done
D.At least one succeeded

Question 69

Open question ↗

A nightly Lakeflow Job ingests data from a partner REST API into a bronze Delta table. The API is occasionally unavailable for a few seconds, causing the `ingest` notebook task to fail with a transient connection error. Today the task fails roughly once a week, and on those mornings the on-call engineer manually re-runs the job.

A Structured Streaming workload in the same task also relies on automatic schema-evolution behavior, which Databricks documents as assuming the job runs with retries so the environment is reset and the stream can proceed.

You want the task to recover automatically from these transient errors without manual intervention, while keeping all other failure-handling behavior unchanged. What is the most appropriate change?

A.Wrap every cell of the notebook in a broad `try/except` that swallows all exceptions and returns success.
B.Add a task-level retry policy to the `ingest` task so it restarts up to a configured number of times on failure.
C.Convert the job trigger to continuous mode so the workload runs nonstop and never has to be re-run.
D.Increase the cluster size so the API call completes faster and never times out.

Question 70

Open question ↗

You are building a Lakeflow Job composed of four notebook tasks that must honor these precedence constraints:

- `ingest_orders` and `ingest_customers` have no upstream dependencies and can run in parallel.

- `build_silver` must run only after **both** `ingest_orders` and `ingest_customers` have completed successfully.

- `publish_gold` must run only after `build_silver` succeeds.

You must wire the **Depends on** field of each task so the resulting Directed Acyclic Graph (DAG) enforces the precedence above and runs the two ingestion tasks in parallel.

Drag each task tile into the slot that specifies its correct **Depends on** configuration.

```mermaid

flowchart TD

IO["ingest_orders"] --> BS["build_silver"]

IC["ingest_customers"] --> BS

BS --> PG["publish_gold"]

subgraph Slots["Depends on configuration (slots)"]

Sa["Slot A — Depends on: (none)"]

Sb["Slot B — Depends on: (none)"]

Sc["Slot C — Depends on: ingest_orders, ingest_customers (Run if: All succeeded)"]

Sd["Slot D — Depends on: build_silver (Run if: All succeeded)"]

end

```