DP-750 Practice Questions — Page 5

Question 41

An auditor asks two separate things about a Unity Catalog managed Delta table `orders`:

1. "Show me the **entire table exactly as it was 3 days ago**, so I can compare a full snapshot to today."

2. "Going forward, give downstream ETL a way to read **only the row-level inserts, updates, and deletes** that occurred since the last run, without scanning the whole table."

You must satisfy requirement 1 using existing data and configure the table to satisfy requirement 2 from now on. Which combination should you implement?

```sql

-- Requirement 1: full snapshot of the table as of 3 days ago

SELECT * FROM orders <CLAUSE_1>;

-- Requirement 2: enable row-level change consumption going forward

ALTER TABLE orders <CLAUSE_2>;

```

A.Requirement 1: `TIMESTAMP AS OF date_sub(current_date(), 3)` (time travel); Requirement 2: `SET TBLPROPERTIES (delta.enableChangeDataFeed = true)`.
B.Requirement 1: enable change data feed and read `_change_type`; Requirement 2: `VERSION AS OF 0`.
C.Requirement 1: `RESTORE TABLE orders TO VERSION AS OF 0`; Requirement 2: run `VACUUM orders RETAIN 0 HOURS`.
D.Requirement 1: `DESCRIBE HISTORY orders`; Requirement 2: `OPTIMIZE orders ZORDER BY (order_id)`.

Question 42

Open question ↗

You are tuning a large, fast-growing Unity Catalog Delta table. Queries frequently filter on a **high-cardinality** `customer_id` column, the query patterns change over time, and the table receives **frequent small `MERGE`/`UPDATE`/`DELETE`** operations from an upsert pipeline. You want to avoid the small-file problems of partitioning, avoid having to re-specify columns on every `OPTIMIZE`, and accelerate row-level modifications.

Which **two** actions best meet these goals? (Choose TWO.)

A.Enable **liquid clustering** with `CLUSTER BY (customer_id)`, which supports high-cardinality keys and lets you redefine clustering keys later without rewriting existing data.
B.Enable **deletion vectors** (`delta.enableDeletionVectors = true`) so `MERGE`, `UPDATE`, and `DELETE` mark rows as removed without rewriting whole Parquet files.
C.Use **Z-ordering** (`OPTIMIZE ... ZORDER BY (customer_id)`) and combine it on the same table with liquid clustering for additive benefits.
D.Use **Hive-style partitioning** on `customer_id` to physically isolate each customer's data.
E.Disable deletion vectors and rely on full-file rewrites to keep older non-DV readers compatible.

Question 43

Open question ↗

Match each scenario on the left to the most appropriate table/layout approach on the right. Each tile is used exactly once.

```mermaid

flowchart LR

subgraph Scenarios

A["1. New gold table; want lowest cost,\nauto-optimization (predictive optimization,\nauto-compaction), latest UC features"]

B["2. External Spark/Trino writers manage the\ndata files in cloud storage outside Databricks;\nDatabricks only governs access"]

C["3. Large new fact table queried by changing,\nhigh-cardinality filters; want self-tuning\nlayout with no re-OPTIMIZE column re-spec"]

D["4. Migrating non-Delta Parquet data already\nin storage during a Hive-to-UC upgrade\nwithout moving the files"]

end

subgraph Tiles

T1["Unity Catalog managed Delta table"]

T2["External table (data lifecycle\nmanaged outside Databricks)"]

T3["Liquid clustering (CLUSTER BY)"]

T4["External Parquet table\n(register in place, no move)"]

end

```

Question 44

Open question ↗

A Unity Catalog Delta table is frequently queried with predicates on a **high-cardinality** `transaction_id` column, and analysts complain that these point-lookup queries scan too many files. The data engineer is **not** using partitioning or liquid clustering on the table.

**Proposed solution:** Run `OPTIMIZE sales ZORDER BY (transaction_id)` so that related values are colocated, file-level min/max statistics tighten, and the engine can skip more files for queries that filter on `transaction_id`.

Does this solution meet the goal of improving query performance for filters on the high-cardinality column?

A.Yes
B.No

Question 45

Open question ↗

A retail analytics team must ingest data from a **Salesforce** SaaS application and a **SQL Server** transactional database into Unity Catalog managed tables. The requirements are:

- The Salesforce ingestion can run on a schedule (every 4 hours is acceptable), and the team wants Databricks to handle incremental reads, schema evolution, and SCD Type 2 history automatically.

- The SQL Server data must be ingested **continuously** using change data capture so that row-level inserts, updates, and deletes are reflected in near real time.

- Both pipelines must run on **serverless compute**, store credentials as Unity Catalog securable objects, and be governed end-to-end by Unity Catalog (including source lineage).

- The team does not want to write or maintain custom Spark code for either source.

Which approach satisfies all requirements with the least custom code?

A.Build two custom Structured Streaming notebooks that use the Salesforce REST API and the SQL Server JDBC driver, and schedule them as Lakeflow Jobs on classic all-purpose compute.
B.Use **Lakeflow Connect managed connectors** — a SaaS connector for Salesforce and a database (CDC) connector for SQL Server — each creating a managed ingestion pipeline governed by Unity Catalog on serverless compute.
C.Use `COPY INTO` for both sources, pointing at staging files exported nightly by an external ETL tool into a Unity Catalog volume.
D.Use Auto Loader (`cloudFiles`) to read Salesforce and SQL Server data directly, since Auto Loader supports incremental reads from any source.

Question 46

Open question ↗

You are designing several one-time and recurring load operations into Unity Catalog. For each requirement, you must select the single most appropriate command from a dropdown. The available options for every row are: **COPY INTO**, **CREATE TABLE AS SELECT (CTAS)**, and **CREATE OR REPLACE TABLE (CRAS)**.

```mermaid

flowchart TD

subgraph HOTSPOT["Select the command for each requirement"]

R1["Row 1: Incrementally and idempotently load thousands of new JSON files that arrive over time into an existing Delta table; already-loaded files must be skipped on re-runs"] -->|dropdown| D1["[ COPY INTO | CTAS | CRAS ]"]

R2["Row 2: Create a brand-new managed Delta table in one statement by querying an existing Hive metastore table (full migration, no incremental reload needed)"] -->|dropdown| D2["[ COPY INTO | CTAS | CRAS ]"]

R3["Row 3: Fully overwrite an existing reporting table's schema and data each night from a SELECT, atomically replacing it while keeping the same table name and history"] -->|dropdown| D3["[ COPY INTO | CTAS | CRAS ]"]

end

```

Question 47

Open question ↗

A data engineer writes the following PySpark cell in a notebook to incrementally ingest JSON files that land in an Azure Data Lake Storage container into a Unity Catalog managed table. The container is registered as a Unity Catalog external location.

```python

(spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "json")

.option("cloudFiles.schemaLocation", "/Volumes/main/raw/_schemas/orders")

.load("abfss://[email protected]/orders/")

.writeStream

.option("checkpointLocation", "/Volumes/main/raw/_chk/orders")

.toTable("main.raw.orders")

)

```

The engineer runs the cell repeatedly during development. Each run reprocesses the **same** historical files and re-inserts duplicate rows. Which statement correctly explains Auto Loader's default behavior and the cause of the observed duplicates?

A.Auto Loader has no state, so it always reprocesses every file in the directory; you must add a `WHERE` filter on a timestamp column to deduplicate.
B.By default Auto Loader processes each file exactly once by tracking discovered file paths in a RocksDB key-value store at the **checkpoint location**; the duplicates are because each run used a *new* (temporary) checkpoint instead of the persistent `checkpointLocation`, or the checkpoint was deleted between runs.
C.Auto Loader requires `cloudFiles.allowOverwrites` set to `true` to avoid duplicates; without it, every file is processed twice.
D.The `toTable()` sink does not support exactly-once semantics, so duplicates are expected with Auto Loader; switch to `COPY INTO` instead.

Question 48

Open question ↗

A bronze streaming table `customers_cdc_clean` contains change data capture records emitted by Debezium from a MySQL `customers` table. Each record has columns `id`, `operation` (`INSERT`, `UPDATE`, or `DELETE`), `operation_date`, and the customer attributes. You must materialize a `customers` streaming table in a Lakeflow Spark Declarative Pipeline that:

- Upserts inserts and updates keyed on `id`.

- Deletes a row from the target when `operation = "DELETE"`.

- Resolves out-of-order events using `operation_date`.

- Keeps only the current version of each record (no history).

Which SQL flow definition correctly implements these requirements?

Question 49

Open question ↗

You are authoring a notebook-based Spark Structured Streaming job that incrementally reads from a Delta source table, enriches each micro-batch, and writes the result to a Unity Catalog Delta table with exactly-once, fault-tolerant guarantees. Arrange the code building blocks into the correct execution order to form a single valid streaming query.

Drag each tile into the correct slot in the pipeline sequence.

```mermaid

flowchart LR

S1["Slot 1"] --> S2["Slot 2"] --> S3["Slot 3"] --> S4["Slot 4"] --> S5["Slot 5"]

subgraph TILES["Available tiles (unordered)"]

T1["Tile: .writeStream"]

T2["Tile: spark.readStream.table('main.bronze.events')"]

T3["Tile: .option('checkpointLocation', '/Volumes/main/raw/_chk/events')"]

T4["Tile: .withColumn('ingest_ts', current_timestamp())"]

T5["Tile: .toTable('main.silver.events')"]

end

```

Question 50

Open question ↗

Your organization streams telemetry into an **Azure Event Hubs** namespace and you must ingest it into a Unity Catalog Delta table on Azure Databricks using Structured Streaming. The Event Hubs namespace exposes the Kafka-compatible endpoint on port `9093`, and you will authenticate with SASL/SSL using the namespace connection string. You write the following PySpark code:

```python

EH_NAMESPACE = "myns.servicebus.windows.net:9093"

CONN_STR = dbutils.secrets.get("eh", "connstr")

EH_SASL = (f'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule '

f'required username="$ConnectionString" password="{CONN_STR}";')

df = (spark.readStream

.format("kafka")

.option("kafka.bootstrap.servers", EH_NAMESPACE)

.option("subscribe", "telemetry")

.option("kafka.security.protocol", "SASL_SSL")

.option("kafka.sasl.mechanism", "PLAIN")

.option("kafka.sasl.jaas.config", EH_SASL)

.option("startingOffsets", "latest")

.load())

```

Which statement is correct about ingesting Azure Event Hubs data this way?

A.Azure Event Hubs cannot be read by the Spark Kafka connector; you must install the third-party Azure Event Hubs Spark connector JVM library and use `.format("eventhubs")`.
B.Because Event Hubs exposes a **Kafka-compatible endpoint**, the built-in Spark Structured Streaming **Kafka connector** can read it using `.format("kafka")` with `kafka.bootstrap.servers` set to the namespace `:9093`, the topic in `subscribe`, and SASL_SSL/PLAIN authentication; the returned `key`/`value` are binary and must be cast.
C.The `subscribe` and `security.protocol` options must be prefixed with `eventhubs.` instead of `kafka.` for Event Hubs to work.
D.Structured Streaming reads from Kafka/Event Hubs return rows already parsed into the source schema, so no casting of the `value` column is required.