FEFreeExamDumps.in

DP-750 Practice Questions — Page 5

An auditor asks two separate things about a Unity Catalog managed Delta table `orders`:

1. "Show me the **entire table exactly as it was 3 days ago**, so I can compare a full snapshot to today."

2. "Going forward, give downstream ETL a way to read **only the row-level inserts, updates, and deletes** that occurred since the last run, without scanning the whole table."

You must satisfy requirement 1 using existing data and configure the table to satisfy requirement 2 from now on. Which combination should you implement?

```sql

-- Requirement 1: full snapshot of the table as of 3 days ago

SELECT * FROM orders <CLAUSE_1>;

-- Requirement 2: enable row-level change consumption going forward

ALTER TABLE orders <CLAUSE_2>;

```

  • A.Requirement 1: `TIMESTAMP AS OF date_sub(current_date(), 3)` (time travel); Requirement 2: `SET TBLPROPERTIES (delta.enableChangeDataFeed = true)`.
  • B.Requirement 1: enable change data feed and read `_change_type`; Requirement 2: `VERSION AS OF 0`.
  • C.Requirement 1: `RESTORE TABLE orders TO VERSION AS OF 0`; Requirement 2: run `VACUUM orders RETAIN 0 HOURS`.
  • D.Requirement 1: `DESCRIBE HISTORY orders`; Requirement 2: `OPTIMIZE orders ZORDER BY (order_id)`.

You are tuning a large, fast-growing Unity Catalog Delta table. Queries frequently filter on a **high-cardinality** `customer_id` column, the query patterns change over time, and the table receives **frequent small `MERGE`/`UPDATE`/`DELETE`** operations from an upsert pipeline. You want to avoid the small-file problems of partitioning, avoid having to re-specify columns on every `OPTIMIZE`, and accelerate row-level modifications.

Which **two** actions best meet these goals? (Choose TWO.)

  • A.Enable **liquid clustering** with `CLUSTER BY (customer_id)`, which supports high-cardinality keys and lets you redefine clustering keys later without rewriting existing data.
  • B.Enable **deletion vectors** (`delta.enableDeletionVectors = true`) so `MERGE`, `UPDATE`, and `DELETE` mark rows as removed without rewriting whole Parquet files.
  • C.Use **Z-ordering** (`OPTIMIZE ... ZORDER BY (customer_id)`) and combine it on the same table with liquid clustering for additive benefits.
  • D.Use **Hive-style partitioning** on `customer_id` to physically isolate each customer's data.
  • E.Disable deletion vectors and rely on full-file rewrites to keep older non-DV readers compatible.

Match each scenario on the left to the most appropriate table/layout approach on the right. Each tile is used exactly once.

```mermaid

flowchart LR

subgraph Scenarios

A["1. New gold table; want lowest cost,\nauto-optimization (predictive optimization,\nauto-compaction), latest UC features"]

B["2. External Spark/Trino writers manage the\ndata files in cloud storage outside Databricks;\nDatabricks only governs access"]

C["3. Large new fact table queried by changing,\nhigh-cardinality filters; want self-tuning\nlayout with no re-OPTIMIZE column re-spec"]

D["4. Migrating non-Delta Parquet data already\nin storage during a Hive-to-UC upgrade\nwithout moving the files"]

end

subgraph Tiles

T1["Unity Catalog managed Delta table"]

T2["External table (data lifecycle\nmanaged outside Databricks)"]

T3["Liquid clustering (CLUSTER BY)"]

T4["External Parquet table\n(register in place, no move)"]

end

```

A Unity Catalog Delta table is frequently queried with predicates on a **high-cardinality** `transaction_id` column, and analysts complain that these point-lookup queries scan too many files. The data engineer is **not** using partitioning or liquid clustering on the table.

**Proposed solution:** Run `OPTIMIZE sales ZORDER BY (transaction_id)` so that related values are colocated, file-level min/max statistics tighten, and the engine can skip more files for queries that filter on `transaction_id`.

Does this solution meet the goal of improving query performance for filters on the high-cardinality column?

  • A.Yes
  • B.No

A retail analytics team must ingest data from a **Salesforce** SaaS application and a **SQL Server** transactional database into Unity Catalog managed tables. The requirements are:

- The Salesforce ingestion can run on a schedule (every 4 hours is acceptable), and the team wants Databricks to handle incremental reads, schema evolution, and SCD Type 2 history automatically.

- The SQL Server data must be ingested **continuously** using change data capture so that row-level inserts, updates, and deletes are reflected in near real time.

- Both pipelines must run on **serverless compute**, store credentials as Unity Catalog securable objects, and be governed end-to-end by Unity Catalog (including source lineage).

- The team does not want to write or maintain custom Spark code for either source.

Which approach satisfies all requirements with the least custom code?

  • A.Build two custom Structured Streaming notebooks that use the Salesforce REST API and the SQL Server JDBC driver, and schedule them as Lakeflow Jobs on classic all-purpose compute.
  • B.Use **Lakeflow Connect managed connectors** — a SaaS connector for Salesforce and a database (CDC) connector for SQL Server — each creating a managed ingestion pipeline governed by Unity Catalog on serverless compute.
  • C.Use `COPY INTO` for both sources, pointing at staging files exported nightly by an external ETL tool into a Unity Catalog volume.
  • D.Use Auto Loader (`cloudFiles`) to read Salesforce and SQL Server data directly, since Auto Loader supports incremental reads from any source.

You are designing several one-time and recurring load operations into Unity Catalog. For each requirement, you must select the single most appropriate command from a dropdown. The available options for every row are: **COPY INTO**, **CREATE TABLE AS SELECT (CTAS)**, and **CREATE OR REPLACE TABLE (CRAS)**.

```mermaid

flowchart TD

subgraph HOTSPOT["Select the command for each requirement"]

R1["Row 1: Incrementally and idempotently load thousands of new<br/>JSON files that arrive over time into an existing Delta table;<br/>already-loaded files must be skipped on re-runs"] -->|dropdown| D1["[ COPY INTO | CTAS | CRAS ]"]

R2["Row 2: Create a brand-new managed Delta table in one statement<br/>by querying an existing Hive metastore table (full migration,<br/>no incremental reload needed)"] -->|dropdown| D2["[ COPY INTO | CTAS | CRAS ]"]

R3["Row 3: Fully overwrite an existing reporting table's schema<br/>and data each night from a SELECT, atomically replacing it<br/>while keeping the same table name and history"] -->|dropdown| D3["[ COPY INTO | CTAS | CRAS ]"]

end

```

A data engineer writes the following PySpark cell in a notebook to incrementally ingest JSON files that land in an Azure Data Lake Storage container into a Unity Catalog managed table. The container is registered as a Unity Catalog external location.

```python

(spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "json")

.option("cloudFiles.schemaLocation", "/Volumes/main/raw/_schemas/orders")

.load("abfss://[email protected]/orders/")

.writeStream

.option("checkpointLocation", "/Volumes/main/raw/_chk/orders")

.toTable("main.raw.orders")

)

```

The engineer runs the cell repeatedly during development. Each run reprocesses the **same** historical files and re-inserts duplicate rows. Which statement correctly explains Auto Loader's default behavior and the cause of the observed duplicates?

  • A.Auto Loader has no state, so it always reprocesses every file in the directory; you must add a `WHERE` filter on a timestamp column to deduplicate.
  • B.By default Auto Loader processes each file exactly once by tracking discovered file paths in a RocksDB key-value store at the **checkpoint location**; the duplicates are because each run used a *new* (temporary) checkpoint instead of the persistent `checkpointLocation`, or the checkpoint was deleted between runs.
  • C.Auto Loader requires `cloudFiles.allowOverwrites` set to `true` to avoid duplicates; without it, every file is processed twice.
  • D.The `toTable()` sink does not support exactly-once semantics, so duplicates are expected with Auto Loader; switch to `COPY INTO` instead.

A bronze streaming table `customers_cdc_clean` contains change data capture records emitted by Debezium from a MySQL `customers` table. Each record has columns `id`, `operation` (`INSERT`, `UPDATE`, or `DELETE`), `operation_date`, and the customer attributes. You must materialize a `customers` streaming table in a Lakeflow Spark Declarative Pipeline that:

- Upserts inserts and updates keyed on `id`.

- Deletes a row from the target when `operation = "DELETE"`.

- Resolves out-of-order events using `operation_date`.

- Keeps only the current version of each record (no history).

Which SQL flow definition correctly implements these requirements?

You are authoring a notebook-based Spark Structured Streaming job that incrementally reads from a Delta source table, enriches each micro-batch, and writes the result to a Unity Catalog Delta table with exactly-once, fault-tolerant guarantees. Arrange the code building blocks into the correct execution order to form a single valid streaming query.

Drag each tile into the correct slot in the pipeline sequence.

```mermaid

flowchart LR

S1["Slot 1"] --> S2["Slot 2"] --> S3["Slot 3"] --> S4["Slot 4"] --> S5["Slot 5"]

subgraph TILES["Available tiles (unordered)"]

T1["Tile: .writeStream"]

T2["Tile: spark.readStream.table('main.bronze.events')"]

T3["Tile: .option('checkpointLocation', '/Volumes/main/raw/_chk/events')"]

T4["Tile: .withColumn('ingest_ts', current_timestamp())"]

T5["Tile: .toTable('main.silver.events')"]

end

```

Your organization streams telemetry into an **Azure Event Hubs** namespace and you must ingest it into a Unity Catalog Delta table on Azure Databricks using Structured Streaming. The Event Hubs namespace exposes the Kafka-compatible endpoint on port `9093`, and you will authenticate with SASL/SSL using the namespace connection string. You write the following PySpark code:

```python

EH_NAMESPACE = "myns.servicebus.windows.net:9093"

CONN_STR = dbutils.secrets.get("eh", "connstr")

EH_SASL = (f'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule '

f'required username="$ConnectionString" password="{CONN_STR}";')

df = (spark.readStream

.format("kafka")

.option("kafka.bootstrap.servers", EH_NAMESPACE)

.option("subscribe", "telemetry")

.option("kafka.security.protocol", "SASL_SSL")

.option("kafka.sasl.mechanism", "PLAIN")

.option("kafka.sasl.jaas.config", EH_SASL)

.option("startingOffsets", "latest")

.load())

```

Which statement is correct about ingesting Azure Event Hubs data this way?

  • A.Azure Event Hubs cannot be read by the Spark Kafka connector; you must install the third-party Azure Event Hubs Spark connector JVM library and use `.format("eventhubs")`.
  • B.Because Event Hubs exposes a **Kafka-compatible endpoint**, the built-in Spark Structured Streaming **Kafka connector** can read it using `.format("kafka")` with `kafka.bootstrap.servers` set to the namespace `:9093`, the topic in `subscribe`, and SASL_SSL/PLAIN authentication; the returned `key`/`value` are binary and must be cast.
  • C.The `subscribe` and `security.protocol` options must be prefixed with `eventhubs.` instead of `kafka.` for Event Hubs to work.
  • D.Structured Streaming reads from Kafka/Event Hubs return rows already parsed into the source schema, so no casting of the `value` column is required.