DP-750 Practice Questions — Page 4

Question 31

A downstream Power BI report that consumes `analytics.gold.revenue_summary` started showing incorrect totals after a change to an upstream table. Before deleting a suspect column in `analytics.silver.transactions`, a data engineer must perform an impact analysis to identify every downstream table, job, notebook, and dashboard that depends on that column, and trace the upstream sources that feed `revenue_summary`. All queries ran on Azure Databricks compute attached to the Unity Catalog metastore.

Which approach correctly uses Unity Catalog data lineage to satisfy this requirement?

A.Lineage must be enabled and captured manually for each table; the engineer has to run `ANALYZE TABLE ... COMPUTE LINEAGE` first, because Unity Catalog does not capture lineage automatically.
B.In Catalog Explorer, open the table's **Lineage** tab and **See Lineage Graph** to view upstream/downstream tables, click a column for column-level lineage, and filter the lineage details by notebooks, jobs, pipelines, queries, and dashboards — lineage is captured automatically down to the column level.
C.Catalog Explorer lineage shows only table-to-table relationships; to see which jobs, notebooks, or dashboards consume a table, the engineer must instead parse the Delta transaction log of each object manually.
D.Lineage is only available through the REST API and cannot be visualized; Catalog Explorer has no lineage view.

Question 32

Open question ↗

A security team must monitor account activity across all workspaces attached to a Unity Catalog metastore. They want to query audit events with SQL (including which user ran each notebook command text), retain the data for analytics, and they specifically need account-level events such as Unity Catalog lineage tracking that are NOT delivered through Azure Monitor diagnostic settings.

```mermaid

flowchart TB

subgraph CFG["Audit logging configuration choices"]

A1["Where do queryable audit events live?"]

A2["What must be enabled so command_text appears for notebook/SQL commands?"]

A3["What is the table path for the audit log system table?"]

A4["How long is the free retention of the audit log system table?"]

end

```

For each requirement, select the correct option from the dropdown.

Question 33

Open question ↗

You are the data provider on a Unity Catalog-enabled Azure Databricks workspace. A partner organization also runs on Azure Databricks with a Unity Catalog-enabled workspace. Security wants the most secure sharing model with the least credential management overhead, and they also want to share a Unity Catalog volume and a notebook in addition to tables.

Which **two** statements correctly describe how to choose and secure the sharing model for this partner? (Choose TWO.)

A.Use Databricks-to-Databricks Delta Sharing: the recipient does not need a bearer token, and identity verification, authentication, and auditing are managed by Delta Sharing and the Databricks platform.
B.Only open sharing can share Unity Catalog volumes and notebook files, so you must distribute a long-lived bearer token to the partner.
C.With Databricks-to-Databricks sharing you request the recipient's Unity Catalog metastore **sharing identifier** and use it to establish the secure connection, so no token-based credentials are required.
D.Open Delta Sharing should be preferred here because it is inherently more secure than Databricks-to-Databricks sharing for cross-organization data exchange.
E.Databricks-to-Databricks sharing requires you to generate and rotate a credential file for the recipient, just like open sharing.
F.Because both organizations are on the same Unity Catalog metastore, Delta Sharing is unnecessary and you should grant the partner direct privileges instead.

Question 34

Open question ↗

You are a metastore admin acting as the data provider. You must securely share the table `sales.gold.orders` with a partner who already has a Unity Catalog-enabled Azure Databricks workspace (Databricks-to-Databricks sharing). Drag the configuration steps into the correct order, from first to last.

```mermaid

flowchart TB

subgraph TILES["Steps (unordered)"]

T1["Create a share object in your Unity Catalog metastore: CREATE SHARE sales_share"]

T2["Enable Delta Sharing on the Unity Catalog metastore"]

T3["Grant the recipient access to the share: GRANT SELECT ON SHARE sales_share TO RECIPIENT partner_co"]

T4["Add the table to the share: ALTER SHARE sales_share ADD TABLE sales.gold.orders"]

T5["Create a recipient using the partner's metastore sharing identifier: CREATE RECIPIENT partner_co USING ID '...'"]

end

subgraph SLOTS["Correct order"]

S1["Step 1"]

S2["Step 2"]

S3["Step 3"]

S4["Step 4"]

S5["Step 5"]

end

```

Question 35

Open question ↗

A retail company drops new sales export files into an Azure Data Lake Storage Gen2 container governed by a Unity Catalog external location. The upstream system writes **millions of small files per day**, the schema occasionally gains new columns, and you must guarantee that every file is processed **exactly once** without reprocessing files that were already ingested. The team wants the ingestion job to scale to billions of files over time and to evolve the schema without manual intervention.

Which extraction strategy and source configuration should you implement?

```python

# Candidate pattern under evaluation

(spark.readStream.format("cloudFiles")

.option("cloudFiles.format", "json")

.option("cloudFiles.schemaLocation", "/Volumes/main/raw/_schema")

.load("abfss://[email protected]/incoming/"))

```

A.Full extraction on a daily schedule using `spark.read.format("json")` over the entire directory, overwriting the target table each run.
B.Incremental extraction with Auto Loader (`cloudFiles`) reading JSON, using `cloudFiles.schemaLocation` for schema inference and evolution and the RocksDB checkpoint for exactly-once tracking.
C.Incremental extraction with `COPY INTO` from JSON, scheduled hourly, relying on its load history to skip processed files.
D.Full extraction with `CONVERT TO DELTA` over the raw Parquet directory once per day.

Question 36

Open question ↗

You are designing ingestion into a Unity Catalog lakehouse for three sources. For each scenario, choose the **most appropriate ingestion approach** from the dropdown. The goal is to minimize long-term operational maintenance while keeping data fresh and governed by Unity Catalog.

- **Scenario 1:** Ingest **Salesforce** opportunity and account objects with low-code setup, automatic incremental reads, and SCD type 2 history tracking, governed by Unity Catalog on serverless compute.

- **Scenario 2:** Ingest **new JSON files arriving in an ADLS Gen2 external location**, where you need full control over complex transformations during ingestion using PySpark and Auto Loader.

- **Scenario 3:** A team already has an existing enterprise **Azure Data Factory** orchestration and only needs to land copied files into storage, with Databricks reading them afterward.

```mermaid

flowchart LR

S1["Scenario 1:\nSalesforce SaaS,\nlow-code, SCD2"] --> D1{Dropdown 1}

S2["Scenario 2:\nADLS JSON,\ncustom PySpark transforms"] --> D2{Dropdown 2}

S3["Scenario 3:\nExisting ADF copy\norchestration"] --> D3{Dropdown 3}

D1 -. options .-> O["Lakeflow Connect managed connector\n|\nAuto Loader in a notebook / pipeline\n|\nAzure Data Factory copy + Databricks read"]

D2 -. options .-> O

D3 -. options .-> O

```

Question 37

Open question ↗

A fraud-detection team needs new transaction events ingested into a Unity Catalog table with **end-to-end latency of a few seconds**, processing only newly arrived records and recovering automatically from failures. A second reporting team consumes the same data but only needs an aggregate refreshed **once an hour** with always-accurate, fully-consistent batch results.

You are building both flows with Lakeflow Spark Declarative Pipelines. Which combination of objects and pipeline mode should you choose to satisfy the fraud-detection requirement specifically?

```sql

-- Fraud-detection ingestion (latency: seconds)

CREATE OR REFRESH STREAMING TABLE transactions_bronze

AS SELECT * FROM STREAM read_files(

'abfss://[email protected]/txn/',

format => 'json'

);

```

A.A streaming table in a pipeline running in **continuous** mode.
B.A materialized view in a pipeline running in **triggered** mode.
C.A streaming table in a pipeline running in **triggered** mode scheduled every 4 hours.
D.A materialized view in a pipeline running in **continuous** mode with a 2-hour trigger interval.

Question 38

Open question ↗

You are creating a new gold-layer fact table in a Unity Catalog catalog. Requirements:

- ACID transactions with `MERGE`, `UPDATE`, and `DELETE` support.

- Time travel and rollback for accidental data corrections.

- Change data feed so downstream pipelines can read only row-level changes.

- The table must be a **Unity Catalog managed table** that benefits from automatic optimization (predictive optimization, auto-compaction) and the broadest set of platform features.

Which table format should you choose for this managed table?

```sql

CREATE TABLE finance.gold.sales_fact (

sale_id BIGINT,

customer_id BIGINT,

amount DECIMAL(18,2),

sale_ts TIMESTAMP

)

USING <FORMAT>

TBLPROPERTIES (delta.enableChangeDataFeed = true);

```

A.CSV
B.Parquet
C.Delta Lake
D.JSON

Question 39

Open question ↗

A data engineer is creating a new Delta table named `events` in Unity Catalog. The table is expected to hold about **300 GB** of data. The previous Hive-based design partitioned the data by `event_date` **and** by high-cardinality `user_id`, which produced hundreds of thousands of tiny files and slow queries. Most analytical queries filter on `event_date` and `country`.

What should you do to optimize the data layout and avoid over-partitioning for this new table?

```sql

-- Proposed table definition (choose the correct layout strategy)

CREATE TABLE analytics.events (

event_id BIGINT,

user_id BIGINT,

country STRING,

event_date DATE,

payload STRING

)

<LAYOUT_STRATEGY>;

```

A.Keep Hive-style partitioning but partition by `event_date` and `user_id` to maximize partition pruning.
B.Partition by `user_id` only, because it has the highest cardinality and therefore creates the most partitions.
C.Do not partition the table; instead enable liquid clustering with `CLUSTER BY (event_date, country)` so the layout can evolve and avoids small-file/over-partitioning problems.
D.Partition by `event_date` and additionally Z-order by `country` and `user_id` on the same column set used for partitioning.

Question 40

Open question ↗

You are modeling a `dim_customer` dimension in Unity Catalog from a CDC feed using Lakeflow Spark Declarative Pipelines `AUTO CDC ... INTO`. The business has these requirements:

- Auditors and analysts must be able to reconstruct **what a customer's address and tier were at any point in time** (full historical record).

- Each historical version must carry validity columns so the **current** row is identifiable (active record has a null end timestamp).

- Out-of-order CDC events must be sequenced correctly by an event timestamp.

Which **two** statements correctly describe the SCD design you should implement? (Choose TWO.)

A.Use SCD Type 2 (`STORED AS SCD TYPE 2`), which keeps all versions with `__START_AT`/`__END_AT` validity columns and marks the active row with a null end value.
B.Use SCD Type 1, because it preserves the complete history of every change to the customer record.
C.Use the `SEQUENCE BY` (or `_sequence_by`) clause so out-of-order CDC records are applied in the correct order.
D.Use SCD Type 1 with change data feed disabled, because Type 1 inherently stores every prior version.
E.Use SCD Type 3, which is the only type that can store the full unbounded history of all past values.