FEFreeExamDumps.in

DP-750 Practice Questions — Page 8

You are authoring a Lakeflow Spark Declarative Pipeline in Python. The pipeline defines a bronze streaming table `orders_raw`, a silver streaming table `orders_clean`, and a gold materialized view `daily_revenue`. You did **not** write any explicit orchestration code to specify which dataset runs before another.

```python

import dlt

from pyspark.sql.functions import col, sum as _sum

@dlt.table(name="orders_raw")

def orders_raw():

return (spark.readStream.format("cloudFiles")

.option("cloudFiles.format", "json")

.load("/Volumes/main/sales/landing/orders"))

@dlt.table(name="orders_clean")

def orders_clean():

return (dlt.read_stream("orders_raw")

.filter(col("amount") > 0))

@dlt.table(name="daily_revenue")

def daily_revenue():

return (dlt.read("orders_clean")

.groupBy("order_date")

.agg(_sum("amount").alias("revenue")))

```

How does Lakeflow Spark Declarative Pipelines determine the execution order of these three datasets?

  • A.You must add a `depends_on` parameter to each `@dlt.table` decorator to declare the order explicitly.
  • B.SDP automatically infers the dependency graph from the `dlt.read`/`dlt.read_stream` references between datasets and runs flows in the correct order with maximum parallelism.
  • C.Datasets execute in the top-to-bottom order they appear in the notebook source file.
  • D.You must create a separate Lakeflow Job with notebook tasks to sequence the three datasets.

A new data engineer must create their first Lakeflow Job in a Unity Catalog-enabled, serverless-enabled Azure Databricks workspace. The job's first unit of work is to run an existing notebook at the path `/Workspace/etl/ingest_bronze`.

Using the workspace UI, which sequence of actions correctly creates the job with this initial notebook task?

  • A.Click **Jobs & Pipelines** → **Create** → **Job**, choose the **Notebook** tile, enter a task name, set the notebook **Path**, then click **Create task**.
  • B.Click **Catalog** → **Create table** → attach a notebook, which automatically becomes a scheduled job.
  • C.Open the notebook, run all cells once, and Databricks automatically registers it as a recurring job.
  • D.Click **Compute** → **Create cluster**, then attach the notebook to the cluster to register it as a job.

You administer three Lakeflow Jobs, each with a different latency and cost requirement. For each job you must choose the single most appropriate trigger type from the **Schedules & Triggers** panel.

Requirements:

- **Job 1 (Nightly reporting):** Must run once every day at 02:00, regardless of whether new data has arrived. Cost should be minimized; compute should not run between executions.

- **Job 2 (Irregular file drops):** A partner uploads files to a Unity Catalog external location at unpredictable times. The job should run only when new files appear, and a polling schedule would waste compute.

- **Job 3 (Always-on streaming):** Must keep processing with the lowest possible latency by starting a new run as soon as the previous run completes or fails.

For each job, select the correct trigger type.

```mermaid

flowchart TD

J1["Job 1 — Nightly reporting at 02:00"] --> D1{"Trigger type?<br/>Scheduled / File arrival / Continuous"}

J2["Job 2 — Irregular partner file drops"] --> D2{"Trigger type?<br/>Scheduled / File arrival / Continuous"}

J3["Job 3 — Always-on, lowest latency"] --> D3{"Trigger type?<br/>Scheduled / File arrival / Continuous"}

```

You are scheduling a Lakeflow Job using the **Advanced** schedule type with the **Show Cron Syntax** option enabled. Databricks uses Quartz cron syntax, where the six fields are, in order: `seconds minutes hours day-of-month month day-of-week`, and `?` marks an unspecified day field.

The job must run **every day at exactly 09:46:00** in the `America/Los_Angeles` time zone. You define the schedule in JSON for the Databricks CLI:

```json

{

"schedule": {

"quartz_cron_expression": "<CRON>",

"timezone_id": "America/Los_Angeles",

"pause_status": "UNPAUSED"

},

"max_concurrent_runs": 1

}

```

Which value should replace `<CRON>` so the job runs daily at 09:46:00 local time?

  • A.`0 46 9 * * ?`
  • B.`46 9 0 * * ?`
  • C.`0 9 46 * * *`
  • D.`9 46 0 ? * *`

Your team operations runbook for a production Lakeflow Job requires that the on-call engineer be notified for the following situations:

1. Whenever a run **fails**, so they can investigate immediately.

2. Whenever a run **exceeds its expected duration**, so they can detect slow-running jobs before an SLA breach.

3. Whenever a run **completes successfully**, so the downstream consumer team knows fresh data is available.

You open the **Job notifications** section of the **Job details** panel and click **Add notification**. Which notification event types must you select to satisfy all three requirements? **(Choose THREE.)**

  • A.Failure
  • B.Duration warning
  • C.Success
  • D.Start
  • E.Streaming backlog
  • F.Cluster termination

Your team runs a nightly Lakeflow Job named `nightly_sales_etl` that ingests data from an external REST API into a Delta table. The ingestion task occasionally fails because the upstream API returns transient HTTP 503 errors that resolve themselves within a minute or two. Currently the task has **no retry policy configured**, so a single 503 error fails the entire run and pages the on-call engineer.

You must change the task configuration so that transient API failures are automatically tolerated, while still ensuring a genuinely broken task does not retry forever and a hung task is eventually killed. You add a retry policy to the task with the following intent:

```json

{

"task_key": "ingest_api_data",

"max_retries": 3,

"min_retry_interval_millis": 60000,

"timeout_seconds": 1800,

"notebook_task": {

"notebook_path": "/Shared/etl/ingest_api_data"

}

}

```

Which statement correctly describes how Lakeflow Jobs applies this configuration when the task fails with a transient error?

  • A.The task is restarted up to 3 times after a failure, the retry waits at least 60 seconds between the start of the failed run and the next retry, and the 30-minute `timeout_seconds` applies independently to each retry run.
  • B.The task is restarted exactly once because `max_retries` is capped at 1 for notebook tasks, and the timeout applies only to the original run.
  • C.The task retries indefinitely with exponential backoff because any value of `max_retries` greater than 0 enables continuous-mode backoff, and `timeout_seconds` is ignored.
  • D.The retry policy is rejected at deployment because a task cannot define both `timeout_seconds` and `max_retries`; you must choose one or the other.

A production Lakeflow Job orchestrates five tasks that run in sequence: `bronze_ingest` → `silver_clean` → `gold_aggregate` → `publish_dashboard` → `notify_team`. During last night's run, `bronze_ingest`, `silver_clean`, and `gold_aggregate` all completed successfully (each writing to a Delta table that took 40 minutes of compute), but `publish_dashboard` failed because a downstream BI warehouse was temporarily offline. As a result, `notify_team` was skipped.

The warehouse is now back online. You need to complete the failed run **without recomputing the three expensive upstream tasks** and while preserving the matrix view history of the original run. You open the **Job run details** page for the failed run.

Which action satisfies the requirement most efficiently?

  • A.Click **Run now** to trigger a brand-new run of the entire job; all five tasks re-execute from `bronze_ingest`.
  • B.Click **Repair run**, which re-runs only the unsuccessful task `publish_dashboard` and its dependent task `notify_team`, keeping the successful results of the upstream tasks.
  • C.Edit the job to delete `bronze_ingest`, `silver_clean`, and `gold_aggregate`, then click **Run now** so that only the remaining tasks execute.
  • D.Clone the job, remove the three completed tasks from the clone, and run the clone; this avoids recomputation and keeps the original history intact.

A data engineering team of six developers is adopting Azure Databricks Git folders to version-control their notebooks and Python files against a shared GitHub repository. They want to follow Databricks' recommended collaboration best practices so that developers do not interfere with each other's work and so that beginners avoid rewriting shared history.

The team lead proposes a workflow and asks you to validate it. Which set of practices aligns with Databricks' documented recommendations for collaborating in Git folders?

  • A.All six developers clone the repository into a single shared Git folder and take turns committing; for combining branches, they always use `git rebase` because it produces a linear history.
  • B.Each developer creates their own Git folder mapped to the repository under their user folder, works on their own development branch, and for beginners uses **merge** rather than **rebase** to combine branches because merge does not require force-pushing or rewriting commit history.
  • C.Developers commit directly to the `main` branch from their Git folders, relying on pull-with-conflict-resolution to reconcile everyone's simultaneous changes at push time.
  • D.Each developer uses a personal Git folder, but all developers commit to a single shared feature branch and resolve conflicts by always selecting **Take all incoming changes** to avoid manual review.

A developer is implementing a new feature in an Azure Databricks Git folder cloned from a shared GitHub repository. She must complete the full branch-to-merge lifecycle using Databricks' documented workflow: create an isolated branch, push her work, get it reviewed, and reconcile a conflict that arises when she pulls upstream changes that another engineer merged first.

For each workflow stage in the table, select the action that correctly completes the recommended Databricks Git folders workflow.

```mermaid

flowchart TD

S1["Stage 1: Begin isolated work"] --> D1{{"Dropdown 1"}}

S2["Stage 2: Save and share work to remote"] --> D2{{"Dropdown 2"}}

S3["Stage 3: Get the change into the default branch"] --> D3{{"Dropdown 3"}}

S4["Stage 4: A conflict appears after Pull"] --> D4{{"Dropdown 4"}}

D1 -.options.- O1["Create Branch in the Git dialog / Detach HEAD / Hard reset"]

D2 -.options.- O2["Commit & Push / Abort merge / Delete Git folder"]

D3 -.options.- O3["Create a pull request in the Git provider and merge / Force a branch switch / Run a hard reset"]

D4 -.options.- O4["Resolve merge conflicts in the Git folders UI, then Continue / Switch branches to discard / Re-clone immediately"]

```

Your team is building a CI/CD process for a Lakeflow project deployed with Databricks Asset Bundles. You are defining the automated testing strategy and want to map each test type to the right tool and scope so that the pipeline catches defects at the cheapest possible layer before promotion to production.

The team agrees on these layers:

- **Unit tests** — validate isolated transformation logic (a single function or table) with mock data, fast, no production data.

- **Integration tests** — validate that workflows and data pipelines run end-to-end against real Databricks compute.

Which TWO statements correctly describe Databricks-recommended testing practices for these layers? (Choose TWO.)

  • A.Unit tests for Python business logic should be implemented with a framework such as `pytest`, and for Lakeflow Spark Declarative Pipelines you can write Python unit tests that run a subset of the pipeline against a fully isolated catalog using mock data and validate results with standard `pytest` assertions.
  • B.Integration tests for workflows and data pipelines should be run after deployment, for example using the Databricks CLI `databricks bundle run` to execute the deployed job/pipeline against real compute, and tools such as `chispa` can validate Spark DataFrames.
  • C.Unit tests must always be executed against the production Unity Catalog so that they exercise the exact data the job will see at run time.
  • D.`databricks bundle validate` replaces the need for unit and integration tests because it confirms the business logic of every transformation is correct.
  • E.Integration tests cannot be automated in CI/CD and must be performed manually by a reviewer in the workspace UI before each release.