DP-750 Practice Questions — Page 9

Question 81

You are packaging a Lakeflow ETL project as a Databricks Asset Bundle (now part of Declarative Automation Bundles) so it can be source-controlled and deployed through CI/CD. A teammate sends you the proposed project layout:

```

my_etl_project/

├── databricks.yml

├── src/

│ └── ingest.py

└── resources/

└── etl_job.yml

```

And the root configuration file:

```yaml

bundle:

name: my_etl_project

include:

- resources/*.yml

targets:

dev:

default: true

```

Which statement correctly describes the requirements and behavior of this bundle configuration file?

A.A bundle must contain exactly one configuration file named `databricks.yml` at the root of the project; it defines the required `bundle` name and can reference other configuration files (such as `resources/etl_job.yml`) through the `include` mapping.
B.A bundle may contain any number of `databricks.yml` files in any folder; the Databricks CLI merges all of them at deploy time regardless of location.
C.Resource definitions such as jobs and pipelines must be written inline in `databricks.yml`; the `include` mapping is only for Python wheel artifacts and cannot reference YAML resource files.
D.The `targets` mapping is optional metadata only; without a `--target` flag the CLI deploys to every workspace listed in the user's `.databrickscfg` profile.

Question 82

Open question ↗

You maintain a single Databricks Asset Bundle that must deploy the same Lakeflow job to two environments that point at different all-purpose cluster IDs and different catalogs. You define a custom variable for the cluster ID and override it per target:

```yaml

bundle:

name: sales_pipeline

variables:

cluster_id:

description: The all-purpose cluster used by the job

default: 0000-000000-default

targets:

dev:

default: true

variables:

cluster_id: 1234-567890-abcde123

prod:

variables:

cluster_id: 2345-678901-bcdef234

```

In a CI pipeline you run the deploy for production as:

```bash

BUNDLE_VAR_cluster_id=9999-111111-override databricks bundle deploy --target prod

```

Which `cluster_id` value does the Databricks CLI use for the production deployment, and why?

A.`2345-678901-bcdef234`, because per-target `variables` mappings always take precedence over environment variables.
B.`9999-111111-override`, because an environment variable prefixed with `BUNDLE_VAR_` has higher precedence than the value defined in the target's `variables` mapping.
C.`0000-000000-default`, because the presence of a `BUNDLE_VAR_` environment variable forces the CLI to ignore both target and command-line values and fall back to the declared default.
D.The deploy fails with a conflict error because a variable cannot be set both in a `targets.variables` mapping and through a `BUNDLE_VAR_` environment variable.

Question 83

Open question ↗

You have authored a Databricks Asset Bundle locally with a `databricks.yml` that defines a `dev` target and a Lakeflow job named `sample_job`. You have already configured the Databricks CLI (v0.218.0+) with OAuth U2M authentication to your workspace. From the bundle project root you want to:

1. Confirm the configuration is syntactically valid before deploying.

2. Deploy the job and its source files to the `dev` target workspace.

3. Trigger an actual run of the deployed job from the command line.

Which sequence of Databricks CLI commands accomplishes these three steps in order?

Question 84

Open question ↗

Your organization deploys Databricks Asset Bundles from a GitHub Actions CI pipeline. On every merge to `main`, the pipeline must authenticate non-interactively to the Azure Databricks workspace, validate the bundle, and deploy it to the `prod` target. A draft workflow uses the Databricks-provided `databricks/setup-cli` action and then runs `bundle` commands.

```yaml

jobs:

deploy:

runs-on: ubuntu-latest

steps:

- uses: actions/checkout@v4

- uses: databricks/setup-cli@main

- run: databricks bundle validate -t prod

env:

DATABRICKS_TOKEN: ${{ secrets.SP_TOKEN }}

- run: databricks bundle deploy -t prod

env:

DATABRICKS_TOKEN: ${{ secrets.SP_TOKEN }}

```

Which TWO statements describe Databricks-recommended practices for this CI deployment? (Choose TWO.)

A.The pipeline should authenticate as a Databricks service principal (for example, using an OAuth machine-to-machine token or workload identity federation) rather than an interactive user, so that automated deploys do not depend on a human's credentials.
B.You should run `databricks bundle validate` before `databricks bundle deploy` so that misconfigurations in `databricks.yml` are caught early in the pipeline, before any resources are created in the workspace.
C.Service principals cannot be used for automated bundle deployment; the workflow must store a personal access token belonging to a workspace admin in `secrets.SP_TOKEN`.
D.Because bundles are declarative, you must edit each job and pipeline manually in the workspace UI after deployment to point it at the production cluster; the bundle cannot set environment-specific values.
E.Workload identity federation is discouraged for CI/CD because it requires storing a long-lived Databricks secret in the repository, which is less secure than a static token.

Question 85

Open question ↗

Your team uses an Azure Databricks Git folder under `/Workspace/Users/<email>/` to develop notebooks against a shared GitHub repository. A developer finishes a notebook, commits it, and pushes it to a feature branch. After the pull request is merged into `main`, a manager claims that the merged notebook is now automatically live in the production workspace.

**Proposed solution:** Committing and pushing notebooks from a user-level Git folder, and merging the pull request into the `main` branch on the Git provider, automatically deploys those notebooks into the production Databricks workspace with no further action required.

**Does this solution meet the goal?**

A.Yes
B.No

Question 86

Open question ↗

You are the platform engineer for the Contoso lakehouse on Azure Databricks. Finance has asked you to produce a chargeback report that shows the total DBU consumption attributed to the `cost_center` of `analytics-eu` across all classic compute and serverless jobs over the last 30 days.

You have already enforced a custom cluster tag named `cost_center` through a compute policy, and the tag also propagates to serverless usage through a serverless usage policy. You need to query the appropriate Unity Catalog system table so the report attributes usage to that custom tag.

Which query returns the correct chargeback figure?

```sql

-- Option A

SELECT sku_name, usage_unit, SUM(usage_quantity) AS usage

FROM system.billing.usage

WHERE custom_tags['cost_center'] = 'analytics-eu'

AND usage_date >= current_date() - INTERVAL 30 DAYS

GROUP BY sku_name, usage_unit;

```

```sql

-- Option B

SELECT SUM(usage_quantity) AS usage

FROM system.compute.clusters

WHERE tags['cost_center'] = 'analytics-eu';

```

```sql

-- Option C

SELECT SUM(EstimatedCost) AS usage

FROM system.access.audit

WHERE ServiceName = 'clusters'

AND cost_center = 'analytics-eu';

```

```sql

-- Option D

SELECT SUM(usage_quantity) AS usage

FROM system.billing.list_prices

WHERE custom_tags['cost_center'] = 'analytics-eu';

```

A.Option A — query `system.billing.usage`, filter on `custom_tags['cost_center']`
B.Option B — query `system.compute.clusters`, filter on `tags['cost_center']`
C.Option C — query `system.access.audit`, filter on a `cost_center` column
D.Option D — query `system.billing.list_prices`, filter on `custom_tags['cost_center']`

Question 87

Open question ↗

You operate several Lakeflow Jobs on Azure Databricks. For each operational situation below, you must choose the correct action in the Lakeflow Jobs UI/API. The diagram shows each situation mapped to a dropdown of possible actions.

```mermaid

flowchart TD

subgraph Situations

S1["A multi-task job failed on task 'transform'. Upstream tasks succeeded. You fixed the notebook and want to re-run only the failed and dependent tasks, reusing successful results."]

S2["A continuous job hit consecutive failures and is now in exponential-backoff state. You fixed the cause and want to reset the retry period and immediately start a new run."]

S3["You need to launch a one-off execution of a scheduled job right now, outside its normal trigger, with the current settings."]

S4["A single-task job is hung consuming compute and producing nothing useful. You must terminate the in-flight run immediately."]

end

subgraph Actions

A1["Repair run"]

A2["Restart run"]

A3["Run now"]

A4["Cancel / Stop run"]

end

S1 -.dropdown.-> A1

S2 -.dropdown.-> A2

S3 -.dropdown.-> A3

S4 -.dropdown.-> A4

```

For each situation (S1–S4), select the single correct action.

Question 88

Open question ↗

A nightly ETL job on a classic all-purpose cluster (Databricks Runtime 15.4 LTS, Photon enabled) runs a large `GROUP BY` aggregation over a 2 TB Delta table. The job is slow. In the compute **Metrics** tab you observe:

- CPU utilization on the workers is pinned near 100% for the duration of the stage.

- The Spark UI shows the aggregation stage has **only 8 tasks**, while the cluster has 64 worker cores available.

- Memory utilization is moderate (no spill is reported) and there are no failed tasks.

You must reduce the wall-clock time of the aggregation stage by improving parallelism, without changing the data or adding nodes.

Which action most directly resolves the bottleneck?

A.Increase the number of shuffle partitions (for example, set `spark.sql.shuffle.partitions` to a value such as 128–256, or `auto`) so the aggregation runs across more tasks and uses all 64 cores.
B.Disable Photon to free CPU cycles for the JVM, because Photon is consuming the CPU headroom needed by the aggregation.
C.Increase `spark.executor.memory` to 32 GB per executor to give the aggregation more heap.
D.Switch the worker nodes to memory-optimized instance types and reduce the number of cores per executor to one.

Question 89

Open question ↗

A join-heavy job intermittently has one very long stage. You open the Spark UI, drill into the longest stage, and read the **Summary Metrics** for the stage's tasks:

```text

Duration (task) Shuffle Read Size

min: 12 s 45 MB

25th: 14 s 48 MB

median:15 s 50 MB

75th: 16 s 52 MB

max: 9.4 min 6.1 GB <-- a single task

```

There is no spill reported for the stage. You must correctly classify the root cause from these metrics so you can apply the right remediation.

A.Data skew — one partition (and its task) holds far more data than the others, shown by the **Max** duration and shuffle-read size being far above the 75th percentile.
B.Disk spill — the stage is moving data from memory to disk, shown by the long Max task.
C.Small-file problem — the stage is reading too many tiny input files, shown by the uniform median task times.
D.Driver bottleneck — a `collect()` is pulling the result to the driver, shown by the high Max task duration.

Question 90

Open question ↗

A daily job joins a 3 TB `fact_orders` Delta table to a small 40 MB `dim_country` lookup table and then aggregates. In the Spark UI you find:

- The join stage reports significant **disk spill** and a very large **Shuffle Read** on both sides of a sort-merge join.

- A few post-shuffle partitions are far larger than the rest (the join key is skewed).

- AQE has been turned off in this workspace by a legacy `spark.databricks.optimizer.adaptive.enabled false` setting.

You must reduce shuffle and spill and balance the skewed partitions. Which **three** actions are appropriate? (Choose THREE.)

A.Re-enable Adaptive Query Execution (`spark.databricks.optimizer.adaptive.enabled true`) so Spark can coalesce post-shuffle partitions and apply skew-join handling at runtime.
B.Force a broadcast hash join of the 40 MB `dim_country` table (for example with a `BROADCAST` hint) to eliminate the shuffle of the large fact table against the lookup.
C.Use `repartition()` on the skewed join key to redistribute the fact data into more, balanced partitions before the join.
D.Replace the join with `coalesce(1)` on the fact DataFrame to reduce the number of output partitions and therefore the shuffle volume.
E.Disable Photon so the sort-merge join falls back to row-based execution.
F.Set `spark.sql.shuffle.partitions` to 1 so the entire shuffle happens in a single task.