DP-750 Certification Practice Question #71

Question

You are authoring a Lakeflow Spark Declarative Pipeline in Python. The pipeline defines a bronze streaming table `orders_raw`, a silver streaming table `orders_clean`, and a gold materialized view `daily_revenue`. You did **not** write any explicit orchestration code to specify which dataset runs before another.

```python
import dlt
from pyspark.sql.functions import col, sum as _sum

@dlt.table(name="orders_raw")
def orders_raw():
    return (spark.readStream.format("cloudFiles")
            .option("cloudFiles.format", "json")
            .load("/Volumes/main/sales/landing/orders"))

@dlt.table(name="orders_clean")
def orders_clean():
    return (dlt.read_stream("orders_raw")
            .filter(col("amount") > 0))

@dlt.table(name="daily_revenue")
def daily_revenue():
    return (dlt.read("orders_clean")
            .groupBy("order_date")
            .agg(_sum("amount").alias("revenue")))
```

How does Lakeflow Spark Declarative Pipelines determine the execution order of these three datasets?

Accepted Answer

In a Lakeflow Spark Declarative Pipeline you describe *what* each dataset is, not *how* to schedule it. SDP parses the `dlt.read("orders_clean")` and `dlt.read_stream("orders_raw")` references to build the dependency graph automatically, then runs the flows in topological order with as much parallelism as the dependencies allow. There is no `depends_on` decorator parameter (A), execution order is not determined by source-file position (C), and you do not need a separate Lakeflow Job to sequence datasets within one pipeline (D) — the pipeline itself is the unit of orchestration.

More DP-750 practice questions