DP-750 Certification Practice Question #54

Question

You are a data engineer onboarding a new `bronze.iot.sensor_readings` Delta table into Unity Catalog. Before writing transformation logic, your lead asks you to produce, in a single notebook cell, a profile that returns **summary statistics for numeric, string, and date columns AND histograms of the value distributions for every column** so the team can spot skew, high null fractions, and high-cardinality keys at a glance.

You read the table into a DataFrame named `df`. You want the option that computes the full profile (including value-distribution histograms) over the DataFrame, not just count/mean/stddev/min/max.

```python
df = spark.read.table("bronze.iot.sensor_readings")
# ??? produce a full data profile with histograms
```

Which approach should you use?

Accepted Answer

`dbutils.data.summarize(df)` is the programmatic equivalent of clicking **+ > Data Profile** on a `display(df)` output. It analyzes the complete DataFrame and renders summary statistics for numeric, string, and date columns **and** histograms of the value distributions for every column — the combination the scenario requires. `describe()` and `summary()` are lightweight exploratory helpers limited to scalar statistics with no distribution histograms, and the remaining options return only schema or a row count.

More DP-750 practice questions