DP-750 Certification Practice Question #39

Question

A data engineer is creating a new Delta table named `events` in Unity Catalog. The table is expected to hold about **300 GB** of data. The previous Hive-based design partitioned the data by `event_date` **and** by high-cardinality `user_id`, which produced hundreds of thousands of tiny files and slow queries. Most analytical queries filter on `event_date` and `country`.

What should you do to optimize the data layout and avoid over-partitioning for this new table?

```sql
-- Proposed table definition (choose the correct layout strategy)
CREATE TABLE analytics.events (
  event_id BIGINT,
  user_id  BIGINT,
  country  STRING,
  event_date DATE,
  payload STRING
)
<LAYOUT_STRATEGY>;
```

Accepted Answer

The table is well under 1 TB, and Databricks recommends not partitioning tables that small and not partitioning at all unless each partition holds at least ~1 GB. The earlier design over-partitioned on high-cardinality `user_id`, producing many tiny files and slow scans. Liquid clustering replaces both partitioning and Z-ordering: it is self-tuning, skew-resistant, supports high-cardinality columns, and lets you redefine clustering keys (`event_date`, `country`) without rewriting data. You also cannot Z-order on a partition column, which eliminates D.

More DP-750 practice questions