Normalization vs Denormalization in BigQuery (PDE)

619c7c8da6d7b95cf26f6f70

December 26, 2025

Of all the BigQuery design questions that show up on the Professional Data Engineer exam, the normalization versus denormalization tradeoff is one of the most consistently tested. It looks like a simple database theory question on the surface, but the exam wants you to know something more specific. BigQuery is not a traditional relational database, and the default modeling instinct you brought from Postgres or MySQL is sometimes wrong here. Sometimes it is still right. Knowing which scenario calls for which approach is what separates a correct answer from a confident-but-wrong one.

Let me walk through how I think about this.

What normalization actually means

Normalization is the process of structuring a database to reduce redundancy and improve data integrity by splitting data into smaller, related tables. Each piece of data lives in exactly one place. If you have an orders table that repeats customer names and product names across every row, you would pull those out into a customers table and a products table, then reference them by ID from the orders table.

The payoff is clean. You update a customer's email in one row, and every query that joins back through customer_id sees the new value. You eliminate the class of bugs where the same fact is stored in two places and gets out of sync.

The cost is that queries that need information from multiple tables have to JOIN those tables back together at read time. On large datasets, JOINs are expensive. That is the central tension.

What denormalization is doing about it

Denormalization reverses some of that normalization. You combine related tables into one, accepting redundancy in exchange for not having to JOIN at query time. If your workload reads the data far more often than it updates the data, the math usually favors denormalization.

The classic warning about denormalization is data redundancy. If the same customer name appears in a million order rows, updating that customer name now means rewriting a million rows. For transactional systems where updates are frequent, that is a real problem. For analytical systems where data is mostly appended and rarely updated in place, it is much less of a problem.

Why BigQuery leans denormalized

BigQuery is built for analytics, and it actively encourages denormalization. Two reasons matter for the exam.

First, BigQuery uses columnar storage. When you query a wide denormalized table, BigQuery only scans the columns you reference in the query. The extra columns you would never have packed into a normalized schema cost you almost nothing at query time. This is fundamentally different from a row-oriented database, where wider tables hurt.

Second, BigQuery supports nested and repeated fields through RECORD and STRUCT data types. Instead of creating a separate child table and joining to it, you can nest the related rows directly inside the parent row. The data lives together physically and gets scanned together at query time. When you need to flatten the nested structure for analysis, you use UNNEST.

A weather forecast table is a clean example. Each row represents one geographic point, and inside that row, the forecast column holds a nested array of hourly forecasts. To pull the forecast at 10AM across all locations, you write:

SELECT
  geography,
  forecast.time AS forecast_time,
  forecast.temperature_2m_above_ground AS temperature
FROM
  `bigquery-public-data.noaa_global_forecast_system.NOAA_GFS0P25`,
  UNNEST(forecast) AS forecast
WHERE
  forecast.hours = 10

No JOIN. Everything is in one place. BigQuery scans only the columns the query references. That is the pattern the exam wants you to recognize.

When normalization still wins in BigQuery

Here is the part that trips people up. BigQuery prefers denormalization, but the exam will hand you scenarios where normalization is actually the correct answer, and you have to know how to spot them.

The key signal is update frequency and data integrity. If the question describes a small dimension that is frequently updated, and a large fact table that references it, normalization usually wins. The reason is that denormalizing the dimension into every fact row means every update to the dimension has to rewrite every matching fact row. That gets expensive fast.

Use normalization when:

Data integrity and consistent updates matter more than query speed
A small reference table is updated often and joined to a much larger table
The cost of rewriting denormalized rows on update would exceed the cost of the JOIN at read time

Use denormalization when:

Query performance is the primary concern, which is true for most analytics workloads
The system is read-heavy and updates are rare
The data has a natural hierarchical structure that nested fields can capture cleanly

A scenario the Professional Data Engineer exam likes

A retail company has a stores table with store details like store_id, name, and location. They also have a transaction logs table with millions of transactions, each referencing a store_id. The company updates store details frequently and wants those updates to be efficient while keeping the data consistent.

How should you model this in BigQuery?

The right answer is to keep the data normalized. Store details live in the stores table and the transaction logs table references them by store_id. The instinct to denormalize because BigQuery prefers it leads you to the wrong answer here. If you denormalized store details into every transaction row, every name change or location update would require rewriting millions of rows. The cost of maintaining that data integrity outweighs the cost of the JOIN at query time.

This is the pattern. When you see a small dimension table that gets updated frequently and a large fact table that references it, normalization is the answer regardless of which database you are in.

How I'd answer these on the exam

Read the question for two signals. First, what is the read-to-write ratio? Mostly reads with rare updates points to denormalization. Frequent updates to shared data points to normalization. Second, what is the shape of the data? Hierarchical or naturally nested data fits denormalization with RECORD and STRUCT types. A small dimension joined to a large fact table favors keeping them separate.

The Professional Data Engineer exam is testing whether you understand that BigQuery's default preference for denormalization is a guideline, not a rule. The right schema depends on the workload.

My Professional Data Engineer course covers BigQuery schema design, nested and repeated fields, UNNEST patterns, and the full set of modeling tradeoffs you need to know for the exam.

Normalization vs Denormalization in BigQuery for the PDE Exam