Normalization vs Denormalization in BigQuery for the PCA Exam

GCP Study Hub
Ben Makansi
January 18, 2026

Most engineers learn database design on traditional relational systems, where normalization is the default. BigQuery flips that intuition. For the Professional Cloud Architect exam, you need to understand both directions of the tradeoff and know when BigQuery wants you to denormalize.

What Normalization Actually Does

Normalization is the process of structuring a database to reduce redundancy and improve data integrity by dividing data into smaller, related tables. The goal is that each piece of data is stored in only one place.

Picture a single table with order_id, customer_name, product_name, and price. The same customer name appears across many orders. The same product name appears across many orders. That repetition makes updates error-prone. If a customer changes their email, you have to update every row they appear in.

Normalization breaks this apart. You create a Customers table with customer_id, name, and email, storing each customer once. You create a Products table with product_id, product_name, and price. You restructure the original data into an Orders table containing order_id, customer_id, product_id, and order_date, with foreign keys linking back to the other two tables.

Now each piece of data lives in exactly one place. Update a customer's email once and every order references the new value automatically.

The Cost of Normalization

The tradeoff is query performance. When data is split across multiple related tables, any query that needs to combine information across them requires JOIN operations. JOINs are expensive, especially on large datasets. The more tables involved and the larger they get, the more compute the database has to spend stitching rows back together at query time.

For transactional systems where you are doing many small writes and care deeply about consistency, that cost is worth it. For analytical systems doing massive scans across hundreds of millions of rows, it becomes a real bottleneck.

Denormalization

Denormalization is the process of partially reversing normalization by combining related tables back into one. The goal is to reduce the number of tables involved in a query so you avoid expensive JOINs.

Instead of querying multiple tables and joining them at runtime, you store related information together in fewer tables, sometimes a single table. This trades data redundancy for query speed. You accept that the customer's email might be repeated across many order rows because the cost of updating it is lower than the cost of joining on every read.

Why BigQuery Pushes Toward Denormalization

BigQuery actively encourages denormalization for both cost and performance reasons. There are two key features that make this work.

The first is columnar storage. BigQuery stores data column by column rather than row by row. When you query a denormalized table, BigQuery only scans the columns you actually reference in the query. A wide denormalized table with fifty columns is not penalized when your query only touches three of them. This is the opposite of how a traditional row-based database behaves, where wide tables are expensive to scan.

The second is nested and repeated fields. BigQuery lets you store complex data structures inside a single table using RECORD and STRUCT data types. You can nest a whole table-like structure inside a single column of another table. When you need to flatten that nested data for analysis, you use the UNNEST function.

This is the feature that makes BigQuery genuinely different from most relational systems. You are not limited to a flat row model. You can keep related data physically grouped within the same row, scan it efficiently with columnar reads, and avoid the JOINs that would otherwise be required.

What a Denormalized BigQuery Table Looks Like

Consider a public dataset like NOAA_GFS0P25, which stores weather forecasts for geographic points. Each row represents one location, identified by latitude and longitude. Inside each row, there is a nested forecast column that stores many forecasted hours and their corresponding temperature values, all attached to that single row.

Instead of splitting locations into one table and forecasts into another and joining them, BigQuery stores the forecast as a nested structure inside the location row. To query it, you UNNEST the nested array.

SELECT
  geography,
  forecast.time AS forecast_time,
  forecast.temperature_2m_above_ground AS temperature
FROM
  `bigquery-public-data.noaa_global_forecast_system.NOAA_GFS0P25`,
  UNNEST(forecast) AS forecast
WHERE
  forecast.hours = 10

The UNNEST takes the nested forecast array and expands it into individual rows, one per forecasted hour, which you can then filter and project against. No JOIN is required. All the relevant data was physically colocated in the same row, and BigQuery's columnar storage scans only the columns you reference.

When Normalization Still Wins

Denormalization is the default in BigQuery, but it is not universal. There are still cases where normalization is the right call.

Choose normalization when data integrity and update efficiency are more important than query speed. If you have a system where the same fact appears thousands of times and gets updated often, denormalization forces you to update every copy. Normalizing means you update one row and every reference picks up the change automatically. For systems where consistent, accurate data matters more than fast reads, normalization is still the right structure.

Choose denormalization when query performance is the primary concern. This is most analytical workloads. Read-heavy systems with large datasets, where JOINs would be a bottleneck, are exactly where BigQuery's columnar storage and nested fields shine.

BigQuery does support traditional JOIN operations, and you should not feel obligated to denormalize everything. But when you have a choice, denormalize. The platform is built to reward it.

What the PCA Exam Expects

For the Professional Cloud Architect exam, the points worth holding onto are these. BigQuery encourages denormalization because columnar storage scans denormalized data efficiently. Nested and repeated fields, accessed via RECORD/STRUCT types and UNNEST, let you avoid JOINs while keeping schemas expressive. Normalization is still appropriate when integrity and update efficiency matter more than query speed. Denormalization is appropriate when query performance dominates, which describes most analytical use cases.

If you see a question framing a BigQuery analytics workload and asking how to model the data, default to denormalization with nested fields unless the question is explicitly about consistency or frequent updates.

My Professional Cloud Architect course covers normalization vs denormalization in BigQuery alongside the rest of the storage and analytics material.

arrow