Regions and Zones on GCP for the PDE Exam

619c7c8da6d7b95cf26f6f70

July 8, 2025

If you are preparing for the Professional Data Engineer exam, regions and zones are one of those topics that look simple on the surface and then quietly show up in half the scenario questions. Where you put your data, where you run your jobs, and how those choices interact with cost, latency, and availability is something the exam keeps coming back to. I want to walk through how I think about it, because once the mental model clicks the questions get a lot easier.

The basics: regions and zones

Google Cloud runs data centers all over the world. Those data centers are grouped into regions, which are large geographic areas, and each region is divided into multiple zones. A zone is essentially one independent data center, or sometimes a small cluster of them, that operates separately from the other zones in its region. So us-east4 is a region in Northern Virginia, and inside it you have us-east4-a, us-east4-b, and us-east4-c as its zones.

The reason this structure matters is twofold. First, resources inside the same region can communicate quickly and cheaply with each other. Second, because zones are designed to fail independently, spreading a workload across multiple zones means that if one zone has an outage the others keep running. That is the foundation of fault tolerance on GCP.

Why this shows up so much on the PDE exam

The Professional Data Engineer exam loves scenarios where you have to weigh three things at once: latency, cost, and availability. Regions and zones are the lever that controls all three. Put data far from your compute and you pay for egress and wait longer. Put everything in one zone and you save money but lose redundancy if that zone fails. Put data in a multi-region and you get strong durability but you pay more and lose some latency benefits for single-region compute.

The exam will rarely ask you to memorize which region is where. It will ask you to make a tradeoff, and the right answer almost always depends on knowing whether a given service is zonal, regional, or multi-regional.

How GCP data services map to regions and zones

This is the part I would actually commit to memory. Different services live at different scopes, and that scope determines what kind of failure they can survive on their own.

Cloud Storage buckets can be regional, dual-region, or multi-regional. You pick this at bucket creation and it controls replication and pricing.
BigQuery datasets can be regional or multi-regional. The location is set when you create the dataset and it cannot be changed afterward.
Cloud SQL instances are zonal. A primary lives in one zone, and you configure a failover replica in a different zone for high availability.
Dataproc clusters are zonal. The cluster nodes all live in a single zone, which is something to plan around if you care about availability of long-running jobs.
Compute Engine instances are zonal too. That is why managed instance groups and load balancers exist, to spread VMs across zones so one zone failure does not take the whole service down.

When you see a Professional Data Engineer exam question that says "the Cloud SQL instance is unavailable" or "the Dataproc job failed when the zone went offline", the implicit answer is almost always related to the fact that these services are zonal and need cross-zone redundancy to survive a zone outage.

Egress, latency, and the regional gotchas

One of the easiest ways to lose money on GCP is to move data between regions without realizing it. Inter-region traffic is billed as egress, and at scale this can dwarf the cost of the compute or storage itself. If your BigQuery dataset is in us-central1 and your Dataflow job is running in europe-west1, you are paying for every byte that crosses the Atlantic, and you are also waiting on it.

The rule I keep in my head is simple. Keep your data and your compute in the same region whenever you can. If you cannot, at least know why you cannot and what it is costing you. BigQuery in particular will refuse to query across region boundaries, so a dataset in EU and a dataset in US cannot be joined directly without first copying one of them.

Multi-region and dual-region: when to use them

Multi-region storage is for data that needs to survive a regional outage or be read with low latency from many places. Think a global application that serves users on multiple continents, or compliance setups that require geographic redundancy. Dual-region is a middle ground where you pick two specific regions and get the same strong consistency as a single region with the durability of multi-region storage. It is more expensive than regional storage, but for critical pipelines it can be worth it.

For BigQuery, multi-region locations like US and EU are popular because they offer strong durability and let you use BigQuery slots from a larger pool. But again, you cannot move a dataset between locations after creation, so think about this up front.

Default region and zone

You can assign a default region and zone for your GCP project, which is useful for keeping deployments consistent and avoiding accidental cross-region traffic. It will not stop you from explicitly deploying to a different location, but it lowers the chance of a typo putting your VM in a region you did not intend.

What I would actually study

For the Professional Data Engineer exam, I would not try to memorize every region code. I would make sure I can answer three questions quickly for any data service: is it zonal, regional, or multi-regional, what does a zone failure do to it, and what does cross-region traffic cost. If you have those three things locked in, most of the location-based scenario questions answer themselves.

My Professional Data Engineer course covers regions, zones, and the location scopes of every major GCP data service, plus the tradeoffs the exam tests on cost, latency, and fault tolerance.