Cloud Availability for the PDE Exam: Uptime and SLAs

619c7c8da6d7b95cf26f6f70

July 5, 2025

When a Google Cloud business scenario uses the phrase "high availability," you are being told something specific about the solution architecture. The Professional Data Engineer exam leans on this term a lot, and the candidates who do best are the ones who can translate "high availability" into concrete service choices and configurations. In this article I want to walk through what availability actually means in a cloud data context, how it gets measured, and how to spot it as a requirement in an exam scenario.

What availability actually means

Availability refers to the ability of a system to remain operational and accessible, even during failures or disruptions. That disruption could be a zonal outage, a regional event like a natural disaster, a hardware failure underneath a managed service, or a software issue that takes a node offline. A system that is "highly available" is one that keeps serving requests through those events, usually because there is redundancy built in somewhere.

In a data engineering context, availability often comes up around three layers. First, the storage layer, where the question is whether your data is reachable. Second, the compute or query layer, where the question is whether your jobs and queries can run. Third, the ingestion layer, where the question is whether you can keep accepting new records when something fails upstream or downstream.

Why it matters for mission-critical systems

The reason availability shows up so often on the Professional Data Engineer exam is that it is one of the cleanest ways to frame a business requirement. A business that runs a financial trading platform, a healthcare records system, or a global ecommerce checkout flow cannot tolerate downtime the same way an internal analytics dashboard can. Downtime in those systems costs revenue, damages reputation, and in regulated industries can create real compliance problems.

When you see language like "mission-critical," "cannot afford downtime," "global users," or "must remain operational during a regional outage," the exam is telling you that availability is the primary lens for the question. The right answer is usually whichever service or configuration provides the strongest uptime guarantee, even if it costs more or is more complex to set up.

How availability gets measured

Availability is almost always expressed as an uptime percentage over a given period, typically a year. The higher the percentage, the less downtime the system is allowed. Here is the math you should have memorized going into the exam:

99.5% uptime allows roughly 1.83 days of downtime per year.
99.9% uptime (three nines) allows roughly 8.76 hours per year.
99.99% uptime (four nines) allows roughly 52.56 minutes per year.
99.999% uptime (five nines) allows roughly 5.26 minutes per year.

You do not need to derive these on the spot, but you should recognize that each extra nine is roughly a tenfold drop in allowed downtime. That is why moving from four nines to five nines is genuinely hard. It usually requires multi-region replication, automated failover, and a managed service that handles the heavy lifting for you.

Service Level Agreements

The number you see on a service's documentation page is not just marketing. It is backed by a Service Level Agreement, or SLA, which is a contractual guarantee from Google Cloud about how much uptime a service will deliver. If the service falls short, the SLA typically defines credits or compensation. SLAs differ by service and by configuration. A regional configuration of a service often has a different SLA than a multi-region or global one, and a single-zone deployment usually has the weakest guarantee of all.

The exam will sometimes test this distinction directly. If a question says the business needs the highest possible availability, the answer is rarely the single-zone or single-region option, even when those are cheaper. The answer is the configuration whose SLA matches the requirement.

A concrete example: Cloud Spanner

Cloud Spanner is the example I always come back to when explaining five nines on Google Cloud. In its multi-region configuration, Spanner guarantees 99.999% global availability, which works out to about five minutes of downtime per year. That is why Spanner is so often the right answer when a scenario describes a global, strongly consistent transactional workload that absolutely cannot go down. It is also why Spanner is not always the right answer when the scenario only requires regional availability or has cost as a stronger constraint. The exam wants you to match the SLA to the requirement, not to pick the most impressive service every time.

How to spot availability questions on the exam

Here is the pattern I look for. When you see a scenario that mentions high availability, global users, regional outages, or any phrasing about staying operational during failures, immediately filter the answer choices through an SLA lens. Ask yourself which option offers the strongest uptime guarantee that still satisfies the other constraints in the question. Read replicas, failover replicas, multi-region storage classes, and managed services with built-in redundancy will keep coming up across the Professional Data Engineer exam, and the language pattern is fairly consistent once you have seen a few questions.

Availability is one of those topics that looks simple on the surface but quietly shapes a meaningful share of the exam. Knowing the uptime math, knowing which services are built for which tier of availability, and recognizing the language patterns will make those questions much faster to answer.

My Professional Data Engineer course covers availability tiers, SLAs, and the specific uptime guarantees of services like Cloud Spanner, Cloud SQL, BigQuery, and Cloud Storage that show up on the exam.

Cloud Availability for the PDE Exam: Measuring and Designing for Uptime