Monitoring Cloud Spanner Health and Data Ingestion for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
May 10, 2026

Cloud Spanner shows up on the Professional Data Engineer exam in two flavors. The first is design: when do you reach for Spanner instead of Bigtable or Cloud SQL. The second is operations: once Spanner is running, how do you tell whether it is healthy, and how do you ingest data into it without tipping it over. This article is about the second flavor, which is where I see candidates lose the most points because the answers hinge on specific numeric thresholds and one or two named tools.

The CPU thresholds you need to memorize

Spanner health monitoring lives in Cloud Monitoring. To read the metrics you need at least the roles/monitoring.viewer role on the project. That is the floor. Anything more restrictive and you cannot see the dashboards or build alerting policies against them.

The numbers Google publishes for healthy CPU usage are different depending on whether the instance is single-region or multi-region, and the Professional Data Engineer exam likes to test the distinction. For a single-region instance, high priority CPU usage should stay at or below 65%, and the 24-hour smoothed aggregate CPU usage should stay under 90%. For a multi-region instance, the high priority threshold drops to 45%, while the 24-hour smoothed aggregate stays at 90%.

The reason multi-region runs cooler is replication overhead. A multi-region instance is paying CPU cost on writes that single-region does not pay, so you need more headroom to absorb the same query load. If you forget which is which on exam day, anchor on the idea that multi-region needs more spare capacity, so its threshold has to be lower.

When any of these thresholds is exceeded, the action is the same: add nodes. Spanner is a horizontally scaled database. You do not vertically resize a node, you provision more of them. If a question describes sustained high CPU and asks what to do, the answer is to scale out the instance, not to change machine types or move workload off.

Watching ingestion with used_bytes

The exam asks ingestion questions in a very specific shape. You have a pipeline writing data into Spanner, and something looks wrong. Which metric do you watch to confirm data is actually landing.

The answer is the rate of change in instance/storage/used_bytes. That metric tracks how much storage the instance is consuming, and during a healthy ingest its derivative should be roughly linear or at least steady. A sudden flattening or drop in the rate of change is the signal that ingestion has stalled. It could be a stuck Dataflow worker, a quota issue, a backpressure problem upstream, or an error rate that has spiked so high that nothing is being committed.

I like used_bytes as an indicator because it cannot be faked by a healthy-looking write RPC count. Writes can be retried, throttled, or rejected after the request was accepted, and your client-side metrics will not always tell you. But bytes on disk only go up when commits succeed. If the storage curve flatlines and your job is still running, you have a real problem.

Hot keys and the Key Visualizer

The other operational topic the Professional Data Engineer exam likes is hot keys. Spanner shards data by primary key range, and if your writes concentrate on a narrow range of keys, you create a hot spot that pins one server while the rest of the instance sits idle. CPU looks fine in aggregate, latency on the affected operations is awful, and adding nodes does not help because the heat is on a key range, not the whole cluster.

The tool for diagnosing this is Key Visualizer for Spanner. It renders a heatmap with key ranges on one axis and time on the other, and brightness encoding access intensity. Hot spots show up as bright horizontal bands. If a question describes uneven latency, low overall CPU, and asks how to find the cause, Key Visualizer is the answer.

The monotonic key anti-pattern

The most common cause of a hot spot is a primary key that increases monotonically. Timestamps, auto-increment IDs, and anything else where new rows always land at the high end of the range will park every write on the same server. This is the canonical anti-pattern, and the exam tests it directly.

The fixes are about distributing writes across the keyspace. Common options include:

  • Hash the key by prefixing it with a hash of a high-cardinality field, so new rows scatter.
  • Reverse a timestamp or use a UUID instead of a sequence, so adjacent inserts do not target adjacent keys.
  • Use bit-reversed sequences, which Spanner supports natively for the auto-generated identity case.

If a question mentions a primary key like created_at or an integer counter and asks why throughput is capped, the answer is monotonic-key hot spotting, and the remediation is one of the three above.

Batch ingestion patterns

For bulk loads into Spanner the recommended path is Dataflow, usually via the Google-provided JDBC to Spanner, Avro to Spanner, or Text Files to Spanner templates. These templates handle partitioning, retries, and mutation batching so a single hot writer does not tank the instance. For ongoing change capture, Spanner change streams plus a Dataflow pipeline is the standard pairing.

The exam pattern is simple. Bulk load into Spanner. Pick the tool. The answer is Dataflow with a Spanner template, not a custom client looping through inserts.

My Professional Data Engineer course covers Spanner monitoring, hot key diagnosis, and the ingestion patterns above with the exact thresholds and tool names the exam expects.

Get tips and updates from GCP Study Hub

arrow