Pub/Sub Subscriber Health Metrics for the PDE Exam

September 7, 2025

Pub/Sub is one of the most reliable messaging systems on Google Cloud, but reliability at the system level does not guarantee that your subscriber is doing its job. A subscriber can crash, hit a bug, or simply get overwhelmed by a traffic spike, and when that happens you want to know about it before downstream pipelines stall. On the Professional Data Engineer exam, Google expects you to know exactly which Cloud Monitoring metrics to watch to catch a sick subscriber early.

There are three metrics worth memorizing, and together they tell you almost everything you need to know about whether a subscription is healthy.

The three subscriber health metrics

The Professional Data Engineer guide focuses on these three signals in Cloud Monitoring:

Total number of messages in the Pub/Sub queue for a subscription
Oldest unacknowledged message age
Number of unacknowledged messages, surfaced as the metric subscription/num_undelivered_messages

Each one catches a different failure mode. Looking at them together is what makes the diagnosis clean.

Why total messages in the queue matters

The first thing I look at is the total number of messages sitting in the queue for a given subscription. If that number suddenly spikes, or trends upward when it normally sits flat, something is off. Either publishers are producing more than usual, or the subscriber has slowed down or stopped pulling. A sustained anomaly here is your first sign that the subscriber is not keeping up with the topic.

This metric alone can be misleading though. A queue can grow because of a legitimate burst of traffic that the subscriber will work through in seconds. That is why you pair it with the next metric.

Why the oldest unacknowledged message tells the real story

The oldest unacknowledged message age is the one that separates a momentary spike from a real outage. If the queue is large but all the messages are seconds old, you probably just had a publish burst and the subscriber is chewing through it. If the oldest unacknowledged message is from ten minutes ago, or an hour ago, the subscriber is not making progress on the backlog at all. That points to a crashed worker, a deadlock, a poison message that keeps getting redelivered, or code that throws before it can call ack.

I think of queue depth as a volume signal and oldest unacked age as a liveness signal. You need both.

num_undelivered_messages, the metric Google actually names

The third metric is the one to commit to memory by its exact name, because the Professional Data Engineer exam is fond of asking for the specific Cloud Monitoring metric ID. It is subscription/num_undelivered_messages. This counts the messages that have been published to the topic and acknowledged as received by Pub/Sub, but not yet acknowledged by your subscriber. A rising num_undelivered_messages means a backlog is building, a falling one means the subscriber is catching up, and a flat non-zero value usually means the subscriber is processing at exactly the publish rate and you are barely keeping up.

If the exam gives you a scenario where a team needs to know whether their Pub/Sub subscriber is behind, subscription/num_undelivered_messages is almost always the right answer.

How I'd wire these up in practice

In Cloud Monitoring I would set alerts on all three:

An alert on queue depth crossing a threshold that is well above normal traffic
An alert on oldest unacked age exceeding what your SLA tolerates, often a few minutes
An alert on num_undelivered_messages trending upward over a rolling window

The combination is what matters. A single metric will lie to you. All three together rarely do.

What the exam tends to ask

Questions in this area usually fall into one of three shapes. The first is a straight recall question that hands you a Pub/Sub subscriber problem and asks which metric to monitor, with num_undelivered_messages as the correct option and distractors like publisher-side metrics or generic CPU usage. The second is a diagnosis scenario where the queue is large but messages are recent, and you need to recognize that this is not a subscriber problem. The third is the inverse where queue size looks fine but the oldest unacked message is old, which often points to a single subscriber pulling but failing to ack a specific message.

Keep the three metrics and their roles straight and you can answer almost any Pub/Sub monitoring question on the Professional Data Engineer exam in under a minute.

My Professional Data Engineer course covers Pub/Sub monitoring and the rest of the streaming and messaging surface area you need for exam day.