Pub/Sub Common Ingestion Pattern for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
September 11, 2025

If you sit the Professional Data Engineer exam without a mental picture of how Pub/Sub, Dataflow, and the storage layer fit together, a chunk of the scenario questions will feel harder than they need to be. Google leans on one canonical ingestion pattern across the test, and once you can sketch it on a napkin, you stop guessing on the architecture questions and start picking answers by elimination.

This is the pattern I want to walk you through. It is worth memorizing, not because Google asks you to draw it, but because almost every streaming scenario on the Professional Data Engineer exam maps back to some variation of it.

The shape of the pipeline

The pattern has three logical stages, left to right.

  • Pub/Sub sits on the left. It is the front door. Data comes in from publishers and Pub/Sub stores and buffers it so nothing is lost while downstream systems catch up.
  • Dataflow sits in the middle. It pulls from Pub/Sub and applies transformations, cleaning, enrichment, windowing, whatever the pipeline needs. Dataflow handles both streaming and batch, so the same engine works for real-time and periodic processing.
  • The storage layer sits on the right, and this is where the routing logic shows up. Dataflow writes to different destinations depending on the shape of the data.

The three destinations you need to know cold:

  • Cloud Storage for unstructured data. Files, logs, images, video, anything you would treat as a blob.
  • BigQuery for relational data that needs SQL analytics. This is the warehouse endpoint, and it is the most common right-hand side answer on the exam.
  • Bigtable for NoSQL, time series, and IoT data. Low latency, high throughput, wide-column. When a question mentions sensor data, telemetry, or single-digit millisecond reads at scale, the answer is almost always Bigtable.

Why Pub/Sub is the entry point

Pub/Sub does two jobs in this pattern. The obvious one is decoupling, publishers do not need to know who is reading or how fast. The less obvious one is durability under load. If your Dataflow job hits a hot spot or a downstream sink slows down, Pub/Sub holds messages until things drain. That buffer is the reason the pattern survives traffic spikes without dropping data.

On the exam, if a scenario describes high-volume real-time ingestion from many sources, the first hop is Pub/Sub. If the answer set has Kafka, Pub/Sub is still usually the right Google-native pick unless the scenario explicitly says they want to keep Kafka.

Why Dataflow is in the middle

Dataflow is the transformation layer. It is built on Apache Beam, which means the same pipeline code can run in streaming mode reading from Pub/Sub or in batch mode reading from Cloud Storage. That flexibility is why Google keeps putting it in the middle of every reference architecture.

If a question asks where you would clean, deduplicate, enrich, or window streaming data on its way to BigQuery, the answer is Dataflow. Not BigQuery scheduled queries, not Cloud Functions, not Dataproc. Dataflow.

How to pick the right sink

This is where I see candidates lose points. The sink choice is driven by the data shape and the access pattern, not by the volume.

  • If the consumer is an analyst running SQL, the sink is BigQuery.
  • If the consumer is an application doing key-based lookups with strict latency budgets, the sink is Bigtable.
  • If the consumer needs raw files, or you are landing data before further processing, the sink is Cloud Storage.

Time series and IoT are the two phrases that should trigger Bigtable in your head immediately. If a scenario talks about millions of devices writing temperature readings every second, you are not putting that in BigQuery as the primary store. You can stream to both, that is a real pattern, but the operational store is Bigtable.

What to memorize before exam day

Three things will get you most of the way:

  • The left-to-right order, Pub/Sub then Dataflow then the storage layer.
  • The three storage destinations and which data shape goes to each.
  • The fact that this pattern works for both streaming and batch, because Dataflow handles both.

When a Professional Data Engineer exam question gives you a streaming scenario with four boxes and arrows, you can almost always identify the right architecture by checking it against this pattern and eliminating anything that does not match.

My Professional Data Engineer course covers this ingestion pattern alongside each service in it, so you can recognize the variations Google throws at you on test day.

Get tips and updates from GCP Study Hub

arrow