
If you sit the Professional Data Engineer exam without a mental picture of how Pub/Sub, Dataflow, and the storage layer fit together, a chunk of the scenario questions will feel harder than they need to be. Google leans on one canonical ingestion pattern across the test, and once you can sketch it on a napkin, you stop guessing on the architecture questions and start picking answers by elimination.
This is the pattern I want to walk you through. It is worth memorizing, not because Google asks you to draw it, but because almost every streaming scenario on the Professional Data Engineer exam maps back to some variation of it.
The pattern has three logical stages, left to right.
The three destinations you need to know cold:
Pub/Sub does two jobs in this pattern. The obvious one is decoupling, publishers do not need to know who is reading or how fast. The less obvious one is durability under load. If your Dataflow job hits a hot spot or a downstream sink slows down, Pub/Sub holds messages until things drain. That buffer is the reason the pattern survives traffic spikes without dropping data.
On the exam, if a scenario describes high-volume real-time ingestion from many sources, the first hop is Pub/Sub. If the answer set has Kafka, Pub/Sub is still usually the right Google-native pick unless the scenario explicitly says they want to keep Kafka.
Dataflow is the transformation layer. It is built on Apache Beam, which means the same pipeline code can run in streaming mode reading from Pub/Sub or in batch mode reading from Cloud Storage. That flexibility is why Google keeps putting it in the middle of every reference architecture.
If a question asks where you would clean, deduplicate, enrich, or window streaming data on its way to BigQuery, the answer is Dataflow. Not BigQuery scheduled queries, not Cloud Functions, not Dataproc. Dataflow.
This is where I see candidates lose points. The sink choice is driven by the data shape and the access pattern, not by the volume.
Time series and IoT are the two phrases that should trigger Bigtable in your head immediately. If a scenario talks about millions of devices writing temperature readings every second, you are not putting that in BigQuery as the primary store. You can stream to both, that is a real pattern, but the operational store is Bigtable.
Three things will get you most of the way:
When a Professional Data Engineer exam question gives you a streaming scenario with four boxes and arrows, you can almost always identify the right architecture by checking it against this pattern and eliminating anything that does not match.
My Professional Data Engineer course covers this ingestion pattern alongside each service in it, so you can recognize the variations Google throws at you on test day.