
One of the patterns that shows up over and over on the Professional Data Engineer exam is Dataflow sitting in the middle of a pipeline, with Cloud Storage on one side and Pub/Sub on the other. If you understand how those three services hand data off to each other, you can answer a surprising number of integration questions without overthinking them. In this post I want to walk through the connections I drill into candidates when they prep with me, because the exam loves to test whether you reach for the right service at the right point in the pipeline.
Dataflow is the managed Apache Beam service for batch and streaming transformations. On its own it does not store anything. It pulls data in, applies transformations, and writes results out. That means every Dataflow job needs a source and a sink, and the choice of source and sink is what most exam questions actually hinge on.
Google groups the integrations into two buckets. The first is native integrations, which include Cloud Storage, Pub/Sub, and BigQuery. These are the services Dataflow is built to talk to directly. The second bucket is connectors, which cover services like Bigtable and Apache Kafka. The native versus connector distinction is worth remembering, because if you see a question about a first-class Dataflow integration, the answer is almost always going to involve GCS, Pub/Sub, or BigQuery.
If a question asks where you should land raw data for a batch pipeline to read, Cloud Storage is the default answer unless the data has a more specific home. I tell candidates studying for the Professional Data Engineer exam to treat GCS as the fallback any time the data shape does not obviously demand BigQuery, Bigtable, or a relational database.
The reason this pairing works so well is that Dataflow can read directly from GCS objects and write directly back into a bucket. For a batch job, the typical flow looks like this:
This is the bread and butter of batch processing on Google Cloud. The exam will sometimes give you a scenario with a nightly file drop and ask what to do with it. If the data is large and the processing is bulk, Cloud Storage as input plus Dataflow as the processor is the answer you should reach for first.
The other half of the pattern is real-time ingestion, and that is where Pub/Sub enters. Pub/Sub serves as the entry point for data collection. It accepts high volumes of messages, buffers them, and guarantees they are not lost while waiting for a consumer to pick them up. Dataflow is that consumer.
Here is the canonical streaming pipeline that the Professional Data Engineer exam expects you to know cold:
That last step is where the exam likes to trip people up. The routing depends entirely on the data type:
When you see a streaming question with a fan-out across multiple sinks, that mental map of unstructured to GCS, relational to BigQuery, and NoSQL or time series to Bigtable is what you want to anchor on.
The Professional Data Engineer exam rarely asks you to define what Dataflow is. It puts you in a scenario and asks which services you should compose to solve a problem. A few things I watch for:
The pattern to internalize is simple. Pub/Sub at the edge, Dataflow in the middle, and a storage layer at the end that matches the data type. For batch jobs, swap Pub/Sub for Cloud Storage at the front. Keep the same three roles in mind, ingestion, transformation, storage, and the exam questions get a lot easier to parse.
My Professional Data Engineer course covers the full set of Dataflow integrations along with the streaming and batch patterns you need to recognize on exam day.