Dataflow with Cloud Storage and Pub/Sub for the PDE Exam

619c7c8da6d7b95cf26f6f70

October 12, 2025

One of the patterns that shows up over and over on the Professional Data Engineer exam is Dataflow sitting in the middle of a pipeline, with Cloud Storage on one side and Pub/Sub on the other. If you understand how those three services hand data off to each other, you can answer a surprising number of integration questions without overthinking them. In this post I want to walk through the connections I drill into candidates when they prep with me, because the exam loves to test whether you reach for the right service at the right point in the pipeline.

Where Dataflow fits in the data ecosystem

Dataflow is the managed Apache Beam service for batch and streaming transformations. On its own it does not store anything. It pulls data in, applies transformations, and writes results out. That means every Dataflow job needs a source and a sink, and the choice of source and sink is what most exam questions actually hinge on.

Google groups the integrations into two buckets. The first is native integrations, which include Cloud Storage, Pub/Sub, and BigQuery. These are the services Dataflow is built to talk to directly. The second bucket is connectors, which cover services like Bigtable and Apache Kafka. The native versus connector distinction is worth remembering, because if you see a question about a first-class Dataflow integration, the answer is almost always going to involve GCS, Pub/Sub, or BigQuery.

Cloud Storage and Dataflow

If a question asks where you should land raw data for a batch pipeline to read, Cloud Storage is the default answer unless the data has a more specific home. I tell candidates studying for the Professional Data Engineer exam to treat GCS as the fallback any time the data shape does not obviously demand BigQuery, Bigtable, or a relational database.

The reason this pairing works so well is that Dataflow can read directly from GCS objects and write directly back into a bucket. For a batch job, the typical flow looks like this:

Source files land in a Cloud Storage bucket, often as CSV, JSON, Avro, or Parquet.
A Dataflow job reads those files, applies transformations, and aggregates or reshapes them.
The output is written to another GCS path, to BigQuery, or to whatever downstream system the use case requires.

This is the bread and butter of batch processing on Google Cloud. The exam will sometimes give you a scenario with a nightly file drop and ask what to do with it. If the data is large and the processing is bulk, Cloud Storage as input plus Dataflow as the processor is the answer you should reach for first.

Pub/Sub plus Dataflow for streaming ingestion

The other half of the pattern is real-time ingestion, and that is where Pub/Sub enters. Pub/Sub serves as the entry point for data collection. It accepts high volumes of messages, buffers them, and guarantees they are not lost while waiting for a consumer to pick them up. Dataflow is that consumer.

Here is the canonical streaming pipeline that the Professional Data Engineer exam expects you to know cold:

Pub/Sub ingests events from producers. It stores and buffers messages, decoupling the producers from anything downstream.
Dataflow subscribes to the Pub/Sub topic, applies transformations such as filtering, enrichment, windowing, or schema normalization.
The transformed data is then routed to the right storage layer based on its shape.

That last step is where the exam likes to trip people up. The routing depends entirely on the data type:

Unstructured data goes to Cloud Storage. Things like images, raw logs, video, or any blob that does not need to be queried with SQL.
Relational data that needs SQL analytics goes to BigQuery. This is the warehouse layer for structured records you want to slice and dice.
NoSQL, time series, or IoT data goes to Bigtable. Bigtable shines when you need low-latency reads and writes at scale, especially for time-series workloads.

When you see a streaming question with a fan-out across multiple sinks, that mental map of unstructured to GCS, relational to BigQuery, and NoSQL or time series to Bigtable is what you want to anchor on.

How the exam phrases these questions

The Professional Data Engineer exam rarely asks you to define what Dataflow is. It puts you in a scenario and asks which services you should compose to solve a problem. A few things I watch for:

If the scenario mentions real-time, streaming, or event ingestion at scale, expect Pub/Sub at the front.
If the scenario mentions buffering or decoupling producers from consumers, that is also a Pub/Sub signal.
If the scenario mentions batch processing of files or large datasets stored in buckets, expect Cloud Storage as the source.
If the scenario asks where transformations happen, the answer is Dataflow, not Pub/Sub or GCS. Pub/Sub stores and forwards. GCS stores. Dataflow transforms.
If the scenario then asks where the transformed data lives, look at the data shape and pick GCS, BigQuery, or Bigtable accordingly.

Putting it together

The pattern to internalize is simple. Pub/Sub at the edge, Dataflow in the middle, and a storage layer at the end that matches the data type. For batch jobs, swap Pub/Sub for Cloud Storage at the front. Keep the same three roles in mind, ingestion, transformation, storage, and the exam questions get a lot easier to parse.

My Professional Data Engineer course covers the full set of Dataflow integrations along with the streaming and batch patterns you need to recognize on exam day.

Connecting Dataflow to Cloud Storage and Pub/Sub for the PDE Exam

Where Dataflow fits in the data ecosystem

Cloud Storage and Dataflow

Pub/Sub plus Dataflow for streaming ingestion

How the exam phrases these questions

Putting it together

Get tips and updates from GCP Study Hub