Cloud Dataflow Overview for the PDE Exam: Unified Batch and Streaming

GCP Study Hub
619c7c8da6d7b95cf26f6f70
September 15, 2025

If you are studying for the Google Cloud Professional Data Engineer exam, Cloud Dataflow is one of those services you have to know cold. It shows up across pipeline design, streaming questions, fraud detection scenarios, and any time the exam wants to test whether you understand the difference between Apache Beam and the managed runner that executes it. In this article I want to walk through what Dataflow is, the problem it was built to solve, and the key facts the exam expects you to recall.

The problem Dataflow was built to solve

For a long time, data teams had to maintain two completely separate stacks. One pipeline handled batch workloads, which meant large volumes of historical data processed on a schedule and optimized for accuracy and completeness. A second pipeline handled streaming, which meant continuous, low-latency processing of events as they arrived. Each pipeline used different tools, different code, and often different teams.

That split caused real pain. Comparing recent activity with historical data was awkward because you had to sync data between two systems that were designed for different purposes. One pipeline was optimized for fast, the other for accurate, and neither one could give you both at once. Scaling and operating two parallel stacks also added overhead that most organizations did not want to carry.

The classic example, and one worth keeping in mind for the Professional Data Engineer exam, is fraud detection on credit card transactions. To catch fraud well, you need to compare a transaction happening right now against a long history of that cardholder's behavior. With two pipelines, that comparison is slow and inconsistent. With a unified approach, it becomes a single, continuous evaluation.

What Cloud Dataflow actually is

Cloud Dataflow is Google Cloud's fully managed service for running data processing pipelines. It handles batch and streaming workloads in a single model, which is the headline feature you want anchored in your head before the exam.

Under the hood, Dataflow is the managed runner for Apache Beam. Beam is the open source unified programming model, and the name itself comes from combining Batch and strEAM. When you write a pipeline in Beam, you describe the transformations once and then choose where to execute it. On Google Cloud, that execution happens on Dataflow.

Here are the properties that consistently show up in exam questions:

  • Unified batch and streaming: one programming model and one service handles both. You do not need separate code or separate infrastructure.
  • Serverless and no-ops: you do not provision or manage clusters. Google handles the underlying workers.
  • Autoscaling: Dataflow scales workers up and down based on the workload, so you are not stuck paying for idle capacity or hitting bottlenecks during spikes.
  • Based on Apache Beam: portable pipelines, written in Java, Python, or Go, that can target other runners if you ever needed to move them.

If you spot a question that asks for a serverless, autoscaling service to handle both streaming and batch with the same pipeline code, Dataflow is almost always the right answer.

How Dataflow fits into a Google Cloud data architecture

Another thing the Professional Data Engineer exam loves to test is how services integrate. Dataflow has native integrations with the core data services on Google Cloud:

  • Cloud Storage as a source or sink for batch files and exports.
  • Pub/Sub as the streaming source for event ingestion.
  • BigQuery as the analytical destination for processed data.

On top of that, there are connectors for Bigtable and Apache Kafka, which is useful if you are ingesting from an existing Kafka cluster or writing low-latency results into Bigtable for serving.

The reference pattern you should be very comfortable with looks like this: events land in Pub/Sub, Dataflow reads from the Pub/Sub subscription, applies transformations and windowing, and writes results into BigQuery or Bigtable. For batch jobs, swap Pub/Sub for Cloud Storage as the source. That single mental model covers a large share of the streaming and batch questions on the exam.

Back to the fraud detection example

With Dataflow, the fraud detection scenario collapses into one pipeline. New transactions flow in through Pub/Sub and get processed in real time, so suspicious activity can be flagged within seconds. The same pipeline, or a sibling batch job using the same Beam code, processes the historical transaction archive to maintain a baseline of normal behavior. Because both sides run on the same model, you can continuously evaluate live activity against historical patterns without syncing two different stacks.

That is the architectural payoff Google wants you to internalize. Dataflow replaces two systems with one, reduces operational complexity, and makes it realistic to compare real-time and historical data in the same evaluation.

What to remember for the PDE exam

For the Professional Data Engineer exam, a few facts about Dataflow are worth memorizing as a tight bundle:

  • Dataflow is the managed runner for Apache Beam, and Beam stands for Batch + strEAM.
  • It handles both batch and streaming with the same programming model.
  • It is serverless, no-ops, and autoscaling.
  • It integrates natively with Cloud Storage, Pub/Sub, and BigQuery, with connectors for Bigtable and Kafka.
  • The canonical streaming pattern is Pub/Sub to Dataflow to BigQuery.

If a question hands you a scenario with low-latency event processing, a need to unify batch and streaming, or a fraud-style comparison of real-time and historical data, Dataflow should be the first service you reach for.

My Professional Data Engineer course covers Cloud Dataflow in depth, including the Apache Beam programming model, windowing, and the streaming and batch reference architectures you need for the exam.

Get tips and updates from GCP Study Hub

arrow