Cloud Dataflow Overview for the PCA Exam

Ben Makansi

December 4, 2025

When I was first learning Google Cloud's data services, the question that kept coming up was simple. If I have a stream of events arriving in real time and a pile of historical data sitting in storage, how do I run them through the same logic without building two separate systems? Cloud Dataflow is Google's answer, and for the Professional Cloud Architect exam it shows up as the default service whenever a question mentions both batch and streaming in the same architecture.

This article walks through what Dataflow actually is, the historical problem it was built to solve, and the integration points you should recognize on the exam.

The problem Dataflow was designed to fix

Until fairly recently, organizations that wanted to process data at scale ran two separate pipelines. One was a batch pipeline, optimized for accuracy and throughput, that ran on a schedule against historical data. The other was a streaming pipeline, optimized for latency, that processed events as they arrived. The two pipelines used different tools, different code, and often different teams.

That split caused real architectural pain. Whenever you needed to compare recent activity against historical context, you had to reconcile data from both systems. The classic example is fraud detection on credit card transactions. The streaming side flags a suspicious charge in the moment. The batch side knows the cardholder's spending pattern over the last three years. Joining those two views means syncing data across systems, accepting some lag, and accepting that your real-time view and your historical view will occasionally disagree.

On top of the data inconsistency, you paid an operational tax. Two pipelines meant two sets of failures to debug, two scaling stories to manage, and two codebases to evolve. For an architect, that is the kind of complexity that compounds across a platform.

What Dataflow is

Dataflow is Google Cloud's fully managed service for running data processing pipelines, and the headline feature is that a single pipeline can handle both batch and streaming workloads. You write the transformation logic once, and Dataflow runs it against bounded historical data or unbounded live event streams using the same code.

Under the hood, Dataflow is Google's managed implementation of Apache Beam, which is an open source unified programming model. The name Beam is itself a contraction of Batch and strEAM, which tells you exactly what the project was designed to unify. By running on Beam, Dataflow gives you portable pipeline code. The same Beam pipeline can in principle run on other Beam runners, though on Google Cloud you get the managed Dataflow runner with full integration into the rest of the platform.

A few characteristics matter for the Professional Cloud Architect exam:

Serverless and no-ops. You do not provision or manage worker VMs. Dataflow handles the cluster lifecycle for you.
Autoscaling. The service scales workers up and down based on the volume of data flowing through the pipeline. A backlog in a streaming source pulls in more workers, and idle capacity scales back down.
Unified batch and streaming. The same Beam pipeline definition handles both modes, which is the answer to most exam questions that mention processing both kinds of data.

The pattern of serverless plus autoscaling plus an open source foundation shows up across Google Cloud's flagship data services. BigQuery, Cloud Run, and Dataflow all share that profile, and the exam tends to lean on those services whenever a question describes a managed, scalable workload without operational overhead.

Native integrations and connectors

Dataflow's value on Google Cloud comes partly from how cleanly it slots into the rest of the data ecosystem. The native integrations you should know are:

Cloud Storage as a source or sink for batch data, including files in formats like Avro, Parquet, and JSON.
Pub/Sub as the streaming source. A standard Google Cloud streaming architecture is Pub/Sub feeding Dataflow feeding BigQuery, and that combination is worth memorizing for the exam.
BigQuery as a sink for processed results, which is how most analytical streaming pipelines on Google Cloud terminate.

Beyond the native integrations, Dataflow ships with connectors for Bigtable and Apache Kafka. Bigtable is the destination when you need low-latency reads on processed data at very high throughput. Kafka matters when you are migrating an existing on-premises or hybrid streaming architecture and want to keep Kafka as the message bus while moving the processing layer to Google Cloud.

Going back to the fraud detection example

With Dataflow, the fraud detection scenario collapses into a single pipeline. New transactions arrive on a Pub/Sub topic and feed the streaming side of the pipeline. Historical transaction data sits in Cloud Storage or BigQuery and feeds the batch side. The same Beam transformations apply scoring logic to both streams, write enriched events to BigQuery for analysis, and emit alerts when a transaction looks anomalous against the historical baseline.

Architecturally, that means one codebase, one set of operational concerns, and a consistent view of what counts as suspicious activity across both real-time and historical data. That is the kind of simplification the Professional Cloud Architect exam expects you to recognize when a scenario describes parallel batch and streaming requirements.

What to take into the exam

For the Professional Cloud Architect exam, the high-value points on Dataflow are:

It handles batch and streaming with one pipeline, built on Apache Beam.
It is serverless, autoscaling, and fully managed.
It integrates natively with Cloud Storage, Pub/Sub, and BigQuery, and has connectors for Bigtable and Kafka.
The canonical streaming architecture on Google Cloud is Pub/Sub plus Dataflow plus BigQuery.

If a scenario mentions unified batch and streaming, real-time analytics terminating in BigQuery, or migration of a Kafka-based pipeline into Google Cloud, Dataflow is almost always the right answer.

My Professional Cloud Architect course covers Dataflow alongside the rest of the messaging and pipelines material.