
When I was first learning Google Cloud's data services, the question that kept coming up was simple. If I have a stream of events arriving in real time and a pile of historical data sitting in storage, how do I run them through the same logic without building two separate systems? Cloud Dataflow is Google's answer, and for the Professional Cloud Architect exam it shows up as the default service whenever a question mentions both batch and streaming in the same architecture.
This article walks through what Dataflow actually is, the historical problem it was built to solve, and the integration points you should recognize on the exam.
Until fairly recently, organizations that wanted to process data at scale ran two separate pipelines. One was a batch pipeline, optimized for accuracy and throughput, that ran on a schedule against historical data. The other was a streaming pipeline, optimized for latency, that processed events as they arrived. The two pipelines used different tools, different code, and often different teams.
That split caused real architectural pain. Whenever you needed to compare recent activity against historical context, you had to reconcile data from both systems. The classic example is fraud detection on credit card transactions. The streaming side flags a suspicious charge in the moment. The batch side knows the cardholder's spending pattern over the last three years. Joining those two views means syncing data across systems, accepting some lag, and accepting that your real-time view and your historical view will occasionally disagree.
On top of the data inconsistency, you paid an operational tax. Two pipelines meant two sets of failures to debug, two scaling stories to manage, and two codebases to evolve. For an architect, that is the kind of complexity that compounds across a platform.
Dataflow is Google Cloud's fully managed service for running data processing pipelines, and the headline feature is that a single pipeline can handle both batch and streaming workloads. You write the transformation logic once, and Dataflow runs it against bounded historical data or unbounded live event streams using the same code.
Under the hood, Dataflow is Google's managed implementation of Apache Beam, which is an open source unified programming model. The name Beam is itself a contraction of Batch and strEAM, which tells you exactly what the project was designed to unify. By running on Beam, Dataflow gives you portable pipeline code. The same Beam pipeline can in principle run on other Beam runners, though on Google Cloud you get the managed Dataflow runner with full integration into the rest of the platform.
A few characteristics matter for the Professional Cloud Architect exam:
The pattern of serverless plus autoscaling plus an open source foundation shows up across Google Cloud's flagship data services. BigQuery, Cloud Run, and Dataflow all share that profile, and the exam tends to lean on those services whenever a question describes a managed, scalable workload without operational overhead.
Dataflow's value on Google Cloud comes partly from how cleanly it slots into the rest of the data ecosystem. The native integrations you should know are:
Beyond the native integrations, Dataflow ships with connectors for Bigtable and Apache Kafka. Bigtable is the destination when you need low-latency reads on processed data at very high throughput. Kafka matters when you are migrating an existing on-premises or hybrid streaming architecture and want to keep Kafka as the message bus while moving the processing layer to Google Cloud.
With Dataflow, the fraud detection scenario collapses into a single pipeline. New transactions arrive on a Pub/Sub topic and feed the streaming side of the pipeline. Historical transaction data sits in Cloud Storage or BigQuery and feeds the batch side. The same Beam transformations apply scoring logic to both streams, write enriched events to BigQuery for analysis, and emit alerts when a transaction looks anomalous against the historical baseline.
Architecturally, that means one codebase, one set of operational concerns, and a consistent view of what counts as suspicious activity across both real-time and historical data. That is the kind of simplification the Professional Cloud Architect exam expects you to recognize when a scenario describes parallel batch and streaming requirements.
For the Professional Cloud Architect exam, the high-value points on Dataflow are:
If a scenario mentions unified batch and streaming, real-time analytics terminating in BigQuery, or migration of a Kafka-based pipeline into Google Cloud, Dataflow is almost always the right answer.
My Professional Cloud Architect course covers Dataflow alongside the rest of the messaging and pipelines material.