Cloud Dataflow for the ACE Exam: Batch and Streaming in One Pipeline

Ben Makansi
April 16, 2026

This article covers what Cloud Dataflow is, the unified batch-and-streaming model it provides, the integrations that show up most often on the Associate Cloud Engineer exam, and the scenarios where Dataflow is the expected answer.

It does not cover Apache Beam programming details, Dataflow templates, custom transforms, or the deeper performance tuning options. The ACE exam tests Dataflow at a conceptual level, and that is what I am focused on here.

What Dataflow actually is

Cloud Dataflow is GCP's managed data processing service. It runs pipelines that process large volumes of data, either as one-time batch jobs or as continuously running streaming jobs. It is auto-scaling and serverless. You write the pipeline and submit it. Dataflow figures out how many workers to use and manages the execution.

Under the hood, Dataflow is Google's managed implementation of Apache Beam. Beam is the open-source programming model for unified batch and streaming pipelines. The name Beam itself comes from combining "batch" and "stream." If you have written Beam code, you can run it on Dataflow without changes.

The unified batch and streaming model

This is the most important thing to understand about Dataflow, and it is the source of most exam questions. Historically, organizations had to maintain two separate data processing pipelines. One for batch (large historical datasets, processed all at once) and one for streaming (real-time events, processed continuously). Two codebases, two operational stacks, two sets of monitoring.

Dataflow eliminates that split. The same pipeline code can handle both. You write one pipeline and run it in batch mode against a historical dataset, or run it in streaming mode against a real-time event source. The transformations, the aggregations, the windowing - all of it works the same way in both modes.

The practical value is that you do not have to pick between batch and streaming up front, and you do not have to rewrite your pipeline if your needs change. Real-time fraud detection that uses both historical baselines and incoming transactions can run as one Dataflow job, not two pipelines stitched together.

The integrations that matter

Dataflow integrates natively with three other GCP services, and these come up on the Associate Cloud Engineer exam constantly.

Pub/Sub is the most common input for streaming pipelines. Events get published to a Pub/Sub topic. Dataflow consumes from the topic and processes them in real time. The Pub/Sub plus Dataflow combination is the canonical streaming architecture on GCP, and it is what the exam expects when a scenario describes real-time event processing.

BigQuery is the most common output. Dataflow processes the data and writes results to a BigQuery table. From there, analysts or downstream systems can query the data. Pub/Sub to Dataflow to BigQuery is essentially the standard streaming analytics pipeline on GCP.

Cloud Storage is the most common input or output for batch pipelines. Files land in a GCS bucket. Dataflow reads them, processes them, and writes results back to GCS or to BigQuery.

How the ACE exam tests this

Dataflow shows up on the ACE exam in a few consistent patterns.

The first is unified batch and streaming. A scenario describes a team that needs to process both historical data and incoming real-time data with one pipeline. The answer is Dataflow. The unified model is the differentiator.

The second is the streaming analytics pipeline. A scenario describes events flowing into Pub/Sub that need real-time processing before landing in BigQuery. The answer is Dataflow as the processing layer between Pub/Sub and BigQuery.

The third is the Apache Beam framing. A scenario mentions Apache Beam by name, or describes a Beam pipeline that the team wants to run as a managed service on GCP. The answer is Dataflow, because it is Google's managed Beam runtime.

The fourth is the batch-only data processing pattern. A scenario describes a one-time job that needs to transform a large dataset in Cloud Storage. Dataflow can do this, but on the ACE exam Dataproc (managed Hadoop and Spark) is sometimes the better answer for that scenario, especially when the scenario mentions existing Hadoop or Spark code. Dataflow wins for new pipelines and for anything streaming-related.

If you see in the question unified batch and streaming, Apache Beam, real-time event processing from Pub/Sub, or auto-scaling serverless data processing, think Dataflow.

What Dataflow is not

For exam clarity, a few quick distinctions. Dataflow is not Pub/Sub - Pub/Sub is the messaging buffer, Dataflow is the processing engine. Dataflow is not BigQuery - BigQuery is the data warehouse, Dataflow is the pipeline that fills it. Dataflow is not Dataproc - Dataproc is the managed Hadoop and Spark service, useful when you already have Hadoop or Spark code and want to lift it into GCP. Dataflow is for new pipelines built on Beam.

The bottom line

Cloud Dataflow is GCP's managed data processing service for batch and streaming pipelines. It is built on Apache Beam, which unifies batch and streaming under a single programming model. It auto-scales, is serverless, and integrates natively with Pub/Sub, BigQuery, and Cloud Storage. The ACE exam tests it in scenarios involving unified batch-and-streaming, real-time event processing from Pub/Sub, or Apache Beam.

For the Associate Cloud Engineer exam, recognize the integrations and the unified model. That covers most of what gets tested.

My Associate Cloud Engineer course covers Dataflow alongside Pub/Sub and BigQuery in the data services section, with the connections between them mapped to the kinds of streaming and batch scenarios the ACE exam asks about.

arrow