Batch vs Streaming Data Processing for the PDE Exam

619c7c8da6d7b95cf26f6f70

June 8, 2025

One of the foundational distinctions on the Professional Data Engineer exam is the split between batch and streaming data processing. It sounds simple once you have seen it a few times, but the exam loves to test whether you can pick the right approach for a given scenario, and whether you understand the tradeoffs that come with each. I want to walk through how I think about this distinction when I am studying, and how I expect candidates to reason about it on test day.

What batch processing actually means

Batch processing means data is collected first and then processed in large, scheduled chunks. The schedule can be a time interval like hourly, daily, or weekly. It can also be triggered when a threshold or condition is met, or even kicked off manually when someone decides it is time to run the job. The key idea is that the data sits and accumulates, and then a job runs against the whole set at once.

You will sometimes see batch data referred to as bounded data, because the set is defined and finite before processing begins. The classic use cases are financial reporting, inventory management, and customer billing. None of those workloads require instant insight. They require a clean, complete picture of what happened over a defined window.

What streaming processing actually means

Streaming processing handles data continuously as it arrives, either in real time or near real time. There is no waiting around for a job to fire. The processing happens immediately on each event, or on small windows of events, as they come in.

Streaming data is often called unbounded data, because it keeps flowing and there is no defined end. The use cases that make streaming worth the trouble all share a need for rapid response. Fraud detection wants to flag a suspicious transaction before it clears. IoT sensor pipelines have devices that emit data constantly and need it processed as it lands. System monitoring needs insights quickly enough to react to problems before they cascade.

The tradeoffs you need to know

The exam will not ask you to recite definitions. It will give you a scenario and expect you to weigh the tradeoffs. Here is how I frame them.

Batch advantages:

Efficient for large volumes of data, because you are processing a defined chunk in one pass
Simpler infrastructure requirements compared to a continuous pipeline
Easier to reason about correctness when the input set is finite

Batch drawbacks:

There is a delay in insights, because you wait for data to accumulate before processing
Not appropriate when decisions need to happen in seconds or minutes

Streaming advantages:

Lower latency, which enables quick decision making on fresh data
Fits naturally with event-driven systems and continuous data sources

Streaming drawbacks:

Resource-intensive and requires more robust infrastructure to handle real-time flow
More operational complexity around late data, out-of-order events, and windowing

How to pick on the exam

When a question gives you a scenario, I run through three considerations before I commit to an answer.

Data volume. How much data is being processed, and in what shape does it arrive? Large historical sets that land in cloud storage on a schedule point toward batch. Continuous high-frequency events from many producers point toward streaming.

Speed of analysis. How quickly do the insights need to be available? If the business needs to react in seconds, streaming is the answer. If a daily or weekly report is acceptable, batch is usually cheaper and simpler.

Infrastructure. Does the scenario suggest the team has the resources to run a streaming pipeline, or are they optimizing for cost and simplicity? Streaming costs more to run and operate, and the exam sometimes hints at that constraint.

The hybrid approach

Real systems do not always pick one or the other. A common pattern is to run streaming for real-time alerts on urgent signals, and batch for in-depth reporting on the accumulated data later. You will see this called a lambda architecture in some materials, and the Professional Data Engineer exam expects you to know that combining both is often the right answer.

An example I keep in mind is a retail platform that uses streaming to detect fraud as transactions happen, and batch jobs overnight to produce reconciliation reports and revenue summaries. Same underlying data, two processing paths, each serving the latency profile that fits the use case.

The trend, and what it means for the exam

Streaming has been growing as more workloads demand real-time insight. That does not mean batch is going away. Batch is still the right tool for many analytical workloads where freshness is measured in hours or days rather than seconds. The exam reflects this. You will see questions that test whether you can resist over-engineering a batch problem into a streaming pipeline just because streaming sounds more modern.

When you see a scenario about fraud detection, IoT telemetry, or live monitoring, think streaming. When you see scheduled financial reporting, billing runs, or end-of-day inventory reconciliation, think batch. When the question mixes urgent alerts with longer historical analysis, think hybrid.

My Professional Data Engineer course covers how this batch versus streaming distinction maps onto the specific Google Cloud services you will see on the exam, including Dataflow, Pub/Sub, BigQuery, and Dataproc.