
One of the foundational distinctions on the Professional Data Engineer exam is the split between batch and streaming data processing. It sounds simple once you have seen it a few times, but the exam loves to test whether you can pick the right approach for a given scenario, and whether you understand the tradeoffs that come with each. I want to walk through how I think about this distinction when I am studying, and how I expect candidates to reason about it on test day.
Batch processing means data is collected first and then processed in large, scheduled chunks. The schedule can be a time interval like hourly, daily, or weekly. It can also be triggered when a threshold or condition is met, or even kicked off manually when someone decides it is time to run the job. The key idea is that the data sits and accumulates, and then a job runs against the whole set at once.
You will sometimes see batch data referred to as bounded data, because the set is defined and finite before processing begins. The classic use cases are financial reporting, inventory management, and customer billing. None of those workloads require instant insight. They require a clean, complete picture of what happened over a defined window.
Streaming processing handles data continuously as it arrives, either in real time or near real time. There is no waiting around for a job to fire. The processing happens immediately on each event, or on small windows of events, as they come in.
Streaming data is often called unbounded data, because it keeps flowing and there is no defined end. The use cases that make streaming worth the trouble all share a need for rapid response. Fraud detection wants to flag a suspicious transaction before it clears. IoT sensor pipelines have devices that emit data constantly and need it processed as it lands. System monitoring needs insights quickly enough to react to problems before they cascade.
The exam will not ask you to recite definitions. It will give you a scenario and expect you to weigh the tradeoffs. Here is how I frame them.
Batch advantages:
Batch drawbacks:
Streaming advantages:
Streaming drawbacks:
When a question gives you a scenario, I run through three considerations before I commit to an answer.
Data volume. How much data is being processed, and in what shape does it arrive? Large historical sets that land in cloud storage on a schedule point toward batch. Continuous high-frequency events from many producers point toward streaming.
Speed of analysis. How quickly do the insights need to be available? If the business needs to react in seconds, streaming is the answer. If a daily or weekly report is acceptable, batch is usually cheaper and simpler.
Infrastructure. Does the scenario suggest the team has the resources to run a streaming pipeline, or are they optimizing for cost and simplicity? Streaming costs more to run and operate, and the exam sometimes hints at that constraint.
Real systems do not always pick one or the other. A common pattern is to run streaming for real-time alerts on urgent signals, and batch for in-depth reporting on the accumulated data later. You will see this called a lambda architecture in some materials, and the Professional Data Engineer exam expects you to know that combining both is often the right answer.
An example I keep in mind is a retail platform that uses streaming to detect fraud as transactions happen, and batch jobs overnight to produce reconciliation reports and revenue summaries. Same underlying data, two processing paths, each serving the latency profile that fits the use case.
Streaming has been growing as more workloads demand real-time insight. That does not mean batch is going away. Batch is still the right tool for many analytical workloads where freshness is measured in hours or days rather than seconds. The exam reflects this. You will see questions that test whether you can resist over-engineering a batch problem into a streaming pipeline just because streaming sounds more modern.
When you see a scenario about fraud detection, IoT telemetry, or live monitoring, think streaming. When you see scheduled financial reporting, billing runs, or end-of-day inventory reconciliation, think batch. When the question mixes urgent alerts with longer historical analysis, think hybrid.
My Professional Data Engineer course covers how this batch versus streaming distinction maps onto the specific Google Cloud services you will see on the exam, including Dataflow, Pub/Sub, BigQuery, and Dataproc.