ETL and Data Preparation in BigQuery for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
December 10, 2025

When I coach candidates for the Google Cloud Professional Data Engineer exam, the ETL and data preparation questions tend to trip people up not because the tools are obscure, but because the exam wants you to pick the right tool for the job. Two services come up over and over when the topic is preparing data and loading it into BigQuery: Dataflow and Dataprep by Trifacta. They solve overlapping problems in very different ways, and the Professional Data Engineer exam loves to probe whether you understand the difference.

In this article I want to walk through how I think about these two services on exam day, what kinds of scenarios point you toward each one, and how they fit into the broader ETL picture with BigQuery as the destination.

Why BigQuery sits at the center

Before I talk about the tools, it helps to remember why so many ETL questions on the Professional Data Engineer exam end in BigQuery. BigQuery is the default analytics destination on Google Cloud. It is serverless, separates storage from compute, and integrates natively with almost every ingestion and transformation service in the catalog. When the exam describes a pipeline that ends with analysts running SQL, dashboards rendering from a warehouse, or ML features being prepared at scale, the destination is almost always BigQuery.

That framing matters because the question is rarely where the data is going. The question is how you get it there cleanly, on schedule, and at the right cost.

Dataflow: the workhorse for pipelines

Dataflow is Google Cloud's managed service for running Apache Beam pipelines. I describe it as the workhorse because it handles the heavy ETL scenarios: large-scale transformations, streaming ingestion from Pub/Sub, batch jobs that need to fan out across hundreds of workers, and pipelines where you need real code rather than a UI.

Things that point to Dataflow on the exam:

  • The scenario mentions large-scale or high-volume data transformation.
  • The pipeline is streaming, often from Pub/Sub, and lands in BigQuery in near real time.
  • The team needs custom transformation logic in Java or Python.
  • The same pipeline needs to handle both batch and streaming with one codebase. Beam's unified model is the giveaway.
  • The question emphasizes autoscaling and managed infrastructure for ETL.

When I see those signals, I pick Dataflow and move on. The output sink in these scenarios is almost always BigQuery, written through the BigQuery I/O connector that Beam provides out of the box.

Dataprep by Trifacta: the no-code option

Dataprep is the other side of the coin. It is a no-code, visual interface for cleaning and transforming data before loading it into BigQuery. Instead of writing Beam code, a user clicks through the data, applies suggested transformations, and builds a recipe that Dataprep then executes (under the hood, it actually runs as a Dataflow job, but the user never has to think about that).

Things that point to Dataprep on the exam:

  • The user is a business analyst, data analyst, or someone who is explicitly described as not a developer.
  • The requirement is to explore, profile, or visually clean data.
  • The scenario emphasizes no code or self-service data preparation.
  • The dataset is messy and the team wants suggested transformations rather than writing them from scratch.
  • The output still goes to BigQuery, but the prep work is the focus.

If the question describes a small or mid-size cleanup job done by a non-engineer and exported to BigQuery, Dataprep is the answer. If the question describes a production-grade pipeline written by data engineers and running 24/7, it is Dataflow.

The trap question

The trap I see most often on the Professional Data Engineer exam looks like this: a scenario describes an analyst who needs to clean data and load it into BigQuery, and one of the answers is Dataflow. It is tempting to pick Dataflow because it is the more general-purpose tool, but the analyst persona is the signal. Dataprep exists precisely for that user, and the exam expects you to match the tool to the user.

The reverse trap also shows up. A scenario describes engineers building a streaming pipeline at high volume, and Dataprep appears as an option. It is wrong for the same reason: Dataprep is not built for production streaming workloads even though it runs on Dataflow underneath.

How I drill this for the exam

When I review ETL questions, I read the scenario looking for three things: who is doing the work, how big the workload is, and whether streaming is involved. Those three signals will resolve almost every Dataflow vs Dataprep question you see. Add BigQuery as the assumed destination and you have a quick decision tree that holds up under exam time pressure.

My Professional Data Engineer course covers ETL and data preparation, the Dataflow vs Dataprep decision, and how both services land data into BigQuery in production.

Get tips and updates from GCP Study Hub

arrow