ETL vs ELT for the PDE Exam: When Each Wins

619c7c8da6d7b95cf26f6f70

June 29, 2025

One of the first conceptual splits the Professional Data Engineer exam asks you to reason about is ETL versus ELT. The acronyms differ by one letter, but the order of those letters changes which Google Cloud services you reach for, how you bill storage, and how quickly downstream teams get their hands on data. I want to walk through how I think about the two patterns when a scenario question lands in front of me, and how to pick a winner without overthinking it.

The integration problem both patterns solve

The starting point is the same for both. Source data lives in operational databases, SaaS APIs, files dropped into buckets, and application logs. The shape is inconsistent, the schemas drift, and analytics teams need it centralized in a warehouse or lake. Extract, Transform, Load and Extract, Load, Transform are two ordering choices for that centralization. They both extract from sources, they both end with usable data in a target system, and they differ on one question: do you transform the data before it lands, or after.

That single ordering decision drives almost every tradeoff you will see on the Professional Data Engineer exam, so I anchor my thinking there before I start matching services.

ETL: transform first, then load

ETL is the traditional pattern. You pull data out of the source, run it through a transformation layer where you clean fields, apply business rules, join lookups, and aggregate where needed, and only then load the curated result into the warehouse. The warehouse never sees the raw rows. It only sees the modeled output.

This approach makes sense in a few situations:

On-prem warehouses with limited compute. If your target system charges heavily for query time or cannot scale transformations cheaply, you do not want it grinding through dirty data.
Heavy upstream transformation. When the raw data needs significant cleaning, masking, or business logic before anyone should look at it, doing that work in flight keeps the warehouse tidy.
Sensitive data that should never land raw. PII redaction or tokenization is often easier to enforce when the warehouse is downstream of the transform step.

On Google Cloud, an ETL pipeline often looks like Cloud Storage or a source database feeding Dataflow, which applies transformations, before writing curated tables into BigQuery. Dataproc serves the same role when teams prefer Spark. The transform layer is the star, and the warehouse is just the destination.

ELT: load first, then transform

ELT flips the order. You pull data out of the source and land it, raw, into the target system. Transformation happens later, inside the warehouse or lake, using its own compute. The warehouse is no longer just a destination. It is also the transformation engine.

ELT dominates cloud-native architectures for three reasons.

First, storage is cheap. Keeping multiple copies of raw data, in different formats, for different teams, used to be a budget conversation. In the cloud it is rounding error. Marketing can have its slice, finance can have a different aggregation, and the raw layer keeps sitting there for audit and replay.

Second, ingestion is faster. The data is usable for analysts the moment it lands. You do not have to wait for a transformation job to finish before the warehouse has anything to show. Transformations happen on demand against the raw tables.

Third, modern warehouses are built for it. BigQuery is the canonical example. Its separation of storage and compute, combined with SQL-based transformation tools, makes load-first pipelines the natural choice. You ingest with Storage Transfer Service, the BigQuery Data Transfer Service, Datastream for change data capture, or Dataflow in append mode, and then you transform with scheduled queries, Dataform, or dbt running on top of BigQuery.

How to pick on an exam question

When a Professional Data Engineer scenario question gives you a pipeline to design, I read for a few specific signals.

Is the target BigQuery or a data lake on Cloud Storage? Both lean toward ELT. The cloud warehouse can handle the transformation workload, and storage is cheap enough to keep raw layers.
Does the question stress fast ingestion or near-real-time availability? ELT wins because you skip the upstream transform step before data is visible.
Are there strict pre-load requirements like redaction, format conversion, or schema enforcement? ETL with Dataflow doing the heavy lifting is the safer pick.
Does the scenario mention multiple downstream consumers with different needs? ELT, because the raw layer can be transformed differently per consumer without rerunning extraction.
Is the existing environment on-prem or compute-constrained? ETL is usually the answer because you cannot lean on cheap, scalable cloud compute for transforms.

The exam will not always say "use ELT" directly. It will describe a team that wants fast ingestion, multiple analytical use cases, and a BigQuery target, and you have to recognize that as an ELT pattern. Or it will describe a regulated data flow that needs masking before it touches the warehouse, and that is ETL.

One thing not to overthink

You will sometimes see ETL and ELT framed as a moral choice, where ELT is "modern" and ETL is "legacy". That framing leaks into questions, but it is not how I would answer. Both are valid patterns and both come up on the exam. The right answer is the one that matches the constraints in the scenario, not the one that feels more current.

The pattern I see most often in Google Cloud architectures is a hybrid. Raw data lands in BigQuery or Cloud Storage through an ELT-style ingestion, and then Dataform or scheduled queries transform it into curated marts that look very much like the output of an ETL job. The ordering is ELT, but the final consumer-facing tables follow ETL hygiene. Knowing both patterns lets you read those hybrid designs correctly.

My Professional Data Engineer course covers ETL and ELT pipeline design on Google Cloud, including which services match each pattern and how to read scenario questions for the signal that picks the winner.

ETL vs ELT for the PDE Exam: When Each Approach Wins

The integration problem both patterns solve

ETL: transform first, then load

ELT: load first, then transform

How to pick on an exam question

One thing not to overthink

Get tips and updates from GCP Study Hub