Apache Big Data Tools and Their GCP Equivalents for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
June 20, 2025

One pattern shows up again and again on the Google Cloud Professional Data Engineer exam: a company is running some Apache tool on-prem, and the question asks which Google Cloud product they should migrate to. If you know the mapping cold, these questions take ten seconds. If you don't, you can talk yourself into the wrong answer by overthinking the use case.

So in this post I want to walk through the Apache-to-GCP equivalents I drill into my Professional Data Engineer students, why these mappings exist, and the small distinctions that the exam likes to test.

Why the Apache ecosystem matters for the PDE exam

Big data tooling grew out of a real problem. As datasets got larger, traditional single-machine systems couldn't keep up, and a wave of open-source projects emerged to handle the scale. Most of them ended up under the Apache Software Foundation, which is why so many of the names you see on the exam start with the word "Apache".

Three principles shaped almost all of them:

  • Distributed and parallel processing: split a big job into smaller tasks that run across many machines at once.
  • Scalability: add more nodes (horizontal) or beefier nodes (vertical) as data volume grows.
  • Fault tolerance: assume nodes will fail and keep the job running anyway.

GCP didn't reinvent any of these wheels. Instead, Google built managed services that take the same open-source frameworks and run them with autoscaling, security, and performance tuning handled for you. That's the shape of the migration questions on the exam: the customer wants the open-source tool's capabilities without the operational overhead of running it.

The six mappings to memorize

Here are the six pairings that show up most often on the Professional Data Engineer exam.

Apache Kafka maps to Pub/Sub

Kafka is the workhorse of real-time event streaming. If a customer is running Kafka on-prem to move events between producers and consumers, the GCP equivalent is Pub/Sub, a fully managed messaging service that scales automatically and doesn't require you to provision brokers or manage partitions.

Where the exam likes to trip people up: Kafka has a stronger ordering model than vanilla Pub/Sub. If a question emphasizes strict ordering, Pub/Sub still works, but you need to enable ordering keys.

Apache Beam maps to Dataflow

Beam is a programming model for writing batch and stream pipelines with the same code. Dataflow is Google's managed runner for Beam. You write your pipeline in the Beam SDK, submit it to Dataflow, and Google handles the autoscaling workers, shuffle service, and stream-batch unification.

If a question mentions "unified batch and stream" or shows Beam SDK code, the answer is Dataflow.

Apache Hadoop and Apache Spark map to Dataproc

Hadoop (with HDFS and MapReduce) and Spark are both used for large-scale data processing, and on GCP they share a single managed home: Dataproc. The Dataproc value prop is that you can spin up a Hadoop or Spark cluster in about 90 seconds, run your job, and tear it down. That ephemeral cluster pattern is exam gold because it lets you treat compute as disposable.

If a customer is lifting an existing Spark or Hadoop workload to GCP with minimal code changes, Dataproc is the answer. If they're willing to rewrite, Dataflow is often a better long-term fit.

Apache Airflow maps to Cloud Composer

Airflow is the standard for orchestrating data pipelines as DAGs. Cloud Composer is the managed Airflow service on GCP. Same DAG code, same operators, but Google runs the scheduler, web server, and workers for you on GKE under the hood.

If a question mentions DAGs, Python orchestration code, or a customer migrating an Airflow setup, Cloud Composer is the answer.

Apache HBase maps to Bigtable

HBase is a distributed, wide-column, real-time NoSQL database modeled on Google's original Bigtable paper. So it's a clean loop: HBase migrates to Bigtable, and Bigtable even exposes an HBase-compatible API so you often don't need to rewrite client code.

The exam loves the combination of high-throughput writes, low-latency reads, and time-series or IoT workloads. That cluster of clues points to Bigtable.

Apache Hive maps to BigQuery

Hive sits on top of Hadoop and lets you run SQL-like queries over data stored in HDFS. The GCP equivalent is BigQuery, a serverless, fully managed data warehouse with a powerful SQL dialect and separate storage and compute.

BigQuery is more than a Hive replacement, but for migration questions where the customer is running Hive on a Hadoop cluster and wants SQL analytics without managing infrastructure, BigQuery is the target.

How to recognize these questions on the exam

The migration questions tend to follow a pattern. You'll see a sentence or two of context about the customer's current stack, almost always naming the Apache tool, followed by a constraint like "with minimal operational overhead" or "managed by Google". The right answer is usually the direct mapping from the list above.

The trap answers fall into two buckets. Either they swap two GCP products that sound similar (Dataflow vs Dataproc, Bigtable vs BigQuery), or they offer a technically possible but operationally heavier choice, like running Hadoop on Compute Engine VMs instead of using Dataproc. When in doubt, pick the most managed option that still matches the workload.

My Professional Data Engineer course covers each of these Apache-to-GCP mappings with side-by-side architecture diagrams and the specific migration patterns the exam tests.

Get tips and updates from GCP Study Hub

arrow