
One of the things the Professional Data Engineer exam likes to test is whether you understand that Dataproc is not stuck working only with Cloud Storage or HDFS. If your data lives in BigQuery or Bigtable, you do not need to move it to run a Spark, Hive, or Hadoop job against it. Google provides specialized connectors that let Dataproc reach into those services directly, and knowing which connector applies to which engine is the kind of detail the exam rewards.
In this article I will walk through the BigQuery connectors, the two ways to wire Dataproc up to Bigtable, and a worked exam-style question that ties the connector story back to Cloud Storage and HDFS. If you are studying for the Professional Data Engineer certification, this is one of those topics where memorizing the names of the connectors pays off directly.
Dataproc has three connectors that let it talk to BigQuery. The reason there are three is that Dataproc runs three different processing engines (Spark, Hadoop MapReduce, and Hive) and each one needs its own bridge.
The pattern is consistent across all three. You should reach for these connectors especially if you already have datasets in BigQuery, or if you are trying to migrate something like Apache Impala data directly into BigQuery. Moving the compute to where the data is, instead of copying the data around, is the more cloud-native approach.
For Bigtable, there are two options instead of three, and they map cleanly onto how you want to interact with the data.
As with the BigQuery side, the advice is the same. If you already have data in Bigtable, or you are migrating Apache HBase data into Bigtable, these connectors are the right answer rather than copying the data into Cloud Storage or HDFS first.
Here is the kind of scenario you can expect to see on the Professional Data Engineer exam. You are planning to move on-prem Hive data to Dataproc on Google Cloud. The Hive files have already been uploaded to a Cloud Storage bucket, but some of the data also needs to be stored in Dataproc's HDFS. What are two methods you can use within Dataproc to achieve this?
The two correct methods are:
The reason this question matters is that it forces you to remember that Dataproc has multiple paths to data. The Cloud Storage Connector lets Dataproc treat a bucket as if it were HDFS, so you do not need to copy data into the cluster to use it. But when you genuinely need HDFS performance, copying to the master node and into HDFS is still on the table. The exam wants to see that you understand both, and that you do not default to one strategy when the other is a better fit.
If you take one thing away from this topic, let it be the mapping. Spark talks to BigQuery through the Spark-BigQuery Connector. Hadoop talks to BigQuery through the Hadoop-BigQuery Connector. Hive talks to BigQuery through the Hive-BigQuery Connector. Bigtable is reached through either the HBase Client or the Bigtable-Spark Connector. And Cloud Storage is reached through the Cloud Storage Connector, which is what makes the cloud-native pattern of leaving data in a bucket workable in the first place.
My Professional Data Engineer course covers Dataproc connectors for BigQuery and Bigtable, along with the Cloud Storage and HDFS storage decisions they sit alongside.