Dataproc Connectors: BigQuery and Bigtable for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
October 27, 2025

One of the things the Professional Data Engineer exam likes to test is whether you understand that Dataproc is not stuck working only with Cloud Storage or HDFS. If your data lives in BigQuery or Bigtable, you do not need to move it to run a Spark, Hive, or Hadoop job against it. Google provides specialized connectors that let Dataproc reach into those services directly, and knowing which connector applies to which engine is the kind of detail the exam rewards.

In this article I will walk through the BigQuery connectors, the two ways to wire Dataproc up to Bigtable, and a worked exam-style question that ties the connector story back to Cloud Storage and HDFS. If you are studying for the Professional Data Engineer certification, this is one of those topics where memorizing the names of the connectors pays off directly.

The three BigQuery connectors for Dataproc

Dataproc has three connectors that let it talk to BigQuery. The reason there are three is that Dataproc runs three different processing engines (Spark, Hadoop MapReduce, and Hive) and each one needs its own bridge.

  • Spark-BigQuery Connector: enables integration between Dataproc Spark jobs and BigQuery. The canonical use case is running a Spark ML job in Dataproc on data that lives in BigQuery, reading and writing through the connector instead of staging the data somewhere else first.
  • Hadoop-BigQuery Connector: lets Hadoop mappers and reducers interact with BigQuery tables. It works by providing simplified versions of the InputFormat and OutputFormat classes, so a MapReduce job can treat BigQuery as a source or sink without custom plumbing.
  • Hive-BigQuery Connector: ships a Storage Handler that lets Apache Hive interact directly with BigQuery tables using HiveQL syntax. If you already write Hive queries, you do not need to learn anything new to query BigQuery through this connector.

The pattern is consistent across all three. You should reach for these connectors especially if you already have datasets in BigQuery, or if you are trying to migrate something like Apache Impala data directly into BigQuery. Moving the compute to where the data is, instead of copying the data around, is the more cloud-native approach.

Connecting Dataproc to Bigtable

For Bigtable, there are two options instead of three, and they map cleanly onto how you want to interact with the data.

  • HBase Client: connects your Dataproc cluster to Bigtable through an HBase interface. You can use the HBase Shell from your terminal or the HBase client APIs, and you interact with the Bigtable instance using familiar HBase commands. This is the right choice when you are already working with HBase or when you need low-level access to tables and rows.
  • Bigtable-Spark Connector: lets Dataproc Spark jobs interact directly with Bigtable. You get Spark's distributed processing layered on top of Bigtable's scalability, which is a strong fit for large-scale workloads like Spark ML jobs running on Bigtable data.

As with the BigQuery side, the advice is the same. If you already have data in Bigtable, or you are migrating Apache HBase data into Bigtable, these connectors are the right answer rather than copying the data into Cloud Storage or HDFS first.

An exam-style question that ties it all together

Here is the kind of scenario you can expect to see on the Professional Data Engineer exam. You are planning to move on-prem Hive data to Dataproc on Google Cloud. The Hive files have already been uploaded to a Cloud Storage bucket, but some of the data also needs to be stored in Dataproc's HDFS. What are two methods you can use within Dataproc to achieve this?

The two correct methods are:

  • Use the Cloud Storage Connector to access the Hive tables where they are. Define them as external Hive tables, and replicate them to HDFS. This keeps the data in its original Cloud Storage location while still making it available in HDFS for jobs that need it there.
  • Move the Hive data to the cluster's master node, copy it to HDFS using the Hadoop utility, and then mount it from HDFS. This is the more direct, traditional approach when you want low-latency, in-cluster access.

The reason this question matters is that it forces you to remember that Dataproc has multiple paths to data. The Cloud Storage Connector lets Dataproc treat a bucket as if it were HDFS, so you do not need to copy data into the cluster to use it. But when you genuinely need HDFS performance, copying to the master node and into HDFS is still on the table. The exam wants to see that you understand both, and that you do not default to one strategy when the other is a better fit.

What to remember for the exam

If you take one thing away from this topic, let it be the mapping. Spark talks to BigQuery through the Spark-BigQuery Connector. Hadoop talks to BigQuery through the Hadoop-BigQuery Connector. Hive talks to BigQuery through the Hive-BigQuery Connector. Bigtable is reached through either the HBase Client or the Bigtable-Spark Connector. And Cloud Storage is reached through the Cloud Storage Connector, which is what makes the cloud-native pattern of leaving data in a bucket workable in the first place.

My Professional Data Engineer course covers Dataproc connectors for BigQuery and Bigtable, along with the Cloud Storage and HDFS storage decisions they sit alongside.

Get tips and updates from GCP Study Hub

arrow