Loading Data into BigQuery for the Professional Cloud Database Engineer Exam

GCP Study Hub
June 10, 2026

BigQuery offers several ways to get data in, and the right one depends on whether you want a manual approach, a programmatic one, or a fully automated pipeline. The Professional Cloud Database Engineer exam tends to test whether you can match an ingestion requirement to the appropriate method, so it helps to know what each path is for rather than just that it exists. The choices span one-off console uploads, scriptable command-line loads, a high-throughput API that handles both batch and streaming, a managed transfer service for external sources, and integrations with other Google Cloud services for more involved pipelines.

Console uploads and supported sources

The most common entry point is the BigQuery console, which provides a graphical interface for one-off tasks. You use it for local uploads from your machine, or to pull from storage services like Cloud Storage, Google Drive, and Bigtable. It also supports external cloud sources, including Azure Blob Storage and AWS S3. The console accepts a specific set of file types: CSV, JSONL, Avro, Parquet, and ORC. This is usually the method to reach for with smaller datasets, or when you are just starting to explore a new data source and want to see it land in a table quickly.

The bq load command

For more repeatable workflows, the bq load command uploads files directly from your terminal. Using the command line is generally faster for repetitive tasks and large data operations than manual uploads through the browser, and it is easily scriptable. That makes it a better fit when you need to automate part of your data loading process rather than click through the console each time.

The BigQuery Storage Write API

When you need to handle high-velocity data, the BigQuery Storage Write API is the preferred method. It combines batch and streaming ingestion into a single interface, so you do not have to choose a separate mechanism depending on how the data arrives. It also guarantees exactly-once delivery, which means you do not have to account for duplicate records or missing data during ingestion, even at large scale. If a question describes streaming ingestion with strict correctness requirements, this is the path the exam is usually pointing at.

The BigQuery Data Transfer Service

For external sources that you want to keep synchronized without writing custom code, there is the BigQuery Data Transfer Service. It is a fully managed solution that automates data ingestion on a scheduled basis. It connects natively to Google services like Google Ads and YouTube, to SaaS applications such as Salesforce, and to other cloud providers. The point of it is to keep BigQuery datasets in step with your business applications on a recurring schedule, rather than moving data once by hand.

Integrations with other Google Cloud services

Beyond these direct methods, BigQuery integrates with other Google Cloud services for more complex pipelines. Datastream supports continuous ingestion, Dataflow handles batch or stream processing, the managed service for Apache Spark runs Spark jobs, Pub/Sub carries real-time messaging, and Data Fusion provides visual pipeline building. These come into play when the work involves more than landing files, such as transforming data in flight or coordinating several systems.

CDC and replication into BigQuery

One ingestion requirement worth singling out is change data capture, or CDC, and replication into BigQuery from another database. CDC means capturing the changes in a source database and reflecting them in the destination, so that BigQuery stays synchronized with the operational system. Datastream is the service designed for these replication scenarios, and it keeps changes flowing into the data warehouse with minimal latency.

Two patterns tend to show up. In the first, you start with a source database where the primary transactions occur, and Datastream pulls the change logs and streams them directly into BigQuery. The goal is to keep BigQuery updated in near real time by continuously applying incremental updates, so analytical queries run against the most current version of the operational data.

The second pattern adds steps after Datastream so that the data can be changed before it loads. It again begins with a source database and uses Datastream to capture the change events, but the data is first moved into Cloud Storage. Once it lands in a bucket, Dataflow picks up the files and processes them before they reach BigQuery. This middle stage is used when the data needs extra processing first, such as cleaning, filtering, or reformatting records. It is also the approach to use when combining multiple sources, because landing the data in storage and processing it with Dataflow lets you merge data from various systems into a common schema by the time it reaches BigQuery.

For the Professional Cloud Database Engineer exam, the useful distinction is which method fits the requirement in front of you. Console uploads and bq load cover manual and scriptable loads, the Storage Write API covers high-throughput streaming with exactly-once delivery, the Data Transfer Service covers scheduled syncs from external sources, and Datastream covers CDC, with Dataflow added when the replicated data needs transformation along the way.

Our Professional Cloud Database Engineer course covers loading data into BigQuery alongside Datastream replication and Dataflow pipelines, with practice questions that drill these distinctions.

Get tips and updates from GCP Study Hub

arrow