BigQuery Best Practices for the PDE Exam: Query, Ingest, and Storage

GCP Study Hub
619c7c8da6d7b95cf26f6f70
January 25, 2026

BigQuery best practices are one of the highest-yield study areas for the Professional Data Engineer exam. The exam loves to put you in a scenario where a query is too slow, a bill is too high, or a pipeline is wasting money on streaming inserts, and then ask you to pick the option that fixes it without breaking anything else. If you internalize a handful of patterns across query design, ingestion, and storage, a surprising number of questions answer themselves.

Here is how I think about each layer when I am preparing candidates for the Professional Data Engineer.

Query design: scan less, compute less

BigQuery charges you for bytes scanned on on-demand queries, so the cheapest query is the one that touches the fewest columns and the fewest rows. That single sentence drives most of the query best practices you need for the exam.

  • Never use SELECT *. Specify the columns you actually need. BigQuery is columnar, so listing fewer columns directly reduces bytes scanned and the cost on the bill. If you see SELECT * in an answer choice, it is almost always a distractor.
  • Check the query size before you run it. The BigQuery UI shows a byte estimate before execution. Use it. On the exam, an answer that mentions estimating cost or using a dry run is usually a green flag.
  • Filter early with WHERE. WHERE filters data before it is scanned. LIMIT and HAVING do not. LIMIT only restricts the rows returned, and HAVING filters after aggregation, which means neither of them reduces the bytes scanned or the cost. This is a classic trap, and the exam will absolutely test it.
  • Develop on small subsets first. Establish the logic of a query on a sample before unleashing it on a multi-terabyte table. The exam version of this is any scenario where someone is iterating on logic against a huge production table and racking up cost.
  • Prefer INNER JOIN over a WHERE-style join. Joining tables with a WHERE clause produces more variable combinations and forces more computation than an explicit INNER JOIN. The optimizer handles INNER JOIN more efficiently.
  • Save intermediate results. If multiple downstream queries hit the same expensive transformation, materialize that step into an intermediate table and query the intermediate table. You pay for the computation once instead of every time.

On top of that, partitioning and clustering belong in your mental toolkit. Partitioning a table by ingestion time or by a date column lets WHERE clauses prune entire partitions, which is the single biggest cost lever in BigQuery. Clustering on commonly filtered columns sorts data within partitions so range and equality filters scan even less. When the exam describes a query that filters on a date and a customer ID against a giant table, the right answer is usually partition on the date, cluster on the customer ID.

Ingestion: batch when you can, stream when you must

The ingestion best practices come down to one principle the exam hammers on: streaming ingest is expensive.

Streaming inserts are the right call when you genuinely need data to be queryable within seconds of arrival. They are the wrong call for nightly ETL, hourly rollups, or anything where a few minutes of latency is acceptable. Batch loads into BigQuery are free. If a scenario describes a workload that ingests at a regular cadence and does not need second-level freshness, batch loading is the cost-optimized answer.

File format matters too. Loading Avro or Parquet from Cloud Storage is faster and cheaper than loading CSV or JSON, because the columnar formats can be parsed in parallel and BigQuery does not need to infer or coerce schema as aggressively. If the exam offers a choice between formats for a batch load, the columnar ones are usually correct.

One more ingestion-level decision is shape. Nested and repeated fields, which is to say denormalized tables, reduce the need for joins at query time. BigQuery is built for this. If you are tempted to normalize a schema across many small tables, ask whether the resulting joins will outweigh the storage savings. The Professional Data Engineer exam frequently rewards the denormalized answer.

Storage: partition, expire, and let long-term storage do its job

BigQuery storage has two automatic tiers. Data that has not been modified for 90 days drops to long-term storage pricing, which is roughly half the cost of active storage. You do not have to do anything to get this discount, but you can lose it by touching a table unnecessarily. If a question describes archival data being rewritten on a schedule and asks why storage costs are not dropping, the answer is that the rewrites are resetting the 90-day clock.

  • Use partitioning to control storage too, not just query cost. Smaller partitions let you apply targeted expiration policies and let long-term storage pricing kick in partition by partition.
  • Set expiration on datasets, tables, and partitions for data with a known shelf life. Expirations delete data automatically and keep you from paying to store rows nobody is going to query.
  • Balance more, smaller tables against join cost. Partitioning and splitting can optimize storage, but if it forces you into expensive joins, the savings evaporate. The exam likes scenarios where someone over-normalized and query performance tanked.

If you can answer three questions for any BigQuery scenario the exam throws at you, which columns am I scanning, am I filtering before the scan, and is this data shaped to avoid joins, you will get the vast majority of BigQuery questions right.

My Professional Data Engineer course covers BigQuery best practices in depth alongside the rest of the storage, processing, and machine learning topics the exam tests, with worked examples for the query, ingest, and storage patterns described here.

Get tips and updates from GCP Study Hub

arrow