Bigtable Garbage Collection Policies for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
February 12, 2026

Bigtable garbage collection is one of those topics that looks tiny on paper but generates a steady stream of Professional Data Engineer exam questions. The exam loves scenarios where a team is wrestling with storage costs, version history, or compliance-driven retention, and the right answer almost always comes down to picking the correct garbage collection rule on the correct column family. I want to walk through what these policies actually do, how they combine, and the specific angles the PDE exam tends to test.

What garbage collection actually means in Bigtable

Every write to a Bigtable cell creates a new version of that cell, timestamped at the moment of the write. Bigtable does not overwrite the previous value by default. It stacks versions, and each version sits there consuming storage until something removes it. Garbage collection is the mechanism Bigtable uses to automatically delete old or unwanted cell versions so your tables do not balloon over time.

Two facts about garbage collection trip people up on the Professional Data Engineer exam, so commit them to memory. First, garbage collection rules are defined at the column family level. You do not set them per row, per cell, or per table. You attach them to a column family, and every column inside that family inherits the rule. Second, even though the rule lives at the column family level, it is applied at the cell level. Each individual cell is evaluated against the rule, and only the versions that violate it get deleted.

The two policy types

Bigtable gives you two building blocks. Max versions retains only the specified number of versions for each cell. If you set max versions to 3, Bigtable keeps the three most recent versions of every cell in that column family and deletes anything older. This is the rule you reach for when the most recent values are what matter and history beyond a fixed depth is noise. A sensor reporting current temperature where you only care about the last few readings is a classic fit.

Max age retains data based on how old each cell version is. If you set max age to 30 days, any cell version with a timestamp older than 30 days becomes eligible for deletion. This is the rule for time-bounded retention. Clickstream events that must roll off after 90 days, session data that becomes meaningless after a week, compliance windows that mandate a hard cutoff, all of that maps to max age.

Combining rules with intersection and union

This is where the exam likes to get sneaky. You are not limited to one rule per column family. You can combine max versions and max age into a compound policy using either an intersection or a union, and the difference between them changes which cell versions survive.

An intersection means a version is deleted only when it violates all of the rules. If you set an intersection of max versions 5 and max age 30 days, a cell version must be both beyond the fifth most recent and older than 30 days before Bigtable removes it. The intersection is the more conservative option. It tends to retain more data, which is what you want when you need at least N versions available no matter how old, and you also want to keep recent versions even if they exceed N.

A union means a version is deleted as soon as it violates any of the rules. A union of max versions 5 and max age 30 days deletes a cell version the moment it falls outside the top five most recent or crosses the 30-day threshold, whichever happens first. The union is the more aggressive option. It is the right answer when you want a hard ceiling on both dimensions, like keeping at most five versions and no version older than a month.

When a PDE exam question describes a retention requirement, read the wording carefully. Phrases like "keep at least the last N versions even if older than X" point to an intersection. Phrases like "never store more than N versions and never store anything older than X" point to a union.

When garbage collection actually runs

A subtle point the exam sometimes probes is timing. Garbage collection in Bigtable is not instantaneous. Cell versions that violate the policy become eligible for deletion, but they are physically removed during background compactions. That means a cell version may still be readable for some time after it technically expired under the rule. If a question asks whether you can rely on garbage collection for hard, immediate deletion, the answer is no. For strict guarantees you have to issue explicit deletes or filter reads by timestamp.

The other implication is that storage savings from a new policy show up gradually. If you tighten a max age from 90 days to 30 days, you will not see your table size drop overnight. Compactions will work through the eligible versions over the following hours or days.

How the PDE exam frames these

Most Professional Data Engineer questions on this topic put you in a scenario where storage costs are creeping up, or a team needs to enforce a retention window, or you are designing a schema for time-series data. The question stem will describe the access pattern and the retention requirement, and you pick the column family configuration that satisfies it. Watch for questions that try to make you set retention at the table or row level, those are distractors. The correct answer always sets the policy on a column family.

Also watch for questions that mix garbage collection with cell timestamps. If a scenario says writes carry custom timestamps from upstream events, remember that max age measures against the cell's timestamp, not against ingestion time. A backfill of historical data with old timestamps will be eligible for garbage collection the moment it lands.

My Professional Data Engineer course covers Bigtable garbage collection, schema design, and the full storage and retention surface area the exam tests, with worked scenarios for the trickier intersection and union cases.

Get tips and updates from GCP Study Hub

arrow