
When I prep candidates for the Professional Data Engineer exam, the Cloud DLP section is one of the places people lose easy points. The exam loves to test whether you can pick the right de-identification method for a specific business need, and the choices look similar on the surface. Masking, tokenization, bucketing, format-preserving encryption, and pseudonymization all hide sensitive data, but each one preserves a different property of the original value. The Professional Data Engineer exam wants you to match the property the downstream system needs to the method that preserves it.
In this post I want to walk through the data protection methods Cloud DLP gives you and the scenarios where each one is the right call.
Masking replaces part of a value with a fixed character, usually an asterisk, while leaving the rest of the value visible. The goal is to protect the sensitive portion but keep enough of the original around for reference, analytics, or troubleshooting.
The canonical examples on the exam look like this:
John Doe becomes J*** D**johndoe@example.com becomes j*****e@example.com1234 5678 9012 3456 becomes **** **** **** 3456987-65-4321 becomes ***-**-4321Notice that the email keeps the domain intact, the credit card keeps the last four digits, and the SSN keeps the last four. That partial visibility is the whole point. A support agent can confirm a caller by reading the last four of their card without ever seeing the full number, and an analyst can still group rows by domain without seeing usernames. If the exam describes a scenario where someone needs to verify a value but not see it, masking is almost always the answer.
Tokenization swaps a sensitive value for a surrogate token that has no mathematical relationship to the original. Cloud DLP offers a couple of flavors, and the distinction shows up on the Professional Data Engineer exam.
If the question asks about preserving the ability to join tables on a sensitive field, lean toward deterministic. If the question asks about re-identifying a record later under controlled access, lean toward cryptographic.
Bucketing replaces a precise value with a range or category. It is the method to reach for when the analyst needs the shape of the distribution but does not need individual values.
Two common patterns:
On the exam, bucketing tends to show up in healthcare or HR scenarios where regulators want the data de-identified but the data scientists still need to compute aggregates.
Format-preserving encryption, or FPE, is the method I see candidates underestimate the most. FPE encrypts a value but keeps the original format. A 16-digit credit card stays a 16-digit string. An email keeps the local part, the at sign, and the domain structure. A US SSN keeps the three-two-four digit pattern.
The example values look like this:
John Doe becomes Qelj Tirjohndoe@example.com becomes xylepto@domain.org1234 5678 9012 3456 becomes 8763 2190 4321 6589987-65-4321 becomes 234-12-7890The reason FPE matters is that legacy and downstream systems often validate the shape of a field. A payment processing pipeline that expects 16 digits will reject a token like tok_a91f... outright, but it will happily accept an FPE-encrypted card number because the format is identical. FPE is the right answer when the exam tells you that the data has to flow through systems that enforce a schema or format, and the data does not need to be human-readable for troubleshooting.
The trade-off is that FPE values are not meaningful to a human reader. If the downstream consumer is a person reconciling records, masking is usually better. If the downstream consumer is a system with strict input validation, FPE wins.
Pseudonymization is the umbrella term for replacing direct identifiers with artificial ones so a record can no longer be tied to an individual without an additional key. Deterministic tokenization and cryptographic tokenization are both forms of pseudonymization. The reason to learn the term separately is that compliance regimes like GDPR call it out explicitly, and Professional Data Engineer questions sometimes use the regulatory vocabulary instead of the implementation vocabulary.
If a question references GDPR and asks for the de-identification approach that still allows controlled re-identification, pseudonymization, implemented via Cloud DLP cryptographic tokens, is the framing the exam expects.
When a Professional Data Engineer scenario hands you a sensitive field, ask three questions in order:
If none of those apply and the analyst only needs aggregates, bucketing or date shifting is the cleanest fit.
My Professional Data Engineer course covers Cloud DLP end to end, including the data protection methods above, how DLP integrates with Cloud Storage, BigQuery, Dataflow, and Pub/Sub for at-rest and in-transit protection, and the exam-style decision patterns you need to choose the right method under time pressure.