
When the Professional Data Engineer exam asks how to keep personally identifiable information out of a BigQuery export, or how to scan a Cloud Storage bucket for credit card numbers before sharing it with a partner team, the answer is almost always the same service. Cloud DLP, now branded as Sensitive Data Protection, is the discovery and redaction layer GCP gives you for sensitive data. It shows up in scenario questions across security, governance, and pipeline design, and it pays to know exactly what it does and where it fits.
I want to walk through how I think about DLP on the exam, what each capability actually does, and which patterns Google likes to test.
Cloud DLP is a managed service for discovering, classifying, and protecting sensitive data. It automates four jobs that you would otherwise have to write yourself: recognizing sensitive values inside text or structured data, running risk analysis against datasets, redacting matches, and de-identifying records so they can be used downstream without exposing the original values.
The service is built around two main entities. InfoTypes are the patterns DLP knows how to detect. Inspection and de-identification are the two operating modes. You either ask DLP what is in the data, or you ask it to transform the data so the sensitive bits are masked, tokenized, or removed.
DLP integrates natively with Cloud Storage, BigQuery, and Datastore, and it exposes a REST API so you can call it from Dataflow, Cloud Functions, or any application that handles text. On the Professional Data Engineer exam, treat it as the default answer whenever a question mentions PII, PHI, GDPR, HIPAA, or any flavor of regulated data inside GCP.
The first thing DLP does well is tell you what sensitive data you have and where it lives. You point it at a Cloud Storage bucket or a BigQuery table, configure an inspection job, and DLP returns findings grouped by infoType, with locations, likelihood scores, and counts.
This matters for exam scenarios that describe an organization inheriting a data lake of unknown content, or a team that needs to prove to auditors that no Social Security numbers are sitting in a public bucket. The right answer is not to write a custom regex pipeline. It is to schedule a DLP inspection job and let the built-in detectors do the work.
Likelihood scores, which range from VERY_UNLIKELY to VERY_LIKELY, let you tune false positives. A credit card detector matching a 16 digit string with a valid Luhn checksum returns VERY_LIKELY. A loose number that just happens to be 16 digits returns something lower. You filter on likelihood when you build automation around the findings.
InfoTypes are where most exam questions pin down whether you actually know the service. DLP ships with more than 150 built in detectors. The ones worth recognizing on sight include:
You can also define custom infoTypes. Three flavors exist. Regex-based detectors let you match an internal employee ID pattern. Dictionary detectors let you supply a word list. Stored detectors let you build a large dictionary backed by a Cloud Storage file. When a question gives you a non-standard identifier specific to a company, the answer is a custom infoType, not a built in one.
Inspection is the read-only mode. You define an InspectConfig with the infoTypes you care about, a minimum likelihood, and optional rules for excluding hotwords or known false positives. You point it at a storage location and DLP writes findings to a BigQuery table, to Pub/Sub, or to Security Command Center.
For scheduled scans, you wrap this in a job trigger that fires on a schedule or whenever new objects land in a bucket. A common Professional Data Engineer pattern is a trigger that scans every new file uploaded to a landing bucket, publishes findings to Pub/Sub, and routes anything sensitive to a quarantine bucket through a Cloud Function. Recognize that flow when you see it.
De-identification is the write mode. Instead of returning findings, DLP transforms the data so sensitive values are replaced. The main transformation types are worth memorizing:
Format preserving encryption and date shifting are the two that get tested most often because they support analytics on the de-identified output. If a question asks how to share a dataset with an analytics team while preserving referential integrity on a sensitive column, the answer is FPE with a wrapped key from Cloud KMS. If it asks how to support longitudinal medical analysis without exposing real dates, the answer is date shifting.
The Professional Data Engineer exam rarely names DLP outright in the question stem. It describes a symptom and expects you to map it to the service. Watch for these signals: a request to find or remove sensitive data without writing custom code, a compliance reference like GDPR or HIPAA, a need to share data with an external party safely, or a structured data column that must be tokenized but still joinable. All of these route to Cloud DLP.
My Professional Data Engineer course covers Cloud DLP alongside the rest of the security and governance domain, with worked scenarios for inspection jobs, de-identification templates, and the integration patterns Google likes to test.