Cloud DLP for the PDE Exam: Protecting Sensitive Data

619c7c8da6d7b95cf26f6f70

April 29, 2026

When the Professional Data Engineer exam asks how to keep personally identifiable information out of a BigQuery export, or how to scan a Cloud Storage bucket for credit card numbers before sharing it with a partner team, the answer is almost always the same service. Cloud DLP, now branded as Sensitive Data Protection, is the discovery and redaction layer GCP gives you for sensitive data. It shows up in scenario questions across security, governance, and pipeline design, and it pays to know exactly what it does and where it fits.

I want to walk through how I think about DLP on the exam, what each capability actually does, and which patterns Google likes to test.

What Cloud DLP is

Cloud DLP is a managed service for discovering, classifying, and protecting sensitive data. It automates four jobs that you would otherwise have to write yourself: recognizing sensitive values inside text or structured data, running risk analysis against datasets, redacting matches, and de-identifying records so they can be used downstream without exposing the original values.

The service is built around two main entities. InfoTypes are the patterns DLP knows how to detect. Inspection and de-identification are the two operating modes. You either ask DLP what is in the data, or you ask it to transform the data so the sensitive bits are masked, tokenized, or removed.

DLP integrates natively with Cloud Storage, BigQuery, and Datastore, and it exposes a REST API so you can call it from Dataflow, Cloud Functions, or any application that handles text. On the Professional Data Engineer exam, treat it as the default answer whenever a question mentions PII, PHI, GDPR, HIPAA, or any flavor of regulated data inside GCP.

Discovery and classification

The first thing DLP does well is tell you what sensitive data you have and where it lives. You point it at a Cloud Storage bucket or a BigQuery table, configure an inspection job, and DLP returns findings grouped by infoType, with locations, likelihood scores, and counts.

This matters for exam scenarios that describe an organization inheriting a data lake of unknown content, or a team that needs to prove to auditors that no Social Security numbers are sitting in a public bucket. The right answer is not to write a custom regex pipeline. It is to schedule a DLP inspection job and let the built-in detectors do the work.

Likelihood scores, which range from VERY_UNLIKELY to VERY_LIKELY, let you tune false positives. A credit card detector matching a 16 digit string with a valid Luhn checksum returns VERY_LIKELY. A loose number that just happens to be 16 digits returns something lower. You filter on likelihood when you build automation around the findings.

InfoType detectors

InfoTypes are where most exam questions pin down whether you actually know the service. DLP ships with more than 150 built in detectors. The ones worth recognizing on sight include:

EMAIL_ADDRESS, PHONE_NUMBER, PERSON_NAME, STREET_ADDRESS for general PII
US_SOCIAL_SECURITY_NUMBER, US_DRIVERS_LICENSE_NUMBER, US_PASSPORT and the country-specific equivalents for government IDs
CREDIT_CARD_NUMBER, IBAN_CODE, SWIFT_CODE for financial data
US_HEALTHCARE_NPI, MEDICAL_RECORD_NUMBER for PHI scenarios under HIPAA

You can also define custom infoTypes. Three flavors exist. Regex-based detectors let you match an internal employee ID pattern. Dictionary detectors let you supply a word list. Stored detectors let you build a large dictionary backed by a Cloud Storage file. When a question gives you a non-standard identifier specific to a company, the answer is a custom infoType, not a built in one.

Inspection jobs

Inspection is the read-only mode. You define an InspectConfig with the infoTypes you care about, a minimum likelihood, and optional rules for excluding hotwords or known false positives. You point it at a storage location and DLP writes findings to a BigQuery table, to Pub/Sub, or to Security Command Center.

For scheduled scans, you wrap this in a job trigger that fires on a schedule or whenever new objects land in a bucket. A common Professional Data Engineer pattern is a trigger that scans every new file uploaded to a landing bucket, publishes findings to Pub/Sub, and routes anything sensitive to a quarantine bucket through a Cloud Function. Recognize that flow when you see it.

De-identification

De-identification is the write mode. Instead of returning findings, DLP transforms the data so sensitive values are replaced. The main transformation types are worth memorizing:

Redaction removes the value entirely
Masking replaces characters with a fixed symbol, like turning a card number into XXXX-XXXX-XXXX-1234
Replacement substitutes a placeholder string or the infoType name
Cryptographic hashing produces an HMAC-SHA-256 of the value
Format preserving encryption with FPE-FFX preserves length and character set so downstream schemas still work
Date shifting moves dates by a random offset that is consistent per individual, useful for medical records
Generalization and bucketing coarsen values, turning a precise age into a range

Format preserving encryption and date shifting are the two that get tested most often because they support analytics on the de-identified output. If a question asks how to share a dataset with an analytics team while preserving referential integrity on a sensitive column, the answer is FPE with a wrapped key from Cloud KMS. If it asks how to support longitudinal medical analysis without exposing real dates, the answer is date shifting.

How to recognize DLP questions on the exam

The Professional Data Engineer exam rarely names DLP outright in the question stem. It describes a symptom and expects you to map it to the service. Watch for these signals: a request to find or remove sensitive data without writing custom code, a compliance reference like GDPR or HIPAA, a need to share data with an external party safely, or a structured data column that must be tokenized but still joinable. All of these route to Cloud DLP.

My Professional Data Engineer course covers Cloud DLP alongside the rest of the security and governance domain, with worked scenarios for inspection jobs, de-identification templates, and the integration patterns Google likes to test.