Cloud DLP Integration with GCP Services for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
May 1, 2026

Cloud DLP questions on the Professional Data Engineer exam almost never ask you to recite infoTypes from memory. They ask you where DLP fits in a pipeline. Is the data sitting in a bucket? Streaming through Pub/Sub? Landing in BigQuery before analysts query it? The right answer depends on whether the data is at rest or in transit, and which service holds it at the moment you need to protect it.

I want to walk through the integration story service by service, because that is how the exam frames sensitive data handling. If you can match the workload to the DLP integration pattern, the question writes itself.

BigQuery: column inspection and in-place de-identification

BigQuery is the integration point that comes up most often on the Professional Data Engineer exam. DLP scans BigQuery tables column by column, identifying which columns contain PII, PHI, payment data, or other regulated content. You point an inspection job at a dataset, DLP profiles the columns, and you get back a structured report of what was found and where.

The pattern you want in your head: profile first, then act. After DLP flags the sensitive columns, you have two main options. You can run a de-identification job that writes a transformed copy of the table with masking, tokenization, or format-preserving encryption applied to the flagged columns. Or you can apply column-level access controls and policy tags through Dataplex so analysts only see what they are authorized to see. Both are valid PDE answers depending on the scenario. If the question emphasizes downstream analytics on protected data, lean toward de-identified copies. If the question emphasizes governance and least-privilege access to the original data, lean toward policy tags.

Cloud Storage: scheduled inspection of unstructured objects

For data at rest in Cloud Storage, DLP scans objects directly. This covers CSV, JSON, Avro, and plain text files, and it extends to PDFs and images through OCR. The exam-relevant pattern is scheduled scans. You configure a recurring inspection trigger that points at a bucket or prefix, and DLP profiles new objects as they land.

The output goes to a BigQuery findings table or to Pub/Sub, depending on how you wire it. From there you can drive automated responses. A common Professional Data Engineer scenario: raw files arrive in a landing bucket, a scheduled DLP scan inspects them, findings publish to Pub/Sub, and a Cloud Function moves files with sensitive content to a quarantine bucket while clean files continue downstream.

Dataflow and Datastream: in-flight masking in ETL

For data in transit through ETL, DLP plugs into Dataflow. The Dataflow templates for DLP let you call the inspection or de-identification API on records as they pass through the pipeline, so sensitive fields are masked or tokenized before the data lands in its destination. This is the pattern to reach for when the exam says something like before the data is stored or processed further. You are protecting data at the moment of ingestion, not after.

Datastream uses the same idea for CDC streams replicating from Oracle, MySQL, or PostgreSQL into BigQuery or Cloud Storage. You stage the change events through Dataflow, apply DLP transforms there, and write the protected output. This keeps you from ever materializing raw PII in your analytics environment.

Pub/Sub: streaming, event-driven scanning

Pub/Sub is where event-driven scanning lives. If sensitive content might arrive in messages from upstream systems you do not fully control, you can set up a subscriber that calls DLP on each message, redacts or tokenizes findings, and republishes to a clean topic. The original raw topic can be locked down with strict IAM so only the scanning subscriber sees the unredacted payload.

The PDE exam frames this as protecting streaming data in real time. The keyword to listen for is continuous or event-driven. That should pull your mind toward Pub/Sub plus DLP rather than scheduled bucket scans.

Vertex AI: training data preparation

When you are preparing training data for a model on Vertex AI, you do not want raw PII baked into the model weights or surfaced in predictions. The integration pattern here is to run a DLP de-identification job over the training corpus before it goes to Vertex AI. Names, emails, phone numbers, and account identifiers get tokenized or replaced with surrogate values, and the training set retains its analytical shape without carrying the raw identifiers forward.

For text generation use cases, this matters even more. A model trained on un-redacted support transcripts can memorize and regurgitate customer data. The PDE exam will sometimes pose this as a privacy or compliance scenario, and DLP preprocessing is the expected answer.

Dataplex: sensitivity discovery at the lake level

Dataplex pulls all of this together at the data lake layer. Dataplex data profiling can invoke DLP to scan assets across your lake, BigQuery datasets, and Cloud Storage buckets included, and surface the sensitivity findings as part of the asset catalog. From there you can attach policy tags to BigQuery columns automatically based on what DLP found.

This is the integration you reach for when the exam asks how to manage sensitive data classification at scale across a lake. Manual per-table inspection does not fit. Dataplex plus DLP gives you continuous, lake-wide discovery with governance attached.

How to read these questions on the exam

When a Professional Data Engineer question mentions sensitive data, my first move is to identify the state of the data. At rest in a bucket points to Cloud Storage plus scheduled DLP scans. At rest in BigQuery points to column inspection plus de-identification or policy tags. In transit through ETL points to Dataflow plus DLP transforms. In transit through messaging points to Pub/Sub plus a scanning subscriber. Training data points to a DLP pass before Vertex AI. Lake-wide governance points to Dataplex.

If you can map the workload to one of those six patterns, you have the answer.

My Professional Data Engineer course covers Cloud DLP integration patterns alongside the rest of the security and governance topics the exam tests, with worked examples for BigQuery, Dataflow, Pub/Sub, and Dataplex scenarios.

Get tips and updates from GCP Study Hub

arrow