Data Security with LLMs for the Generative AI Leader Exam

GCP Study Hub
Ben Makansi
November 22, 2025

Most of the security questions on the Generative AI Leader exam come down to one idea: when you feed data into an AI system, you are introducing exposure vectors that traditional security models were not designed to handle. Sensitive information can end up embedded in model weights, surfaced in outputs, or sent to an external API without adequate controls. The perimeter is harder to define, and the stakes are higher.

Google Cloud frames data security in the generative AI era as maintaining absolute control over your information assets. That phrasing is deliberately strong. The exam wants you to recognize three foundational approaches for getting there, plus the two Google Cloud tools that implement them in practice.

De-identification

De-identification removes or masks personally identifiable information before data ever reaches the model. Rather than hoping the model will not reproduce sensitive details, you strip them out at the source. Names, ID numbers, health data, and financial identifiers never enter the pipeline, so the model never has access to them in the first place.

For the exam, Google Cloud points to two tools as the primary implementations of de-identification:

  • Sensitive Data Protection
  • Data Loss Prevention API (DLP API)

If a Generative AI Leader question asks which Google Cloud service handles de-identification of sensitive fields before they reach an LLM, those are the two names to recognize.

Data minimization

Minimizing data collection takes a complementary stance: do not collect what you do not need. Every piece of sensitive data that enters your AI pipeline is a liability. If a task can be accomplished with less data, or with aggregated rather than individual-level data, that is the safer path. The exam treats this as an architectural principle, not a tooling question.

Federated learning

Federated learning is the architectural option of the three. Instead of centralizing data in one place for training, the model learns from data where it lives, across distributed devices or systems, without the data ever leaving its source. The insights travel. The sensitive data stays put. On the exam, federated learning is the answer when a scenario rules out moving data across boundaries but still needs a model trained on that data.

Masking versus substitution

Inside de-identification itself, the exam expects you to know two specific techniques. The example Google Cloud uses is a record with a name like John Doe, an email, and an SSN. What happens to each field is the point.

  • Masking hides parts of the original value while keeping its structure recognizable. An SSN with only the last four digits visible is masking.
  • Synthetic substitution replaces the value with realistic but entirely fictional data. A real name swapped for a fake name that preserves the data format is substitution.

The one-line distinction worth memorizing: masking obscures, substitution replaces. Both ensure the model processes data without ever seeing the actual personal identifiers behind it.

What to take into the exam

From this section, the Generative AI Leader exam expects you to walk in knowing four things:

  1. The three foundational approaches: de-identification, data minimization, federated learning.
  2. Sensitive Data Protection and the Data Loss Prevention API as Google Cloud's tools for de-identification.
  3. The masking versus synthetic substitution distinction.
  4. That federated learning keeps data at its source rather than centralizing it.

My Generative AI Leader course covers data security with LLMs alongside the rest of the foundational material on the exam.

arrow