
Data Catalog is one of those services on the Professional Data Engineer exam that looks deceptively simple on the surface and then bites you with a question about policy tag taxonomies or tag visibility roles. I want to walk through the parts that actually show up on the exam, in roughly the order I think about them when I see a Data Catalog question on a Professional Data Engineer practice test.
Before going any further, a quick framing note. Data Catalog was rolled into Dataplex as part of Google Cloud's data governance consolidation, and you may see it referred to as Dataplex Universal Catalog in newer documentation. The exam guide and most question banks still use the Data Catalog name, so that is the terminology I will stick with here. The concepts, the API surface, and the IAM roles are the same things you need to know either way.
Data Catalog is a fully managed, scalable metadata management service. The word that matters there is metadata. It is not where your data lives. It is where the descriptions of your data live, indexed in one place and made searchable across every project and dataset in your organization.
If you take only one thing into the exam about Data Catalog, take this: it is a centralized metadata repository. You go to Data Catalog when you want to find data, describe data, classify data, or apply governance rules to data that physically lives in BigQuery, Cloud Storage, Pub/Sub, Bigtable, Cloud SQL, Dataproc, or Dataflow.
The three features the exam expects you to know are metadata management, data discovery and search, and tagging with policy tags and tag templates. I will take each in turn.
Every data service on Google Cloud emits its own metadata. BigQuery has table schemas. Cloud Storage has object metadata. Pub/Sub has topic configurations. Bigtable has column families. Dataproc tracks cluster metrics. Dataflow has pipeline templates. On their own, those descriptions are scattered across services, each with its own API and its own UI.
Data Catalog pulls all of that into one place automatically. BigQuery, Pub/Sub, and Cloud Storage assets are cataloged with no setup on your part. The moment a new BigQuery table is created in your org, it shows up in Data Catalog with its schema. That automatic, no-configuration cataloging is a detail the exam likes to probe in scenario questions where someone asks how to inventory data without writing ingestion code.
Once metadata is centralized, discovery becomes a search problem. Data Catalog gives you a search interface, both in the console and via API, that queries across every cataloged asset in your organization.
The exam loves regulatory compliance framings here. A common scenario is something like: an auditor needs every dataset that contains personally identifiable information located within the hour. The right answer is to search Data Catalog for assets tagged with the appropriate PII classification, not to write a script that crawls every project's BigQuery datasets one by one. Whenever a Professional Data Engineer question mentions a compliance audit, a discovery request, or a need to find data across many projects fast, Data Catalog is usually the answer.
Tag templates let you define structured custom metadata. Think of a template as a schema for a tag. You might create a template called data_governance with fields for owner, retention period, sensitivity level, and environment. Then you attach instances of that tag to specific assets, filling in the values per asset.
Examples of the kind of tagging the exam expects you to recognize:
Tag templates are how you get consistent governance metadata across a heterogeneous stack. That consistency is what makes a data mesh architecture work in practice, and the exam connects tag templates to data mesh framings on more than one question.
This is the part of Data Catalog that I see candidates underestimate going into the Professional Data Engineer exam. Policy tags are different from regular tag template tags. They are organized into hierarchical taxonomies, and they are used by BigQuery and BigLake to enforce column-level security.
A taxonomy might look like a tree with a root of PII, branching into high_sensitivity (national ID, financial account) and medium_sensitivity (email, phone). You attach policy tags to individual BigQuery columns. You then grant the Fine-Grained Reader IAM role on specific policy tags to specific principals. A user who queries a table without that role on a tagged column gets the column blocked, even if they otherwise have read access to the table.
If a Professional Data Engineer question asks how to restrict access to a single sensitive column without splitting the table or creating views, the answer is a policy tag taxonomy applied to that column.
This is the gotcha. Data Catalog tags have two visibility settings, public and private, and the names are misleading.
Public visibility does not mean the tag is exposed to the internet. It means that any user who already has asset-level metadata permissions on the underlying resource can also see the Data Catalog tag. If a user has bigquery.metadataViewer on a dataset, public-visibility tags on that dataset are visible to them automatically.
Private visibility means tag access is restricted to users who hold the datacatalog.tagTemplateViewer role specifically. Having bigquery.metadataViewer alone is not enough. The exam tests this directly with scenarios where a user can see a table but cannot see its tags, and the right answer is to grant the Tag Template Viewer role.
My Professional Data Engineer course covers Data Catalog end to end, including how policy tag taxonomies plug into BigQuery column-level security, how tag templates power data mesh governance, and the Dataplex consolidation context you will run into in newer Google Cloud documentation.