Cloud Firewall Rules and Identity-Based Rules for the PDE Exam

619c7c8da6d7b95cf26f6f70

May 10, 2026

Networking questions on the Professional Data Engineer exam tend to feel out of place at first. You came to study BigQuery and Dataflow, and suddenly a question asks why a Dataproc worker cannot reach Cloud SQL. The answer is almost always a firewall rule. Cloud Firewall sits in front of every VM-backed service in Google Cloud, which includes a lot of the data stack, so understanding how its rules evaluate is core exam territory.

This article walks through what I want a Professional Data Engineer candidate to have in their head before exam day: how VPC firewall rules are structured, how priority resolves conflicts, when hierarchical firewall policies override them, and how identity-based rules using service accounts give you a cleaner alternative to IP-based access control for data pipelines.

What Cloud Firewall actually protects

Cloud Firewall lets you define rules that control network traffic to and from resources inside a VPC. The most visible target is VMs, both standalone VMs and the VMs that sit underneath services like Dataproc clusters. The rules also apply to Load Balancers, GKE clusters, and Cloud SQL instances, which is why a misconfigured firewall can break a pipeline that does not look VM-shaped on the surface.

Every rule covers one direction. Ingress rules control incoming traffic. Egress rules control outgoing traffic. A common exam trap is a scenario where ingress looks fine and the candidate forgets to check egress on the source side.

The four things a VPC firewall rule contains

Every VPC firewall rule has the same shape. If you can recite these four pieces, you can answer most of the basic questions on the Professional Data Engineer exam.

Direction: ingress or egress.
Action: allow or deny.
Match criteria: protocol and port, plus a source (for ingress) or destination (for egress). The source or destination can be an IP range, a network tag, or a service account.
Priority: an integer from 0 to 65535. Lower numbers evaluate first.

Priority is where most exam ambiguity hides. If a rule with priority 100 says allow TCP from a given range, and a rule with priority 200 says deny all SSH, the priority 100 rule wins for any TCP traffic that matches its range, even if some of that traffic happens to be SSH. The lower priority number gets evaluated first and short-circuits the decision.

Hierarchical firewall policies

VPC firewall rules live at the network level. That is fine for one project but becomes painful at scale, which is where hierarchical firewall policies come in. You attach these policies at the organization or folder level, and they evaluate before any VPC firewall rule inside a project below that node.

The exam-relevant property: a hierarchical policy can enforce a deny that a project owner cannot override with a VPC rule, and it can enforce an allow the same way. There is also a goto_next action that lets the hierarchical policy defer the decision to the next level down. This is how a security team locks down org-wide egress to known data exfiltration targets while still letting individual project owners write the rules that govern their own VPCs.

If an exam scenario describes a centralized security team that must guarantee a deny rule cannot be removed by a project admin, hierarchical firewall policies are the answer. A plain VPC firewall rule does not have that property.

Identity-based firewall rules

This is the section that catches more candidates off guard than any other. Traditional firewall rules use IP ranges. Identity-based rules use the service account attached to the source or target resource instead. The match happens on identity, not on network position.

Picture three resources. A Cloud Run service runs as Service Account A. A VM Instance 1 runs as Service Account B. A VM Instance 2 runs as Service Account C. If you write a rule that says deny when source is Service Account B and target is Service Account C, then Instance 1 cannot reach Instance 2 even if their IPs are in the same range. The Cloud Run service is unaffected because Service Account A does not appear in the rule.

For data pipelines this is much cleaner than IP-based control. Dataflow workers, Dataproc clusters, and Composer environments all run under service accounts. If you want to say "this pipeline can read from this Cloud SQL instance and no other workload can", you write the rule against the worker's service account rather than chasing ephemeral worker IPs.

Network tags work the same way structurally, but tags are not identities and are not protected by IAM. A user with edit rights on a VM can attach a tag and grant themselves access. Service-account-based rules require the IAM iam.serviceAccountUser role to actually run as that service account, which makes them the right answer when an exam question emphasizes least privilege or zero trust.

Logging is off by default

One detail worth memorizing for the Professional Data Engineer exam: firewall rule logging is disabled by default. If a question describes a team that cannot tell whether a rule is being hit, the fix is to enable logging on that rule. From the console it is the Logs toggle on the rule's config page. From the command line it is:

gcloud compute firewall-rules update RULE_NAME --enable-logging

Enable it on the rules that matter rather than blanket-enabling across the project, because the log volume on a busy rule adds up fast.

What to lock in before exam day

For the Professional Data Engineer exam, the firewall-related muscle memory I want you to have is: priority is numeric and lower-wins, hierarchical firewall policies override VPC rules from above, identity-based rules using service accounts are the right tool for pipeline-to-pipeline access control, and logging starts off. If you can map a scenario to one of those four facts in under thirty seconds you will not lose points on the networking questions.

My Professional Data Engineer course covers Cloud Firewall, hierarchical policies, and the identity-based access patterns that show up in the data pipeline scenarios on the exam.