Dataflow Networking for the PDE Exam

619c7c8da6d7b95cf26f6f70

October 2, 2025

Dataflow networking is one of those topics that looks dry on paper and then shows up on the Professional Data Engineer exam as a scenario where a pipeline silently fails to start, workers cannot talk to each other, or a job runs fine in one project and breaks in another. The fix is almost always at the network layer, and the answer usually comes down to a small set of ports and a single VPC setting. I want to walk through what you actually need to know for the exam: the ports Dataflow uses, the firewall rules that have to permit them, and how internal-IP-only workers stay connected to services like Pub/Sub and BigQuery.

The three ports Dataflow workers depend on

Every Dataflow job, whether streaming or batch, relies on a handful of TCP ports. If any one of them is blocked at the VPC firewall, the symptoms range from a job that never leaves the queued state to a streaming pipeline that hangs partway through processing.

TCP 443 (HTTPS): used by all Dataflow jobs. Workers reach the Dataflow service control plane on this port for job management, API requests, and metadata exchange. Because it is HTTPS, the traffic is encrypted end to end.
TCP 12345: used by streaming jobs that are not running on Shuffle or Streaming Engine. This is the worker-to-worker and worker-to-service channel for streaming pipelines.
TCP 12346: the same role as 12345, but for batch jobs that are not using Shuffle or Streaming Engine. Worker-to-worker and worker-to-service traffic for batch pipelines flows here.

The pattern to remember is that 443 is universal, and 12345 and 12346 only matter when you are not using the managed Shuffle or Streaming Engine backends. If your pipeline uses Streaming Engine for a streaming job or Dataflow Shuffle for a batch job, the inter-worker traffic moves off your VPC and onto Google-managed infrastructure, and those two ports stop being part of the equation. That distinction shows up in exam questions where the scenario tells you a job is using Streaming Engine and asks what ports are still relevant. The answer narrows to 443.

Firewall rules in a custom VPC

The default VPC tends to ship with permissive rules that let Dataflow work without much thought. The moment you move a job into a custom VPC or a shared VPC, you have to be explicit about firewall rules. Dataflow needs both ingress and egress allowed on the ports it uses, and the rules need to apply to the workers themselves.

For a pipeline running in a custom VPC, the rules should permit:

TCP 443 ingress and egress for job management, API calls, and metadata exchange with the control plane.
TCP 12345 and 12346 ingress and egress for worker-to-worker and worker-to-service communication when Shuffle and Streaming Engine are not in use.

If those ports are blocked, workers cannot reach the control plane or each other. The pipeline will not start, or it will start and then stall. On the exam, you should expect to see a scenario where a Dataflow job in a custom VPC fails to launch, and the question asks you to pick the corrective action. The right answer almost always involves adding firewall rules that allow ingress and egress on the necessary ports. If you see distractors about adjusting IAM permissions or restarting the job, treat them with suspicion when the symptoms point at network failures.

Internal IPs only and Private Google Access

The other half of Dataflow networking on the Professional Data Engineer exam is the internal-IP-only configuration. By default, Dataflow workers get external IP addresses. For a more secure setup, you can configure a job to run with internal IPs only so that no worker is reachable from the public internet and no traffic leaves the VPC over a public path.

The catch is that workers still need to talk to Google Cloud services like Pub/Sub, BigQuery, and Cloud Storage. Those services live outside the VPC, and an internal-IP-only worker has no public path to reach them. The solution is Private Google Access, a subnet-level setting that lets resources without external IPs reach Google APIs and services over Google's internal network.

The configuration is straightforward:

Place the Dataflow workers in a subnet inside your VPC.
Enable Private Google Access on that subnet.
Set the job to use internal IPs only when you launch it.

With that in place, your workers operate without external IPs, traffic to Google services takes the private path, and your firewall rules still need to allow the ports above so workers can manage the job and coordinate with each other.

If the exam gives you a scenario where a security team requires no public IPs on Dataflow workers, and the job needs to read from Pub/Sub and write to BigQuery, the answer is to enable Private Google Access on the subnet and launch the pipeline with the internal-IP-only flag. Picking a Cloud NAT or a VPN setup as the primary fix is usually the wrong answer here, because Private Google Access is the purpose-built option for Google APIs.

How this lands on the exam

The Professional Data Engineer exam likes to test Dataflow networking through symptoms. You read a description of a pipeline that hangs or fails, and you have to map the behavior back to a missing firewall rule, a blocked port, or a misconfigured subnet. If you remember that 443 is always in play, that 12345 and 12346 matter only without Shuffle or Streaming Engine, and that internal-IP-only workers need Private Google Access on the subnet, you can answer most of these without guessing.

My Professional Data Engineer course covers Dataflow networking, Shuffle and Streaming Engine, and the rest of the Dataflow exam surface in depth.

Dataflow Networking for the PDE Exam: Ports, Firewall Rules, Internal IPs

The three ports Dataflow workers depend on

Firewall rules in a custom VPC

Internal IPs only and Private Google Access

How this lands on the exam

Get tips and updates from GCP Study Hub