
Dataflow networking is one of those topics that looks dry on paper and then shows up on the Professional Data Engineer exam as a scenario where a pipeline silently fails to start, workers cannot talk to each other, or a job runs fine in one project and breaks in another. The fix is almost always at the network layer, and the answer usually comes down to a small set of ports and a single VPC setting. I want to walk through what you actually need to know for the exam: the ports Dataflow uses, the firewall rules that have to permit them, and how internal-IP-only workers stay connected to services like Pub/Sub and BigQuery.
Every Dataflow job, whether streaming or batch, relies on a handful of TCP ports. If any one of them is blocked at the VPC firewall, the symptoms range from a job that never leaves the queued state to a streaming pipeline that hangs partway through processing.
The pattern to remember is that 443 is universal, and 12345 and 12346 only matter when you are not using the managed Shuffle or Streaming Engine backends. If your pipeline uses Streaming Engine for a streaming job or Dataflow Shuffle for a batch job, the inter-worker traffic moves off your VPC and onto Google-managed infrastructure, and those two ports stop being part of the equation. That distinction shows up in exam questions where the scenario tells you a job is using Streaming Engine and asks what ports are still relevant. The answer narrows to 443.
The default VPC tends to ship with permissive rules that let Dataflow work without much thought. The moment you move a job into a custom VPC or a shared VPC, you have to be explicit about firewall rules. Dataflow needs both ingress and egress allowed on the ports it uses, and the rules need to apply to the workers themselves.
For a pipeline running in a custom VPC, the rules should permit:
If those ports are blocked, workers cannot reach the control plane or each other. The pipeline will not start, or it will start and then stall. On the exam, you should expect to see a scenario where a Dataflow job in a custom VPC fails to launch, and the question asks you to pick the corrective action. The right answer almost always involves adding firewall rules that allow ingress and egress on the necessary ports. If you see distractors about adjusting IAM permissions or restarting the job, treat them with suspicion when the symptoms point at network failures.
The other half of Dataflow networking on the Professional Data Engineer exam is the internal-IP-only configuration. By default, Dataflow workers get external IP addresses. For a more secure setup, you can configure a job to run with internal IPs only so that no worker is reachable from the public internet and no traffic leaves the VPC over a public path.
The catch is that workers still need to talk to Google Cloud services like Pub/Sub, BigQuery, and Cloud Storage. Those services live outside the VPC, and an internal-IP-only worker has no public path to reach them. The solution is Private Google Access, a subnet-level setting that lets resources without external IPs reach Google APIs and services over Google's internal network.
The configuration is straightforward:
With that in place, your workers operate without external IPs, traffic to Google services takes the private path, and your firewall rules still need to allow the ports above so workers can manage the job and coordinate with each other.
If the exam gives you a scenario where a security team requires no public IPs on Dataflow workers, and the job needs to read from Pub/Sub and write to BigQuery, the answer is to enable Private Google Access on the subnet and launch the pipeline with the internal-IP-only flag. Picking a Cloud NAT or a VPN setup as the primary fix is usually the wrong answer here, because Private Google Access is the purpose-built option for Google APIs.
The Professional Data Engineer exam likes to test Dataflow networking through symptoms. You read a description of a pipeline that hangs or fails, and you have to map the behavior back to a missing firewall rule, a blocked port, or a misconfigured subnet. If you remember that 443 is always in play, that 12345 and 12346 matter only without Shuffle or Streaming Engine, and that internal-IP-only workers need Private Google Access on the subnet, you can answer most of these without guessing.
My Professional Data Engineer course covers Dataflow networking, Shuffle and Streaming Engine, and the rest of the Dataflow exam surface in depth.