
Fusion is one of those Dataflow concepts that quietly shows up on the Professional Data Engineer exam, and it catches candidates off guard because the name sounds like a feature you would want rather than a problem you need to fix. In most cases fusion is helpful, but when it goes wrong your pipeline drags along on a fraction of the workers it could be using. I want to walk through what fusion is, how to recognize it, and the three techniques the exam expects you to know for preventing it.
When you submit a Dataflow pipeline, the service looks at your transforms and decides how to execute them. Sometimes it combines several pipeline steps into a single execution stage. If you have a filter step followed by a transform step followed by a summarize step, Dataflow may decide to run all three together inside one fused stage rather than executing them as separate stages. This combination is called fusion.
Most of the time this is a good thing. Running steps together cuts down on the overhead of moving data between stages, which makes the pipeline faster and cheaper. The trouble starts when fusion blocks parallelization.
The scenario you need to remember for the exam is the one where a pipeline step takes a small dataset and expands it into a much larger one. Picture a transform that reads a few hundred records and emits millions of derived records from them. If Dataflow fuses this expansion step with whatever comes after it, the service ends up treating the downstream work as if it were still operating on the small input. It does not redistribute the expanded data across more workers, because from the fused stage's perspective the input was small.
The result is a pipeline that runs slowly and only uses a small number of workers out of the maximum you allowed. That mismatch between active and idle workers is the key indicator of fusion. If you look at your job in the Dataflow monitoring interface and see most workers sitting idle while one or two grind through what should be a parallel workload, fusion is almost certainly the cause.
The Professional Data Engineer exam expects you to know three specific strategies for breaking fusion. Each one inserts a deliberate boundary that forces Dataflow to treat the data on either side as separate work.
The exam tends to phrase fusion questions as a troubleshooting scenario. You will see a description of a pipeline that is running slowly and only using a fraction of the available workers, often after a step that expands the data. The question will ask what is happening or what you would do about it. The answer is to recognize the symptom as fusion and pick the option that introduces one of the three boundaries above.
Watch for distractors that sound plausible but address the wrong problem. Increasing the maximum number of workers will not help, because the workers are already available and just sitting idle. Switching machine types will not help either. The bottleneck is the way Dataflow planned the execution graph, not the resources you gave it.
Fusion is Dataflow's optimization to combine pipeline steps into one execution stage. It usually helps, but it hurts when a step expands a small dataset into a much larger one and the next step needs to parallelize. The symptom is a pipeline that runs slowly with most workers idle. The three fixes you need to know for the Professional Data Engineer exam are GroupByKey plus ungroup, side input, and reshuffle. Reshuffle is the option I reach for most often because it is purpose-built for the job.
My Professional Data Engineer course covers Dataflow fusion alongside the other Dataflow troubleshooting topics that tend to show up on exam day.