
Datastream is one of those services that shows up on the Professional Data Engineer exam in a very specific scenario shape, and once you recognize that shape, the question almost answers itself. The setup is usually some flavor of "we have a transactional database running somewhere (often on-prem, often Oracle), and we need its data to land in BigQuery or Cloud Storage in near real time without hammering the source." That is the Datastream wheelhouse, and in this article I want to walk through what the service actually does, which sources and destinations it supports, and how to spot it on the exam.
Datastream is a fully managed, serverless change data capture and replication service. Change data capture, or CDC, is the technique of reading a database's transaction log and streaming each insert, update, and delete downstream as it happens, rather than running batch queries against the source tables. That distinction matters because CDC is what lets you replicate a busy production database without putting analytical load on it. The transaction log is already being written, so reading from it costs the source database almost nothing extra.
A few framing points that come up on the Professional Data Engineer exam. First, Datastream is serverless. You do not provision clusters, you do not size workers, and you do not manage the runtime. You configure a stream from a source to a destination and Google Cloud handles the scaling. Second, it is fully managed in the sense that retries, ordering, and schema drift handling are baked in rather than something you have to bolt on with custom code. Third, Datastream used to live inside Cloud Data Fusion as a feature, but it is now its own standalone product. If you see older study material that lumps the two together, that history is worth knowing, but for the exam treat Datastream as a separate service.
The classic use case is replicating an on-prem operational database into Google Cloud so that analytics teams can query a near real time copy without touching the source. Picture a company running Oracle on-prem for their order management system. They want analysts to run BigQuery dashboards against fresh order data, but they cannot point those dashboards directly at Oracle because the operational workload cannot tolerate the extra query load, and the firewall and licensing situation makes federated access painful. Datastream sits between the two, reads the Oracle redo logs, and streams the changes into BigQuery continuously.
Datastream supports four database sources, and this list is exactly the kind of detail the exam likes to test. The supported sources are Oracle, PostgreSQL, MySQL, and SQL Server. If a scenario describes any of those four databases as the origin of the data, and the requirement is CDC or low-latency replication into Google Cloud, Datastream is almost certainly the intended answer. If the scenario uses a different database, say MongoDB or Spanner, then Datastream is the wrong choice and you should look at a different replication path.
This source list is also useful as a process-of-elimination tool. The exam likes to put Datastream next to Database Migration Service, Dataflow, and Pub/Sub in the answer choices. If the question is about ongoing replication with minimal source impact from one of those four supported relational engines, Datastream beats Dataflow (which would require you to build a custom CDC pipeline) and it beats Pub/Sub (which is a messaging substrate, not a CDC tool). Database Migration Service is a closer cousin, but it is oriented toward one-time migrations and homogeneous moves into Cloud SQL or AlloyDB, whereas Datastream is built for continuous replication into analytics destinations.
On the destination side, the two answers worth memorizing are BigQuery and Cloud Storage. Streaming into BigQuery is the more common pattern for the analytics use case because it gives you a queryable, near real time replica of your source tables that downstream dashboards and SQL workloads can hit directly. Streaming into Cloud Storage is the pattern when you want a raw landing zone, often as JSON or Avro files, that you can then process further with Dataflow, load into other systems, or hold as a durable archive of the change history.
One phrase that often appears in Datastream questions is something like "with minimal impact on the source database" or "without affecting production performance." That language is a strong signal. CDC against the transaction log is fundamentally lighter on a source than running scheduled extracts, because you are reading sequential log entries rather than scanning tables. When the exam emphasizes low source impact alongside continuous replication, it is pointing at Datastream.
Another phrase to watch for is "low-latency analytical replica." That is essentially the product pitch in five words. If a question describes a need to keep a BigQuery copy of an Oracle or MySQL or PostgreSQL or SQL Server database synchronized within seconds-to-minutes of the source, with no custom code and no operational overhead on the source side, Datastream is the service being described.
My playbook is three quick checks. One, is the source one of Oracle, PostgreSQL, MySQL, or SQL Server. Two, is the destination BigQuery or Cloud Storage. Three, does the scenario describe ongoing replication or change capture rather than a one-time migration. If all three are yes, pick Datastream. If the source is outside that list, or the destination is Cloud SQL or AlloyDB, or the workload is a one-shot move, look elsewhere in your answer set.
My Professional Data Engineer course covers Datastream alongside the other replication and ingestion services you need to differentiate on exam day, including Database Migration Service, Dataflow templates, and Pub/Sub, so you can walk into the test knowing exactly which one a scenario is asking for.