Getty Images/iStockphoto

Google Datastream advances change data capture in the cloud

At Google's Data Cloud Summit, a new service released in preview to enable users to capture data from other sources to bring into Google's cloud data and analytics services.

Among the early users of Google's new Datastream change data capture service is grocery store chain Schnuck Markets, which operates 111 stores across Missouri, Illinois, Indiana and Wisconsin.

In a breakout session at the tech giant's Data Cloud Summit on May 26, Caleb Carr, principal technologist at Schnuck Markets, detailed the St. Louis-based company's use case for Datastream, which the tech giant introduced at the virtual conference.

Carr noted that Schnuck has been using Google Cloud Platform for several years, for cloud storage as well as the BigQuery data warehouse. One of the challenges his team faced was ensuring that Schnuck's operational data from its on-premises Oracle database environment was available quickly and reliably for analytics workloads running in BigQuery.

Carr's team's initial approach was a batch job that synchronized data at periodic intervals.

"The batch nature of the process caused delays in replication and we weren't making decisions at the kind of speed we wanted," Carr said.

The batch approach also had an impact of Schnuck's network as large volumes of data were being synchronized at different times. The company also needed to maintain dedicated staff managing the process.

Screen capture of Caleb Carr, principal technologist at Schnuck Markets, presenting at Google Data Cloud Summit conference.
Caleb Carr, principal technologist at Schnuck Markets, outlined how the mid-Midwestern grocery chain is making use of the newly announced Google Datastream change data capture technology.

Getting data from Oracle to Google

With Datastream, Carr said getting up and running was straightforward. The first step was to enable Oracle's LogMiner, which provides Oracle database table-logging functionality. Datastream was then able to pick up the changes from the Oracle database, via a Bastion host into a Google Virtual Private Cloud instance. The Bastion host provides a secured entry point into the Google Cloud.

"One of the clear values of Datastream is the real-time access to data in BigQuery," Carr said. "For us, that means our data science team is working with up-to-date data for our machine learning models, and our programs are running with faster business insights and our stores' teammates can better support our customers."

One of the clear values of Datastream is the real-time access to data in BigQuery.
Caleb CarrPrincipal technologist, Schnuck Markets

In addition to Datastream, Google introduced a series of new cloud data initiatives during the inaugural Data Cloud Summit.

The new Analytics Hub and Dataplex systems are aimed at boosting data and analytics capabilities in the Google Cloud. All three services are now available in preview.

The Analytics Hub is a central location to share analytics, while Dataplex provides a data fabric that can help organizations curate data for analysis across the cloud.

Google Datastream change data capture in the real world

Meanwhile, the Datastream change data capture (CDC) service will help bring in data from multiple sources into Google's cloud data services, including the BigQuery data warehouse, Cloud SQL database and Cloud Spanner distributed SQL database.

While the name Datastream might appear to imply some form of streaming data service like Apache Kafka or Amazon Kinesis, that's not the case.

IDC analyst Stewart Bond emphasized that Datastream is first and foremost, a CDC technology. Bond explained that CDC provides the ability to monitor source database log files for changes to data, and then capture and forward changes to a target for processing.

"Log-based change data capture is a method that has been used for many years to capture changes to data in databases in a noninvasive manner," Bond said. "It means there is no query impact on the source database, no stored procedures or triggers to write, and no shadow tables to manage."

Where Datastream is headed

In a keynote session at the event, Andi Gutmans, general manager and VP of engineering for databases at Google, explained that Datastream is a serverless architecture that will scale up or down as the volume of data shifts in real time.

Gutmans noted that Datastream will integrate with Google's Dataflow templates to help create up-to-date replicated tables in BigQuery for analytics. Datastream can also help replicate and synchronize databases into Google's Cloud SQL or Cloud Spanner for database migration or to enable hybrid cloud configurations.

"The public preview of data stream supports streaming change data from Oracle and MySQL sources, hosted either on premises or in the cloud," Gutmans said. "Over time it will add support for more sources, and more destinations."

Next Steps

Materialize brings streaming SQL database to the cloud

Dig Deeper on Data warehousing