fotomek/stock.adobe.com

AWS expands cloud databases with data virtualization

At AWS re:invent 2020 the public cloud giant unveiled enhancements to its database and analytics portfolio, including the Babelfish project for migrating from SQL Server to PostgreSQL.

AWS has enhanced the services surrounding its growing portfolio of cloud databases.

As part of the AWS re:invent 2020 virtual conference, a three-week event running from Nov. 30 to Dec. 18, the cloud giant introduced a series of new cloud database systems and tools.

Among the new cloud database items was the preview release of AWS Glue Elastic Views on Dec. 1. Glue Elastic Views provides data virtualization capabilities for the Glue data catalog, enabling users to combine data from different sources.

Also on Dec. 1, AWS unveiled a preview of the Babelfish for Aurora PostgreSQL service, which enables users to more easily migrate Microsoft SQL Server workloads.

AWS has pledged that Babelfish will be an open source effort that brings T-SQL support to PostgreSQL. A week later on Dec. 8, AWS introduced Amazon Neptune ML bringing machine learning capabilities to its graph database.

In this Q&A, Herain Oberoi, general manager of databases, analytics and blockchain marketing at AWS, outlines some of the cloud giant's recent database innovations.

How do you limit data duplication when moving data around for analysis and where does data virtualization fit in?

Herain Oberoi: Moving data around doesn't necessarily mean duplicating the data. Moving it around means being able to access the data, regardless of where you're accessing it from.

AWS general manager of databases, analytics and blockchain marketing Herain OberoiHerain Oberoi

There are a number of different capabilities that the team has been working on. We've had, for example, a capability called federated query in RedShift and Athena, for a while now. With federated query I can run a query that can access data from Aurora, but also access data in S3, and also access data in Redshift.

We also announced a new capability called AWS Glue Elastic Views. What Glue Elastic Views does is, it basically allows you to create a materialized view. It's kind of like a virtual table of data that can come from multiple different data sources. So, I can have some data coming from S3, some coming from Aurora and other data coming from DynamoDB, for example.

This is not a one-time thing either. Once I create a materialized view, it actually creates a target; it stores that virtual table in the target database, and it will continuously monitor changes to any of the source systems and it'll keep that target database up to date.

We debated whether Glue Elastic Views should be called data virtualization. In general, I think we are just trying to be as descriptive as possible and not get fanciful. Practically Glue Elastic Views is a materialized view. It is a virtual table that is virtualizing data if you want to put it that way, but it's effectively the same thing.

We debated whether Glue Elastic Views should be called data virtualization. In general, I think we are just trying to be as descriptive as possible and not get fanciful.
Herain OberoiGeneral manager of databases, analytics and blockchain marketing, AWS

Where do you see opportunities for graph cloud databases from what you have seen with Neptune?

Oberoi: In a general sense, any data set where the data itself has a lot of relationships between different aspects is where a graph model would make sense.

We've seen it used by customers trying to build a knowledge graph. So, you've got a lot of different documents or research notes and things like that and you want to create relationships over the data you've got. You also want to be able to easily search and traverse and find connections.

Fraud detection is a common use case, because sometimes it's easier to find anomalies when you can make connections between what might be disparate things. Product recommendation would be another example of common use case as is life sciences since those data sets tend to be highly connected.

What is the opportunity for PostgreSQL cloud databases with Babelfish?

Oberoi: Today, when a customer wants to migrate an on-premises database, like SQL Server to the cloud, there are a few steps they have to go through.

First, they have to map the schema in the source database to the target. Once they map the schema, then they need to move the data from the source to the target. Usually, while moving it, if there are differences in how the schemas work in the two different databases, there's some amount of data transformation and cleansing that will need to occur.

Once your schema and data has moved over, there's still the application itself and there is some amount of code that has to get rewritten.

If I have an application that uses a lot of T-SQL, which is the proprietary query language for SQL Server, then I have to rewrite all the T-SQL code in my application now to say PostgreSQL. That particular step in the migration, the application code rewrite, is the thing that takes a lot of time. That's the fundamental challenge.

So, the team came up with a pretty creative way to do this with Babelfish, which effectively builds a translation layer.

The benefit is that in my application, I can continue to keep that T-SQL code, but now my application is pointing and running against the Aurora PostgreSQL database. Then I can in my own time decide what parts of that application I want to rewrite in PostgreSQL. I can choose to build new functionality in PostgreSQL and still have my old functionality still running just fine on T-SQL.

Editor's note: This interview has been edited for clarity and conciseness.

Dig Deeper on Data warehousing