Convergence of Data Virtualization Servers and SQL-on-Hadoop Engines?
Hadoop has become a popular and powerful platform for data storage and data processing. Data stored in Hadoop can be used by a wide range of applications and tools and for a wide range of use cases. The fact that SQL can be used to retrieve Hadoop data, has opened up the data to even more tools, especially tools for reporting and analytics. A question organizations have to ask themselves is in which SQL-on-Hadoop technology they should invest? They can go for straightforward SQL-on-Hadoop engines or for data virtualization servers.
Examples of SQL-on-Hadoop engines are Drill, Hive, Impala, and Spark SQL. Many of them only allow data to be queried, but there are some, such as Splice Machine, that offer transactional support on Hadoop. Others, such as Cirro and ScleraDB, support data federation capabilities allowing Hadoop data to be joined with data stored in SQL databases. A technical challenge for most SQL-on-Hadoop engines is how to turn all the non-relational data stored in Hadoop, such as variable data, self-describing data, and schema-less data , into flat relational structures. Not all the engines are capable of that. In other words, only flat data can be accessed by them. Nevertheless, SQL-on-Hadoop engines make it easier to use popular tools for reporting and analytics to access big data stored in Hadoop.
But they are not the only kid in town. Data virtualization servers, such as those of Cisco, Denodo, RedHat, and Stonebond, also allow Hadoop to be accessed through SQL. In fact, most data virtualization servers allow SQL access to data stored in almost any kind of file system or database server, including spreadsheets, XML and JSON documents, sequential files, pre-relational database servers, data hidden behind APIs such as SOAP and REST, and data stored in applications such as SAP and Salesforce.com. As indicated, data virtualization servers offer access to Hadoop as well, and with that they have entered the market of SQL-on-Hadoop solutions. However, when they access Hadoop it’s through one of the existing SQL-on-Hadoop engines.
Note that data virtualization servers are more than engines that translate one language into another. For example, all of them offer data federation capabilities for many non-SQL data sources, they support a high-level design and modeling environment with lineage and impact analysis features, caching capabilities to minimize access to data sources, advanced distributed join optimization techniques are supported, and extensive data security features are offered.
In a nutshell, most current SQL-on-Hadoop engines are tools that solve one technical problem, in this case offering SQL access on Hadoop data. Data virtualization servers are more global solutions that offer access in any language and API on any kind of data source. It’s a more architectural solution.
It’s very likely that SQL-on-Hadoop engines will be extended with typical data virtualization features, and vice versa, data virtualization servers will be enriched with full-blown, native support for Hadoop access by embedding their own SQL-on-Hadoop technology. Because they do try to solve some comparable problems, it’s not unlikely that the two product categories will somehow converge. Some products will merge and others will be extended. This is definitely a market to keep an eye on in the coming years.