Getty Images
Imply advances Apache Druid real-time analytics database
The database vendor is continuing its effort to evolve the Apache Druid database, developing new capabilities to help users more easily load and transform data.
The open source Apache Druid real-time analytics database now includes a multi-stage query engine that commercial database vendor Imply built as part of the Project Shapeshift effort. The new features in Druid are generally available today.
Based in Burlingame, Calif., Imply raised $100 million in a Series D funding round in May in an effort to continue building out its own commercial Imply Polaris cloud database service. Imply Polaris is based on Apache Druid.
Apache Druid provides an online analytics processing database that competes against a number of both open and closed source systems. On the closed source side, competitors include Rockset and Aerospike. In open source, Druid competes with Apache Pinot, which has a commercial offering led by database vendor StarTree.
Apache Druid now integrates a multi-stage query execution engine that enables data to be loaded faster than previous versions of the database.
The execution engine also allows users to execute data transformations using SQL. The new Druid features are coming to the cloud via Imply's Polaris database as a service, which is also updated to make it easier for developers to operate the service with new data visualizations. Imply has been working for most of this year on making Druid more powerful in a project it has codenamed Project Shapeshift.
What multi-stage queries bring to the Druid real-time analytics database
The multi-stage query engine for Druid with SQL-based ingestion and in-database transformation is a step forward, said IDC analyst Amy Machado.
Using SQL code for ingestion not only makes it easier for the developer but also makes the query faster, Machado noted. The same is true for the in-database transforms, which remove an additional step that otherwise could have taken more time.
Amy MachadoAnalyst, IDC
"Complex queries for real-time apps and data analytics need a database that can handle high concurrencies with very low latency, and the ability to do joins with historical data to make sure that the data is always in context," Machado said. "Imply is contributing to the Apache Druid project to help ensure these requirements."
Optimizing workflows for real-time analytics
Multi-stage queries will help optimize the overall workflow for users of the Druid database, said Fangjin Yang, co-founder and CEO of Imply.
The ability to load data and then transform it minimizes the need for organizations to build out an external data pipeline. Without the multi-stage engine, an organization might have needed multiple tools to load and transform data, before bringing it into Druid.
The data transformation technology in Druid is also able to load what are known as nested data structures, which can make it difficult for users to manipulate data directly.
Before the Druid update, users had to first flatten the nested data structures or use some sort of external data pipeline to manipulate that data so it could be used in the database, Yang said. That's no longer the case, as the multi-stage engine is able to load and transform nested data.
Yang noted that Druid is not competing against data transformation technologies, most notably the open source DBT core project. DBT enables workflow automation for data transformation, with the ability to schedule and repeat operations, which is not something that Druid or Imply Polaris intend to offer, he said.
Looking forward, Yang said Imply is working on improving ease of use for both Druid and its Polaris DBaaS.
"Some of the stuff that we're working on is basically removing the need for tuning and configuration, even if you deploy just from open source," Yang said. "Basically, we just want it to work without you having to think that much."