Nabugu - stock.adobe.com
Dremio accelerates data lake operations with Dart Initiative
The cloud data lake engine platform vendor looks to accelerate queries with its Dart Initiative, which uses a query cache plan to optimize SQL queries against data lakes.
Dremio's Summer 2021 update of its cloud data lake engine platform provides fast query capabilities, powered by an effort the vendor refers to as the Dart Initiative.
Dremio, based in Santa Clara, Calif., has been building out a platform that enables organizations to organize and query cloud data lakes.
This has been has been an eventful year for the vendor, as Dremio raised $135 million in a Series D round of funding in January to build out the platform. Dremio has been working for several years on making data lake engine queries run, but it said it is now going further with Dart, which works with the new platform update, released in general availability on June 3.
With Dart, Dremio is aiming to make queries faster in an effort to reduce or eliminate the need for an organization to maintain a data warehouse where data needs to be loaded or copied into a new system.
Enterprise data is increasingly fragmented. As a result of that fragmentation, enterprises struggle to create the data plane to build their next-generation applications that can help them thrive in the era of digital disruption, said Holger Mueller, an analyst at Constellation Research.
"Dremio helps them as it allows to keep data in place, respecting data gravity and minimizing data egress costs," Mueller said. He noted that IT management executives who are considering a technology like Dremio need to evaluate latency and performance for their data plane implementations.
Doug HenschenAnalyst, Constellation Research
Doug Henschen, another Constellation analyst, noted that Dremio has been an innovator in the cloud data lake sphere since the introduction of Apache Arrow in 2016.
One of Dremio's co-founders, Tomer Shiran, helped start the Arrow project, a memory format for hierarchical data.
Dremio separates the query layer from where data is a stored, which is not a unique approach, according to Henschen. He noted that Cloudera, Databricks and Microsoft Synapse are also among the platforms based on separating data from a processing engine, while also offering combined data lake-data warehouse environments.
"That said, Dremio's multi-cloud capabilities and performance promises are compelling," Henschen said.
Accelerating cloud data lakes with Dart
Shiran, who is chief product officer at Dremio, explained that Dart is a multistage project designed to make data lake operations faster, so users can execute queries as fast as they can inside a data warehouse.
Among the ways Dart speeds up its cloud data lake engine platform is with an advanced query plan cache.
Shiran explained that in any database or data warehouse, a data query needs to be compiled into a query plan that defines how the query will be executed. With the new update, Dremio has accelerated the query plan with a cache system.
Dremio now collects statistics about past queries and how they executed across data tables and columns and then uses that history to optimize the query plan. The query plan itself is also cached, which can be useful for business intelligence dashboards that are constantly refreshed by users, Shiran said.
"We actually cache the query plans themselves so that we don't have to plan the query again and again, every time somebody submits a query and the data hasn't changed," he said.
As part of the Dart Initiative, the Dremio Summer 2021 update now also supports what Shiran referred to as "unlimited table sizes." He noted that in the past, in the world of data lakes, organizations often had a lot of problems with data sets that were too big, with millions of files that took a significant amount of time to formulate and execute a query plan against.
"We have eliminated that entire problem, so we now support an unlimited table size, with any number of partitions, any number of files -- there's really no limits," Shiran said.
Gandiva
Another way Dremio has made its entire platform faster and more responsive is with the Gandiva code execution technology that is part of the Apache Arrow project, according to the vendor.
Dremio first introduced Gandiva in 2019, alongside its Data Lake Engine 4.0 update. Gandiva enables native code execution for Java, instead of requiring the code to run through a Java Virtual Machine (JVM). Running code through a JVM introduces some compute resource requirements and latency.
Gandiva has steadily advanced over the last few years and, with the Summer 2021 Dremio update, contributes toward the overall platform speedup.
Dremio will continue to work on Dart to further accelerate cloud data lake operations, Shiran said.