shyshka - Fotolia
Dremio speeds up cloud data lakes for business intelligence
The Dremio fall 2020 update brings new performance to the vendor's cloud data lake engine technology, including Apache Arrow-based caching and runtime filtering.
Enterprises have increasingly used cloud data lakes in recent years to enable data analysis. What hasn't been as common is using a cloud data lake as a source to power a business intelligence dashboard, which typically requires fast access and query speed.
Cloud data lake engine vendor Dremio on Oct. 27 made generally available its latest platform update, targeted at better supporting the business intelligence application for cloud data lakes.
Dremio provides a platform that enables users to query cloud data lakes. The company has been busy in 2020 accelerating different aspects of its technology for specific use cases, including an AWS Edition that promises faster data queries.
With Dremio's fall 2020 update, the vendor is integrating new caching and runtime filtering capabilities that speed up cloud data lake queries even more. Dremio also revealed a partnership and integration with Microsoft's PowerBI, a widely deployed business intelligence platform.
Why BI in the cloud data lake matters
Some market trends suggest that more organizations are moving toward relatively inexpensive cloud data lake storage, while wanting to retain familiar user-facing BI tools, IDC analyst Dan Vesset said.
"The use case here is for pervasive deployment of BI to potentially very large number of users who may be accessing the data concurrently," Vesset said. "In the short term, Dremio is unlikely to replace many established data warehouses, but would be attractive for net new projects where the organization has a need for cloud data lakes and SQL-based BI tools."
Mike Leone, senior analyst at ESG, said the performance and efficiency problem that Dremio solves for BI on cloud data lakes is a big deal. The lack of BI workload support with some traditional SQL engines in the cloud is forcing data teams to move subsets of data into a data warehouses, he added.
"This update will minimize the need to move and create multiple copies of data, while supporting the high levels of concurrency and low latency that data-driven organizations desire for their BI workloads," Leone said.
Why BI has been a challenge for cloud data lakes
There are a few reasons why BI workloads are different than other data analysis processes that run on cloud data lake storage, according to Tomer Shiran, Dremio co-founder and chief product officer.
Shiran explained that with a typical data exploration use case, the user submits a SQL query to the cloud data lake engine, gets a response and then submits another query. That type of approach has a relatively low level of concurrency, in which few queries are running at the same time.
In contrast, with a BI dashboard that provides metrics on business data that is constantly being refreshed by many people within an organization, there is a high degree of data concurrency and queries. As such, Shiran pointed out that the number of queries per second is much higher for BI than in the data exploration type use case.
How Dremio accelerates cloud data lake queries for business intelligence
To help enable faster data queries on cloud data lakes, Dremio uses a new data caching capability that comes from the open source Apache Arrow project. The caching technology caches data from Amazon S3, so it takes less time for a query to execute.
Dan VessetAnalyst, IDC
Another innovation that has landed in Dremio to better enable business intelligence applications for cloud data lakes is a runtime filtering feature.
Shiran explained that as a query executes the Dremio engine will learn what type of data is needed to be pulled from S3 to get the right result. Without the runtime filtering, Shiran said a query engine would typically need to go through a large volume of data from an underlying data source to identify where the correct result can be found.
"Runtime filtering can drastically reduce the amount of data that has to get read from S3, which, in turn, can increase the performance by orders of magnitude as well," Shiran said.