Getty Images/iStockphoto

Dremio updates data catalog to support all deployment types

Recognizing that some customers need to keep data on-premises, the vendor has extended the availability of its Data Catalog for Apache Iceberg beyond its cloud-based platform.

Dremio on Tuesday updated its Data Catalog for Apache Iceberg to support all deployment options, enabling customers to manage data across varied data centers, regions and clouds.

Previously, the vendor's Data Catalog for Apache Iceberg was only available in Dremio Cloud, the vendor's cloud-based platform. The update adds availability in Dremio Software, the on-premises version of the vendor's platform, so that the Apache Iceberg-backed data lakehouse vendor's data catalog now supports cloud, on-premises and hybrid deployments.

In addition, Dremio unveiled integrations with Snowflake's managed service of Apache Polaris Incubating and Databricks' Unity Catalog managed service to enable customers to choose the data catalog that best suits their needs.

Many enterprises still have data on-premises because of how difficult it can be to migrate data to the cloud and the need for some data to be kept on-premises for safety and security reasons, noted Kevin Petrie, an analyst at BARC U.S. Dremio's support is therefore important because it enables all customers to catalog Iceberg tables rather than only those with cloud-based deployments.

"This is a significant addition because many enterprises still have data on premises thanks to migration complexity, sovereignty requirements and data gravity," Petrie said. "Dremio can help them catalog Iceberg tables wherever they are -- on-premises or in the cloud."

Based in Santa Clara, Calif., Dremio is a data lakehouse vendor whose platform is built on Apache Iceberg's open table formats. Other open table formats for lakehouses include Apache Hudi and Delta Lake.

Lakehouses, meanwhile, combine the structured data management capabilities of data warehouses and unstructured data management capabilities of data lakes to enable users to combine their data to gain a more complete view of their operations. Because of that ability to combine disparate data types, they are one of the preferred storage formats for AI models and applications, including generative AI, that perform best when trained on large amounts of high-quality data.

Example of how a data catalog works.
Dremio has extended the availability of its Data Catalog for Apache Iceberg to on premises and hybrid users as well as the cloud-based customers initially supported by the data catalog.

New capabilities

Data catalogs are essentially indexes for data. They are a means of organizing datasets, dashboards, reports, models and other assets based on their semantics and other metadata to make data easy to discover as well as govern to ensure safety and security.

Vendors such as Alation, Collibra and Informatica, among others, specialize in data integration and provide data catalogs. Meanwhile, data platform vendors, including Dremio, Databricks, AWS, Google, Microsoft and Snowflake, also provide data catalogs.

Dremio's Data Catalog for Apache Iceberg is built on Project Nessie, an open source catalog that uses semantic modeling to organize metadata of tables created with open table formats with the aim of making it easier and faster to query data.

It was first launched for cloud-based customers in May and has been improved upon since then to make it ready for additional deployment options, according to James Rowland-Jones, Dremio's vice president of product.

Now, using Dremio's data catalog, customers can take advantage of that speed and simplicity across deployments whether in the cloud, on-premises or a hybrid of both.

"Dremio Data Catalog for Apache Iceberg has been developed and hardened in the cloud, supporting … use cases for our enterprise customers," Rowland-Jones said. "With this release, Dremio is bringing these capabilities to our Dremio Software customers, empowering them to build data lakehouses on their terms."

Regarding the impetus for extending Dremio's data catalog capabilities to Dremio Software users, Rowland-Jones noted that feedback led the vendor to understand that certain customers, especially those in highly regulated industries, were being underserved.

"Customers who operate in regulated markets need to retain control of their metadata in the same way they need to retain control of their data," he said. "They needed an Apache Iceberg data catalog on their terms, where the data stored in this catalog had to reside in their environment."

Key features of updated data catalog include the following:

  • Support for all Iceberg engines, including Dremio, Flink, Spark and Trino, through Iceberg's REST catalog API to provide users with choices and enable them to continue using Dremio should they alter other parts of their data ecosystems.
  • Centralized data governance across all an organization's data, including role-based access control and other fine-grained access privileges to make sure data is secure and compliant.
  • Automated table optimization to improve performance and lower computing costs by compacting certain data and getting rid of unneeded data.
  • Data branching -- making copies of data for testing -- and version control to enable experimentation. including virtual development environments, while preventing risk to production data.

Combined, the features help Dremio customers simplify and accelerate queries in diverse data environments by eliminating the need to move data or create data duplicates, according to Petrie.

"By using Dremio, they can speed up their performance to support analytics projects," he said.

Meanwhile, just as the update of its own data catalog for Apache Iceberg aims to aid users of all deployment types, Dremio's integrations with data catalog tools from Snowflake and Databricks provide customers with open source options for storing and managing metadata, according to Petrie.

Apache Polaris is an open source catalog for Apache Iceberg. Snowflake's managed service of Apache Polaris Incubating enables users to integrate and govern data stored in lakes and lakehouses. Similarly, Databricks' Unity Catalog managed service is open source and enables users to integrate and govern data, whether stored in Databricks or another vendor's environment.

"Dremio is integrating with Snowflake's Polaris service and Databricks' Unity catalog service because it understands the need to maintain open metadata repositories," Petrie said. "The Polaris integration is especially strategic given its ties to the popular Iceberg format."

Plans

Following the addition of Data Catalog for Apache Iceberg to Dremio Software and the integrations with open source data catalogs from Databricks and Snowflake, continuing to participate in the open source community is an important part of Dremio's roadmap, according to Rowland-Jones.

This is a significant addition because many enterprises still have data on premises thanks to migration complexity, sovereignty requirements and data gravity. Dremio can help them catalog Iceberg tables wherever they are -- on-premises or in the cloud.
Kevin PetrieAnalyst, BARC U.S.

In particular, the vendor plans to take part in helping to advance the data catalog capabilities of Apache Polaris Incubating.

"You will see us continue to support the project, extending its capabilities," Rowland-Jones said.

Petrie, meanwhile, said that Dremio's participation in the open source community is wise.

Dremio is far smaller than some of its competitors, including Databricks and tech giants that offer lakehouse options such as AWS and Google Cloud. Working with open source projects provides Dremio with exposure it might not otherwise get if its sole product development focus were internal.

Specifically, Dremio has been able to attract users by promoting the use of open source projects such as Apache Arrow for columnar processing in memory and Apache Iceberg for open-table formats, according to Petrie. The extension of Data Catalog for Apache Iceberg to Dremio Software users continues that effort.

"Dremio has strengthened its market position by evangelizing several popular open-source projects," he said.

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data management strategies