Funtap - stock.adobe.com

Databricks introduces Delta Lake 3.0 to help unify data

As part of the open source community developing the data storage platform, the vendor unveiled the platform's latest iteration with data unification the main goal of the update.

Databricks on Wednesday unveiled Delta Lake 3.0, the latest version of the open source storage format used by many of the vendor's customers.

In addition to Delta Lake, first developed by Databricks and made open source in 2019, Databricks and other data lakehouse vendors' cloud storage platforms can be used with the Apache Hudi or Apache Iceberg storage formats.

But when an organization deploys data storage repositories with different storage formats in different instances -- for example, the finance department's data tables stored in a lakehouse on top of Delta Lake and the human resources department's data tables stored in a lakehouse on top of Apache Iceberg -- the deployments are unable to work together.

The result is data isolated within different deployments.

Delta Lake 3.0 aims to unify data regardless of the storage format with the introduction of Universal Format. UniForm enables Delta Lake to be used not only with Delta Lake tables but also with Hudi and Iceberg tables, which subsequently enables organizations to unify their previously isolated data.

That is an important evolution, according to Kevin Petrie, an analyst at Eckerson Group. He noted that a recent survey conducted by Eckerson Group and BARC found that over more than 80% of organizations deploy two or more data platforms.

"Databricks is right to offer a common format that makes tables more compatible and portable across platforms," Petrie said.

Databricks unveiled Delta Lake 3.0 during Data + AI Summit, the vendor's user conference in San Francisco. Currently in preview, the data storage architecture is expected to be generally available sometime during the second half of 2023.

Unification

One of Databricks' goals from its inception was to help organizations unify their data.

Historically, structured data stored in the cloud was housed in data warehouses while unstructured data stored in the cloud was housed in data lakes. Structured and unstructured data were isolated from one another.

Databricks, when it was founded in 2013, aimed to join structured and unstructured data by helping pioneer the concept of a data lakehouse, which is a combination of a data warehouse and data lake that enables organizations to store all types of data together.

To further help organizations unify their data, Databricks unveiled the Unity Catalog in 2022, a data catalog that enables organizations to put in place data governance and lineage measures.

In addition, a recent focus for the vendor has been industry-specific versions of its lakehouse aimed at helping organizations become more efficient.

However, the problem of disparate storage formats for lakehouses still remained.

Databricks, which actively participates in open source projects and makes much of its own product development open source, worked with the Delta Lake community to develop the latest version of the architecture. The goal was to eliminate isolated data, according to Joel Minnick, vice president of marketing at Databricks.

"There's still one piece out there that's creating silos," he said. "The Delta Lake community looked at this and said, 'If the idea of the lakehouse is unification, let's really solve the problem.' The big change is UniForm, which … unifies all three lakehouse formats and opens up the entire lakehouse ecosystem."

Databricks is right to offer a common format that makes tables more compatible and portable across platforms.
Kevin PetrieAnalyst, Eckerson Group

UniForm works by enabling Delta Lake users to read and write to data tables stored in Hudi and Iceberg as if they were stored in Delta Lake.

The feature automatically converts the table formats to metadata code that Delta Lake can understand, which subsequently enables data tables to be unified. Previously, data engineers had to manually convert table formats to combine data tables stored in different formats.

But not only will UniForm ease the burden on data engineers and make them more efficient, it also has the potential to result in better data quality, according to Petrie.

"It's a great concept," he said. "[The] mix of data warehouses, data lakes, lakehouses and other approaches creates data quality issues, transformation complexities and confusion about what data to use for a given project."

However, UniForm is only a great concept if there's no vendor lock-in to Databricks that results from using Delta Lake 3.0, Petrie continued. If choosing Delta Lake 3.0 reduces other options to re-format data or move data out of Delta Lake for another storage format, it will be problematic.

In addition to UniForm, Delta Lake 3.0 includes Kernal and Liquid Clustering.

Delta Kernal aims to address connector fragmentation by using a stable API to ensure that connectors are built with Delta specifications that will eliminate the need to update code with each new version of the connector or protocol change within the connector.

Delta Liquid Clustering, meanwhile, is a flexible data layout that aims to improve the performance of reads and writes to data tables as well as provide more cost-efficient data clustering as data volumes increase.

All of the new capabilities included in the new version of Delta Lake resulted from customer requests, according to Minnick.

"No customer wants a format war -- it's not good for anybody," he said. "We were hearing that different formats are being picked across different tools, and that's creating complexity. Lakehouses are about being open and flexible. So we're trying to follow that through and not make the format people choose for their lakehouse restrictive."

AI in focus going forward

Beyond its participation in developing Delta Lake 3.0, Databricks on June 20 expanded its marketplace to better enable customers to share and monetize the applications they develop and access data and AI tools within the vendor's Delta Sharing ecosystem.

In addition, on June 26 Databricks reached an agreement to acquire MosaicML for $1.3 billion in a move aimed at better enabling Databricks customers to develop their own generative AI and large language models (LLMs).

Generative AI and LLMs, meanwhile, will be a major emphasis for Databricks in the coming months, according to Minnick.

In the seven months since OpenAI launched ChatGPT -- which marked a substantial increase in generative AI and LLM capabilities -- numerous data management and analytics vendors have unveiled plans to add generative AI and LLM capabilities.

For example, Databricks competitor Snowflake on June 27 unveiled a host of generative AI and LLM capabilities now on its roadmap.

Databricks is no different. But rather than add generative AI through integrations the vendor developed its own LLM in March and is acquiring generative AI capabilities -- and knowhow -- with MosaicML.

That focus on generative AI is appropriate, according to Petrie.

In particular, he noted that Databricks has recognized that organizations don't necessarily need all the public data in LLMs and could be better served building their own smaller language models trained mostly on their own data and augmented with relevant public data.

Co-founder and chief technologist Matei Zaharia and Databricks have helped educate the lakehouse community about how LLMs "don't really need to be large," Petrie said.

"Language models that address domain-specific use cases are a lot easier to build than the industry expected," he added. "We should get ready for a boom of small language models, with Databricks playing a key role."

Eric Avidon is a senior news writer for TechTarget Editorial and is a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data management strategies