Nabugu - stock.adobe.com

Databricks opens up data lakes for data sharing

At its virtual Data + AI Summit, the AI vendor detailed new capabilities to make information stored in cloud data lakes more accessible and secure.

Databricks on May 26 introduced the Delta Sharing open source protocol, designed to open data lakes to more sharing.

Also at its virtual Data + AI Summit on May 26, the machine learning and data lake vendor released a series of other new capabilities for its Delta Lake data lake project, including tools for data collaboration and streaming data.

Databricks, based in San Francisco, has had a busy year so far, raising a dramatic $1 billion in a Series G round of funding on Feb. 1.

Among the key technologies that Databricks has been building out in recent years is the open source Delta Lake project, a data lake technology that is now run by the Linux Foundation. Databricks provides a commercially supported platform that implements Delta Lake.

Now Databricks is expanding Delta Lake with a new open source data collaboration, Delta Sharing. Also, Databricks introduced new data governance capabilities for Delta Lake, grouped in the Unity Catalog, as well a live table capability to support streaming data.

Delta Sharing is a useful concept, said Dave Menninger, an analyst at Ventana Research.

"As data migrates out of data centers and lives in a variety of cloud-based data sources, an open protocol for sharing data makes sense," Menninger said. "The real value of Delta Sharing will be determined by how many vendors agree to support it."

Menninger added that Databricks seem to have done a good job initially with third-party commitments that could create the momentum needed to entice even more support.

As data migrates out of data centers and lives in a variety of cloud-based data sources, an open protocol for sharing data makes sense.
Dave MenningerAnalyst, Ventana Research

In a keynote address at the virtual conference on May 26, Databricks CEO Ali Ghodsi said that among the organizations that will support the Delta Sharing protocol is the AWS Data Exchange. Ghodsi also noted that Microsoft, Google, Tableau and Starburst have said that they will integrate support for Delta Sharing into their products.

How Delta Sharing works to enable data lake collaboration

In a keynote address at the conference, Matei Zaharia, co-founder and CTO of Databricks, said that a primary goal of Delta Sharing is to smooth the sharing of data that an organization already has in its data lake, without the need to copy it out into another system.

"We wanted to make data easy to consume in a wide range of clients," Zaharia said.

Two parties are involved in the Delta Sharing model: the data provider and data recipient. Zaharia explained that the data provider can start with an existing table it already has in the Delta Lake format. Delta Sharing also supports the Apache Parquet format, which is widely used for data lakes. 

Photo of Matei Zaharia, co-founder and CTO of Databricks, explaining the Delta Sharing protocol.
The open source Delta Sharing protocol aims to provide a mechanism that makes it easier to share and collaborate on data stored in cloud data lakes.

"If you're not using Delta Lake and you're just using Apache Parquet, it's also very easy to create a Delta table that points to your existing Parquet data," Zaharia said.

In front of the data provider, a Delta Sharing server needs to be deployed. The Delta Sharing server provides an interface and the protocol that enables the actual sharing with the data recipient.

Zaharia pointed out that Delta Sharing allows the recipient to ask for just a subset of the table data. For example, if a user just cares about the sales for one line of products, they can access just that subset of data.

The process of getting data to the recipient uses cloud object storage to make data transfers quickly. The Delta Sharing server will generate short-lived addresses on Amazon S3 that allow the client to request the specific files that they're actually allowed to get.

"We really think that the future of data sharing is open, and we think Delta Sharing is going to be a key part of that," Zaharia said.

Delta Sharing is generally available immediately.

Unity Catalog brings data governance to data lakes

The Unity Catalog, a data governance capability, is now in preview for Databricks cloud users.

With the Unity Catalog, Databricks puts a single unified object model in front of all the data an organization might have in a data lake. The catalog is configured with standard SQL to define access policy.

"It's a really powerful way to manage security permissions at scale," Zaharia said.

Next Steps

Databricks steps up in competitive machine learning market

Dig Deeper on Database management