Getty Images/iStockphoto

Treeverse set to launch LakeFS cloud data lake service

The open source LakeFS technology gets a fully managed cloud service to help organization better iterate and version cloud data sources for development efforts.

Startup Treeverse on Wednesday introduced its LakeFS Cloud service, a managed offering that will provide organizations with versioning capabilities for cloud data lakes.

The service is expected to be generally available on June 27, the vendor said.

With a cloud data lake, users can store different types of data, but there is usually little or no tracking of how the data changes over time, and no ability to easily revert to an earlier version.

The open source project, which Treeverse created in 2020, is a technology designed to enable versioning for a data lake, in much that same way that the Git version control system enables developers to track and build versions of application code.

The vendor's goal with the cloud service is to provide an offering that is managed and deployed by Treeverse, rather than users needing to deploy and manage it on their own.

Treeverse, based in Tel Aviv, Israel, and founded in 2020, faces a number of competitors, including the open source Nessie project run by Dremio, as well as the AWS Lake Formation service that provides limited versioning and data catalog capabilities.

The LakeFS cloud at launch will only be available on AWS, with plans to fill the gap in cloud coverage with support for Google and Microsoft Azure in the coming months.

The versioned data lake and healthcare

Among the users of the open source LakeFS technology is healthcare startup Karius, based in Redwood City, Calif., which has developed a technology that combines chemistry and AI to diagnose infectious diseases without the need for an invasive procedure.

"As you can imagine, such a complex technology is fueled by massive amounts of complex data that comes with every patient," said Sivan Bercovici, CTO of Karius. "To go from what's in the tube, to what's in the cloud, to what's in a physician report, the chain of custody of data needs to be secured."

Bercovici noted that in the world of data and precision medicine, many organizations have become accustomed to the idea of never deleting any data.

We went from weeks' worth of data hunting, and anxiety around whether or not we got the right data version, to simply being able to rely on the availability of the right data, to the right data scientist, at the right time. It is liberating.
Sivan BercoviciCTO, Karius

As complexity grows, the challenge of managing all the data is immense, which is why Karius uses LakeFS. Bercovici said that his firm knows that when it versions its critical data on LakeFS, the same way it versions its code, the company can rely on its critical data being available and discoverable.

"LakeFS brings the much-needed focus in the clouded data space which is the daily reality of pharma and biotech," Bercovici said. "We went from weeks' worth of data hunting, and anxiety around whether or not we got the right data version, to simply being able to rely on the availability of the right data, to the right data scientist, at the right time. It is liberating."

Karius now self-hosts LakeFS and intends to move to the cloud offering in the future to ease management.

"As a rapidly growing company, we want to make sure someone who is deeply versed in the specific technology has you covered for uptime and develops efforts, while we focus on building our differentiated value," Bercovici said.

How LakeFS works to version cloud data lakes

Einat Orr, co-founder and CEO at Treeverse, said a goal for LakeFS from the onset was to enable organizations to use engineering best practices used in code development for data lakes.

Those best practices that are part of LakeFS include the ability to have multiple versions, or branches for data, that can allow users to work with any branch. The technology also enables reversion, so if something is updated, a user can revert to an older version if needed. The ability to merge different branches is another core capability that LakeFS can enable.

The open source LakeFS technology requires a server, a database and access to storage.

While users can and have set that up on their own in a self-hosted approach, it can be time-consuming and complex to keep LakeFS running in an optimal deployment.

That's where the new LakeFS cloud service comes into play as a managed service that handles the deployment and operations of LakeFS for users. In the AWS deployment, the LakeFS cloud has a gateway component that enables organization to use AWS PrivateLink to securely connect and access an organization's data lake.

The ability to version data in a data lake can aid development efforts and also helps with data quality, which can hard to troubleshoot, according to Orr.

"The moment you have the quality of your data questioned, the process is manual, difficult and hard to manage, and this is where the value of LakeFS shines," Orr said. "LakeFS allows reproducibility and reversion capabilities, and it can support working in isolation for development and debugging."

Dig Deeper on Database management