sakkmesterke - stock.adobe.com

Apache Iceberg rising for new cloud data lake platforms

The open source Apache Iceberg project is helping to define a new data tier for cloud data lakes that can help to improve performance and access for large data sets.

The open source Apache Iceberg data project moves forward with new features and is set to become a new foundational layer for cloud data lake platforms.

At the Subsurface 2021 virtual conference on Jan. 27 and 28, developers and users outlined how Apache Iceberg is used and what new capabilities are in the works. The Apache Iceberg project was originally developed at streaming media giant Netflix in 2018 and became part of the Apache software foundation in 2019. Iceberg provides an open table format for large data sets and is particularly useful for cloud data lake deployments. It is often compared to the Linux Foundation's Delta Lake open source project, which has similar goals.

While Iceberg was created at Netflix to help solve its cloud data lake challenges, the Apache Iceberg technology is finding increasing adoption by large companies, including Apple, Expedia and Adobe, among others. For cloud data lake engine vendor Dremio, which was the host and lead sponsor of the Subsurface conference, Iceberg is set to become the foundation of a new data tier to help organizations make more effective use of their data.

Iceberg in Adobe's cloud data lake

In a technical session on Thursday, Gautam Kowshik, senior computer scientist at Adobe, outlined how the software giant is using Iceberg to help enable its Adobe Experience Platform.

The Adobe Experience Platform uses data to help provide personalized experiences to users. Adobe's platform uses the Microsoft Azure Data Lake Storage (ADLS) at the infrastructure layer and processes up to 13 TB of data per day in the data lake, Kowshik explained.

"We needed a way to be able to do ACID-compliant transactions and Iceberg is great for that with cloud object stores," Kowshik said. "It's very easy to integrate Iceberg, it doesn't have any long running processes and we could integrate into our data management layer and our SDK in a fairly easy way."

Adobe first tested Iceberg in 2019 and now runs 80% of its cloud data lake workloads with the technology. The plan is to have 100% of the platform using Iceberg by the end of the first quarter of 2021, according to Kowshik.

In January 2019, when Adobe first began working with Iceberg, the Delta Lake project wasn't available; it launched in April 2019.

"We went to Iceberg, because that was the only viable option at the time," Kowshik said.

What's new in Apache Iceberg

Ryan Blue, senior software engineer at Netflix, explained during a keynote session on Wednesday that Iceberg exists because Netflix realized it needed a new data table format.

"It turns out with the benefit of hindsight, that table formats are more important than file formats for overall performance, usability and all sorts of goals for what you want from your data platform," Blue said.

It turns out with the benefit of hindsight, that table formats are more important than file formats for overall performance, usability and all sorts of goals for what you want from your data platform.
Ryan BlueSenior software engineer, Netflix

Iceberg adoption and code contributions to the open source project have grown. In particular, Blue highlighted the support for data processing engines in Iceberg, including Spark and Trino (formerly known as Presto), as being capabilities that have been developed outside of Netflix in the broader open source community.

"We have really formed a great community around this project," Blue said.

The most recent release of Apache Iceberg is version 0.11.0, which became generally available on Jan. 26. Among the key features are new data metastore options for users that go beyond just Apache Hive, which was all that Iceberg initially supported. Blue noted that Amazon developers have contributed an AWS Glue module for tracking Iceberg tables. Iceberg now also supports the nascent open source Project Nessie effort. Nessie provides a new type of data metastore model that is inspired by the Git version control system. Iceberg has also improved its support for Apache Flink streams for streaming data processing.

A new data tier for cloud data lakes

While Iceberg on its own is interesting, Dremio co-founder and chief product officer Tomer Shiran sees it as a foundational element of a new data tier that is emerging.

In his keynote address, Shiran outlined three evolving open source projects that are helping to define a new type of data tier for cloud data lake use. The three projects include Iceberg, which provides the open table format for the data lake; Nessie, which provides a new type of data metastore; and Apache Arrow Flight. Apache Arrow is an open source project that is tightly integrated with Dremio's platform, providing fast data access capabilities. Apache Arrow Flight is a new framework that further accelerates data access for large data sets. Shiran said in his view, Apache Arrow Flight is a modern replacement for Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC).

"With Iceberg, Nessie and Arrow Flight, we're moving into an era this year where the data lake will be able to do everything that you can do with a data warehouse and actually quite a bit more," Shiran said.

Next Steps

Apache Daffodil advancing Data Format Description Language

Upsolver raises $25M for no-code data lake platform

ChaosSearch brings SQL to cloud data lake platform

Hudi powering data lake efforts at Walmart and Disney+ Hotstar

Dig Deeper on Data management strategies