Getty Images/iStockphoto
Onehouse emerges with managed Apache Hudi data lake service
One of the original creators of the Hudi project at Uber has launched a new company set to bring a managed service to market to operationalize cloud data lakes.
Data lakehouse startup vendor Onehouse, a descendant of the Apache Hudi project at Uber, emerged from its stealth mode of operation on Feb. 2 alongside $8 million in seed funding.
The open source Apache Hudi cloud data lake project was originally developed in 2016 by a group of engineers including Vinoth Chandar, the CEO and founder of Onehouse.
Uber contributed Hudi to the Apache software foundation in 2019. Over the last several years, Hudi has found a home in a number of large organizations beyond Uber, including Walmart and Disney+ Hotstar.
With its new funding, Onehouse is looking to build out a managed service to help organizations deploy and use Apache Hudi-based data lakes.
The Apache Hudi project and Onehouse are in a competitive market for open source data lakehouse technologies, which includes Apache Iceberg and the Delta Lake project originally created by Databricks.
In this Q&A, Chandar discusses the challenges Apache Hudi was built to solve and how his startup is looking to help organizations.
Why did you start a data lake company based on Apache Hudi?
Vinoth Chandar: We built Hudi during the hypergrowth stage at Uber as a way for the company to scale its data lake and bring in data transactions faster. We made Hudi feel more like a data warehouse than just a data lake. Over the last four years, the Hudi community has grown and has helped to pioneer new transactional data lake capabilities.
What we routinely see in the community is that it still takes a lot of time for companies to operationalize their data lakes. We felt like we can actually create value here by creating a managed service that can help you get started.
Onehouse is not about being an enterprise Hudi company; it's more about helping companies to get started with data lakes, with open data formats, without the need that Uber had to make to get Hudi started.
What types of data lake services are needed that Hudi provides to help build a data lakehouse?
Chandar: If you look at generally how people speak about the data lake space, they talk about table formats. A format is a passive thing. The format being open actually does not mean that you have total freedom because the services on top, which generate value, have to also be open.
I think, at the very minimum, organizations need a very standardized data ingestion service. The service needs to be able to take data from things like cloud storage, or event streaming sources like Kafka or Pulsar, and build tables. Another thing that people routinely need is some way to automatically reclaim storage space.
One of the core advantages of Hudi is the ability to index data quickly, which is also needed to make use of data. Last, but not least, there is a need for data optimization techniques to optimize storage and data so that queries can be faster.
What do you see as a primary challenge for organizations with data lakes?
Chandar: There is a lot of frustration around data lakes just being data swamps.
In fact, why we started out with Hudi at Uber wasn't because we thought it would be cool to enable data transactions on top of a data lake. We saw that it was easy to get all kinds of data sets easily into a data warehouse, but it wasn't as easy to scale or query the data. So we decided to bring transactions to the data lake and then enable an open query engine.
With Hudi, data scientists can now use Spark, and operations people can use Presto and Trino. At the end of the day, we built a data layer that is extremely scalable and open.
For organizations today, the challenge is also that they need to hire data engineers to get started with a data lake. Data volumes have generally grown a lot in recent years. I feel that the large volumes of data that we saw in 2016 at Uber are now routinely seen at other companies, where five years ago you wouldn't have expected to see that.
In the coming years, people are going to want to start with a more effective data lake technology. Since the data volumes are growing so fast, they can't build it on their own.
The principle that I've held for a while is that data should be independent. People should be able to [democratize their] data very easily.
Editor's note: This interview has been edited for clarity and conciseness.