Getty Images
Lakehouse architecture the best fit for modern data needs
While data warehouses and data lakes each excel at handling certain types of data, a hybrid of the two is the best means of handling the increasing complexity of data management.
With data becoming more diverse and complex, the data lakehouse architecture may be the best means for managing that data and helping fuel data-driven decisions.
Speaking on July 18 during a webinar hosted by Transforming Data with Intelligence (TDWI), Fern Halper, the research and advisory firm's senior research director for advanced analytics, noted that data warehouses and data lakes each have their strengths.
But data warehouses and data lakes each fail to meet certain needs as organizations expand their use of data and attempt to combine different types of data to develop a more complete view of their operations.
Defining differences
Data warehouses were developed to store structured data with predefined schema such as financial records and demographic information. Data lakes were developed to store unstructured data with no predefined schema such as text, images and audio and video files.
The result of using data warehouses and lakes is isolated data, with structured data stored in one place and unstructured data stored in another. Without copious manual work by data engineers to give unstructured data a measure of structure by assigning each file a vector -- a numerical value -- organizations are unable to combine structured and unstructured data for a more rich set of data to inform decisions.
Data lakehouses, however, are more flexible than both data warehouses and data lakes, combining the capabilities of each so that structured and unstructured data can be stored in the same place and combined to form complete data sets.
As generative AI becomes more widespread, fueling machine learning built on massive sets of diverse data, data lakehouses can handle organizations' demands.
"Organizations are collecting more diverse data in order to gain better insights, produce new applications and, overall, better compete," Halper said. "This is why cloud platforms have become more mainstream. They're more capable of handling large amounts of data for advanced analytics."
But more than simply moving to the cloud, many organizations are moving beyond just cloud data warehouses and cloud data lakes and deploying data lakehouses, she continued.
According to a recent TDWI survey of about 400 respondents, 21% of organizations now use data lakehouses, up from 15% in a previous survey.
"These data lakehouses are blurring the distinction between data warehouses and data lakes. So no more silos and going back and forth between the two to get all the data for your use cases," Halper said.
Databricks was one of the pioneers of the data lakehouse architecture. But it is not alone in offering lakehouse capabilities. Fellow independent vendors such as Snowflake and Dremio also offer lakehouses, as do tech giants including Google with BigLake and Oracle with its Cloud Infrastructure Data Lakehouse.
Dealing with modern demands
Worldwide data volume stood at two zettabytes in 2010, according to Statista. That grew to 64.2 zettabytes in 2020 and is expected to total 120 zettabytes this year.
That's more than a 3,000% increase in 13 years and a nearly 100% increase in just the last three.
By 2025, worldwide data volume is expected to rise by another 50% to 181 zettabytes.
That exponential increase in data volume is making on-premises databases obsolete. They simply don't possess the same compute power as cloud data storage formats, which were built to handle massive amounts of data.
But at the same time data volume is growing, so is its diversity and complexity.
Data isn't just historical sales information anymore. It's now real-time information from motion-capture devices, text from emails and other messages, photographs, videos, and audio files.
That diversity and complexity are proving more than data warehouses and data lakes can manage. Each is expert at managing a certain type of data, but neither is expert at managing all kinds of data.
Data lakehouses may also someday prove unable to handle the data needs of organizations as demands increase in yet unseen ways. But now, data lakehouses are proving to be the best means of managing both the volume of data and its complexity, according to Halper.
"The data lakehouses can support efficient querying and analysis," she said.
In addition, Halper noted that lakehouses separate compute power from storage needs in ways data warehouses and data lakes do not. Therefore, when one increases, the other does not also increase, which results in better cost control.
"The data lakehouse separates the two and allows each to scale independently," she said.
Ultimately, Halper said that there are five pillars of data lakehouses that make them the best option for turning this era's data into functional sets that can be used to train generative AI and machine learning models and fuel analytics-based decisions.
They include the following:
- A unified architecture for diverse data and analytics.
- Supprt for a unified governance layer.
- Support for all analytics, from reports and dashboards to machine learning and generative AI.
- Support for open source standards.
- Optimization for modern data implementations and use cases.
Architecture and governance
The unified architecture is a reference to the joining together of the disparate capabilities of data warehouses and data lakes to combine structured and unstructured data.
A typical data lakehouse has a storage layer at its base where data is ingested. Above that is a metadata layer where data management features such as file format descriptions are applied. Next can be a semantic layer that includes tools like data catalogs. Then comes an API layer before there's finally a consumption layer where analytics and machine learning tools are located.
"Lakehouse architectures may vary, but the modern lakehouse provides a structured format … that can be queried for analytics," Halper said. "It's doing a lot of things that neither the data warehouse by itself or the data lake by itself can do."
Unified governance is about support for data quality tools that standardize an organization's data, including data catalogs and other ways of tracking data lineage and organizing data to make sure it can be found easily and put to use when needed. It's also about ensuring regulatory compliance, data privacy and access control.
Support for all analytics, meanwhile, means enabling users to do many things with their data from simple tasks like developing reports to building complicated data science models trained with machine learning and generative AI capabilities.
It also means letting organizations combine their own data with third-party data to enrich their data sets, enabling all user types from self-service analysts to data scientists and making it simple to move data products from development into production.
Open source modern applications
Support for open source standards is important because it enables organizations to avoid vendor lock-in by becoming too tied to the tools of one vendor.
Fern HalperSenior research director for advanced analytics, TDWI
"Open source standards are developed and maintained by a community of contributors that promote innovation and interoperability," Halper said. "That means organizations can move between implementations if they don't want their data in a proprietary format."
In addition, organizations prefer open source tools such as Delta Lake and Apache Iceberg because they can be free, Halper continued.
"In some ways, [open source] is future proof as long as the community continues to be engaged," she said.
Finally, Halper noted that unlike data warehouses and lakes, data lakehouses are optimized for modern data implementations and use cases. They're multi-cloud, they support batch files and streaming data, they're scalable, and they support data sharing and collaboration.
"[Sharing and collaboration] have become very popular in organizations," Halper said.
For example, sharing data between different partners is critical when putting together a supply chain.
Better together
Together, the five pillars of data lakehouses effectively separate lakehouses from data warehouses and data lakes in meeting the growing demands organizations are now placing on their data, according to Halper. Data warehouses and data lakes can each meet some demands. But they don't work well with each other and, therefore, limit what organizations can do in ways that data lakehouses do not.
Data warehouses were first developed in the 1980s by a pair of IBM Researchers. Data lakes were introduced by Pentaho (now Hitachi) around 2011 to address the shortcomings of data warehouses. But they had their own limitations.
Around the same time, Databricks developed the data lakehouse format. Other vendors have since followed suit.
Data lakehouses are not perfect. As noted in a blog post by data and security vendor Sakura Sky, they can be complex so require a level of expertise to manage that some organizations may not have. They're also relatively new, so are still maturing. And because they're cloud-based, costs can be difficult to control.
But they build on what data warehouses and lakes can do and are therefore best-suited to meet modern data management and analytics demands. "The data lakehouse is optimized for modern data implementations and use cases [such as] AI, ML, large amounts of data and self-service across the organization," Halper said.
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.