Sergey Nivens - Fotolia
Hadoop data lake architecture tests IT on data integration
Hortonworks users talk about building Hadoop data lakes to support new applications -- and the challenges their teams face on ingesting and refining data for end users.
SAN JOSE, Calif. -- These days, a rolling pageant glittering new data objects includes data planes, data fabrics, data streaming and more. It's almost enough to make you forget about the shiny object just a few years ago: the Hadoop data lake architecture.
But plenty data teams are still working hard to successfully implement real-life Hadoop data lakes -- ones that underlie many organizations' hopes for better predictive analytics and maybe even artificial intelligence.
Such was one the lessons from Hortonworks' DataWorks Summit 2018. That is where people like Sudhir Menon told the story behind big companies' moves to use their data hoards for digital transformation -- as seen at tech-infused upstarts, like Airbnb and Uber.
The journey that Menon, vice president enterprise information management at hotelier Hilton Worldwide, described includes a Hadoop data lake architecture as an integral part.
"We have a lot information in different formats, and we are bringing that into the data lake. Every [data] entity from every channel is coming into the lake now," Menon told a conference session audience.
The Hadoop data lake architecture has formed a basis for a potential consumer application -- a new digital key app that allows Hilton Honors program guests to, in effect, check themselves directly into their room, he said.
Populating the data lake takes time
Still, this will be a multiyear project, Menon noted. The project, which seeks to incrementally build out the Hadoop-based Hortonworks Data Platform (HDP) into a new repository for enterprise data, began about two years ago and is now moving toward going live, with many more "agile sprints" ahead, according to Menon.
The system employs a variety tools beyond HDP, including WSO2 API management, Talend integration and Amazon Redshift cloud data warehouse software.
For Menon's team, populating a data lake means transforming assorted ingested data into JSON events with a microservices architecture.
The transformations are a step in a data-refining process. The experience many data lake users has shown data has to be sorted into some sensible format as soon as possible if it is to be used with BI tools by business analysts, although raw data versions are still preserved for experimental data scientists.
Menon emphasized, while this "renovation and innovation" project supports data science, it also provides a new foundation for leaner everyday data reporting. He said, during the course the project, Hilton has decommissioned 380 dashboards for management reporting and replaced them with a more compact roster 40 dashboards.
What price, data democracy?
For companies like Hilton, with years legacy data, Hadoop data lake architectures can require significant effort.
Another data lake project is underway at United Airlines. At DataWorks, Joe Olson, senior manager for big data analytics at the Chicago-based airline, recounted a move to a new big data analytics environment that includes a data lake, along with a "curated layer data."
Olson described the significant work required to connect existing Teradata data warehouse analytics with pieces Hortonworks' platform.
Sudhir Menonvice president enterprise information management at Hilton Worldwide
"Moving small data sets is trivial ... but we don't yet have a good way to handle large data sets certain types," Olson said in a session on big data at United.
In yet another session, which was whimsically entitled "The Curious Case Data Lake Redemption," Shivinder Singh, a database architect and engineer at Verizon Wireless, based in Basking Ridge, N.J., described issues the telecom giant encountered as it opened up a Hadoop data lake to a wider group analytics users.
Singh described differences in block file sizes in Hadoop data lakes versus single-client implementations that can bring about garbage collection issues with observable performance impact. He said his teams worked with Hortonworks to address such issues in its data lake buildouts.
Despite such implementation issues, he emphasized, the Hadoop platform has helped drive diverse analytics advances, as Verizon Wireless evolved its architecture to deal with bigger and evermore variegated amounts data.
Sampling the data fabric
While the Hadoop data lake architecture was meant, in part, to reduce data silos in organizations, the reality has been that several data lakes may arise, becoming silos in themselves. At its user event, Hortonworks expanded on its recent discussions data fabric architectures, meant to mesh varied data lakes and other data framework components.
The company's Hortonworks DataPlane Service (DPS) is an example, being a software layer that handles governance and data management for multiple data lakes.
Like other products from vendors that once focused solely on Hadoop Distributed File System data storage, DPS supports a variety of storage formats, including cloud object storage.
"The data fabric is an idea that has been around and has evolved to encompass the reality that, going forward, these systems will be hybrids of on-premises and public cloud systems and, eventually, will be on multiple clouds," said Doug Henschen, an analyst at Constellation Research, in an interview
That could mean that, while the data lake may continue to find use, it may not actually be a Hadoop data lake architecture, Henschen said.
"Now, companies want the data lake concept to encompass more than just Hadoop," he said.