photobank.kiev.ua - Fotolia
Apache Hadoop 3.0 goes GA, adds hooks for cloud and GPUs
Is this the post-Hadoop era? Not in the eyes of Hadoop 3.0 backers, who see the latest update to the big data framework succeeding in machine learning applications and cloud systems.
It may still have a place among screaming new technologies, but Apache Hadoop is neither quite as new nor quite as screaming as it once was. And the somewhat subdued debut of Apache Hadoop 3.0 reflects that.
Case in point: In 2017, the name Hadoop was removed from the title of more than one event previously known as a "Hadoop conference." Also, IBM dropped off the list of Hadoop distro providers, and it was a year in which machine learning applications -- and tools like Spark and TensorFlow -- became the focus of many big data efforts.
So, the low level of fanfare that accompanied the mid-December release of Hadoop 3.0 wasn't too surprising. The release does hold notable improvements, however. This update to the 11-year-old distributed data framework reduces storage requirements, allows cluster pooling on the latest graphics processing unit (GPU) resources, and adds a new federation scheme that enables the crucial Hadoop YARN resource manager and job scheduler to greatly expand the number of Hadoop nodes that can run in a cluster.
This latter capability could find use in Hadoop cloud applications -- where many appear to be heading.
Scaling nodes to tens of thousands
"Federation for YARN means spanning out to much larger clusters," according to Carlo Aldo Curino, a principal scientist at Microsoft who is also an Apache Hadoop committer and a member of the Hadoop Project Management Committee (PMC). With federation, in effect, a routing layer now sits in front of Hadoop Distributed File System (HDFS) clusters, he said.
Curino emphasized that he was speaking in his role as a PMC member, and not for Microsoft. He did note, though, that the greater scalability is useful in clouds such as Microsoft's Azure platform. Most of "the biggest among Hadoop clusters to date have been in the low thousands of nodes, but people want to go to tens of thousands of nodes," he said.
If Hadoop applications are going to begin to include millions of machines running YARN, federation will be needed to get there, Curino said. Looking ahead, he expects YARN to be a focus of future updates to Hadoop.
In fact, YARN was the biggest cog in the machine that was Hadoop 2.0, released in 2013 -- most particularly because it untied Hadoop from reliance on its original MapReduce processing engine. So, its central role in Hadoop 3.0 shouldn't be a surprise.
In Curino's estimation, YARN carries forward important new trends in distributed architecture. "YARN was an early incarnation of the serverless movement," he said, referring to the computing scheme that has risen to some prominence on the back of Docker containers.
Curino noted that some of the important updates in Hadoop 3.0, which is now generally available and deemed production-ready, had been brewing in previous point updates.
Opening up the Hadoop 3.0 pack
Among other new aspects of Hadoop 3.0, GPU enablement is important, according to Vinod Vavilapalli, who leads Hadoop YARN and MapReduce development at big data technology vendor Hortonworks Inc., based in Santa Clara, Calif.
That's because GPUs, as well as field-programmable gate arrays -- which are also supported in Hadoop 3.0 API updates -- are becoming go-to hardware for some machine learning and deep learning workloads.
Vinod VavilapalliHadoop YARN and MapReduce development lead, Hortonworks
Without updated APIs such as those found with Hadoop 3.0, Vavilapalli noted, these workloads require special setups to access modern data lakes.
"With Hadoop 3.0, we are moving into larger scale, better storage efficiency and deep learning and AI workloads, and improving interoperability with the cloud," Vavilapalli said. In the latter regard, he added, Hadoop 3.0 brings better support via erasure coding, an alternative to typical Hadoop replication that saves on storage space.
Will Hadoop take a back seat?
Both Curino and Vavilapalli concurred that the original model of Hadoop, in which HDFS is tightly matched with MapReduce, may be fading, but they said that isn't necessarily a reason to declare this the "post-Hadoop" era, as some pundits suggest.
"One of the things I noticed about sensational pieces that say 'Hadoop is dead' is that it's a bit of a mischaracterization," Curino said. "What it is that people see is MapReduce losing popularity -- it's not the paradigm use anymore. This was clear to the [Hadoop] community long ago. It's why we started work on YARN."
For his part, Vavilapalli said he sees Hadoop becoming more powerful and enabling newer use cases.
"This constant reinvention tells me that Hadoop is always going to be relevant," he said. Even if it becomes something running in the background of big data systems, "it will be part of the critical infrastructure that powers an increasingly data-driven world."