Sergey Nivens - Fotolia
Explore Hadoop distributions to manage big data
Discover the uses of Hadoop distributions and the first steps in evaluating these products, as well as how the merger of rivals Cloudera and Hortonworks affects the market.
Hadoop is an open source technology that is the data management platform most commonly associated with big data distributions today. Its creators designed the original distributed processing framework in 2006 and based it partly on ideas that Google outlined in a pair of technical papers.
Yahoo became the first production user of Hadoop that year. Soon, other internet companies, such as Facebook, LinkedIn and Twitter, adopted the technology and began contributing to its development. Hadoop eventually evolved into a complex ecosystem of infrastructure components and related tools that several vendors package together in commercial Hadoop distributions.
Running on clusters of commodity servers, Hadoop offers users a high-performance, low-cost approach to establishing a big data management architecture to support advanced analytics initiatives.
As awareness of Hadoop's capabilities has increased, its use has spread to other industries for both reporting and analytical applications involving a mix of traditional structured data and newer forms of unstructured and semi-structured data. This includes web clickstream data, online ad information, social media data, healthcare claims records, and sensor data from manufacturing equipment and other internet of things devices.
What is Hadoop?
The Hadoop framework encompasses a large number of open source software components with a set of core modules to capture, process, manage and analyze massive volumes of data that are surrounded by a variety of supporting technologies. The core components include:
- The Hadoop Distributed File System (HDFS): Supports a conventional hierarchical directory and file system that distributes files across the storage nodes -- i.e., DataNodes -- in a Hadoop cluster.
- YARN (short for the good-humored Yet Another Resource Negotiator): Manages job scheduling and allocates cluster resources to running applications, arbitrating among them when there's contention for the available resources. It also tracks and monitors the progress of processing jobs.
- MapReduce: A programming model and execution framework for parallel processing of batch applications.
- Hadoop Common: A set of libraries and utilities that the other components utilize.
- Hadoop Ozone and Hadoop Submarine: Newer technologies that offer users an object store and a machine learning engine, respectively.
In Hadoop clusters, those core pieces and other software modules layer on top of a collection of computing and data storage hardware nodes. The nodes connect via a high-speed internal network to form a high-performance parallel and distributed processing system.
As a collection of open source technologies, no single vendor controls Hadoop; rather, the Apache software foundation manages its development. Apache offers Hadoop under a license that grants users a no-charge, royalty-free right to use the software.
Developers and other users can download the software directly from the Apache website and build Hadoop environments on their own. However, Hadoop vendors provide prebuilt, community versions with basic functionality that users can also download at no charge and install on a variety of hardware platforms. The vendors also market commercial -- or enterprise -- Hadoop distributions that bundle the software with different levels of maintenance and support services.
In some cases, vendors also offer performance and functionality enhancements over the base Apache technology -- for example, by providing additional software tools to ease cluster configuration and management or data integration with external platforms. These commercial offerings make Hadoop increasingly more attainable for companies of all sizes.
This is especially valuable when the commercial vendor's support services team can jump-start a company's design and development of their Hadoop infrastructure. It is also helpful to guide the selection of tools and the integration of advanced capabilities to deploy high-performance analytical systems to meet emerging business needs.
The components of a typical Hadoop software stack
What do you actually get when you use a commercial version of Hadoop? In addition to the core components, typical Hadoop distributions will include -- but aren't limited to -- the following:
- Alternative data processing and application execution managers, such as Spark, Kafka, Flink, Storm or Tez, that can run on top of or alongside YARN to provide cluster management, cached data management and other means of improving processing performance.
- Apache HBase: A column-oriented database management system modeled after Google's Bigtable project that runs on top of HDFS.
- SQL-on-Hadoop tools, such as Hive, Impala, Presto, Drill and Spark SQL, that provide varying degrees of compliance with the SQL standard for direct querying of data stored in HDFS.
- Development tools, such as Pig, that help developers build MapReduce
- Configuration and management tools, such as ZooKeeper or Ambari, that are useful for monitoring and administration.
- Analytics environments such as Mahout, which supplies analytical models for machine learning, data mining and predictive analytics.
Because the software is open source, companies don't have to purchase a Hadoop distribution as a product, per se. Instead, the vendors sell annual support subscriptions with varying service-level agreements (SLAs). All of the vendors are active participants in the Apache Hadoop community, although each may promote its own add-on components that it contributes to the community as part of its Hadoop distribution.
Who manages the Hadoop big data management environment?
It's important to recognize that getting the desired performance out of a Hadoop system requires a coordinated team of skilled IT professionals who collaborate on architecture planning, design, development, testing, deployment, and ongoing operations and maintenance to ensure peak performance. Those IT teams typically include:
- requirements analysts to assess the system performance requirements based on the types of applications that will run in the Hadoop environment;
- system architects to evaluate performance requirements and design hardware configurations;
- system engineers to install, configure and tune the Hadoop software stack;
- application developers to design and implement applications;
- data management professionals to prepare and run data integration jobs, create data layouts and perform other management tasks;
- system managers to ensure operational management and maintenance;
- project managers to oversee the implementation of the various levels of the stack and application development work; and
- a program manager who oversees the implementation of the Hadoop environment and prioritization, development and the deployment of applications.
The Hadoop software platform market
The evolution of Hadoop as a viable, large-scale data management ecosystem has also created a new software market that's transforming the business intelligence and analytics industry. This has expanded both the kinds of analytics applications that user organizations can run and the types of data that the companies can collect and analyze as part of those applications.
The market now includes two major independent vendors that specialize in Hadoop -- Cloudera Inc. -- Cloudera and Hortonworks merged in October 2018 to form this new company -- and MapR Technologies Inc. Other companies that offer Hadoop distributions or capabilities include cloud platform market leaders AWS, Google and Microsoft, which uses Hortonworks as part of a big data distributions managed service.
Over the years, the Hadoop market has matured -- and consolidated -- significantly. IBM, Intel and Pivotal Software all dropped out of the market, but the combination of Cloudera and Hortonworks is the biggest change for users to date. The merger of the former rivals gives the new Cloudera a larger share of the market and could enable it to compete more effectively in the cloud.
In fact, Cloudera's new messaging is that it will deliver "the industry's first enterprise data cloud" -- an indication of its desire to compete with the AWS, Microsoft Azure and Google clouds.
Cloudera plans to develop a unified offering called the Cloudera Data Platform, although it hasn't said when it will become available. In the meantime, the company will continue to develop the existing Cloudera and Hortonworks platforms and support them until at least January 2022.
Although the new Cloudera may be more competitive, a potential downside to the merger is that Hadoop users now have fewer options. That's why it's even more critical to evaluate the vendors that provide Hadoop distributions and understand the similarities and differences between the two primary aspects of the product offerings.
First is the technology itself: what's included in the different distributions, what platforms are they supported on, and, most importantly, what specific components do the individual vendors support?
Second is the service and support model: what types of support and SLAs do vendors provide within each subscription level, and how much do different subscriptions cost?
Understanding how these aspects relate to your specific business requirements will highlight the characteristics that are important for a vendor relationship.
Linda Rosencrance contributed to this report.