4 factors to consider in a Hadoop distributions comparison
Examine the key characteristics necessary to evaluate in a Hadoop distribution comparison, focusing on enterprise features, subscription options and deployment models.
Although most of the software components that constitute the Hadoop ecosystem stack are open source technologies, there are numerous benefits to paying a vendor for a subscription to use its commercial Hadoop platform.
For example, a subscription provides technical support and training, as well as access to enterprise features not available to the open source community. While the enterprise editions of vendor Hadoop distributions all provide the core components of the Hadoop ecosystem stack, the key differentiators are what these vendors offer beyond the openly accessible functionality.
Recent changes in the market have thinned the ranks of Hadoop distribution vendors. For example, in 2016, Pivotal Software pulled the plug on its own Hadoop distribution and said it would start reselling the Hortonworks Data Platform (HDP) instead. IBM did the same thing a year later. Then, in 2018, Cloudera Inc. purchased Hortonworks, its former archrival.
But there's still a group of suppliers to consider, including Hadoop specialists Cloudera and MapR Technologies Inc. and the three leading cloud platform providers: AWS, Microsoft and Google. To further complicate things, Cloudera plans to roll out a unified offering that combines features from its own Cloudera Distributed Hadoop (CDH) distribution and HDP in 2019, but said it will also continue to develop and support the two existing platforms at least until the end of 2021.
To determine the right Hadoop provider, a company must be able to perform Hadoop distribution comparisons for specific vendors based on several key characteristics, such as deployment models, enterprise-class features, security and data protection features, and support services.
Note that while the Hadoop big data management ecosystem is engineered to support scalable data storage and high-performance distributed computing, the actual performance may vary for several reasons, including the software implementation. However, many performance issues depend on the planned applications themselves. To address this, buyers should examine how the Hadoop product distributions are targeted to meet the business needs of user organizations.
1. Hadoop deployment models
The Hadoop offerings from AWS, Microsoft and Google deploy solely in cloud environments. AWS uses its Amazon Elastic Compute Cloud, the central part of Amazon's cloud computing platform, and its Simple Storage Service data store to underpin Amazon EMR (Elastic MapReduce), which bundles its Hadoop distribution with the Spark processing engine and various other big data tools and technologies. In addition, Amazon EMR provides the option of using MapR's Hadoop distribution instead of the Amazon distribution.
Microsoft utilizes its Azure cloud infrastructure for HDInsight, a managed service that is currently based on HDP. Likewise, Google offers a managed service on its cloud platform called Google Cloud Dataproc. The service is built around the open source versions of Hadoop and Spark.
The cloud deployment model provides a rapid yet low-effort means of provisioning a Hadoop cluster, and AWS, Microsoft and Google all enable users to resize their environments on demand to handle dynamic computing and storage capacity needs. This elasticity is desirable for organizations with computational and storage needs that may vary over time.
While Cloudera and MapR also offer cloud-based deployments, they aren't limited to that model. These vendors allow users to download distributions that can be deployed on-premises or in private clouds on a variety of servers, including Linux and Windows systems. Cloudera and MapR also provide sandbox versions that can run in a virtual environment such as VMware.
The bottom line: Consider whether the organization prefers to manage its big data environment in-house or use a managed service. In-house management implies oversight and maintenance of the software environment, as well as continuous monitoring of the system, whether that environment is a physical platform on premises or a cluster that runs in the cloud. The on-premises option may be preferable if the company has an experienced IT staff that knows the proper system sizing characteristics or if security concerns warrant managing the system behind a trusted firewall.
The alternative is to use a vendor with a hosted services platform that will help configure, launch, manage and monitor operations. This may be preferable if the organization isn't sure what size system it will need or if it expects the required system size to fluctuate based on changing demand. The benefit of working with a cloud or managed service is that it will provide the necessary elasticity for both storage and processing resources.
2. Enterprise-class features of the top Hadoop distributions
Before the merger of Cloudera and Hortonworks, there were some notable differences in the development approaches of those two vendors and MapR that should be factored into a Hadoop distributions comparison.
Cloudera often augmented the Hadoop core with internally developed add-on technologies -- for example, its Impala SQL-on-Hadoop query engine; Cloudera Manager administration tools; and Kudu, an alternative data store to the Hadoop Distributed File System (HDFS) for use in real-time analytics applications. The company eventually open sourced some of those technologies after doing the initial development work itself, but it kept others proprietary.
Hortonworks, on the other hand, touted that it was "innovating 100% of its software in the Apache Hadoop community, [with] no proprietary extensions." Add-on technologies that it was the driving force behind, such as the Atlas data governance framework and Ambari provisioning and management software, were launched as open source projects from the outset. Hortonworks also banded together with IBM and other companies to form a group called the Open Data Platform Initiative (ODPi), an organization devoted to creating a common set of core technical specifications for Hadoop platforms. ODPi members claim that this will improve interoperability and minimize vendor lock-in.
Cloudera hasn't fully clarified how it will harmonize those two approaches. But the company said its unified Cloudera Data Platform (CDP) distribution will be "a 100% open source data platform."
MapR has taken a third path by eschewing some core Hadoop components and developing its own foundational technologies in an effort to support large clusters with enterprise-class performance needs. For example, instead of using HDFS, MapR built a file system that was initially known as MapR-FS and is now the MapR XD Distributed File and Object Store. It also created a NoSQL database, first called MapR-DB and now MapR Database, as an alternative to the HBase system that's tied to Hadoop.
Reflecting a strategic focus on real-time and stream processing applications, the MapR Data Platform also includes an internally developed event streaming technology that was introduced as MapR Streams and is now called the MapR Event Store for Apache Kafka.
From a features standpoint, the enterprise version of Cloudera's existing CDH distribution provides tools for operational management and reporting, as well as supporting business continuity. This includes such items as configuration history and rollbacks, rolling updates and service restarts, and automated disaster recovery. The HDP distribution developed by Hortonworks offers proactive monitoring and maintenance, plus data governance and metadata management tools. The unified CDP offering will blend features from CDH and HDP, with some overlap on functionality to ease migrations, Cloudera said.
MapR's enterprise offering provides tools to better manage and ensure the resiliency and reliability of data in Hadoop clusters, as well as multi-tenancy and high availability capabilities.
While its cloud platform is AWS' primary calling card for Amazon EMR, it also offers tools for monitoring and managing clusters and enabling application and cluster interoperability as part of the Hadoop service.
Amazon EMR collects and utilizes metrics to track progress and measure the health of a cluster. Users gain cluster health metrics through the command-line interface, software developer kits or APIs and can view it through the EMR management console. Additionally, Amazon's CloudWatch monitoring service can be used along with its implementation of the Apache Ganglia performance monitoring component to check the cluster and set alarms on events triggered by these metrics.
Microsoft's Azure HDInsight managed service offers more than 30 Hadoop and Spark applications that the company said can be installed with a single click. The service uses the Azure Log Analytics tool as an interface for monitoring clusters, and it's integrated with various other technologies in the Azure cloud, including Azure Cosmos DB, SQL Data Warehouse, Blob Storage and Data Lake Storage.
Google Cloud Dataproc provides automated cluster deployment, configuration and management, although users can manually configure systems if they prefer. Cloud Dataproc also includes built-in integration with other Google Cloud Platform services, such as Google Cloud Storage and the BigQuery data warehouse.
The bottom line: Choosing a Hadoop vendor that provides value-added components as part of its enterprise subscription may mean committing to a long-term relationship -- especially if these components integrate with its standard stack distribution. If companies are concerned about vendor lock-in, they should consider those vendors that are participating in the OPDi.
3. Security and data protection offerings from the Hadoop vendors
Despite the expanding use of open source software for enterprise-class applications, there remain suspicions about its suitability for production use from a security and data protection perspective. However, the Hadoop distribution vendors have taken steps to alleviate some of this anxiety.
For example, before it was acquired by Cloudera, Hortonworks had teamed up with other vendors and customers to launch a Data Governance Initiative for Hadoop, with an initial focus on the Apache Atlas project for managing shared metadata, data classification, auditing and security and policy management for data protection. It also integrated Atlas with Apache Ranger, an open source security tool for enforcing data access policies.
Cloudera provides tools that enable users to manage data security and governance for the CDH platform, supporting an organization's need to meet compliance and regulatory requirements. The company said it plans to build a single stack of security and data governance tools into CDP, but it didn't specify whether that will be based on the CDH stack, the Hortonworks one or a combination of the two.
In addition, Cloudera and MapR provide data encryption. Both CDH and HDP from Cloudera support encryption of data at rest while MapR provides encryption of data transmitted to, from and within a cluster.
Amazon EMR encrypts data at rest and in transit. It also offers identity and access management (IAM) policies to set permissions for individual users and groups in Hadoop systems. In addition, the IAM policies combine with tagging if needed for cluster-by-cluster access control. Other security features include Kerberos authentication and Secure Socket Shell support.
Azure HDInsight includes Enterprise Security Package, a technology bundle designed to help organizations protect their data through the use of Microsoft's Azure Virtual Network service, server-side encryption of data at rest and integration with Azure Active Directory for user authentication. ESP also incorporates Ranger for setting access control policies and searching the audit logs that track access to cluster resources.
Google Cloud Platform's standard security model applies to Cloud Dataproc. That provides a set of authentication, authorization and encryption mechanisms, including IAM permissions and both Secure Sockets Layer and Transport Layer Security encryption. Data can be user encrypted for access by only specified users when a cluster is set up, when the data is in transit to or from a cluster, or when a processing job is submitted.
The bottom line: The Hadoop vendors provide different approaches to authentication, role-based access control, security policy management and data encryption. Buyers should carefully specify security and protection requirements and review how each vendor addresses those needs. Doing a comprehensive Hadoop distribution comparison will help distinguish needs from wants.
4. Support subscriptions for the top Hadoop distributions
The fundamental value proposition for the open source software model is the bundling and simplification of system deployment with support and services. One alternative for deploying Hadoop involves downloading the source code for each component from the Open Source repository and then building and integrating all the parts together. This takes both skill and effort and is likely to be an iterative process. Open source vendors have already done the heavy lifting, providing preconfigured distributions and maintaining up-to-date integrated stacks.
What differentiates the vendors to a large degree is their support models. For example, Cloudera offers subscriptions with business-day and 24/7 support options for enterprise license holders. In both cases, it promises an initial response within one hour to a total functionality loss on a production system, although support technicians only work on problems from 9 a.m. to 5 p.m. local time Monday to Friday with the 8/5 business-day option. For its 24/7 customers, the company also offers premium support that includes a 15-minute response time on critical issues. Cloudera recommends opening an online support case when technical problems arise, but it said customers can call for help if need be.
All AWS accounts include basic support, which provides 24/7 customer service, access to community forums and documentation, as well as access to the AWS Trusted Advisor application. Developer support includes 12- or 24-hour response times, depending on the severity of the issue, and email access to support engineers during business hours (8 a.m. to 6 p.m. in the customer's country). Business-level support provides 24/7 phone, email and chat access to support engineers as well as shortened response times based on severity. Enterprise-level support adds less than 15-minute response time for critical issues, plus a dedicated technical account manager and a concierge support team.
MapR offers three levels of support. Beyond a free Community Support option that provides a support knowledge base and other information resources, production users can choose between the company's standard support and technical account management support programs. Both include follow-the-sun support and 24/7 phone support with a one-hour initial response time on severe system errors. The TAM service adds a technical account manager to be a single point of contact during a customer's normal business hours, as well as performance and upgrade guidance.
The bottom line: If support services are the source of added value from the vendor, the costs for the different support subscriptions should be aligned with customer expectations. Subscriptions providing one-hour or even 15-minute response times on a 24/7 basis with dedicated support staff will cost a lot more than 24-hour response time from a web-based interface during business hours.
Hadoop and related technologies have transformed the BI, analytics and data management industry since the big data framework was created in 2006. But, as we've examined, the open source Hadoop framework offers only so much, and companies that need more comprehensive performance and functionality capabilities as well as maintenance and support are turning to commercial Hadoop software distributions. Hopefully, this information will help companies make a more informed choice when purchasing a Hadoop distribution.
Linda Rosencrance contributed to this report