Five factors to help select the right data warehouse product
How big is your company, and what resources does it have? What are your performance needs? Answering these questions and others can help you select the right data warehouse platform.
Once you've decided to implement a new data warehouse, or expand an existing one, you'll want to ensure that you choose the technology that's right for your organization. This can be challenging, as there are many data warehouse platforms and vendors to consider.
Long-time data warehouse users generally have a relational database management system (RDBMS) such as IBM DB2, Oracle or SQL Server. It makes sense for these companies to expand their data warehouses by continuing to use their existing platforms. Each of these platforms offers updated features and add-on functionality (see the sidebar, "What if you already have a data warehouse?").
But the decision is more complicated for first-time users, as all data warehousing platform options are available to them. They can opt to use a traditional DBMS, an analytic DBMS, a data warehouse appliance or a cloud data warehouse. The following factors may help make the decision process easier.
1. How large is your company?
Larger companies looking to deploy data warehouse systems generally have more resources, including financial and staffing, which translates to more technology options. It can make sense for these companies to implement multiple data warehouse platforms, such as an RDBMS coupled with an analytical DBMS such as Hewlett Packard Enterprise (HPE) Vertica or SAP IQ. Traditional queries can be processed by the RDBMS, while online analytical processing (OLAP) and nontraditional queries can be processed by the analytical DBMS. Nontraditional queries aren't usually found in transactional applications typified by quick lookups. This could be a document-based query or a free-form search, such as those done on Web search sites like Google and Bing.
For example, HPE Vertica offers Machine Data Log Text Search, which helps users collect and index large log file data sets. The product's enhanced SQL analytics functions deliver in-depth capabilities for OLAP, geospatial and sentiment analysis. An organization might also consider SAP IQ for in-depth OLAP as a near-real-time service to SAP HANA data.
Teradata Corp.'s Active Enterprise Data Warehouse (EDW) platform is another viable option for large enterprises. Active EDW is a database appliance designed to support data warehousing that's built on a Massively Parallel Processing architecture. The platform combines relational and columnar capabilities, along with limited NoSQL capabilities. Teradata Active EDW can be deployed on-premises or in the cloud, either directly from Teradata or through Amazon Web Services.
For midsize organizations, where a mixture of flexibility and simplicity is important, reducing the number of vendors is a good idea. That means looking for suppliers that offer compatible technology across different platforms. For example, Microsoft, IBM and Oracle all have significant software portfolios that can help minimize the number of other vendors an organization might need. Hybrid transaction/analytical processing (HTAP) capabilities that enable a single DBMS to run both transaction processing and analytics applications should also appeal to midsize organizations.
What if you already have a data warehouse?
Data warehousing has been around for several decades, so it isn't uncommon for an organization to have already implemented a data warehouse. But even if you have a system in place with no plans to change the underlying core technology, there are still things you can do to improve performance and capabilities.
Your RDBMS vendor likely has released several new and improved versions since you first implemented your data warehouse. Take advantage of new features such as OLAP functions, materialized query tables and built-in extract, transform and load capabilities.
In many cases, RDBMS vendors have added HTAP capabilities to enable both transactional and analytical processing using the same DBMS.
Smaller organizations and those with minimal IT support should consider a data warehouse appliance or a cloud-based data warehouse as a service (DWaaS) offering. Both options make it easier to get up and running, and minimize the administration work needed to keep a data warehouse functional. In the cloud, for example, Amazon Redshift and IBM dashDB offer fully managed data warehousing services that can lower up-front implementation costs and ongoing management expenses.
Regardless of company size, it can make sense for an organization to work with a vendor or product that it has experience using. For example, companies using Oracle Database might consider the Oracle Exadata Database Machine, Oracle's data warehouse appliance. Exadata runs Oracle Database 12c, so Oracle developers and DBAs should immediately be able to use the appliance. Also, the up-front system planning and integration required for data warehousing projects is eliminated with Exadata because it bundles the DBMS with compute, storage and networking technologies.
A similar option for organizations that use IBM DB2 is the IBM PureData System for Analytics, which is based on DB2 for LUW. Keep in mind, however, that data warehouse appliances can be costly, at times pricing themselves out of the market for smaller organizations.
Microsoft customers should consider the preview release of Microsoft Azure SQL Data Warehouse. It's a fully managed data warehouse service that's compatible and integrated with the Microsoft SQL Server ecosystem.
2. What are your availability and performance needs?
Other factors to consider include high availability and rapid response. Most organizations that decide to deploy a data warehouse will likely want both, but not every data warehouse actually requires them.
When availability and performance are the most important criteria, DWaaS should be at the bottom of your list because of the lower speed imposed by network latency with cloud access. Instead, on-premises deployment can be tuned and optimized by IT technicians to deliver increased system availability and faster performance at the high end. This can mean using the latest features of an RDBMS, including the HTAP capabilities of Oracle Database, or IBM's DB2 with either the IBM DB2 Analytics Accelerator add-on product for DB2 for z/OS or BLU Acceleration capabilities for DB2 for LUW. Most RDBMS vendors offer capabilities such as materialized views, bitmap indexes, zone maps, and high-end compression for data and indexes. For most users, however, satisfactory performance and availability can be achieved with data warehouse appliances such as IBM PureData, Teradata Active EDW and Oracle Exadata. These platforms are engineered for data warehousing workloads, but require minimal tuning and administration.
Another appliance to consider is the Actian Analytics Platform, which is designed to support high-speed data warehouse implementation and management. The platform combines relational and columnar capabilities, but also includes high-end features for data integration, analytics and performance. It can be a good choice for organizations requiring both traditional and nontraditional data warehouse queries. The Actian Analytics Platform includes Actian Vector, a Symmetric Multiprocessor DBMS designed for high-performance analytics, which exploits many newer, performance-oriented features such as single instruction multiple data. This enables a single operation to be applied on a set of data at once and CPU cache to be utilized as execution memory.
Pivotal Greenplum is an open source, massively parallel data warehouse platform capable of delivering high-speed analytics on large volumes of data. The platform combines relational and columnar capabilities and can be deployed on-premises as software or an appliance, or as a service in the cloud. Given its open source orientation, Pivotal Greenplum may be viewed favorably by organizations basing their infrastructure on an open source computing stack.
3. Are you already in the cloud?
DWaaS is probably the best option for companies that already conduct cloud-based operations. The other data warehouse platform options would require your business to move data from the cloud to an on-premises data warehouse. Keep in mind, though, that in addition to cloud-only options like Amazon Redshift, IBM dashDB and Microsoft Azure SQL Data Warehouse, many data warehouse platform providers offer cloud-based deployments.
4. What are your data volume and latency requirements?
Although many large data warehouses contain petabytes of raw data, every data warehouse implementation has different data storage needs. The largest data warehouses are usually customized combinations of RDBMS and analytic DBMS or HTAP implementations. As data volume requirements diminish, more varied options can be utilized, including data warehouse appliances.
5. Is a data warehouse part of your big data strategy?
Big data requirements have begun to impact the data warehouse, and many organizations are integrating unstructured and multimedia data into their data warehouse to combine analytics with business intelligence requirements -- aka polyglot data warehousing. If your project could benefit from integrated polyglot data warehousing, you need a platform that can manage and utilize this type of data. For example, the big RDBMS vendors -- IBM, Oracle and Microsoft -- are integrating support for nontraditional data and Hadoop in each of their respective products.
You may also wish to consider IBM dashDB, which can process unstructured data via its direct integration with IBM Cloudant, enabling you to store and access JSON and NoSQL data. The Teradata Active EDW supports Teradata's Unified Data Architecture, which enables organizations to seamlessly access and analyze relational and nonrelational data. The Actian Analytics Platform delivers a data science workbench, simplifying analytics, as well as a scaled-out version of Actian Vector for processing data in Hadoop. Last, the Microsoft Azure SQL Data Warehouse enables analysis across many kinds of data, including relational data and semi-structured data stored in Hadoop, using its T-SQL language.
Although organizations have been building data warehouses since the 1980s, the manner in which they are being implemented has changed considerably. You should now have a better idea of how modern data warehouses are built and what each of the leading vendors provides. Armed with this knowledge, you can make a more informed choice when purchasing data warehouse products.