Getty Images/iStockphoto
Top 5 elements needed for a successful data warehouse
While conventional data warehouses may struggle to keep up with growing volumes of data, these five elements best give the ability to tap into valuable BI.
As companies strive for a unified view of their customers and other key corporate data, many turn to data warehouses as a place to keep the single source of digital truth.
A survey TDWI research released earlier this year found that 53% of companies have an on-premises data warehouse and 36% have one in the cloud. According to Gartner, 75% of all databases will be in the cloud by 2022, and by 2023, cloud database revenues will account for 50% of the total market.
Common cloud-based data warehouse database management systems include Amazon Redshift, Cloudera Data Warehouse, Databricks, Google BigQuery, Microsoft Azure Synapse, Snowflake Data Cloud and Teradata Vantage. On premises, popular platforms include IBM Db2, Oracle Autonomous Database, Teradata Vantage, SAP HANA and Vertica, though they can be deployed in the cloud as well.
Data warehouse vendors take a broad range of approaches to analytics, machine learning and artificial intelligence. Some have the most advanced tools built right into the system, even offering self-configuring automated machine learning (AutoML) capabilities. Others support integration with third-party data science platforms and tools, or require companies to do their own data science.
Some data warehouses provide not just information management and analytics capabilities, but also orchestration functionality, said Amaresh Tripathy, global leader of analytics at Genpact, a digital transformation consultancy.
It's not just collecting data and making a prediction based on that data, but also sending execution commands to other systems to take some action based on that prediction. "The next generation of data warehouses will have all those things," Tripathy said.
The 5 characteristics of a successful data warehouse project
1. Cloud and on-premises deployment options
"These days, most -- but not all -- organizations are looking to move legacy on-premises deployments into the cloud," said Doug Henschen, analyst at Constellation Research.
That means companies selecting a data warehouse platform should consider whether the data warehouse can be deployed in their preferred cloud environment or is available as a service.
Companies should also check whether there's an on-premises option and if it's compatible with the cloud version, he added.
2. Data science capabilities
According to Henschen, every data warehouse will support standard SQL queries, but support for data science varies greatly. Some data warehouses have advanced analytics and data science capabilities built in, while others leave the data science up to the customer.
When those features are available, the usability can vary greatly as well. "Do these capabilities target exclusively data scientists or are they AutoML-type features that can be exploited by SQL-savvy analysts and power users?" Henschen said.
In addition, companies may want to look at the support available for third-party data science platforms and ecosystems.
3. Performance capabilities
The key metric for data warehouse performance is how it handles queries.
"Performance will depend on the number, frequency and sophistication of your queries and the number of concurrent users," Henschen said.
Companies need to consider their performance requirements and expectations when making this evaluation. Henschen said companies will need to ask themselves if the workload primarily involves predictable queries driving reports and dashboards at scale, or if unpredictable ad hoc queries will come into the picture.
4. Deployment management
Doug HenschenVice president and principal analyst, Constellation Research
Data warehouses tend to be massive projects. Even as-a-service deployments require a lot of work. Companies need to connect data sources or migrate data from other data warehouses, and then set up the analytics and other functionality. Some vendors make this easier than others.
"Despite the marketing hyperbole, enterprise software is rarely, if ever, easy to configure and deploy," Henschen said.
Some vendors offer support for customers who are deploying on premises, or tools and services to help deploy in the cloud. Some vendors also offer container-based deployment options so that companies can deploy in hybrid and multi-cloud environments consistently.
If a company's requirements change, certain platforms are easier to scale up than others. For example, Henschen said some vendors offer serverless data warehouses that automatically match requirements and then scale up as data stores grow.
5. Workload management
Many data warehouse vendors today promise intelligent automation that makes it easy to manage workloads, but AI and automation technologies are in their infancy.
"There's no such thing as a clairvoyant product that understands your workloads, workload priorities and SLAs [service-level agreements] with zero guidance from humans," Henschen said.
But some data warehouse platforms do make it easier to set priority levels and assign resources. "They're letting the product make all sorts of query tuning, data tiering and caching decisions behind the scenes," he said.
The question is, does the data warehouse just keep adding more computing capacity if there's a problem? That can increase costs unnecessarily compared with tweaking other performance-related options. Henschen said he's seen organizations end up with more capacity than they planned for.
"Keep in mind that some automation features consume compute cycles to monitor and optimize performance," he added. The automation that's designed to make the warehouse more efficient can itself reduce efficiency.
The flip side is that a less automated warehouse may offer many more fine-grained, manual controls, but then it will require a lot of work and specialized skills to keep it optimized. "Buyers of nonautomated systems complain about people costs and the difficulty of finding and hiring skilled staff," Henschen said.
Plus, the older platforms often give discounts to customers if they buy capacity in advance. That means companies that buy a year or three years ahead of time might wind up overprovisioned and spending a lot more money than they needed to spend.
"Experience is the best teacher," Henschen said. "Talk to existing customers about their performance and capacity-planning experiences."
The future of data warehouses
The major alternatives to data warehouses are data lakes. While data warehouses have structured, well-organized data, data lakes are more free-form. The data can come in a variety of formats, and then AI and machine learning tools read this data.
But since the data in data lakes isn't well organized, it's harder to get value out of it, Gartner analyst Adam Ronthal said, and it requires data science expertise. One of the first things companies typically do with a data lake is add a layer of optimization to make some sense of the data.
"We can bring it back to where the business analysts can get value from it," Ronthal said. "Very few people can get value from raw data sets."
The result starts to look more and more like a data warehouse. Meanwhile, data warehouses have been adding support for unstructured data.
So, data warehouses and data lakes are converging into something called a data lakehouse, Ronthal said. It combines the data science focus of the data lake and the analytics power of the data warehouse.
"All the traditional data warehouse approaches are reaching into cloud data stores to implement data lake options," Ronthal said. "And all those that started out as data lakes are layering in layers of optimization so they can be data warehouses."