Capacity planning for the data warehouse environment

Capacity planning is essential for all data warehouse environments. Data warehouses grow at high rates and must be managed to keep budgets under control.

There are many reasons for doing capacity planning for the data warehouse, DSS environment. Data warehouses grow at a tremendous rate and costs escalate over time. Data warehouses operate on a variety of technologies and have similar performance issues to other technology environments. Therefore, it only makes sense to plan your data warehouse environment. In doing so, you place your corporation in a proactive, not reactive position.

Who Does Data Warehouse Capacity Planning?

How many corporations do capacity planning for the data warehouse, DSS environment? And, of those few corporations that actually do capacity planning, how many have done capacity planning for the data warehouse environment effectively? The answer is that there are, at best, a handful of companies that have done data warehouse, DSS capacity planning effectively. All enterprises should do capacity planning for the data warehouse environment.

Three Approaches

This article will describe three approaches for capacity planning for the data warehouse, DSS environment. After reading about these three approaches, you can decide which approach would be best for your enterprise.

What Needs To Be Planned?

What is it that needs to be planned in the data warehouse, DSS environment? While there are many facets to the data warehouse DSS environment, the two most important aspects of capacity planning are planning for storage and planning for processors.

In addition, there are plenty of other decisions that come with this planning:

  • what dbms to use;
  • what form of networking to be used;
  • what end-user access and analysis tools should be used; and
  • what kind of integration and transformation tools should be used, etc?

But the heart of the matter is how much storage and how many and what types of processors should be used for the data warehouse.

Key Factors

The two key factors the capacity planner looks at are the amount of data to be managed and the workload that will be run against the data warehouse. Unfortunately, both of these factors in the data warehouse DSS environment are very difficult to ascertain.

The Analytical Approach

The first approach to capacity planning is the analytical approach. The analytical approach is one in which the capacity planner attempts to calculate and/or predict capacity needs before the equipment is purchased. In the analytical approach the analyst attempts to quantify such things as:

  • how many customers will be in the warehouse;
  • at what rate will the customers grow;
  • how many transactions will be in the warehouse;
  • at what rate will the transactions grow;
  • what other data will be in the warehouse;
  • at what rate will the other data grow;
  • what is the proper level of granularity for data in the warehouse;
  • can the level of granularity be changed if needed;
  • what amount of history is needed in the warehouse;
  • will the user decide to add more history than anticipated, etc?

Each of these interrelated questions must be answered in order for the analyst to determine how much data will be in the warehouse. If you have ever struggled through an exercise of trying to accurately predict the answers to these questions, you know that accuracy is very difficult to achieve.

But volumes of data are only one aspect of capacity planning. The other side of capacity planning in the data warehouse, DSS environment is that of workload projection. If you thought trying to predict volumes of data was difficult, try to predict what the workload for the DSS environment will be.

DSS Workload

There are four types of end-users that will access the data warehouse, DSS environment. They are tourists, farmers, explorers and data miners.

Tourists are people who know how to cover a breadth of material quickly, but who have little depth. Tourists know how to find things.

Farmers are those people who do the same activity repeatedly, except on different data. Farmers know what they want before they set out to execute a query. Farmers operate in a very predictable manner. Farmers execute the same query repeatedly, against very small amounts of data. Farmers expect good performance for their queries. Many knowledge workers are farmers, typically in finance and accounting. 

Explorers are people who do not know what they want. Explorers are people who do "out-of-the-box" thinking. Explorers operate on intuition. Explorers create huge queries, looking at much detail and history. Response time for explorers may be very long. Explorers look at data one way and then another. Then they move on to other data. Explorers often find nothing. But when an explorer finds something, the results can be spectacular. Many actuaries and process control engineers are explorers. 

Data miners are people who methodically scan data. ­Usually data miners examine large amounts of data at a detailed level, ­looking for confirmation of a hypothesis or perhaps even for a new hypothesis. Data miners look for suspected patterns. Once having found the pattern, the data miner tries to explain the pattern, in both the technical sense and the business sense. Once the pattern is discovered and explained, it is then open to exploitation. The business potential is tremendous.

There are many factors that must be considered when trying to profile the data warehouse, DSS workload. Some of the more interesting factors are:

  • how many end-users will you have?
  • what types of end-users do you have?
  • what does the average query look like for each type of end-user?
  • what does the pattern of submission look like for each type of end-user?
  • are there predictable loads for daily, monthly or quarterly processing?
  • will there be an attempt to use a resource governor? etc.

There are many questions that need to be answered to portray the data warehouse, DSS workload. As in the case of volumes of data, an accurate picture simply cannot be painted.

But perhaps the toughest question for the analytical approach is gauging the attitude of the community of people using the data warehouse, DSS environment. In most cases the data warehouse, DSS environment is one of discovery. People simply don't know what is going to happen until they get there. People don't know what they will do until they know what the possibilities are. And if people really don't know what they will do, trying to look into a crystal ball and predict what will happen is impossible.

The Calibrated Extrapolation Approach

The calibrated extrapolation approach is one where there is at best a rudimentary attempt at analytical capacity planning. But after the first or second iteration of the warehouse is created and after the first few users have become enamored of the data warehouse, then careful track is kept for the warehouse and its usage. Over calibrated periods of time, the growth of the warehouse is tracked. Based on the incremental growth that is being measured, an extrapolation of future capacity needs is made. The extrapolation of capacity needs then becomes an educated guess. Of course the educated guess can be refined. The analyst can factor in known growth factors such as the addition of new subject areas, addition of history, and the like. In doing so, the analyst combines the best of the calibrated extrapolation approach and the analytical approach.

But even when the calibrated extrapolation approach is used wisely, the calibrated extrapolation approach has only a short time horizon for effectiveness. Extrapolation can be done for three months or maybe even for six months. But anything beyond that is questionable.

The Third-Party Approach

The Third-Party approach is to find an expert, company or trusted vendor who has worked with a data warehouse, DSS environment that has roughly the same characteristics as your company.   There is no substitute for experience.  But there are pitfalls with the third-party approach. Some of the pitfalls are:

  • the third-party has not provided accurate information;
  • the third-party being examined has fundamental business and technological differences from your company;
  • the third-party being examined is affected by, and is responding to, business pressures which you are not aware of, etc.

The use of experts and vendors can be beneficial if they have your best interests at heart.

Summary

Capacity planning is essential for all data warehouse environments.  Data warehouses grow at high rates and must be managed to keep budgets under control.

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.

Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Dig Deeper on Data integration