Data warehouse architectures, concepts and phases: What you need to know
This tutorial explains the different types of data warehouse architecture including bus, federated and hub-and-spoke. Learn about ETL processes and data staging and data warehouse phases.
1.3 Data Warehouse Architecture
The following architecture properties are essential for a data warehouse system (Kelly, 1997):
- Separation: Analytical and transactional processing should be kept apart as much as possible.
- Scalability: Hardware and software architectures should be easy to upgrade as the data volume, which has to be managed and processed, and the number of users' requirements, which have to be met, progressively increase.
- extensibility: The architecture should be able to host new applications and technologies without redesigning the whole system.
- Security: Monitoring accesses is essential because of the strategic data stored in data warehouses.
- Administerability: Data warehouse management should not be overly difficult.
For more information on this title
This passage is excerpted from Data Warehouse Design: Modern Principles and Methodologies by Matteo Golfarelli and Stefano Rizzi (McGraw-Hill; 2009) with permission from McGraw-Hill.
It explains eight different types of data warehouse architecture including single-, two- and three-layer architecture, bus architecture, federated architecture and hub-and-spoke. Learn about ETL processes and data staging and how to get relevant data through extraction. Plus, read definitions of data marts and legacy systems in this data warehouse architecture tutorial.
Two different classifications are commonly adopted for data warehouse architectures. The first classification, described in sections 1.3.1, 1.3.2, and 1.3.3, is a structure-oriented one that depends on the number of layers used by the architecture. The second classification, described in section 1.3.4, depends on how the different layers are employed to create enterprise-oriented or department-oriented views of data warehouses.
1.3.1 Single-Layer Architecture
A single-layer architecture is not frequently used in practice. Its goal is to minimize the amount of data stored; to reach this goal, it removes data redundancies. Figure 1-2 shows the only layer physically available: the source layer. In this case, data warehouses are virtual. This means that a data warehouse is implemented as a multidimensional view of operational data created by specific middleware, or an intermediate processing layer (Devlin, 1997).
The weakness of this architecture lies in its failure to meet the requirement for separation between analytical and transactional processing. Analysis queries are submitted to operational data after the middleware interprets them. It this way, the queries affect regular transactional workloads. In addition, although this architecture can meet the requirement for integration and correctness of data, it cannot log more data than sources do. For these reasons, a virtual approach to data warehouses can be successful only if analysis needs are particularly restricted and the data volume to analyze is huge.
What is a legacy system?
Definition: The term legacy system denotes corporate applications, typically running on mainframes or minicomputers, that are currently used for operational tasks but do not meet modern architectural principles and current standards. For this reason, accessing legacy systems and integrating them with more recent applications is a complex task. All applications that use a nonrelational database are examples of legacy systems.
1.3.2 Two-Layer Architecture
The requirement for separation plays a fundamental role in defining the typical architecture for a data warehouse system, as shown in Figure 1-3. Although it is typically called a two-layer architecture to highlight a separation between physically available sources and data warehouses, it actually consists of four subsequent data flow stages (Lechtenbörger, 2001):
- Source layer A data warehouse system uses heterogeneous sources of data. That data is originally stored to corporate relational databases or legacy databases, or it may come from information systems outside the corporate walls.
- Data staging The data stored to sources should be extracted, cleansed to remove inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one common schema. The so-called Extraction, Transformation, and Loading tools (ETL) can merge heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a data warehouse (Jarke et al., 2000). Technologically speaking, this stage deals with problems that are typical for distributed information systems, such as inconsistent data management and incompatible data structures (Zhuge et al., 1996). Section 1.4 deals with a few points that are relevant to data staging.
- Data warehouse layer Information is stored to one logically centralized single repository: a data warehouse. The data warehouse can be directly accessed, but it can also be used as a source for creating data marts, which partially replicate data warehouse contents and are designed for specific enterprise departments. Meta-data repositories (section 1.6) store information on sources, access procedures, data staging, users, data mart schemata, and so on.
- Analysis In this layer, integrated data is efficiently and flexibly accessed to issue reports, dynamically analyze information, and simulate hypothetical business scenarios. Technologically speaking, it should feature aggregate data navigators, complex query optimizers, and user-friendly GUIs. Section 1.7 deals with different types of decision-making support analyses.
What is a data mart?
Definition: A data mart is a subset or an aggregation of the data stored to a primary data warehouse. It includes a set of information pieces relevant to a specific business area, corporate department, or category of users.
The architectural difference between data warehouses and data marts needs to be studied closer. The component marked as a data warehouse in Figure 1-3 is also often called the primary data warehouse or corporate data warehouse. It acts as a centralized storage system for all the data being summed up. Data marts can be viewed as small, local data warehouses replicating (and summing up as much as possible) the part of a primary data warehouse required for a specific application domain.