Databases, Data Warehouses, and OLAP
2. Database Management Systems and SQL
3.3. Architecture of Data Warehouses
The DWs have a three-level (tier) architecture that includes
•a bottom tier that consists of the DW server, which may include several specialized data marts and a metadata repository
•a middle tier that consists of an OLAP server (described later) for fast querying of the data warehouse
•a top tier that includes front-end tools for displaying results provided by OLAP, as well as additional tools for data mining of the OLAP-generated data
Operational databases
query results
Decision support environment
External data sources
extraction, cleaning, transformation, and load
Middleware server
Enterprise warehouse
Figure 6.5. A virtual data warehouse (left) and an enterprise data warehouse (right).
Operational databases
Decision support environment Enterprise warehouse
External data sources
extraction, cleaning, transformation and load
Data mart
selection and aggregation
Data mart
Figure 6.6. Data warehouse consisting of an enterprise warehouse and data marts.
The overall DW architecture is shown in Figure 6.7.
Themetadata repositorystores information that defines DW objects. It includes the following parameters and information for the middle and the top tier applications:
– a description of the DW structure, including the warehouse schema, dimensions, hierarchies, data mart locations and contents, etc.
– operational meta-data, which usually describe the currency level of the stored data, i.e., active, archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports, audit trails, etc.
– system performance data, which includes indices used to improve data access and retrieval performance
– information about mapping from operational databases, which includes source RDBMSs and their contents, cleaning and transformation rules, etc.
– summarization algorithms, predefined queries and reports
– business data, which include business terms and definitions, ownership information, etc.
Similarly to the RDBMS, the internal structure of a DW is defined using a warehouseschema.
There are three major types of warehouse schemas:
•thestar schema, where a so-calledfact table, which is connected to a set ofdimension tables, is in the middle
110 3. Data Warehouses
Operational databases External data sources
extraction, cleaning, transformation, and load
Data warehouse
Output
DATA
Monitoring Administration Metadata
repository
Data marts BOTTOM TIER: data warehouse server
MIDDLE TIER: OLAP server OLAP
server
OLAP server
TOP TIER: front–end tools
IF a THEN x IF b AND a THEN w IF b THEN x
Querying / reporting Simple analysis Data mining Figure 6.7. Three-tier architecture of a data warehouse.
•the snowflake schema, which is a refinement of the star schema, in which some dimensional tables are normalized into a set of smaller tables, forming a shape similar to a snowflake
•the galaxy schema, in which there are multiple fact tables that share dimension tables (this collection of star schemas is also called afact constellation).
Each schema type is illustrated using the example of a computer hardware reseller company.
The company sells various computer hardware (CPUs, printers, monitors, etc.), has multiple locations (Edmonton, Denver, San Diego), and sells products from different vendors (CompuBus, CyberMax, MiniComp). The subject of this DW is “the sells,” e.g., the number of sold units and the related costs and profits.
3.3.1. Star Schema
An example star schema for the hardware reseller company is shown in Figure 6.8.
Theitem, time, location, andproducerare dimension tables, associated with the fact tablesales through key attributes, such asitem_code,time_code, etc. These tables are implemented as regular
item
relational tables. Thefacttable stores time-ordered data that concern the predefined DW subject, while the dimension tables store supplementary data that allow the user to organize, dissect, and summarize the subject-related data.
Thestar schema-based DW contains a single fact table that includes nonredundant data. Each tuple in the fact table can be identified using a composite key attribute that usually consists of key attributes from the dimension tables. Each dimension also consists of a single table, i.e., items, time, location, and producer are described using a single table. These tables may be denormalized, i.e., they may contain redundant data. Using the star schema, the data are retrieved by performing ajoin operationbetween the fact table and one or more dimension tables followed by aprojection operationand aselection operation. The join operation selects common data between two or more tables, the projection operation selects a set of particular columns, and the selection operation selects a set of particular tuples. The main benefits of the star schema include ease of understanding and a reduction in the number of joins needed to retrieve data. This translates into higher efficiency when compared with the other two schemas. On the other hand, the star schema does not provide support for concept (attribute) hierarchies, which are explained later in the Chapter. The star schema for our example is shown in Figure 6.9.
3.3.2. Snowflake and Galaxy Schemas
Thesnowflake schema is a refinement of the star schema. Similarly to the star schema, it has only one fact table, but dimensional tables are normalized into a set of smaller tables, forming a shape similar to snowflake. The normalization results in tables that do not contain redundant data.
The normalized dimensions improve the ease of maintaining the dimension tables and also save storage space. However, the space savings are, in most cases, negligible in comparison with the magnitude of the size of the fact table. The snowflake schema is suitable for concept hierarchies, but it requires the execution of a much larger number of join operations to provide answers to most of the queries and thereby has a strong negative impact on the data retrieval performance.
Finally, the galaxy schema is a collection of several snowflake schemas, in which there are
112 3. Data Warehouses
Figure 6.9. Example data for the computer hardware reseller company in the star schema.
multiple fact tables that may share some of their dimension tables. Example snowflake and galaxy schemas for our example are shown in Figure 6.10.
3.3.3. Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set of very specific, low-level concepts to more general, higher-level concepts. In a data warehouse, it is usually used to express different levels of granularity of an attribute from one of the dimension tables. To illustrate, we use the concept of location, in which each street address is mapped into a corresponding city, which is mapped into the state or province, which is finally mapped into the corresponding country. The location concept hierarchy is shown in Figure 6.11.
Concept hierarchies are crucial for the formulation of useful OLAP queries. The hierarchy allows the user to summarize the data at various levels. For instance, using the location hierarchy, the user can retrieve data that summarize sales for each individual location, for all locations in a given city, a given state, or even a given country without the necessity of reorganizing the data.