• Aucun résultat trouvé

Architecture of Data Warehouses

Dans le document Data Mining (Page 115-119)

Databases, Data Warehouses, and OLAP

2. Database Management Systems and SQL

3.3. Architecture of Data Warehouses

The DWs have a three-level (tier) architecture that includes

•a bottom tier that consists of the DW server, which may include several specialized data marts and a metadata repository

•a middle tier that consists of an OLAP server (described later) for fast querying of the data warehouse

•a top tier that includes front-end tools for displaying results provided by OLAP, as well as additional tools for data mining of the OLAP-generated data

Operational databases

query results

Decision support environment

External data sources

extraction, cleaning, transformation, and load

Middleware server

Enterprise warehouse

Figure 6.5. A virtual data warehouse (left) and an enterprise data warehouse (right).

Operational databases

Decision support environment Enterprise warehouse

External data sources

extraction, cleaning, transformation and load

Data mart

selection and aggregation

Data mart

Figure 6.6. Data warehouse consisting of an enterprise warehouse and data marts.

The overall DW architecture is shown in Figure 6.7.

Themetadata repositorystores information that defines DW objects. It includes the following parameters and information for the middle and the top tier applications:

– a description of the DW structure, including the warehouse schema, dimensions, hierarchies, data mart locations and contents, etc.

– operational meta-data, which usually describe the currency level of the stored data, i.e., active, archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports, audit trails, etc.

– system performance data, which includes indices used to improve data access and retrieval performance

– information about mapping from operational databases, which includes source RDBMSs and their contents, cleaning and transformation rules, etc.

– summarization algorithms, predefined queries and reports

– business data, which include business terms and definitions, ownership information, etc.

Similarly to the RDBMS, the internal structure of a DW is defined using a warehouseschema.

There are three major types of warehouse schemas:

•thestar schema, where a so-calledfact table, which is connected to a set ofdimension tables, is in the middle

110 3. Data Warehouses

Operational databases External data sources

extraction, cleaning, transformation, and load

Data warehouse

Output

DATA

Monitoring Administration Metadata

repository

Data marts BOTTOM TIER: data warehouse server

MIDDLE TIER: OLAP server OLAP

server

OLAP server

TOP TIER: front–end tools

IF a THEN x IF b AND a THEN w IF b THEN x

Querying / reporting Simple analysis Data mining Figure 6.7. Three-tier architecture of a data warehouse.

•the snowflake schema, which is a refinement of the star schema, in which some dimensional tables are normalized into a set of smaller tables, forming a shape similar to a snowflake

•the galaxy schema, in which there are multiple fact tables that share dimension tables (this collection of star schemas is also called afact constellation).

Each schema type is illustrated using the example of a computer hardware reseller company.

The company sells various computer hardware (CPUs, printers, monitors, etc.), has multiple locations (Edmonton, Denver, San Diego), and sells products from different vendors (CompuBus, CyberMax, MiniComp). The subject of this DW is “the sells,” e.g., the number of sold units and the related costs and profits.

3.3.1. Star Schema

An example star schema for the hardware reseller company is shown in Figure 6.8.

Theitem, time, location, andproducerare dimension tables, associated with the fact tablesales through key attributes, such asitem_code,time_code, etc. These tables are implemented as regular

item

relational tables. Thefacttable stores time-ordered data that concern the predefined DW subject, while the dimension tables store supplementary data that allow the user to organize, dissect, and summarize the subject-related data.

Thestar schema-based DW contains a single fact table that includes nonredundant data. Each tuple in the fact table can be identified using a composite key attribute that usually consists of key attributes from the dimension tables. Each dimension also consists of a single table, i.e., items, time, location, and producer are described using a single table. These tables may be denormalized, i.e., they may contain redundant data. Using the star schema, the data are retrieved by performing ajoin operationbetween the fact table and one or more dimension tables followed by aprojection operationand aselection operation. The join operation selects common data between two or more tables, the projection operation selects a set of particular columns, and the selection operation selects a set of particular tuples. The main benefits of the star schema include ease of understanding and a reduction in the number of joins needed to retrieve data. This translates into higher efficiency when compared with the other two schemas. On the other hand, the star schema does not provide support for concept (attribute) hierarchies, which are explained later in the Chapter. The star schema for our example is shown in Figure 6.9.

3.3.2. Snowflake and Galaxy Schemas

Thesnowflake schema is a refinement of the star schema. Similarly to the star schema, it has only one fact table, but dimensional tables are normalized into a set of smaller tables, forming a shape similar to snowflake. The normalization results in tables that do not contain redundant data.

The normalized dimensions improve the ease of maintaining the dimension tables and also save storage space. However, the space savings are, in most cases, negligible in comparison with the magnitude of the size of the fact table. The snowflake schema is suitable for concept hierarchies, but it requires the execution of a much larger number of join operations to provide answers to most of the queries and thereby has a strong negative impact on the data retrieval performance.

Finally, the galaxy schema is a collection of several snowflake schemas, in which there are

112 3. Data Warehouses

Figure 6.9. Example data for the computer hardware reseller company in the star schema.

multiple fact tables that may share some of their dimension tables. Example snowflake and galaxy schemas for our example are shown in Figure 6.10.

3.3.3. Concept Hierarchy

A concept hierarchy defines a sequence of mappings from a set of very specific, low-level concepts to more general, higher-level concepts. In a data warehouse, it is usually used to express different levels of granularity of an attribute from one of the dimension tables. To illustrate, we use the concept of location, in which each street address is mapped into a corresponding city, which is mapped into the state or province, which is finally mapped into the corresponding country. The location concept hierarchy is shown in Figure 6.11.

Concept hierarchies are crucial for the formulation of useful OLAP queries. The hierarchy allows the user to summarize the data at various levels. For instance, using the location hierarchy, the user can retrieve data that summarize sales for each individual location, for all locations in a given city, a given state, or even a given country without the necessity of reorganizing the data.

Dans le document Data Mining (Page 115-119)