The functional components of a warehouse environment

one step at a time

5.2 The functional components of a warehouse environment

Before we can begin to illustrate for you the way this revolutionary new ap-proach works, we have to establish some basic understanding and some basic vocabulary for just what exactly it is we are trying to build.

A data warehousing environment is made up of three component pieces:

theacquisition area(identifying data in legacy systems and making it available for the warehouse to use), thestorage area(databases and/or files that hold the data in a form that makes it easy for end users to get at), and theaccess area (includes the network, personal computers, and data access tools that end users leverage to work with the available information), as shown in Figure 5.2.

The warehouse consists of these three functional parts, but is also held together and supported by two critical infrastructure components: thephysical infrastructure (the hardware, network, and software that hold the system together) and theoperational infrastructure(the people, roles, responsibilities, and procedures).

The typical information systems architect will want to immediately start basing his or her decisions about that infrastructure on the physical parameters.

But that is actually impossible to do. In reality, you are never really going to be able to stabilize the physical infrastructure at all. What we submit to you here is that the real way to make an effective warehouse is to concentrate on developing

Acquisition Storage Access

Operational Infrastructure Physical Infrastructure

Figure 5.2 The warehouse and its component parts.

the operational infrastructure as the first priority, and simply “cobble together”

the physical infrastructure as best you can at any given point in time.

The following sections will detail some of the critical information and characteristics about each of these components and will help illustrate exactly how we come to this uncomfortable, but necessary, conclusion.

5.2.1 Acquisition

Acquisition is the largest, most difficult and complicated, least glamorous, and most critical of all of the warehouse components. We include within our definition of acquisition all of the legacy systems of interest and all of the ways we can find the critical information they hold and make it available for use within the environment.

As much as we would like to believe that this is a trivial area, that is far from the truth. First of all, we must deal with the fact that most companies are forced to live with an incredibly diverse and incongruous set of legacy systems. Main-frames, minis, PCs of every size, brand, and description are usually cobbled together into some kind of unmanageable mess that we know as the legacy systems environment. In order to make the warehouse useful, we will have to figure out how to get the information out of each of these and place that extracted data into a database that is usable.

Any attempt to stabilize or create the one ultimate warehouse architecture will immediately be thwarted by these facts about the legacy systems environ-ment. For one thing, the legacy systems environment is always changing.

Therefore, the warehouse can never be more stable than the legacy systems that feed it.

As we stated earlier, acquisition represents 60% to 80% of the cost of warehousing, and 80% of that cost is the investment in people and their knowledge about the legacy systems themselves. Unfortunately, while many companies are trying to sell “automated” data extraction tools that are supposed to make this job easier, so far they have failed to deliver as promised.

The reality of data acquisition, at least for the present, is that it is a process that is mostly about the writing of programs that extract, format, cleanse, merge, purge, and otherwise prepare data for loading into the warehouse. This usually involves the creation of a long series of programs, each of which do a different part of the job, as shown in Figure 5.3.

The process of acquisition is actually made more complex by the fact that these operations are hardly ever performed on the same platform. The problem of building an acquisition job stream, therefore, includes figuring out which parts of the process should run on which machine (see Figure 5.4).

Figure 5.3 Acquisition (data preparation) job streams.

Sales

Extract Format Validate AS400

Warehouse Product.

Control Phone Sales

Acctng

Format Merge

Extract

UNIX

Merge Purge Mainframe A

Mainframe B

Neutral Platform

Extract Format Extract

Figure 5.4 Acquisition on multiple platforms.

5.2.2 Storage

The storage area of the warehouse is probably the easiest to understand. To meet the storage needs of the warehouse environment, we really need nothing more than to provide a database or a file storage area, or even just a tape drive that can hold the formatted data, and make it available to end users.

Several things of special interest about the storage area include the following:

1. It doesn’t have to be a database—Tapes, flat files, databases of all sizes and shapes can all effectively meet the warehouses storage needs.

What’s important about this area is that it store data in a form that is usable. As long as users can get at it, it’s good enough.

2. It doesn’t have to be limited to one database—Today’s client-server, middleware, and distributed database technology make it possible for us to envision and create a megastorage area, made up of several databases, running on various kinds of hardware and using different kinds of database software. The assumption that to be an effective warehouse we must haveonedatabase can create a lot of unnecessary complexity.

3. Data basements versus data warehouses—One of the biggest mistakes a developer of a warehouse can make is to assume that the databases being built are going to be permanent and relatively stagnant. In the worst cases, companies have built warehouses to hold many years worth of historical information with no clear idea of who was going to use it. The assumption is that someday someone will want it. Then the information sits there and is never used. We call these kinds of systems, those that store data for the sake of storing it, “data basements.” Data warehouses, on the other hand, need to be like real warehouses. You don’t use a real-world warehouse to hold items that nobody is going to use. That’s what museums are for. No, the only way to measure the effectiveness of your warehouse is to figure out how often and by how many people it will be used. A data warehouse should be measured, just like a real warehouse, in terms of how many “turns” the data makes within it. And just like the real warehouse, its manager should be ready to eliminate any inventory item that isn’t getting enough use. Adapta-tion of this attitude means that the warehouse manager will need to be constantly vigilant and always ready to drop tables that aren’t being used enough and add tables that can deliver new value.

5.2.3 Access

In many ways, the development of the access area of the warehouse is the easiest part. Data mining tools relieve I/T of most of the process of application pro-gramming and design. For the most part, these tools simply “hook up” to the warehouse and immediately let the user start being productive.

The technological piece of the process, the hooking up of the PC to the warehouse, is almost always managed using normal client-server technology and middleware. So, that, at least, is relatively straightforward.

What can be confusing, however, is the dizzying array of different tools and capabilities that are available to do the access job. While almost all products attempt to integrate different features and functions into their offerings, we have identified seven different major categories of tool type/functionality that all of the products fall under. When evaluating the particular tools, determining first which types of tools and which particular functionalities you are looking for can make the process a lot easier.

5.2.3.1 The seven access tool types These function/tool types include the following:

1. Query and reporting—Traditional query managers and report writers;

2. Agents—Software that schedules, runs, evaluates, or searches for things for the user;

3. OLAP (online analytical processing)—Multidimensional analysis tools;

4. Statistical analysis—Traditional SPSS/SAS and other stats packages;

5. Data discovery—Neural networks, CART, CHAID, and other ad-vanced artificial intelligence and knowledge-generating software;

6. Visualization—Systems that graphically or geographically display complex data relationships;

7. Web tools—Software that performs search, query, and agency work in the WWW environment.

5.2.3.2 The challenges of getting requirements

Given a good understanding of what kinds of tools are available is, of course, only half of the problem. The other half comes when trying to help users figure out what to do with the tools—to do that we need to go through an education/in-doctrination process where we demonstrate the tools and help the users decide how they can best leverage them to solve the business problems they are facing.

5.2.4 The operational infrastructure

Far more important then any of the technical details required to make the warehouse work is the operational infrastructure that we provide for it. Hard-ware, softHard-ware, legacy systems, and business needs are guaranteed to change, and to change often and violently. What will not change will be the need to have people who understand the legacy systems and how to get information out of them, who understand the databases (what’s in them and how they work), the business users and the nature of their problems, and data solutions.

The operational infrastructure—that collection of people, skills, experi-ence, policies, and procedures—is the way you tie the warehouse together and make it usable.

5.2.5 The physical infrastructure

It is in the area of the physical infrastructure that most technicians want to spend most of their time planning and executing. While these issues are important, they must take a back seat to the dynamic needs of the business and the technology. By definition, the acquisition area will be in a constant state of flux.

The cost of storing data and the size of the optimum data storage area is shifting and dropping drastically on a regular basis. The power and ease of use of access tools guarantees that this environment will also be undergoing change.

The sad truth is that the optimum technical solution today is guaranteed to be less than optimum tomorrow. Therefore, the way we develop an optimum physical infrastructure is to remake that decision each time we add functionality, dynamically changing the environment to meet the needs of the business today and in the future.

Dans le document Data Mining for (Page 99-104)