WEB SEARCH ENGINES 2 - Part One Introduction 1 3 Contents

-Drill down

Products

(a) Single cell (b) Multiple cells (c) Slice (d) Dice

FIGURE 2 . 1 5: OLAP operations.

complex, but also less efficient. MDD systems may presummarize along all dimensions.

A third approach, hybrid OLAP (HOLAP), combines the best features of ROLAP and MOLAP. Queries are stated in multidimensional terms. Data that are not updated fre

quently will be stored as MDD, whereas data that are updated frequently will be stored as RDB. 1

As seen in Figure 2.15, there are several types of OLAP operations supported by OLAP tools:

• A simple query may look at a single cell within the cube [Figure 2. 15(a)].

• Slice: Look at a subcube to get more specific information. This is performed by selecting on one dimension. As seen in Figure 2. 15(c), this is looking at a portion of the cube.

• Dice: Look at a subcube by selecting on two or more dimensions. This can be performed by a slice on one dimension and then rotating the cube to select on a second dimension. In Figure 2. 1 5(d), a dice is made because the view in (c) is rotated from all cells for one product to all cells for one location.

• Roll up (dimension reduction, aggregation): Roll up allows the user to ask ques

tions that move up an aggregation hierarchy. Figure 2. 15(b) represents a roll up from (a). Instead of looking at one single fact, we look at all the facts. Thus, we could, for example, look at the overall total sales for the company.

• Drill down: Figure 2.15(a) represents a drill down from (b). These functions allow a user to get more detailed fact information by navigating lower in the aggregation hierarchy. We could perhaps look at quantities sold within a specific area of each of the cities.

• Visualization: Visualization allows the OLAP users to actually "see" results of an operation.

To assist with roll up and drill down operations, frequently used aggregations can be precomputed and stored in the warehouse. There have been several different definitions for a dice. In fact, the term slice ^anddice is sometimes viewed together as indicating that the cube is subdivided by selecting on multiple dimensions.

8 WEB SEARCH ENGINES 2.

Section 2.8 Web Search Eng i nes 41

As a result of the large amount of data on the Web and the fact that it is continually growing, obtaining desired information can be challenging. Web search engines are used to access the data and can be viewed as query systems much like IR systems. As with IR queries, search engine queries can be stated as keyword, boolean, weighted, and so on. The difference is primarily in the data being searched, pages with heterogeneous data and extensive hyperlinks, and the architecture involved.

Conventional search engines suffer from several problems [RS99]:

• Abundance: Most of the data on the Web are of no interest to most people. In other words, although there is a lot of data on the Web, an individual query will retrieve only a very small subset of it.

• Limited coverage: Search engines often provide results from a subset of the Web pages. Because of the extreme size of the Web, it is impossible to search the entire Web any time a query is requested. Instead, mos� search engines create indices that are updated periodically. When a query is requested, often only the index is directly accessed.

• Limited query: Most search engines provide access based only on simple key

word-based searching. More advanced search engines may retrieve or order pages based on other properties such as the popularity of pages.

• Limited customization: Query results are often determined only by the query itself. However, as with traditional IR systems, the desired results are often depen

dent on the background and knowledge of the user as well. Some more advanced search engines add the ability to do customization using user profiles or historical information.

Traditional IR systems (such as Lexis-Nexis) may actually be tailored to a specific domain. The Web, however, has information for everyone.

As discussed in Chapter 7, True Web mining consists of content, structure, and usage mining. Web search engines are very simplistic examjJles of Web content mining.

2.9 STATISTICS

Such simple statistical concepts as determining a data distribution and calculating a mean and a variance can be viewed as data mining techniques. Each of these is, in its own right, a descriptive model for the data under consideration.

Part of the data mining modeling process requires searching the actual data. An equally important part requires inferencing from the results of the search to a general model. A current database state may be thought of as a sample (albeit large) of the real data that may not be stored electronically. When a model is generated, the goal is to fit it to the entire data, not just that which was searched. Any model derived should be statistically significant, meaningful, and valid. This problem may be compounded by the preprocessing step of the KDD process, which may actually remove some of the data. Outliers compound this problem. The fact that most database practitioners

42 Chapter 2 Related Concepts

and users are not probability experts also complicates the issues. The tools needed t make the. computationally diffic

�

lt problems tractable may actually invalidate the results

�

Assum

?

twns often. made about mdependence of data may be incorrect, thus leading to errors m the resultmg model.

An often-used tool i

�

^da

�

^{a mini}

�

g and machine learning is that of sampling. Here a subset

�

e total populatiOn IS exanuned, and a generalization (model) about the entire

�

^opulatwn^ISmade from this subset. In statistics this approach is referred to as statistical mference .. Of course, this generalization may lead to errors in the final model caused by the samplmg process.

The term exploratmy data analysis was actually coined by statisticians to describe the f

�

^{that t}

�

e data can actually drive the creation of the model and any statistical char

actenstics. This seems contradictory to the traditional statistical view that one should not be corrupt

�

d or. influenced by looking at the data before creating the model. This would u

�

necessanly .bias any resulting hypothesis. Of course, database practitioners would never

thini<:

of creating a model of the data without looking at the data in detail first and then creatmg a schema tq describe it.

. Some data mining applications determine correlations among data. These relation

ships, however, are not causal in nature. A discovered association rule may show that 60 percent of the time when customers purchased hot dogs they also bought beer. Care must

?

e taken when assigning any significance to this relationship. We do not know why these Items were purchased toget.her. �erhaps the beer was on sale or it was an extremely hot day. There may be no relatwnship between these two data items except that they were often purchased together. There need not be any probabilistic inference that can be deduced.

. Statis

�

^ics

�

esearch has produced many of the proposed data mining algorithms. The difference lies m the goals, the fact that statisticians may deal with smaller and more form

�

tted data sets, and the emphasis of data mining on the use of machine learning techniques. Howeve

�

t .is often the ca.se that the term data mining is used in a deroga

tory mann

� :

by statisticians as d.ata mming is "analysis without any clearly formulated hypothes

�

s . [Man96] . Indeed, this may be the case because data itself, not a predefined hypothesis, Is the guide.

?

bability distributions can be used to describe the domains found for different data . at

�

nbutes. Statistical inference techniques can be viewed as special estimators and pred

�

^ctw

� �

^eth

?

ds. Use of these approaches may not always be applicable because the p

�

^ec

�

^se

�

Istnbutwn o.f real data v

�

lues may not actually follow any specific probability dist

�

twn, assumptiOns on the mdependence of attributes may be invalid, and some heunstic-based estimator techniques may never actually converge.

� �

^as

?

een stated that "the main difference between data mining and statistics is that data numng I

� �

^ea

�

^{t to}

?

e used by the business user-not the statistician" [BS97, 292]. As such, data numng (particularly from a database perspective) involves not only modelin but also the development of effective and efficient algorithms (and data structures) t g

perform the modeling on large data sets. ⁰

2.10 MACHINE LEARNING

Artificial intelligence (AI) . . includes many DM techniques such as neural t k . ne wor s an d classificatiOn. However, AI IS more general and involves areas outside traditional data

Section 2 . 1 0 Mach i n e Learn ing 43 numng. AI applications also may not be concerned with scalability as data sets may be small.

Machine learning is the area of AI that examines how to write programs that can leani. In data mining, machine learning is often used for prediction or classification.

With machine learning, the computer makes a prediction and then, based on feedback as to whether it is correct, "learns" from this feedback. It learns through examples, domain knowledge, and feedback. When a similar situation arises in the future, this feedback is used to make the same prediction or to make a completely different pre

diction. Statistics are very important in machine learning programs because the results of the predictions must be statistically significant and must perform better than a naive prediction. Applications that typically use machine learning techniques include speech recognition, training moving robots, classification of astronomical structures, and game playing.

When machine learning is applied to data mining tasks, a model is used to represent the data (such as a graphical structure like a neural network or a decision tree). During the learning process, a sample of the database is used to train the system to properly perform the desired task. Then the system is applied to the general database to actually perform the task. This predictive modeling approach is divided into two phases. During the training phase, historical or sampled data are used to create a model that represents those data. It is assumed that this model is representative not only for this sample data, but also for the database as a whole and for future data as well. The testing phase then applies this model to the remaining and future data.

A basic machine learning application includes several major aspects. First, an appropriate training set must be chosen. The quality of the training data determines how well the program learns. In addition, the type of feedback available is important.

Direct feedback entails specific information about the results and impact of each possible move or database state. Indirect feedback is at a higher level, with no specific informa

tion about individual moves or predictions. An important aspect is whether the learning program can actually propose new moves or database states. Another major feature that impacts the quality of learning is how representative the training set is of the overall final system to be examined. If a program is to be designed to perform speech recognition, it is hoped that the system is allowed to learn with a large sample of the speech patterns it will encounter during its actually processing.

There are two different types of machine learning: supervised learning and unsu

pervised learning. A supervised approach learns by example. Given a training set of data plus correct answers, the computational model successively applies each entry in the training set. Based on its ability to correctly handle each of these entries, the model is changed to ensure that it works better with this entry if it were applied again. Given enough input values, the model will learn the correct behavior for any potential entry.

With unsupervised data, data exist but there is no knowledge of the correct answer of applying the model to the data.

Although machine learning is the basis for many of the core data mining research topics, there is a major difference between the approaches taken by the AI and database disciplines. Much of the machine learning research has focused on the learning portion rather than on the creation of useful information (prediction) for the user. Also, machine learning looks at things that may be difficult for humans to do or concentrates on how to develop learning techniques that can mimic human behavior. The objective for data

44 Chapter 2 Related Concepts

TAB LE 2.3: Relationship Between Topics [FPSM9 1]

Database Management

Database is an active, evolving entity Records may contain erroneous or

mining is to un<lover information that can be used to provide information to humans (not take their place). These two conflicting views are summarized in Table 2.3, which orig

inally appeared in [FPSM91]. The items listed in the first column indicate the concerns and views that are taken in this book. Many of the algorithms introduced in this text were created in the AI community and are now being exported into more realistic data min

ing activities. When applying these machine learning concepts to databases, additional concerns and problems are raised: size, complex data types, complex data relationships, noisy and missy data, and databases that are frequently updated.

2. 1 1 PATTERN MATCHING

Pattern matching or pattern recognition finds occurrences of a predefined pattern in data. �attern matching is used in many diverse applications. A text editor uses pattern matching to find occurrences of a string in the text being edited. Information retrieval and Web search engines may use pattern matching to find documents containing a predefined pattern (perhaps a keyword). Time series analysis examines the patterns of behavior in data obt

�

ned from two different time series to determine similarity. Pattern matching can be VIewed as a type of classification where the predefined patterns are the classes under consideration. The data are then placed in the correct class based on a similarity between the data and the classes.

2 . 1 2 SUMMARY

When viewed as a query system, data mining queries extend both database and IR

�oncepts. Dat� mining problems are often ill-posed, with many different solutions. Judg

�

g the effectiveness of the result of a data mining request is often difficult. A major chfference between data mining queries and those of earlier types is the output. Basic database queries always output either a subset of the database or aggregates of the data.

A data mining query outputs a KDD object. A KDD object is either a rule, classifier, or clustering [IM96]. Thes� objects. do not exist before executing the query and they are not part �f the database bemg. que�ed. Table 2.4 compares the different query systems.

AggregatiOn operators have extsted m SQL for years. They do not return objects existing

Section 2. 1 4 Bibliographic Notes 45

in the database, but return a model of the data. For example., an average operator returns the average of a set of attribute values rather than the values themselves. This is a simple type of data mining operator.

2. 13 EXERCISES

1. Compare and contrast database, information retrieval, and data mining queries.

What metrics are used to measure the performance of each type of query?

2. What is the relationship between a fuzzy set membership function and classifica

tion? lllustrate this relationship using the problem of assigning grades to students in classes where outliers (extremely high and low grades) exist.

3. (Research) Data warehouses are often viewed to contain relatively static data.

Investigate techniques that have been proposed to provide updates to this data from the operational data. How often should these updates occur?

2.14 BIBLIOGRAPHIC NOTES

There are many excellent database books, including [DatOO], [ENOO], [GMUW02], and [0001]. There are several books that provide introductions to database query processing and SQL, including [DD97] and [YM98].

Fuzzy sets were first examined by Lotti Zadeh in [Zad65]. They continue to be an important research topic in all areas of computer science, with many introductory texts available such as [KY95], [NW99], and [San98]. Some texts explore the relationship between fuzzy sets and various data mining applications. Pedrycz and Gomide examine fuzzy neural networks and fuzzy genetic algorithms. [PG98]

There are several excellent information retrieval texts, including [BYRN99] and [SM83]. The ER (entity-relationship) data model was first proposed by Chen in 1976 [Che76].

An examination of the relationship between statistics and KDD can be found in [IP96]. This article provides an historical perspective of the development of statistical techniques that have influenced data mining. Much of the research concerning outliers has been performed in the statistics community [BL94] and [Haw80].

There is an abundance of books covering of DSS, OLAP, dimensional modeling, multidimensional schemas, and data warehousing, including [BE97], [PKB98], [Sin98], and [Sin99].

An excellent textbook on machine learning is [Mit97]. The relationships between machine learning and data mining are investigated in [MBK99].

C H A P T E R 3

Dans le document Part One Introduction 1 3 Contents (Page 27-30)