Concepts of Learning, Classification, and Regression

In this Chapter, we introduce the main concepts and types of learning, classification, and regression, as well as elaborate on generic properties of classifiers and regression models (regressors) along with their architectures, learning, and assessment (performance evaluation) mechanisms.

1. Introductory Comments

In data mining, we encounter a diversity of concepts that support the creation of models of data.

Here we elaborate in detail onlearning,classificationandregressionas being the most dominant categories of developments of a variety of models.

1.1. Main Modes of Learning from Data-Problem Formulation

To deal with huge databases, we first describe several fundamental ways to complete their analyses, build underlying models, and deliver major findings. We present the fundamental approaches to such analyses by distinguishing between supervised and unsupervised learning. We stress that such a dichotomy is not the only taxonomy available since a number of interesting and useful alternatives lie somewhere in-between the two thus forming a continuum of options that could be utilized. The relevance and usefulness of such alternatives are described below.

1.2. Unsupervised Learning

The paradigm of unsupervised learning, quite often referred to as clustering involves a process that automatically reveals (discovers) structure in data and does not involve any supervision. Given an N-dimensional datasetX=x₁x₂ x_N, where eachx_kis characterized by a set of attributes, we want to determine the structure of X, i.e., identify and describe groups (clusters) present within it. To illustrate the essence of the problem and build some conceptual prerequisites, let us consider the examples of two-dimensional data shown in Figure 4.1. What can we say about the first one, shown in Figure 4.1(a)? Without any hesitation, we can distinguish three well-separated spherical groups of points. These are the clusters expressing the structure in the data.

The clusters could also exhibit a very different geometry. For instance, in Figure 4.1(b), we see two elongated structures and one ellipsoidal cluster. All the clusters are well separated and clearly visible. In Figure 4.1(c), the structure is much less apparent; the clusters overlap significantly.

Perhaps the two that are close to each other could be considered to form a single cluster. In

(a) (b) x₁

x₂ x₂

x₁

x₂ x₂

x₁

Figure 4.1. Examples of two-dimensional data (all in the two-dimensional feature space x₁−x₂) and the search for their structures: geometry of clusters and their distribution in the two-dimensional data space.

Figure 4.1(d), the shapes are even more complicated: a horseshoe exhibits far higher geometric complexity than the two remaining spherical clusters.

The practicality of clustering is enormous. In essence, we perform clustering almost everywhere.

Clusters form aggregates or, to put it differently, build an abstraction of the dataset. Rather than dealing with millions of data points, we focus on a few clusters and doing so is evidently very convenient. Note, however, that clusters do not have a numeric character; instead we perceive

“clouds” of data and afterwards operate on such structures. Hopefully, each cluster comes with well-defined semantics that capture some dominant and distinguishable parts of the data. Consider, for instance, a collection of transactions in a supermarket; the data sets generated daily are enormous. What could we learn from them? Who are the customers? Are there any clusters – segments of the market we should learn about and study for the purpose of an advertising campaign? Discovering these segments takes place through clustering. A concise description of the clusters leads to an understanding of structure within the collection of customers. Obviously, these data are highly dimensional and involve a significant number of features that describe the customers. A seamless visual inspection of data (as we have already seen in the case of examples in Figure 4.1) is not feasible. We need a powerful “computer eye” that will help us to explore the structure in any space even highly-dimensional one. Clustering delivers an algorithmic solution to this problem.

In spite of the evident diversity ofclustering mechanismsand their algorithmic underpinnings, the underlying principle of grouping is evident and quite intuitive. We look for the closest data points and put them together. The clusters start to grow as we expand them by bringing more points together. This stepwise formation of clusters is the crux of hierarchical clustering. We could look at some centers (prototypes) and request that the data points be split so that the given distance function assumes its lowest value (i.e., their similarity or dissimilarity is highest).

Here we note that all strategies rely heavily on the concept ofsimilarityor distancebetween the data. Data that are close to each other are likely to be assigned to the same cluster. Distance impacts clustering in the sense that it predefines a character of the “computer eye” we use when searching for structure. Let us briefly recall that for any elements – patternsxyz(treated formally

Chapter 4 Concepts of Learning, Classification, and Regression 51

as vectors) the distance function (metric) is a function producing a nonnegative numeric value d(xy) such that the following straightforward conditions are met:

(a) dxx=0

(b) dxy=dyxsymmetry

While the above requirements are very general, there are a number of commonly encountered ways in which distances are described. One frequently used class of distances is known as the Minkowski distance(which includes Euclidean, Manhattan, and Tchebyshev as special cases).

Examples of selected distances are given below with and their pertinent plots shown in Figure 4.2.

We should note that while the distances here concern vectors of numeric values, they could be defined for nonnumeric descriptors of the patterns as well.

Hamming distance dxy=ⁿ

i=1

x_i−y_i (1)

Euclidean distance dxy= n

i=1

x_i−y_i² (2) Tchebyschev distance dxy=max_ix_i−y_i (3)

(a)

(b)

(c)

Figure 4.2.Graphic visualization of distances: equidistant regions: (a) Hamming, (b) Euclidean, and (c) Tchebyshev. Shown are a three-dimensional plot and the two-dimensional contour plots with the distances computed betweenxand the origin, d(x,0).

The graphic illustration of distances is quite revealing and deserves attention. Each distance comes with its own geometry. The Euclidean distance imposes spherical shapes of equidistant regions.

The Hamming distance imposes diamond – like geometry, while the Tchebyschev distance forms hyper – squares. The weighted versions affect the original shape. For the weighted Euclidean, we arrive at ellipsoidal shapes. The Tchebyschev distance leads to hyperboxes whose sides are of different length. It is easy to see that the use of a certain distance function in expressing the closeness of the patterns implies which patterns will be treated as closest neighbors and candidates for belonging to the same cluster. If we are concerned with, say, the Euclidean distance, the search for the structure leads to the discovery of spherical shapes. This means that the method becomes predisposed towards searching for such geometric shapes in the structure.

Clustering methods are described in Chapter 9, while another class of unsupervised learning methods, namely, association rules, is described in Chapter 10.

1.3. Supervised Learning

Supervised learningis at the other end of the spectrum from unsupervised learning in the existing diversity of learning schemes. In unsupervised learning, we are provided with data and requested to discover its structure.

In supervised learning, the situation is very different. We are given a collection of data (patterns) and their characterization, which can be expressed in the form of some discrete labels (in which case we have a classification problem) or some values of auxiliary continuous variables. In which case we are faced with a regression problem or an approximation problem. In classification problems, each data point x_k comes with a certainclass label, say _k, where the values of _k come from a small set of integers _k∈12 c, where “c” stands for a number of classes.

Some examples of two-dimensional classification data are shown in Figure 4.3. The objective here is to build a classifier that is a construct of a function, called a classifier, that generates a class label as its output,x_k= k.

The geometry of the classification problem depends on the distribution of classes. Depending upon the distribution, we can design linear or nonlinear classifiers. Several examples of such

Φ (x)

Φ (x) (a)

Φ (x) (c)

(b)

Figure 4.3. Examples of classification problems in a two-dimensional space: (a) linear classifier, (b) piecewise linear classifier, (c) nonlinear classifier. Two classes of patterns are denoted by black and white dots, respectively.

Chapter 4 Concepts of Learning, Classification, and Regression 53

classifiers are illustrated in Figure 4.3. The geometry of the classifiers is reflective of the distri-bution of the data. Here we emphasize the fact that the nonlinearity of the classifier depends upon the geometry of the data. Likewise, the patterns belonging to the same class could be distributed in several disjoint regions ofX.

1.4. Reinforcement Learning

Reinforcement learningis a learning paradigm positioned between unsupervised and supervised learning. In unsupervised learning, there is no guidance as to the assignment of patterns to classes.

In supervised learning, class assignment is known. In reinforcement learning, we are offered less detailed information (the supervision mechanism) than that encountered in supervised learning.

This information (guidance) comes in the form of some reinforcement (the reinforcement signal).

For instance, given “c” classes, the reinforcement signalr(w) could be binary in its nature:

rw=

1 if class label is even ₂ ₄

−1 otherwise (4)

See also Figure 4.4. When used to supervise development of the classifier, reinforcement offers a fairly limited level of supervision. For instance, we do not “tell” the classifier to which class the pattern belongs but only distinguish between the two super – categories that are composed of odd and even numbers of class labels. Nevertheless, this information provides more guidance than no labeling at all. In the continuous problem (regression), reinforcement results from the discretization of the original continuous target value. Another situation arises when detailed supervision over time is replaced by the mean values regarded as its aggregates – a certain reinforcement signal, as shown in Figure 4.4(c).

In a nutshell, reinforcement learning is guided by signals that could be sought as an aggregation (generalization) of the more detailed supervision signals used in “standard” supervised learning.

The emergence of this type of learning could be motivated by scarcity of available supervision, which in turn that could have been dictated by economical factors (less supervision effort).

r (z)

classifier ^{r (z)}

Regression model

(a)

r (z) Reinforcement

model

(c)

(b)

Figure 4.4. Examples of reinforcement signals: (a) classification provides only partial guidance by combining several classes together as exemplified by (4), and (b) regression offers a threshold version of the continuous target signal, (c) reinforcement supervision involves aggregates (mean value) over time.

1.5. Learning with Knowledge Hints and Semi-supervised Learning

Supervised and unsupervised learning are two extremes and reinforcement learning is an aggregate positioned in-between them. However, a number of other interesting options exist that fall under the umbrella of what could generally be called learning with knowledge-based hints. These options reflect practice: rarely are we provided with complete knowledge and rarely do we approach a problem with no domain knowledge. Both these cases are quite impractical. Noting this point, we now discuss several possibilities in which domain knowledge comes into the picture.

In a large dataset X, we have a small portion of labeled patterns that lead to the notion of clustering withpartial supervision, see Figure 4.5.

These labeled patterns in an ocean of data form some “anchor” points that help us navigate the process of determining (discovering) clusters. The search space and the number of viable structures in the data are thus reduced, simplifying and focusing the overall search process.

The format in which partial supervision directly encapsulates available knowledge could be present in numerous situations in data mining. Imagine a huge database of handwritten characters (e.g., digits and letters used in postal codes). Typically there will be millions of characters (digits and letters). A very small fraction of these are labeled by an expert, who chooses some handwritten characters (maybe those that are difficult to decipher) and labels them. In this way, we produce a small labeled dataset.

Knowledge hints come in different formats. Envision a huge collection of digital pictures. In this dataset we have some pairs of data (patterns) whose proximity has been quantified by an expert or user (see Figure 4.6). Theseproximity hints, are a useful element of supervision during data clustering or in this case when organizing a digital photo album.

We note that knowledge hints of this nature are very different from those we had in partial supervision. The striking difference is this: in the case of clustering with partial supervision, we assume that the number of clusters is equal to the number of classes and that this number is known.

This assumption could be true in many instances; for example when dealing with handwritten characters. In proximity-based clustering, however we do not specify the number of clusters, and in this sense the format of the hints is far more general. In a photo album, the number of clusters is obviously unknown. Thus the use of proximity-based hints under these circumstances is fully justifiable.

labeled patterns

Figure 4.5. Clustering with partial supervision–the highlighted patterns come with class labeling.

Chapter 4 Concepts of Learning, Classification, and Regression 55

Proximity=λ

Proximity=μ

Figure 4.6. Clustering with proximity hints-selected pairs of data are assigned some degree of proximity (closeness).

2. Classification

2.1. Problem Formulation

In the previous section, we introduced the concept of classification.Classifiers are constructs (algorithms) that discriminate between classes of patterns. Depending upon the number of classes in the problem, we may encountertwo-classormany-classclassifiers. The design of a classifier depends upon the character of the data, the number of classes, the learning algorithm and the validation procedures. Let us recall that the development of the classifier gives rise to the

mapping

X→ 1 ₂ _c (5) that maps any patternxinXto one of the labels (classes). In practice, both linear and nonlinear mapping require careful quality assessment. Building classifiers requires prudent use of data so that we reach a sound balance between accuracy and the generalization abilities of the constructed classifier. This goal calls for the arrangement of data into training and testing subsets and for the running training procedures in some mode. The remainder of this section covers these topics.

2.2. Two- and Many-class Classification Problems

Classification tasks involve two or more classes. This taxonomy is quite instructive as it reveals several ways to form classifier architectures and to organize them into a certain topology. In the simplest case, let us consider two classes of data (patterns). Here the classifier (denoted by) generates a single output whose value depends on the class to which the given pattern is assigned as shown in Figure 4.7.

classifier x

a b

ω₁ ω₂

0 1

ω₂

0 ω₁ ω₁

ω₂

(a) (b)

Figure 4.7. (a) a two-class classifier; (b) some alternative ways to code its single output (y) to represent two classes in the problem.

Since there are two classes, their coding could be realized in several ways. As shown, a given range of real numbers could be split into two intervals, Figure 4.7(b). In particular, if we use the range [0,1], the coding could assume the following form:

0½if the pattern belongs to class ₁

½1if the pattern belongs to class ₂ (6)

In another typical coding alternative, we use the entire real space, and the coding assumes the form:

x <0 if xbelongs to₁

x≥0 if xbelongs to₂ (7)

We will come back to this classification rule when we discuss the main classifier architectures.

The multiclass problem can be handled in two different ways. An intuitive way to build a classifier is to consider all classes together and to create a classifier with “c” outputs, where the class to whichxbelongs to is assigned by identifying the output for which it attains the maximal value. We express this option in the form:

i₀=arg maxy₁ y₂ y_c wherey₁ y₂ y_c are the outputs of the classifier (see Figure 4.8).

The other general approach is to split thec-class problem into a subset of two-class problems.

In each, we consider one class, say ₁, with the other class composed of all the patterns that do not belong to ₁. In this case, we come up with a dichotomy:

₁x≥0 if xbelongs to ₁

₁x <0 if xdoes not belong to ₁ (8) (here the index in the classifier formula pertains to the class label).

In the same way, we can design a two-class classifier for ₂, ₃, , _c. When used, these classifiers are invoked by some pattern x and return decisions about class assignment (see Figure 4.9).

If only one classifier generates a nonnegative value, the assignment of the class is obvious.

There are two other possible outcomes, however: (a) several classifiers identify the pattern as belonging to a specific class, in which case we have a conflicting situation that needs to be resolved, or (b) no classifier issued a classification decision, in which case the class assignment of the pattern becomes undefined. An example of the resulting geometry of the two-class linear classifiers is shown in Figure 4.9.

Below, we discuss the main categories of classifiers i.e., elaborate on various forms of the classifier (viz. function) being used to realize discrimination.

y₁ y₂ yc

classifier

Figure 4.8. A single-level c-output classifierx.

Chapter 4 Concepts of Learning, Classification, and Regression 57

ω1 not ω1

ω2 not ω2

ω1

ω2

conflict

lack of decision

Figure 4.9. Two-class linear classifiers and their classification decisions. Note the regions of conflict and lack of decision.

2.3. Classification and Regression: A General Taxonomy

As we have noted, the essence of classification is to assign a new pattern to one of the classes when the number of classes is usually quite low. We usually make a clear distinction between classification and regression. Inregression, we encounter a continuous output variable and our objective is to build a model (regressor) so that a certain approximation error is minimized. More formally, consider a data set formed by some pairs of input-output datax_k y_k,k=12 N, where nowy_k∈R. The regression model (regressor) comes in the form of some mapping F(x) such that for any x_k we obtain Fx_k≈y_k. As illustrated in Figure 4.10, the regression model attempts to “capture” most of the data by passing through the area of the highest density of data.

The quality of the model depends on the nature of the data (including their dispersion) and its functional form. Several cases are illustrated in Figure 4.10.

Regression becomes a standard model when revealing dependencies between input and output variables. For instance, one might be interested in finding a meaningful relationship between the spending of customers and their income, marital status, job, etc.

One could treat classification as a discrete version of the regression problem. Simply discretize the continuous output variable existing in the regression problem and treat each discrete value as a class label.

As we will see in the ensuing discussion, the arguments, architectures, and algorithms pertaining to classification models are also relevant when discussing and solving regression problems.

2.4. Main Categories of Classifiers and Selected Architectures

The diversity of classifiers is amazing. It is motivated by the geometry of the classification problems, the ensuing approaches to design, and complexity of the problem at hand. The reader may have been exposed to names like linear classifier, decision trees, neural networks,k-nearest neighbor, polynomial classifier, and the alike. The taxonomy of classifiers could be presented in several ways depending on which development facet we decide to focus. In the ensuing discussion, we concentrate on two different viewpoints.

2.4.1. Explicit and Implicit Characterization of the Classifier

The distinction made here is concerned with the form in which the classifier arises. It could be described as some function, say, x, where could be quite complex yet described in an explicit manner. In other words,could be a linear function, a quadratic relationship, some high order polynomial, etc. Implicit characterization of a classifier takes place when we do not have a formula but rather the classifier is described in some graphic form, such as a decision tree, a nearest neighbor classifier, or a cognitive map. To illustrate this point, let us concentrate on

(a)

(b)

Figure 4.10. Examples of regression models with single input and single output and their relationships

Dans le document Data Mining (Page 59-78)