Neural Networks - HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS

Neural networks used for computation were based on early understandings of the struc-ture and function of the human brain. They were proposed as a means for mathematical computation by McCulloch and Pitts (1943). But it was not until the 1980s that the concept was developed for use with digital computers. The underlying assertion of neural nets is that all of the operations of a digital computer can be performed with a set of interconnected McCulloch-Pitts “neurons” (Abu-Mostafa, 1986).

Figure 7.5 shows how the human neurons are structured.

Neuron cells receive electrical impulses from neighboring cells and accumulate them until a threshold value is exceeded. Then they “fire” an impulse to an adjacent cell. The FIGURE 7.4 Link graph showing the strength of association by the thickness of the line connecting theBody andHeadwords of some association rules. (Source:StatSoft Inc.)

128 7. BASIC ALGORITHMS FOR DATA MINING: A BRIEF OVERVIEW

capacity of the cell to store electrical impulses and the threshold are controlled by biochem-ical processes, which change over time. This change is under the control of the autonomic nervous system and is the primary means by which we “learn” to think or activate our bodies.

Artificial neurons in networks (Figure 7.6) incorporate these two processes and vary them numerically rather than biochemically. The aggregation process accepts data inputs by summing them (usually). The activation process is represented by some mathematical function, usually linear or logistic. Linear activation functions work best for numerical esti-mation problems (i.e., regression), and the logistic activation function works best for classi-fication problems. A sharp threshold, as is used in decision trees, is shown in Figure 7.6.

The symbol X_i represents input variables, representing the number of other neurons connected to a human neuron. W_i represents the numerical weights associated with each linkage, and they are analogous to the strength of the interconnections. This strength of con-nection represents the proximity of concon-nections between two neurons in the region called thesynapse (Bishop, 1995).

Artificial neurons are connected together into anarchitectureor processing structure. This architecture forms a network in which each input variable (called an input node) is connected to one or more output nodes. This network is called an artificial neural network (neural net for short). When the input nodes with summation aggregation function and a logistic activation function are directly connected to an output node, the mathematical processing is analogous to a logistic regression with a binary output. This configuration

Axon

FIGURE 7.5 Structure of the human neuron [Source:Carlson, Neil R. (1992).

Foundations of Physiological Psychology.

Needham Heights, Massachusetts:

Simon & Schuster. pp. 36].

W₁

FIGURE 7.6 Architecture of a neu-ron, with a number of inputs (Xi).

129

BASIC DATA MINING ALGORITHMS

II. THE ALGORITHMS IN DATA MINING AND TEXT MINING

of a neural net is a powerful classifier. It has the ability to handle nonlinear relationships between the output and the input variables, by virtue of the logistic function shown in Figure 7.7.

The logistic function fits many binary classification problems and can express much of the nonlinear effects of the predictors.

The most interesting property of a neural net comes into view when you intercalate a middle layer of neurons (nodes) between the input and output node, as shown in Figure 7.8.

FIGURE 7.7 A plot of the logistic function.

Neural Net Architecture

F(Σ^wj•F_j(Σ^x1w_ij))

j = 1

w_ij H₁

H₂

H₃

Input Layer Middle Layer Output

w_j

“Feed-Forward” Neural Net

I=1

FIGURE 7.8 Architecture of a neural net.

130 7. BASIC ALGORITHMS FOR DATA MINING: A BRIEF OVERVIEW

Weights (Wij) are assigned to each connection between the input nodes and middle layer nodes, and between the middle layer nodes and the output node(s). These weights have the capacity to model nonlinear relationships between the input nodes and output node(s). Herein lies the great value of a neural net for solving data mining problems.

The nodes in the middle layer provide the capacity to model nonlinear relationships between the input nodes and the output node (the decision). The greater the number of nodes in the middle layer, the greater the capacity of the neural net to recognize non-linear patterns in the data set. But, as the number of nodes increases in the middle layer, the training time increases exponentially, and it increases the probability of overtraining the model. An overtrained model may fit the training set very well but not perform very well on another data set. Unfortunately, there are no great rules of thumb to define the number of middle layer nodes to use. The only guideline is to use more nodes when you have a lot of training cases and use fewer nodes with fewer cases. If your classification problem is complex, use more nodes in the middle layer; if it is simple, use fewer nodes.

The neural net architecture can be constructed to contain only one output node and be configured to function as a regression (for numerical outputs) or binary classification (yes/no or 1/0). Alternatively, the net architecture can be constructed to contain multiple output nodes for estimation or classification, or it can even function as a clustering algorithm.

The learning process of the human neuron is reflected (crudely) by performing one of a number of weight adjustment processes, the most common of which is called backpropa-gation, shown in the diagram of Figure 7.9.

The backpropagation operation adjusts weights of misclassified cases based on the mag-nitude of the prediction error. This adaptive process iteratively retrains the model and improves its fit and predictive power.

Neural Net Architecture

FIGURE 7.9 A feed-forward neural net with backpropagation.

131

BASIC DATA MINING ALGORITHMS

II. THE ALGORITHMS IN DATA MINING AND TEXT MINING

How Does Backpropagation Work?

The processing steps for backpropagation are as follows:

1. Randomly assign weights to each connection.

2. Read the first record and calculate the values at each node as the sum of the inputs times their weights.

3. Specify a threshold value above which the output is evaluated to 1 and below which it is evaluated to 0. In the following example, the threshold value is set to 0.01.

4. Calculate the prediction error as

Error¼Expected predictionActual prediction:

5. Adjust the weights as

Adjustment¼Erroroutput weight:

6. Calculate the new weight as

Old input weightþAdjustment assume as 0:1ð Þ:

7. Do the same for all inputs.

8. Repeat a number of iterations through the data (often 100–200).

Figure 7.10 shows the evaluation of all weights after the first record is processed.

For example, the new weight of input variable X1 connected to the middle layer (or hidden layer)H1¼ 4.9þ(12.2)¼ 7.1, as shown in Figure 7.11.

How Does Backpropagation Work?

Example: Solve the XOR Case (Assume a threshold value of 0.01)

XOR is evaluated as true, but actually it is false!

FIGURE 7.10 How backpropagation solves the XOR case. (Note that these NN flow from the bottom to the top;

previous models flowed from left to right.)

132 7. BASIC ALGORITHMS FOR DATA MINING: A BRIEF OVERVIEW

Advantages of neural nets include the facts that they

• Are general classifiers. They can handle problems with very many parameters, and they are able to classify objects well even when the distribution of objects in the N-dimensional parameter space is very complex.

• Can handle a large amount of nonlinearity in the predictor variables.

• Can be used for numerical prediction problems (like regression).

• Require no underlying assumption about the distribution of data.

• Are very good at finding nonlinear relationships. The hidden layer(s) of the neural net architecture provides this ability to model highly nonlinear functions efficiently.

Disadvantages of neural nets include

• They may be relatively slow, especially in the training phase but also in the application phase.

• It is difficult to determinehowthe net is making its decision. It is for this reason that neural nets have the reputation of being a “black box.”

• No hypotheses are tested and nop-values are available in the output for comparing variables.

Modern implementations of neural nets in many data mining tools open up the “black box” to a significant extent by showing theeffectof what it does related to the contribution of each variable. As a result, many modern neural net implementations are referred to as

“gray boxes.” These effects of the trained neural net are often displayed in the form of a sen-sitivity analysis. In this context, the termsensitivityhas a slightly different meaning than it does in classical statistical analysis. Classical statistical sensitivity is determined either by calculation of the statistical model with all but one of the variables, and leaving out a differ-ent variable in each of a number of iterations (equal to #variables), or by keeping track of the partial least squares values (as is done in partial least squares regression). In neural net analysis, sensitivities are calculated from the normalized weights associated with each variable in the model.

FIGURE 7.11 Weights after backpropagation for one record.

133

BASIC DATA MINING ALGORITHMS

II. THE ALGORITHMS IN DATA MINING AND TEXT MINING

Training a Neural Net

Training a neural net is analogous to a ball rolling over a series of hills and valleys.

The size (actually, the mass) of the ball represents its momentum, and the learning rate is analogous to the slope of the error pathway over which the ball rolls (Figure 7.12).

The low momentum of the search path means that the solution (depicted by the ball) may get stuck in the local minimum error region of the surface (A) rather than finding the global minimum (B). This happens when the search algorithm does not have enough tendency to continue searching (momentum) to climb the hill over to the global minimum.

This problem is compounded when the learning rate is relatively high, analogous to a steep slope of the decision surface (Figure 7.13).

The configuration of the neural net search path depicted in Figure 7.12 is much more likely to permit the error minimization routine to find the global minimum. This increased likelihood is due to the relatively large momentum (shown by the large ball), which may carry the ball long enough to find the global minimum. In manually configured neural nets, the best learning rate is often 0.9, and the momentum is often set to 0.1 to 0.3.

Another neural net setting that must be optimized is the learning decay rate. Most imple-mentations of a neural net start at the preset learning rate and then reduce it incrementally during subsequent runs. This decay process has the effect of progressively flattening the search surface. The run with the lowest error rate is selected for the training run.

Modeling with a manual neural net is very much an art. The modeling process usually includes a number of training runs with different combinations of

• Learning rate

• Learning rate decay

• Momentum

• Number of nodes in the middle layer

• Number of middle layers to add

A B

Error FIGURE 7.13 The topology of a learning

surface associated with a high momentum and a low learning rate.

A B

Error FIGURE 7.12 The topology of a

learning surface associated with a low momentum and a high learning rate.

134 7. BASIC ALGORITHMS FOR DATA MINING: A BRIEF OVERVIEW

Because of the artistic nature of neural net modeling, it is difficult for novice data miners to use many implementations of neural nets successfully. But there are some implementations that are highly automated, permitting even novice data miners to use them effectively.

The type of neural net described here is sometimes called a Multilayer Percep-tron (MLP).

Additional Types of Neural Networks

Following are some additional types of neural nets:

• Linear Networks:These networks have two layers: input and output layers. They do not handle complexities well but can be considered as a “baseline model.”

• Bayesian Networks:Networks that employ Bayesian probability theory which can be used to control model complexity, and can be used to optimize weight decay rates, and to automatically find the most important input variables.

• Probabilistic Networks:These networks consist of three to four layers.

• Generalized Regression:These networks train quickly but execute slowly. Probabilistic (PNN) and Generalized Regression (GRNN) neural networks operate in a manner similar to that of Nearest-Neighbor algorithms (see Chapter 12), except the PNN operates only with categorical target variables and the GRNN operates only with numerical target variables. PNN and GRNN networks have advantages and disadvantages compared to MLP networks (adapted from http://www.dtreg.com/pnn.htm):

• It is usually much faster to train a PNN/GRNN network than an MLP network.

• PNN/GRNN networks often are more accurate than MLP networks.

• PNN/GRNN networks are relatively insensitive to outliers (wild points).

• PNN networks generate accurate predicted target probability scores.

• PNN networks approach Bayes optimal classification.

• PNN/GRNN networks are slower than MLP networks at classifying new cases.

• PNN/GRNN networks require more memory space to store the model.

• Kohonen:This type of neural network is used for classification. It is sometimes called a “self-organizing” neural net. It iteratively classifies inputs, until the combined

difference between classes is maximized. This algorithm can be used as a simple way to cluster data, if the number of cases or categories is not particularly large. For data sets with a large number of categories, training the network can take a very long time.

MLPs can be used to solve most logical problems, but only those in which the classes are linearly separable.Figure 7.14 shows a classification problem in which it is possible to sepa-rate the classes with a straight line in the space defined by their dimensions.

Figure 7.15 shows two classes that cannot be separated with a straight line (i.e., are not linearly separable).

135

BASIC DATA MINING ALGORITHMS

II. THE ALGORITHMS IN DATA MINING AND TEXT MINING

Dans le document HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS (Page 164-172)