• Aucun résultat trouvé

Cost-Sensitive MLP

In many real-world problems, such as financial analysis and medical diagnosis, different errors in prediction usually lead to different costs. For example, the cost of making an error in predicting the future of a million-dollar investment is higher compared to that of a thousand-dollar investment. The cost of diagnosing a person with serious disease as healthy is much greater than mis-diagnosing a healthy person as ill. Cost-sensitive neural networks (CSNNs) address these important issues [19]. Cost-sensitive classification trees have been studied by Turney [319] and Ting [312].

2.3.1 Standard Back-propagation The total input to an artificial neuron is

h=

NI

j=1

wjVj , (2.16)

where NI is the number of inputs to the neuron (dimension of the input vector), {w1, w2, ..., wNI} are the weights or synapses of the neuron, and {V1, V2, ..., VNI} are the individual inputs to the neuron either from other neurons or external sources of input.

The neuron then determines its outputaaccording to

a=f(h+b), (2.17)

wherebis the bias of the neuron andf is usually a non-linear function, which will be specified later.

Let us consider a layer ofNHneurons. All neurons receive an input vector (pattern)1, ξ2, ..., ξNI}. Neuronjin this layer has weights{wj1, wj2, ..., wjNI}. Hence, the total input to neuronj is

hj=

NI

k=1

wjkξk , (2.18)

which produces output

Vj =g(hj) =g(

NI

k=1

wjkξk), (2.19)

where

g(x) =f(x+b). (2.20)

Now let us connect a second layer ofNO neurons on top of this first layer ofNH neurons to form a feedforward neural network. The weight connecting

2.3 Cost-Sensitive MLP 39 neuroni(i= 1,2, ..., NO) in the second layer and neuronj (j= 1,2, ..., NH) in the first layer isWij. Hence, neuroni in the second layer receives a total input

Suppose that for an input pattern ξµ used during training, the desired output pattern isζµ. We shall use superscriptµto denote the training pattern µ, whereµ= 1,2, ..., Np andNp is the number of input-output training pairs.

The objective of training is to minimize the error between the actual out-put Oµ and the desired outputζµ. A commonly used error measure or cost functionis Or, by substituting Eq. (2.22) into Eq. (2.23), we have

= 1 Back-propagation uses a gradient descent algorithm to learn the weights.

For the hidden-to-output connections we have

∆Wij =−η

whereη is called the learning rate and we have defined

δµi =g(hµi)[ζiµ−Oµi]. (2.26) For the input-to-hidden connections, we have

40 2 MLP Neural Networks for Time-Series Prediction and Classification

We see that Eq. (2.25) and Eq. (2.27) have exactly the same form, only with different definitions of the δ’s. In general, for a feedforward neural network with an arbitrary number of layers, suppose that layer preceives input from layer q, which can be either a hidden layer or the external input. Then, the gradient descent learning rule for layerpcan always be written as follows:

∆wpq=η

µ

δpµVqµ, (2.29)

where δpµ represents the error at the output of layer p and Vqµ is the input to layer p from layer q. If the layer concerned is the final (or top) layer of the network, δ is given by Eq. (2.26), which represents the error between the desired and the actual outputs. If the layer concerned is one of the hidden layers,δneeds to be calculated with some propagating rule, such as Eq. (2.28).

The most popular non-linear function for a neuron is the sigmoid function.

2.3.2 Cost-sensitive Back-propagation

In conventional back-propagation, errors made with respect to different pat-terns are assumed to be the same, as shown in the cost function given by Eq.

(2.23). We now write a cost-sensitive cost function as follows [22]:

= 1 2

µ,i

λµiµ−Oµi]2, (2.30) whereλµis a cost-dependent factor. The standard back-propagation situation (2.23) is recovered if we let

λµ= 1 for allµ. (2.31)

2.3 Cost-Sensitive MLP 41 With Eq. (2.30), we can easily generalize the standard back-propagation (SBP) algorithm to cost-sensitive situations. By going through the same derivations as above, we can show that the cost-sensitive cost function given by Eq. (2.30) is minimized by the same back-propagation algorithm, but with Eq. (2.29) modified as follows:

∆wpq=η

µ

λµδpµVqµ. (2.32) Other quantities such as the δ’s andV’s are calculated in the same ways as in the standard back-propagation case.

In an iterative implementation, i.e., all weights are updated after each training pattern is presented, the above cost-sensitive back-propagation (CSBP) can be realized by simply replacing the learning rateηin the standard back-propagation case by a cost-sensitive learning rateλµηfor each training pattern µ.

In the CSBP, ‘more important’ pattern classes with larger cost factors (λ) have larger learning rates compared to the ‘less important’ classes with smaller cost factors. Let us consider a case with only two pattern classes.

Suppose that λ(1) = 3λ(2) = 3, or making an error in classifying a pattern of class 1 is three times as costly as making an error in classifying a pattern of class 2. The CSBP requires that the learning rate for class 1 is 3η if the learning rate for class 2 is η. This is roughly equivalent to the case where we use the SBP, i.e., the same learning rate for all classes, and present each training pattern in class 2 only once to the network, but present each training pattern in class 1threetimes.

Suppose that there are a total ofNpinput-output pattern pairs for training and there are Ncl classes (kinds) of patterns. In particular, there are Nk training patterns for classk, wherek= 1,2, ..., Ncl. In this book, we assume that there are an equal number of training patterns for each class, i.e.,

Nk=No (2.33)

Since, in the SBP cost function (2.23), the sum of all coefficients in front of [ζiµ−Oiµ]2 is

µ

1 =Np, (2.35)

it is reasonable to require the same condition satisfied in the CSBP. This can be achieved by choosing

λµ =Ncl Ck(µ) Ncl

k=1Ck . (2.36)

42 2 MLP Neural Networks for Time-Series Prediction and Classification The SBP is recovered if allC’s are the same (thus allλ’s are 1).

2.3.3 Experimental Results

Assume that two classes of patterns are uniformly distributed in two inter-secting circles in a plane. We randomly generate 600 training patterns (char-acterized by their coordinates in the plane) for each of the two classes. These patterns are used to train the neural network by setting different cost func-tions. The neural network has one input layer of two neurons, one hidden layer of three neurons with sigmoid transfer functions, and one output layer of one neuron with a sigmoid transfer function. Three different cost-factor settings are used in the simulation study. The different cost factors are set as follows:

Case 1. C1 = C2 = 0.5: this is corresponding to the standard back-propagation (SBP) algorithm.

Case 2.C1= 0.2, C2= 0.8: this sets a higher cost for class 2 than class 1.

Case 3.C1= 0.8, C2= 0.2: this sets a higher cost for class 1 than class 2.

After the network is trained, we randomly generate another 600 test pat-terns for each class to test the neural network for the recognition rate.

−1 −0.5 0 0.5 1 1.5 2 2.5

−1

−0.5 0 0.5 1

Fig. 2.7.Two classes are represented by triangles and circles, respectively.

2.4 Summary 43