Cost-Sensitive MLP - Advanced Information and Knowledge Processing

In many real-world problems, such as financial analysis and medical diagnosis, different errors in prediction usually lead to different costs. For example, the cost of making an error in predicting the future of a million-dollar investment is higher compared to that of a thousand-dollar investment. The cost of diagnosing a person with serious disease as healthy is much greater than mis-diagnosing a healthy person as ill. Cost-sensitive neural networks (CSNNs) address these important issues [19]. Cost-sensitive classification trees have been studied by Turney [319] and Ting [312].

2.3.1 Standard Back-propagation The total input to an artiﬁcial neuron is

j=1

w_jV_j , (2.16)

where N_I is the number of inputs to the neuron (dimension of the input vector), {w₁, w₂, ..., w_N_I} are the weights or synapses of the neuron, and {V₁, V₂, ..., V_N_I} are the individual inputs to the neuron either from other neurons or external sources of input.

The neuron then determines its outputaaccording to

a=f(h+b), (2.17)

wherebis the bias of the neuron andf is usually a non-linear function, which will be speciﬁed later.

Let us consider a layer ofN_Hneurons. All neurons receive an input vector (pattern){ξ₁, ξ₂, ..., ξ_N_I}. Neuronjin this layer has weights{w_j1, w_j2, ..., w_jN_I}. Hence, the total input to neuronj is

h_j=

k=1

w_jkξ_k , (2.18)

which produces output

V_j =g(h_j) =g(

k=1

w_jkξ_k), (2.19)

where

g(x) =f(x+b). (2.20)

Now let us connect a second layer ofN_O neurons on top of this ﬁrst layer ofN_H neurons to form a feedforward neural network. The weight connecting

2.3 Cost-Sensitive MLP 39 neuroni(i= 1,2, ..., N_O) in the second layer and neuronj (j= 1,2, ..., N_H) in the ﬁrst layer isW_ij. Hence, neuroni in the second layer receives a total input

Suppose that for an input pattern ξ^µ used during training, the desired output pattern isζ^µ. We shall use superscriptµto denote the training pattern µ, whereµ= 1,2, ..., N_p andN_p is the number of input-output training pairs.

The objective of training is to minimize the error between the actual out-put O^µ and the desired outputζ^µ. A commonly used error measure or cost functionis Or, by substituting Eq. (2.22) into Eq. (2.23), we have

= 1 Back-propagation uses a gradient descent algorithm to learn the weights.

For the hidden-to-output connections we have

∆W_ij =−η ∂

whereη is called the learning rate and we have deﬁned

δ^µ_i =g(h^µ_i)[ζ_i^µ−O^µ_i]. (2.26) For the input-to-hidden connections, we have

40 2 MLP Neural Networks for Time-Series Prediction and Classiﬁcation

We see that Eq. (2.25) and Eq. (2.27) have exactly the same form, only with diﬀerent deﬁnitions of the δ’s. In general, for a feedforward neural network with an arbitrary number of layers, suppose that layer preceives input from layer q, which can be either a hidden layer or the external input. Then, the gradient descent learning rule for layerpcan always be written as follows:

∆w_pq=η

δ_p^µV_q^µ, (2.29)

where δ_p^µ represents the error at the output of layer p and V_q^µ is the input to layer p from layer q. If the layer concerned is the ﬁnal (or top) layer of the network, δ is given by Eq. (2.26), which represents the error between the desired and the actual outputs. If the layer concerned is one of the hidden layers,δneeds to be calculated with some propagating rule, such as Eq. (2.28).

The most popular non-linear function for a neuron is the sigmoid function.

2.3.2 Cost-sensitive Back-propagation

In conventional back-propagation, errors made with respect to diﬀerent pat-terns are assumed to be the same, as shown in the cost function given by Eq.

(2.23). We now write a cost-sensitive cost function as follows [22]:

= 1 2

µ,i

λ^µ[ζ_i^µ−O^µ_i]², (2.30) whereλ^µis a cost-dependent factor. The standard back-propagation situation (2.23) is recovered if we let

λ^µ= 1 for allµ. (2.31)

2.3 Cost-Sensitive MLP 41 With Eq. (2.30), we can easily generalize the standard back-propagation (SBP) algorithm to cost-sensitive situations. By going through the same derivations as above, we can show that the cost-sensitive cost function given by Eq. (2.30) is minimized by the same back-propagation algorithm, but with Eq. (2.29) modiﬁed as follows:

∆w_pq=η

λ^µδ_p^µV_q^µ. (2.32) Other quantities such as the δ’s andV’s are calculated in the same ways as in the standard back-propagation case.

In an iterative implementation, i.e., all weights are updated after each training pattern is presented, the above cost-sensitive back-propagation (CSBP) can be realized by simply replacing the learning rateηin the standard back-propagation case by a cost-sensitive learning rateλ^µηfor each training pattern µ.

In the CSBP, ‘more important’ pattern classes with larger cost factors (λ) have larger learning rates compared to the ‘less important’ classes with smaller cost factors. Let us consider a case with only two pattern classes.

Suppose that λ⁽¹⁾ = 3λ⁽²⁾ = 3, or making an error in classifying a pattern of class 1 is three times as costly as making an error in classifying a pattern of class 2. The CSBP requires that the learning rate for class 1 is 3η if the learning rate for class 2 is η. This is roughly equivalent to the case where we use the SBP, i.e., the same learning rate for all classes, and present each training pattern in class 2 only once to the network, but present each training pattern in class 1threetimes.

Suppose that there are a total ofN_pinput-output pattern pairs for training and there are N_cl classes (kinds) of patterns. In particular, there are N_k training patterns for classk, wherek= 1,2, ..., N_cl. In this book, we assume that there are an equal number of training patterns for each class, i.e.,

N_k=N_o (2.33)

Since, in the SBP cost function (2.23), the sum of all coeﬃcients in front of [ζ_i^µ−O_i^µ]² is

1 =N_p, (2.35)

it is reasonable to require the same condition satisﬁed in the CSBP. This can be achieved by choosing

λ^µ =N_cl C_k(µ) _N_cl

k=1C_k . (2.36)

42 2 MLP Neural Networks for Time-Series Prediction and Classiﬁcation The SBP is recovered if allC’s are the same (thus allλ’s are 1).

2.3.3 Experimental Results

Assume that two classes of patterns are uniformly distributed in two inter-secting circles in a plane. We randomly generate 600 training patterns (char-acterized by their coordinates in the plane) for each of the two classes. These patterns are used to train the neural network by setting different cost func-tions. The neural network has one input layer of two neurons, one hidden layer of three neurons with sigmoid transfer functions, and one output layer of one neuron with a sigmoid transfer function. Three different cost-factor settings are used in the simulation study. The different cost factors are set as follows:

• Case 1. C₁ = C₂ = 0.5: this is corresponding to the standard back-propagation (SBP) algorithm.

• Case 2.C₁= 0.2, C₂= 0.8: this sets a higher cost for class 2 than class 1.

• Case 3.C₁= 0.8, C₂= 0.2: this sets a higher cost for class 1 than class 2.

After the network is trained, we randomly generate another 600 test pat-terns for each class to test the neural network for the recognition rate.

−1 −0.5 0 0.5 1 1.5 2 2.5

−1

−0.5 0 0.5 1

Fig. 2.7.Two classes are represented by triangles and circles, respectively.

2.4 Summary 43

Dans le document Advanced Information and Knowledge Processing (Page 48-53)