Feedforward Neural Networks - Supervised Learning for Operational

5 Supervised Learning for Operational

5.1 Feedforward Neural Networks

5.5.1 FFNN Architecture

There are already a large number of textbooks on FFNNs. Here it is introduced less technically. Simply speaking, a FFNN neural network is an algorithm or computer software that can learn to identify the complex nonlinear relationship between multiple inputs and outputs. The learning process has a number of characteristics.

X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control

Firstly, FFNN does not need fundamental domain problem models and is easy to be set up and trained. This is different from conventional statistical methods that usually require the user to specify the functions over which the data is to be regressed. In order to specify the function, the user has to know the forms of the equations governing the correlations between the data. If these functions are incorrectly specified, the data will not be satisfactorily regressed. Furthermore, considerable mathematics and numerical experience is required to obtain convergence if these equations are highly nonlinear. FFNN does not need to specifY the forms of the correlations as well as any mathematical and numerical expertise requirements. Secondly, data examples used for training are allowed to be imprecise or noisy, in some cases even incomplete. Thirdly it mimics the human learning process: learning from examples through repeatedly updating the performance.

A FFNN neural network consists of a number of processing elements called neurons. These neurons are divided into layers. Figure 5.1 shows a three layered FFNN architecture including an input, a hidden and an output layer. Typically the input layer nodes correspond to input variables and the output layer to output variables. Hidden neurons do not have physical meanings. Neurons between two adjacent layers are fully connected by branches.

Output layer

Hidden layer

Input layer

Figure 5.1 A three layer feedforward neural network.

Each neuron in the hidden and output layer is described by a transfer function (or activation function). Usually a sigmoidal function is used,

fez) = - 1 1 -07

+e (5.1)

f(z) transforms an input z to the neuron to the range of [0.0, 1.0] as shown in Figure 5.2(a). The parameter a in Equation 5.1 is used to change the shape of the

Chapter 5 Supervised LearningJor Operational Support 87

sigmoidal function. Some other activation functions can also be used as shown in Figure 5.2 (b) and (c). However for a specific FFNN structure, the neurons in the hidden and output layers are usually fixed on the same transfer function.

J(z) f(z) J(z)

1.0

z z

0.0 z

(a) Sigmoid (b) Ramp (c) Step

Figure 5.2 Activation functions.

Each connection branch is described by a weight representing the strength of connection between two linked nodes. The so called learning or training process is the procedure to adjust the weights. A bias neuron that supplies an invariant output is connected to each neuron in the hidden and output layers. The bias provides a threshold force activation of the neuron, and is essential in order to classify networks input patterns into various subspaces.

5.1.2 FFNN Training Algorithm

Given some arbitrary values for all the connection weights, for a specific data pattern, the FFNN makes use of the weights and input values to predict the outputs.

The training is intended to gradually update the connection weights to minimise the mean square error E,

E = f±(tfmLy~m))2

/)/=\ i",l

(5.2) where M -number of training data patterns

N - number of neurons in the output layer

t:

^{m )} ^- the target value of the ith output neuron for the given mth data pattern

y;ml _ the prediction for the ith output neuron given the mth data pattern

The process involves a forward path calculation to predict the outputs and backward path calculation to update the weights. For a neuron in the input layer, its output is equal to the input so there is in fact no activation function for an input neuron. For a neuron in the hidden and output layers, it receives the values of the outputs of its front layer nodes and takes the weighted sum as its input. The weighted sum is then transformed by the activation function to give an output. The outputs of the output layer neurons are compared with the target values using Equation 5.2 to calculate an error. The error is used to backwards updating the weights.

Given the mth data pattern, the weight updating in a supervised learning algorithm follows the formulation,

where

w;:1) -

the weight of the connection between the jth neuron of the upper layer and the ith neuron of the lower layer, in the mth learning iteration.

(5.3)

wj:'-ll -the weight of the connection between thejth neuron of the upper layer and the ith neuron of the lower layer, in the (m-l)th learning iteration.

~W;:'l -the weight change.

In backpropagation learning approach, the weight change is calculated by,

where

11 - learning rate, providing the step size during gradient descent.

Generally to assure rapid convergence, larger step sizes which do not lead to oscillation are used

a - coefficient of momentum term, 0< a < 1

(5.4)

eim) -

the output value of the ith neuron of the previous layer, in the mth iteration

8i

^{m )}^-the error signal ofthejth neuron in the mth learning iteration.

Ifj belongs to the output layer,

s::lm)

=

(t(m) _ yen,)) f' (" ^(m) ^(m)+ (m))

0, I ) J ~ ^WJ/ 01 W,O (5.5)

and if j belongs to the hidden layers,

Chapter 5 Supervised Learning for Operational Support 89

s:(m) = f' (" ^i.m) ^(m)+ (m)" s:(") (m)

U) ) £-WI' 0, WIo £-Uk Wig

i k (5.6)

where

f

is the derivative of the transfer function.

Therefore, the error backpropagation approach for adjusting weights computes an error for each neuron in the output and hidden layers using Equations 5.5 and 5.6, and recursively updates the weights of all the layers using Equation 5.4, starting from the output layer and working backwards until the input layer.

Like all first order methods, backpropagation learning is a significantly less efficient optimisation method than are second order optimisation methods such as the conjugate gradient or quasi-Newton algorithms. Leonard and Kramer [138]

studied the use of a conjugate gradient approach in order to speed up convergence.

Alternative algorithms have also been proposed by Brent [139], Chen and Billings [140] and Peel et al. [141]. An attractive approach to the backpropagation learning algorithm is the quasi-Newton method [142,143].

Since its development, multilayer neural network has shown surprisingly good performance in solving many complex problems. However there is still a lack of theoretical explanation but an interesting theorem sheds some light on the capabilities of multi-layer percetrons [144]. This theorem states that any continuous function of N variables can be computed using linear summations and nonlinear but continuously increasing functions of only one variable. It effectively states that a three layer percetron with N(2N+ 1) nodes using continuously increasing nonlearities can compute any continuous function of N variables.

5.1.3 Parameter Selection and Training Techniques

From the above discussion, it is clear that FFNN training involves the following initial decisions to begin: network topology, i.e., number of hidden layers and hidden neurons; learning rate; momentum factor; error tolerance or number of iteration; initial values of weights. Learning rate 11 and momentum factor a are not very difficult to set. We can start with some reasonable values, e.g., 11 = 0.35, a = 0.7 and then find the most appropriate values in training. The error tolerance apparently depends on the problem to be solved. Understanding how these parameters affect the learning performance, which is due to discuss next, is useful in setting the right values. Initial values of weights can be generated by a random number generator. Therefore here our discussion will focus on network topology and then some other important issues in learning.

5.1.3.1 Network Topology

It is generally accepted that only one hidden layer is necessary in a network using a sigmoid activation function, and that no more than two are necessary if a step activation function is used.

There are no available methods to decide how many hidden neurons are required in a three layered network. The number of hidden neurons depends on the nonlinearity of the problem and error tolerance. Empirically the number of neurons in the hidden layer is of the same order as the number of neurons in the input and output layers. The number of hidden neurons must be large enough to form a decision region that is as complex as required by a given problem, too few hidden neurons hinder the learning process and may not be able to achieve the required accuracy. However, the number of hidden neurons must not be so large that many weights required can not be reliably estimated from available training data patterns.

An unnecessary large hidden layer can lead to poor generality. A practical method is to start with a small number of neurons and gradually increase the number.

5.1.3.2 Local Minima

Chemical process models are multidimensional with peaks and valleys [145], which can trap the gradient descent process before it reaches the system minimum. There are several methods of combating the problem of local minima [146, 147]. The momentum factor

a ,

which tends to keep the weight changes moving in the same direction, allowed the algorithm to slip over small minimal. Another approach is to start the learning again with different set of initial weights if it is found that the network keeps oscillating around a set of weights due to lack of improvement in the error. Some times adjusting the shape of the activation function (e.g., through adjusting the constant a in Equation 5.1 can have an effect on the network susceptibility to local minima. Some new optimising approaches have been applied to multilayer neural networks which prove to be able to address the local minima significantly, such as the simulated annealing [147]. In contrast to the comments made by Crowe and Vassiliadis [145] and Chitra [147], Knight [146] thought that FFNN rarely slips into local minima. However, it should be dealt with care.

Chapter 5 Supervised Learningfor Operational Support 91

5.1.3.3 Over fitting or over parameterisation

Overfitting occurs when the network learns the classification of specific training points but fail to capture the relative probability densities of the classes [148]. This can be caused by two situations: (1) oversized network, e.g., due to inclusion of irrelevant inputs in the network structure or too many hidden layers or neurons; and (2) insufficient number of training data patterns.

5.1.3.4 Generality

Over fitting in training a neural network deteriorates the generality of the network.

5.2 Variable Selection and Feature Extraction for

Dans le document Industrial Control (Page 101-107)