• Aucun résultat trouvé

Artificial Neural Network

DATA MINING TECHNIQUES

2.3 Artificial Neural Network

Figure 2.1 Data mining tasks.

Figure 2.2 Data mining techniques.

Artificial neural network (ANN) is a very well-known, powerful, and robust classification technique that has been used to approximate real-valued, discrete-valued, and vector-valued functions from examples. ANNs have been used in many areas such as interpreting visual scenes, speech recognition, and learning robot control strategies. An artificial neural network (ANN) simulates the biological nervous system in the human brain. Such a nervous system is composed of a large number of highly interconnected processing units (neurons) working together to produce our feelings and reactions. ANNs, like people, learn by example.

The learning process in a human brain involves adjustments to the synaptic connections between neurons. Similarly, the learning process of ANN involves adjustments to the node weights. Figure 2.3 presents a simple neuron unit, which is called perceptron. The perceptron input, x, is a vector or real-valued input, and w is the weight vector, in which its value is determined after training. The perceptron computes a linear combination of an input vectorxas follows (Eq. 2.1).

Figure 2.3 The perceptron.

Notice that wi corresponds to the contribution of the input vector component xi of the perceptron output. Also, in order for the perceptron to output a 1, the weighted combination of the inputs

must be greater than the thresholdw0.

Learning the perceptron involves choosing values for the weights w0 + w1x1 + … + wnxn. Initially, random weight values are given to the perceptron. Then the perceptron is applied to each training example updating the weights of the perceptron whenever an example is misclassified. This process is repeated many times until all training examples are

correctly classified. The weights are updated according to the following rule (Eq. 2.2):

where η is a learning constant, o is the output computed by the perceptron, and t is the target output for the current training example.

The computation power of a single perceptron is limited to linear decisions. However, the perceptron can be used as a building block to compose powerful multi-layer networks. In this case, a more complicated updating rule is needed to train the network weights. In this work, we employ an artificial neural network of two layers and each layer is composed of three building blocks (see Figure 2.4). We use the back propagation algorithm for learning the weights. The back propagation algorithm attempts to minimize the squared error function.

Figure 2.4 Artificial neural network.

Figure 2.5 The design of ANN used in our implementation.

A typical training example in WWW prediction is 〈[kt–τ+1,

…,kt–1,kt]T,d〉, where [kt–τ+1, …,kt–1,kt]Tis the input to the ANN andd is the target web page. Notice that the input units of the ANN in Figure 2.5 are τ previous pages that the user has recently visited, where k is a web page id. The output of the network is a boolean value, not a probability. We will see later how to approximate the probability of the output by fitting a sigmoid function after ANN output. The approximated probabilistic output becomes o′ = f(o(I) = pt+1, where I is an input session and pt+1 = p(d|kt–τ+1, …, kt). We choose the sigmoid function (Eq. 2.3) as a transfer function so

that the ANN can handle a non-linearly separable dataset [Mitchell, 1997]. Notice that in our ANN design (Figure 2.5), we use a sigmoid transfer function, Eq. 2.3, in each building block. InEq. 2.3,Iis the input to the network,Ois the output of the network, W is the matrix of weights, and σ is the sigmoid function.

We implement the back propagation algorithm for training the weights. The back propagation algorithm employs gradient descent to attempt to minimize the squared error between the network output values and the target values of these outputs.

The sum of the error over all of the network output units is defined inEq. 2.4. In Eq. 2.4, the outputs is the set of output units in the network, D is the training set, and tik and oikare

unit and training example k. For a specific weight wji in the network, it is updated for each training example as inEq. 2.5, where η is the learning rate and wji is the weight associated with the ith input to the network unit j (for details see [Mitchell, 1997]). As we can see from Eq. 2.5, the search direction δw is computed using the gradient descent, which guarantees convergence toward a local minimum. To mitigate that, we add a momentum to the weight update rule such that the weight update direction δwji(n) depends partially on the update direction in the previous iteration δwji(n– 1). The new weight update direction is shown in Eq. 2.6, where n is the current iteration, and α is the momentum constant. Notice that inEq. 2.6, the step size is slightly larger than inEq. 2.5. This contributes to a smooth convergence of the search in regions where the gradient is unchanging [Mitchell, 1997].

In our implementation, we set the step size η dynamically based on the distribution of the classes in the dataset.

Specifically, we set the step size to large values when updating the training examples that belong to low distribution classes and vice versa. This is because when the distribution of the classes in the dataset varies widely (e.g., a dataset might have 5% positive examples and 95% negative examples), the network weights converge toward the examples from the class of larger distribution, which causes a slow convergence. Furthermore, we adjust the learning rates slightly by applying the momentum constant, Eq. 2.6, to speed up the convergence of the network [Mitchell, 1997].

2.4 Support Vector