Training Algorithms - Thoughtful Machine Learning

As mentioned, the weights for each neuron came from a training algorithm. There are many such algorithms, but the most common are:

• Back Propagation

• QuickProp

• RProp

All of these algorithms find optimal weights for each neuron. They do so through iterations, also known as epochs. For each epoch, a training algorithm goes through the entire Neural Network and compares it against what is expected. At this point, it learns from past miscalculations.

These algorithms have one thing in common: they are trying to find the optimal solu‐

tion in a convex error surface. You can think of convex error surface as a bowl with a minimum value in it. Imagine that you are at the top of a hill and want to make it to a valley, but the valley is full of trees. You can’t see much in front of you, but you know that you want to get to the valley. You would do so by proceeding based on local inputs and guessing where you want to go next. This is known as the Gradient

Descent algorithm (i.e., determining minimum error by walking down a valley) and it is illustrated in Figure 7-11. The training algorithms do the same thing; they are look‐

ing to minimize error by using local information.

Figure 7-11. Gradient Descent algorithm in a nutshell The delta rule

While we could solve a massive amount of equations, it would be faster to iterate.

Instead of trying to calculate the derivative of the error function with respect to the weight, we calculate a change in weight for each neuron’s weights. This is known as the delta rule, and it is as follows:

δw_ji=α t_j−ϕ h_j ϕ^′ h_j x_i

This states that the change in weight for the neuron j’s weight number i is:

alpha * (expected - calculated) * derivative_of_calculated * input_at_i

alpha is the learning rate and is a small constant. This initial idea, though, is what powers the idea behind the Back Propagation algorithm, or the general case of the delta rule.

Back Propagation

Back Propagation is the simplest of the three algorithms that determine the weight of a neuron. You define error as (expected * actual)² where expected is the expected out‐

put and actual is the calculated number from the neurons. We want to find where the derivative of that equals 0, which is the minimum:

� is the momentum factor and propels previous weight changes into our current weight change, whereas α is the learning rate.

Back Propagation has the disadvantage of taking many epochs to calculate. Up until 1988, researchers were struggling to train simple Neural Networks. Their research on how to improve this led to a whole new algorithm called QuickProp.

QuickProp

Scott Fahlman introduced the QuickProp algorithm after he studied how to improve Back Propagation. He asserted that Back Propagation took too long to converge to a solution. He proposed that we instead take the biggest steps without overstepping the solution.

Fahlman determined that there are two ways to improve Back Propagation: making the momentum and learning rate dynamic, and making use of a second derivative of the error with respect to each weight. In the first case, you could better optimize for each weight, and in the second case, you could utilize Newton’s method of approxi‐

mating functions.

With QuickProp, the main difference from Back Propagation is that you keep a copy of the error derivative computed during the previous epoch, along with the difference between the current and previous values of this weight.

To calculate a weight change at time t, use the following function:

Δw t = _{S t}_{− 1 −}^{S t} _{S t} Δ˙w t− 1

This carries the risk of changing the weights too much, so there is a new parameter for maximum growth. No weight is allowed to be greater in magnitude than the max‐

imum growth rate multiplied by the previous step for that weight.

RProp

RProp is the most used algorithm because it converges fast. It was introduced by Martin Riedmiller in the 1990s and has had some improvements since then. It con‐

verges on a solution quickly due to its insight that the algorithm can update the weights many times through an epoch. Instead of calculating weight changes based on a formula, it uses only the sign for change as well as an increase factor and decrease factor.

To see what this algorithm looks like in code, we need to define a few constants (or defaults). These are a way to make sure the algorithm doesn’t operate forever or become volatile. These defaults were taken from the FANN library.

The basic algorithm was easier to explain in Ruby instead of writing out the partial derivatives.

For ease of reading, note that I am not calculating the error gradi‐

ents (i.e., the change in error with respect to that specific weight term).

This code gives you an idea of how the RProp algorithm works using just pure Ruby code:

deltas = Array.new(inputs) { Array.new(neurons) { delta_zero }}

last_gradient = Array.new(inputs) { Array.new(neurons) { 0.0 } }

sign = ->(x) {

weights = inputs.times.map {|i| rand(-1.0..1.0) } 1.upto(max_epoch) do |j|

weights.each_with_index do |i, weight|

# Current gradient is derived from the change of each value at each layer gradient_momentum = last_gradient[i][j] * current_gradient[i][j]

if gradient_momentum > 0

deltas[i][j] = [deltas[i][j] * increase_factor, delta_max].min change_weight = -sign.(current_gradient[i][j]) * deltas[i][j]

weights[i] = weight + change_weight

last_gradient[i][j] = current_gradient[i][j]

elsif gradient_momentum < 0

deltas[i][j] = [deltas[i][j] * decrease_factor, delta_min].max last_gradient[i][j] = 0

else

change_weight = -sign.(current_gradient[i][j]) * deltas[i][j]

weights[i] = weights[i] + change_weight last_gradient[i][j] = current_gradient[i][j]

end end end

These are the fundamentals you need to understand to be able to build a Neural Net‐

work. Next, we’ll talk about how to do so, and what decisions we must make to build an effective one.

Dans le document Thoughtful Machine Learning (Page 151-155)