Algorithms for the analysis of sleep signals

(1)

Part I

Algorithms for the analysis of sleep signals

123

(2)

(3)

Chapter 3

Artificial neural networks - an introduction

3.1 Abstract

In the next chapter a novel method for analyzing long term respiratory recordings will be presented, namely an algorithm built in order to extract the moments of onset of inspiration and expiration from long duration respiratory signals. The algorithm is based on artificial neural networks, more precisely, multilayer feed-forward networks, using a back propagation algorithm to learn a desired behavior. The present chapter will then focus on introducing the theoretical basis of artificial neural networks as well as some practical points of implementing such algorithms.

3.2 Introduction

Artificial Neural Networks (ANN’s) are biologically inspired computational tools. They have their origin in the study of the brain’s neurons and the recognition that the human brain computes in a completely different way than conventional digital computers. The brain is capable of powerful, highly parallel calculus/ information processing tasks, required for ordinary day-to-day tasks. The human brain is still largely unbeatable in pattern recognition and many other tasks, namely all that require some form of flexibility, of adaptation to new unforeseen cases, that are not simple to translate into a sequential list of actions.

Programming a computer consists mostly in “translating” a specific task or decision process into a precise set of instructions, a list of actions and dichotomous choices, which necessarily limits flexibility and openness to new cases.

The brain is constituted by large networks of interconnected neurons, and information in the human brain is mainly retained in the connections between neurons, the synapses.

The synaptic connection between neurons has a very important characteristic: plasticity. It can adapt their response with time and experience and be driven to learn to produce the appropriate response when faced with some assortment of stimuli.

By opposition to classical programming, where openness to new, unforeseen, cases is small or absent, learning from a predefined set of examples can lead to some sort of general- ization capability, that is, to being able to produce reasonable outputs in face of previously unseen inputs.

125

(4)

In general terms, Haykin ([1]) defines artificial neural networks as any kind of algorithm or machine designed to mimic or model the way the brain performs a particular task or function. This is accomplished by using highly parallel machines or algorithms, composed of individual neurons (also called processing units), highly interconnected. Following the biological parallel, the connection weights (synapses) between these individual neurons will store the information necessary for the task at hand. Though other types exist, this text will only overview networks that learn, and only a particular subset of those networks, networks that perform supervised learning. Learning is achieved in ANN’s using a learning algorithm, an algorithm that operates, as the biological system it mimics, by changing the weight of the connection between neurons (synapses), in a desired, oriented way. In the present text, learning is achieved by minimizing an error function.

The preceding paragraphs have presented the two utmost benefits of neural networks:

1. They are parallel distributed system, thus able to deploy a considerable computing power in an economic way.

2. They are able to generalize from a set of examples, and thus output reasonable outputs from information not contained in its initial training set.

Other strong points of ANN’s are their ability to cope with noise, and even more generally, its capability to be fault tolerant. This results from the fact that, as information in a network is stored in a diffuse, distributed way, it can cope with local lacks or losses of information without degrading its overall performance in a significant way.

Still, ANN’s are hardly ever implemented alone, they are usually integrated into a larger engineering approach, where pre and post processing as well as decision rules are also im- portant tools. This was also the case in the application to breath detection, as will be made clear in the next chapter.

In the present work feed-forward neural networks were used, and the back-propagation learning algorithm implemented to achieve learning. According to Ripley ([2]), a feed- forward neural network, is defined as “A network in which the vertices can be numbered so that all connections go from a vertex to one with a higher number. In practice the vertices are arranged in layers, with connections to higher layers.”. Also from the same source ([2]), back propagation is “the method used to calculate the gradient vector of a fitting criterion for a feed-forward neural network with respect to the parameters (weights).”

3.3 The single neuron

The single (artificial) neuron is the fundamental brick from which artificial neural networks are built. The artificial neuron is a simplified model of a true neuron, presenting the following main elements (figure 3.1):

1. A certain number of inputs, in this case m inputs, denoted x

j

, j = 1, 2, ...m.

2. A set of synapses, each characterized by a weight. Specifically, the input x

j

, is con-

nected to the neuron k, with a synaptic weight w

kj

(first subscript indicating neuron

k, second subscript input j).

(5)

3.3. THE SINGLE NEURON 127

3. The neuron acts as a simple summing unit, adding all inputs weighted by their respec- tive synaptic weight.

4. An activation function, either linear or non-linear, is usually incorporated to limit the scope of the output to a desired interval, and often rendering the neuron’s output comparable to a probability.

Figure 3.1: Schematic view of a single neuron, depicting the path from input to output: the input vector is weighted by the synaptic weights, added in the summing junction together with the bias term; the sum constitutes the argument for an activation function, yielding the output of the neuron (from [1]).

The neuron depicted in figure 3.1 also includes an external bias b

k

, which will act directly on the input level of the activation function. It is mathematically equivalent to consider that the bias b

k

corresponds to an additional input x

0

= 1, additional input that is weighted by a synaptic w

k0

. In this description b

k

= w

k0

.

Once the neuron integrates (adds) all its m weighted inputs, thus its input activation u

k

can be defined as:

u

k

=

!

m

j=0

w

kj

x

j

(3.1)

and the final output y

k

of the neuron will then be given by:

y

k

= ϕ(u

k

) (3.2)

where ϕ is the activation function.

(6)

Many different activation functions can be implemented, from the simple Heaviside func- tion, to piecewise linear functions, to combination of exponential functions. The most com- mon is the sigmoid function. An example of sigmoid function is the logistic function, defined by :

ϕ(v) = 1

1 + e

⁻^av

(3.3)

The sigmoid function is depicted in figure 3.2, for different values of the parameter a. The sigmoid function is a differentiable function, limited to the interval [0, 1], and the parameter a determines the slope. It can be easily shown that the slope for v = 0 corresponds to

^a₄

. If a tends to infinity, the sigmoid function tends to a threshold function.

Figure 3.2: Representation of the sigmoid (equation 3.3) function for different values of the parameter a (from [1]).

3.4 Learning

In this section attention will be given to how to make an ANN learn a desired behavior.

Focus will be on supervised learning, learning with a teacher. In supervised learning a set of training examples and the corresponding desired output has been established by an expert, and are repetitively presented to the network. Using a learning algorithm, the weights of the connections between neurons are changed so that the output error is minimized. The network learns to reproduce the association of each input example to its desired output, thus to mimic the classification of the expert.

In this type of supervised learning, a sufficiently large set of learning examples is required.

Each example is constituted of an input signal, and a desired output. The network is presented with an example picked at random from the set of examples, and the synaptic weights of the network are modified so as to minimize the difference between the desired response and the actual response of the network to the specific example presented. Training is achieved by repeating this procedure for all examples in the training set. Once all examples in the training set have been presented to the network - each presentation of all examples in the training set is called an epoch -, the procedure is repeated as many times (epochs) as required for a steady state to be reached, that is until no further significant changes take place in the connection weights.

In this way, the network is learning to map the set of inputs it receives into a set of out-

puts, by making no special assumptions over its distribution. With each presentation of the

learning set, the performance of the network improves, through learning. This improvement

(7)

3.4. LEARNING 129 takes place over time, due to a process of adjustment of the synaptic weights and bias levels of the neurons.

A learning algorithm corresponds generically to any type of defined rules capable of making a system learn. As would be expected from this vague definition, there is no single, neither is there an optimal, learning algorithm, there are a diverse variety of learning algo- rithms, each with its own advantages and disadvantages, as well as its limitations. Learning algorithms differ from each other in numerous ways, but mainly in the way the synap- tic weights are adjusted after the presentation of examples. In the current text, only the paradigm of supervised learning will be covered. Supervised learning, in which a teacher or expert associates a given set of learning examples to the corresponding desired outputs, al- lows the calculation of the error committed, and thus learning can be achieved by correcting errors. The error is defined as the difference between the network’s output and the desired output expressed by the teacher. A step by step correction of errors will direct the network to emulate the teacher’s classification.

3.4.1 Error correcting learning

To illustrate error correcting learning, the single neuron depicted in figure 3.1 will be con- sidered. The connection weights w

kj

of the neuron are arbitrarily initiated. The neuron k yields an output y

k

(n), obtained following equation 3.2, for each given input x(n). In this context, n is an implicit discrete time, denoting the time step of the iterative learning process. As the desired output d

k

(n) (also called target output) of the neuron is known for any given input (as we are considering supervised learning), the error committed by the neuron can be computed as follows:

e

k

(n) = d

k

(n) − y

k

(n) (3.4)

The goal of the learning algorithm is then to minimize the error e

k

(n), by adequately adjusting the synaptic weights w

kj

, j = 1, 2, ..., m in the “right” direction. It is clear that by minimizing the error function, the output of the neuron (y

k

) will come closer to the desired output d

k

.

This objective can be achieved by minimizing a cost function, ξ(n), defined in terms of the error signal e

k

(n):

ξ(n) = 1

2 e

²_k

(n) (3.5)

ξ(n) measures an instantaneous value of the error energy, and the goal of minimizing ξ(n) leads to a learning rule commonly called delta rule (also known as Widrow-Hoff rule, or the least mean square (LMS) algorithm).

The difficulty of finding the “right” direction of change of each synaptic weight remains.

With w

kj

(n) denoting the value of synaptic weights of the neuron k, with input x

j

(n) at the time step n, the delta rule proposes to minimize ξ(n) by adjusting the synaptic weights at time step n by ∆w

kj

(n), defined by:

∆w

kj

(n) = ηe

k

(n)x

j

(n) (3.6)

where η is a positive constant that determines the rate of learning. η is commonly called

the learning rate parameter, and the value of this parameter is of the utmost importance to

(8)

ensure the convergence of the learning process. It is also determinant for how fast the con- vergence towards an acceptable error is achieved. Change in synaptic weight is proportional to learning rate, a user set parameter, but also to the error committed by the neuron and the input to that specific connection.

Once ∆w

kj

(n) is computed, the synaptic weights are updated:

w

kj

(n + 1) = w

kj

(n) + ∆w

kj

(n) (3.7) and w

kj

(n +1) (the new value of the weights, at time step n+1) can go through the next error correcting iteration, in a cycle that lasts until an acceptable level of error is attained.

3.5 Singlelayer and multilayer networks

3.5.1 Network architecture

The single neurons described in section 3.3 can be grouped into more or less complicated structures. The way in which neurons interconnect is closely linked to the learning rule required to train the network, but for the moment let us focus only on the architecture of such connections.

Figure 3.3: Schematic view of a singlelayer neural network (from [1]).

The simpler grouping of neurons can be obtained by organizing a layer of neurons, all receiving the same input, and each presenting an individual output, as presented in figure 3.3. This kind of networks is called a single layer network. It is also a feedforward network, for there is no feedback from the output of the neurons towards its input level. A recurrent network, on the other hand is a network that presents at least one feedback loop. Those networks will not be dealt with in this text.

The next step in network complexity is the multilayer feedforward network, in which

more than one layer is present. Two levels of layers are fundamentally distinct, the output

(9)

3.5. SINGLELAYER AND MULTILAYER NETWORKS 131

layer - the one that yields the final outputs of the system- and the hidden layer or layers that connect the inputs to the output layer. The multilayer network presented in figure 3.4 is also a feedforward network, and has another special property, as it is a fully connected network.

A fully connected network is a network in which each node in each layer is connected to every other node in the adjacent forward layer.

Figure 3.4: Schematic view of a multilayer neural network (from [1]).

In networks such as the one depicted above (3.4), “signal” propagates from left to right, with the input vector activating the input layer, propagating to the hidden layer and then to the output layer. It is clear that each individual connection weight contributes to the final output, and is thus partially responsible for the final output of the network. The knowledge or information the network might have is distributed by the connections and bias.

3.5.2 Multilayer networks, error propagation

In order to enable such networks to learn, a generalization of the learning rule presented in subsection 3.4.1 is necessary. Generalization from the single neuron to multiple neurons, and to multiple layer networks is not straightforward, but is a necessary step in order to introduce the back-propagation algorithm. In the following lines, the analysis will focus on two layer networks. The approach followed to establish the learning rule can be generalized into networks with more than two layers.

In order to ease the mathematical burden, let us begin by presenting a summary of the notation used in the following pages:

• indices j and k refer to different neurons in the network; index i is used to indicate

neurons in either one or the other layer

(10)

• neurons in the output layer are labeled k, and there are m

out

of them

• neurons in the hidden layer are referred to by the indices j, and they number m

hid

.

• A generic m is used for indicating either m

out

or m

hid

.

• n represents the learning iteration (when the n

^th

training example is presented to the network). N denotes the total of examples available in the learning set

• e

j

(n) refers to the error of neuron j for iteration n

• d

j

(n) refers to the desired output of neuron j, and is used for computing e

j

(n)

• y

j

(n) refers to the output of neuron j at iteration n

• w

ji

(n) refers to the synaptic weight connecting the neuron i to the neuron j

• ∆w

ji

(n) refers to the correction applied to w

ji

(n) at iteration n

• v

j

(n) constitutes the j neuron output at iteration n, prior to application of the acti- vation function

• ϕ

j

(.) denotes the activation function for neuron j

• b

j

(n) refers to the bias for neuron j, and will also be denoted as w

j0

(n), for w

j0

(n) = b

j

when w

j0

(n) is connected to a fixed input = +1

• x

j

(n) is the j

^th

element of the input vector x(n)

• o

k

(n) refers to the k

^th

element of the output vector o(n)

• η is the learning rate parameter

Following this notation, and referring to equation 3.4 for calculation of individual neu- ron’s errors, the total error of a network of neurons, can be computed as:

ξ(n) = 1 2

m

!

out

k=1

e

²_k

(n) (3.8)

where k corresponds to neurons in the output layer, and m

out

is the number of those neurons.

With N denoting the total number of examples in the learning set, the average squared error over all examples can be computed as follows:

ξ

av

(n) = 1 N

!

N

n=1

ξ(n) (3.9)

ξ

av

is a function of all free parameters in the network, all the weights and bias. For each given training set, ξ

av

represents a cost function, that can be used to measure learning performance. The objective of the learning algorithm is to adjust the free parameters so that ξ

av

is minimized.

A simple method of training is one in which the update of the weights is performed for

each example in the test set, examples that are presented in an arbitrary order. When all

examples have been presented to the network - one epoch - , and all corresponding weight

updates have been performed, the network will find itself in a new operating point. The

(11)

3.5. SINGLELAYER AND MULTILAYER NETWORKS 133 rationale behind this approach is that the arithmetic average of all individual weight changes induced by each example in the training set is an estimate of the true change that would result from modifying the weights based on the average cost function ξ

av

, minimized for the entire training set. This approach is analog to the gradient descent algorithm. The limits of validity of this approach will be discussed later.

In order to achieve the goal of minimizing the error function, the back propagation algorithm is introduced. The back propagation algorithm applies, much as described for the delta rule algorithm, a correction to each individual weight, correction that is proportional to the local derivative of the error function (∂ξ(n)/∂w

ji

(n)) in order to minimize the error function. This partial derivative function represents a sensitivity factor, determining locally the direction of search in the weights space for the optimal synaptic weight w

ji

minimizing the squared error relationship. We may express the gradient of the squared error function as:

∂ξ(n)

∂w

ji

(n) = ∂ξ(n)

∂e

j

(n)

∂e

j

(n)

∂y

j

(n)

∂y

j

(n)

∂v

j

(n)

∂v

j

(n)

∂w

ji

(n) (3.10)

where

v

j

(n) =

!

m

i=0

w

ji

(n)y

i

(n) (3.11)

and

y

j

(n) = ϕ(v

j

(n)) (3.12)

The equation 3.10 can be broken down into its components. From equation 3.8,

∂ξ(n)

∂e

j

(n) = e

j

(n) (3.13)

Differentiating both sides of equation 3.4,

∂e

j

(n)

∂y

j

(n) = − 1 (3.14)

The next term in the calculation of the partial derivative can be obtained from equation 3.12,

∂y

j

(n)

∂v

j

(n) = ϕ

^!_j

(v

j

(n)) (3.15)

where prime represents differentiation with respect to the argument. Finally the last term can be computed from 3.11,

∂v

j

(n)

∂w

ji

(n) = y

i

(n) (3.16)

Taken together, the four preceding equations yield:

∂ξ(n)

∂w

ji

(n) = − e

j

(n)ϕ

^!_j

(v

j

(n))y

i

(n) (3.17)

(12)

And thus, the corrections ∆w

ji

(n) applied to w

ji

(n), defined by the generalized delta rule is given by:

∆w

ji

(n) = − η ∂ξ(n)

∂w

ji

(n) = ηδ

j

(n)y

i

(n) (3.18) with the local gradient δ

j

(n) defined as

δ

j

(n) = − ∂ξ(n)

∂v

j

(n) = − ∂ξ(n)

∂e

j

(n)

∂e

j

(n)

∂y

j

(n)

∂y

j

(n)

∂v

j

(n) = e

j

(n)ϕ

^!_j

(v

j

(n)) (3.19) According to this equation, the local gradient for output neuron j is equal to the prod- uct of the corresponding error signal e

j

(n) and the derivative ϕ

^!_j

(v

j

(n)) of the associated activation function.

Breaking these last two equations into its components, the adjustment of the weights (∆w

ji

) is dependent on the error committed e

j

(n) at the neuron j. It is then important to individualize two distinct cases, the one in which the neuron j is an output node, and the error it commits can be easily established, and the case in which the neuron under consideration is in an input or hidden layer. Output of hidden layer neurons are not directly associated with an error, still they are partially responsible for the error committed by the output neurons. The difficulty is then how to penalize and reward hidden neurons for their share in the final output. This difficulty is overcome in an elegant fashion by back propagating the error through the network.

Output layer Each neuron in the hidden layer produces an output for a given input.

This output can be compared with the desired output, and the error committed for that given input can be easily calculated using equation 3.4. Being easy to determine e

j

(n), and even more so δ

j

(n) (using equation 3.19), ∆w

ji

can be computed and the weights updated accordingly.

Hidden layer When neuron j is situated in the hidden layer, there is no desired output attributed to it, thus no direct calculation of the error committed can be performed. The contribution of a specific hidden neuron to the overall error should be computed recursively from the impact it has on the error committed by all output neurons to which that hidden neuron is directly connected.

Rewriting equation 3.19 in a form more convenient to tackle error propagation in hidden neurons,

δ

j

(n) = − ∂ξ(n)

∂v

j

(n) = − ∂ξ(n)

∂y

j

(n)

∂y

j

(n)

∂v

j

(n) = − ∂ξ(n)

∂y

j

(n) ϕ

^!_j

(v

j

(n)) (3.20) j is an hidden neuron, and equation 3.15 was used. To calculate the partial deriva- tive ∂ξ(n)/∂y

j

(n), equation 3.11 is required, in addition to the knowledge that the error committed by all output neurons is given by

ξ(n) = 1 2

m

!

out

k=1

e

²_k

(n) (3.21)

(13)

3.5. SINGLELAYER AND MULTILAYER NETWORKS 135 This is equation 3.8, but it is repeated here to emphasize the point that the sum is made over all output neurons (index k), not neurons in the hidden layer (index j).

Differentiating equation 3.21 with respect to y

j

(n)

∂ξ(n)

∂y

j

(n) =

m

!

out

k=1

e

k

(n) ∂e

k

(n)

∂y

j

(n) (3.22)

Using the chain rule to rewrite the partial derivative, the preceding equation can be written in the equivalent form

∂ξ(n)

∂y

j

(n) =

m

!

out

k=1

e

k

(n) ∂e

k

(n)

∂v

k

(n)

∂v

k

(n)

∂y

j

(n) (3.23)

The first term of this product can be computed from equation 3.4,

e

k

(n) = d

k

(n) − y

k

(n) = d

k

(n) − ϕ

k

(v

k

(n)) (3.24) from which one can determine the second term of the product:

∂e

k

(n)

∂v

k

(n) = − ϕ

^!_k

(v

k

(n)) (3.25)

Expressing v

k

(n) as a function of y

j

(n) is not difficult, following equation 3.11, and noting once again that index k refers to the output layer and index j to the hidden layer

v

k

(n) =

m

!

hid

j=0

w

kj

(n)y

j

(n) (3.26)

where m

hid

is the number of neurons in the hidden layer. From this equation, the third term of the product is easily computed:

∂v

k

(n)

∂y

j

(n) = − w

kj

(n) (3.27)

Replacing equation 3.25 and 3.27 into equation 3.23:

∂ξ(n)

∂y

j

(n) = −

m

!

out

k=1

e

k

(n)ϕ

^!_k

(v

k

(n))w

kj

(n) = −

m

!

out

k=1

δ

k

(n)w

kj

(n) (3.28) In which the definition of the local gradient δ (equation 3.19) was used. Please note the index k referring to the output layer.

Finally, using equation 3.28 and 3.20, one can express the local error gradient of a hidden layer neuron δ

j

(n) as

δ

j

(n) = ϕ

^!_j

(v

j

(n))

m

!

out

k=1

δ

k

(n)w

kj

(n) (3.29)

This equation connects the local gradient of a hidden neuron j with the derivative of

its activation function (ϕ

^!_j

(v

j

(n))), the synaptic weights connecting it to all neurons in the

output layer (w

kj

), and of course, it takes into account the the errors committed by each

individual neuron in the output layer connected to hidden neuron j (sum over all δ

k

(n)).

(14)

These errors are accounted for in the preceding line, weighted by their individual synaptic weight w

kj

.

This last equation (equation 3.29) is called the back propagation formula, for by linking the error committed by the output layer with the contribution of previous layers, it allows to send “backwards” through the network, local measures of contribution to the output error.

The reasoning presented for a two layer network can be further expanded (in a lengthy, laborious, but similar way) to more than one hidden layer.

Updating the weights can thus generally be performed using the delta rule together with equation 3.7, with ∆w

ji

(n) defined as:

∆w

ji

(n) = ηδ

j

(n)y

i

(n) (3.30)

and δ

j

(n), depending on wether the neuron is on the output layer or the hidden layer, defined as:

• if neuron j is an output neuron, then equation 3.19 applies

δ

j

(n) = e

j

(n)ϕ

^!_j

(v

j

(n)) (3.31) where both error and derivative are associated with the neuron in question.

• if neuron j is in the hidden layer, then equation 3.29 applies δ

j

(n) = ϕ

^!_j

(v

j

(n))

m

!

out

k=1

δ

k

(n)w

kj

(n) (3.32)

where the derivative of the local activation function is multiplied by the weighted sum of the δ’s computed for the neurons in the next forward layer that are connected to the neuron j. The weighting of the different δ’s in the next layer is performed by taking into account the present connection strength w

kj

.

3.6 Implementing back propagation learning

Initialization Prior to the presentation of the first example to the network, from which learning will begin, synaptic weights have to be initialized. The best way to achieve this is by randomly set all weights from a distribution of zero mean and whose variance is set so as to cover the linear part of the activation function. The reason for imposing this last constraint is evident: as the weight update is proportional to the derivative of the chosen transfer function, and the derivative of the transfer function tends towards zero in the extremities of the interval, synaptic update is much slower when, from the initial iteration, the neuron finds itself in the plateau of the transfer function.

Calculation steps Implementation of the back propagation algorithm is a process per-

formed in two steps, two different phases of computation. First, an example is presented to

the network, and the output yielded by the current state of the network is computed and

output error is calculated. The second phase is the backward pass, when error propagates

backwards, in the network, and synaptic weights are adapted in order to minimize error.

(15)

3.6. IMPLEMENTING BACK PROPAGATION LEARNING 137 The forward pass During the forward pass, connection weights and bias are not altered anywhere in the network. The input is presented to the network and output of each neuron in the structure is calculated following equations 3.1 and 3.2, (n represents the present state), thus obtaining y

k

(n) and y

j

(n), respectively the output of neurons in the hidden and the output layer.

The signal of the output neurons is then compared to the desired response d

j

(n), and the error signal e

j

(n) is calculated.

This terminates the forward pass. From an input vector, the state of every neuron in the network and the error committed in the output layer were calculated.

The backward pass Following the forward pass, it is now possible to update the value of weights and bias, and thus take the network into a new operating point where, presumably, less error will occur. This is done during the backward pass.

During the backward pass the error is propagated backwards through the network, layer by layer, recursively calculating the local gradient δ for each neuron, in order to adjust the weights. Weights are updated accordingly to the delta rule, as defined in equation 3.30.

The weights w

kj

connecting the hidden layer to the output layer are updated first, using

∆w

kj

defines by equations 3.30 and 3.31.

Recursive computation continues to the next layer, the hidden layer, updating all weights w

ji

connecting the input to the hidden layer by ∆w

ji

, calculated using 3.30 and 3.32.

The forward and backward pass are inseparable. For each example presented to the network, both passes are performed. It is in some way a round-trip, where the consequences of the input move forward, towards the output layer, and the error distributes backwards, allowing for an oriented change in the corresponding synaptic weights.

Fine tuning back propagation In order to obtain faster and/ or more accurate conver- gence towards a minimal error, many variations to the back propagation algorithm exist.

Some of them, used in the next chapter are succinctly introduced in the following lines.

Learning rate The back propagation algorithm performs a steepest gradient descent path on an approximated error surface in the weight space. It is clear from equation 3.30 that the smaller the learning rate η, the smaller will be the changes in synaptic weight from one iteration to the next. The smaller the η, the longer it will take for the network to learn, but on the other hand, a small η ensures that the gradient descent is smooth, and so there is a smaller probability that a local minima is missed during the gradient descent path. If η is large, this will speed learning, but at the risk of large changes in synaptic weights from one iteration to the next making the network eventually miss shorter scale minima in the error surface that could be otherwise be reached with a smaller learning rate, and the risk of the network getting stuck or oscillating between local minima that do not correspond to the global minima is increased.

Adaptive learning rates In the preceding lines, the learning rate (η) has always been considered constant, but in practice it may be useful to make it depend on the connection that is being updated (η

ji

), and also to change its value in each iteration. In the simpler approach, learning rate can be made to depend on the epoch. In this approach, η is set at first in a high value and, as learning evolves and the search for the error minima becomes more local, the learning rate decreases, in order to search more and more locally.

It can also be performed in a way inspired by Levenberg-Marquardt algorithm, in which

the step (here corresponding to η(n)) is replaced in each iteration by a new value η(n + 1),

(16)

based on the value of the remaining terms of equation 3.30. When the error term becomes small, and thus ∆w

ji

will only marginally change the current weight, η is increased, so as to search elsewhere for a lower point in the error surface. The bigger the value for η, the wider the surface the algorithm can cover. However, as in Levenberg-Marquardt’s algorithm, if the weight update produced by the “big step” does not lead to a decrease in error, the weights are not updated, they remain as they were in the previous iteration.

Generalized delta rule A high learning rate speeds up learning, but at the expense of accuracy in finding the global minima of the error function. There are several ways to minimize the impact of a high learning rate on learning, or to increase accuracy for the same learning rate. One is to adapt the delta rule (equation 3.30) by adding a momentum term, obtaining what is called the generalized delta rule:

∆w

ji

(n) = ηδ

j

(n)y

i

(n) + α∆w

ji

(n − 1) (3.33) where α is a positive number called the momentum constant, 0 ≤ α < 1 (α = 0 corre- sponding to the original delta rule, equation 3.30).

In this generalized formulation, the present update of synaptic weights ∆w

ji

depends not only on the present input, but also on the preceding correction ∆w

ji

(n − 1). The physical analogy to this situation is to consider that once the network starts to adjust its weight in a certain direction of the weights’ space, it will have a certain inertia, thus in the following iteration, a fraction of the movement in that direction is still taken into account, with α determining the relative importance given to inertia.

Introducing a momentum term in the back propagation algorithm will usually smooth the path of gradient descent, and speed up convergence. The momentum term may help preventing the learning process from finishing its path on a shallow local minima of the error surface.

3.6.1 Batch versus sequential mode of training

Learning using back propagation results from the repetitive presentation of a pre-established set of training examples to the chosen network.

An epoch consists of presenting all examples on the learning set to the network. An epoch is thus the fundamental “time” unit in learning. Training will take place for a certain number of epochs, or, as will be presented in section 3.8, using early stopping, one can determine the optimal number of epochs. In either case, in each epoch all examples from the learning set are presented to the network. The order of presentation of examples is randomized. This is done in order to yield a stochastic nature to the search for the minimum error, thus limiting the possibility that this search falls and stays in a local minima that does not correspond to the global minima.

Two variants exist for updating the weights ∆w

ji

,

• The sequential mode - which corresponds to the algorithm derived in the previous sec- tions - in which weights are updated after the presentation of each individual example.

In this variant, for a learning set with N examples, connection weights are updated N times in each epoch, once for each example presented in the learning set. The N

^th

update will result in the initial state to the beginning of the next epoch.

• The batch mode, in which weights are updated only once in one epoch. This update

is made in the direction of minimizing the average squared error over all N examples

in the learning set. This introduces only minor changes in the delta rule, namely an

(17)

3.7. GENERALIZATION 139 averaging over all individual gradient descent directions, average that is weighted by their relative error. In this variant, randomizing the presentation of learning examples is, of course, useless.

Both variants have their strong and weak points, the most important constraints being computational in nature. Sequential mode requires less memory, for there is no need to store all synaptic changes corresponding to the N examples in the learning set, and due to its stochastic nature, it is less likely to fall into local minima of the error surface. On the other hand, it is harder to establish theoretical conditions for convergence.

In contrast, batch mode requires more computational memory, yet it yields a thorougher estimate of the local gradient vector, thus converges faster. Convergence conditions are also easier to establish theoretically. However, only convergence to local minima can be assured. In order to probe if the minima found in batch mode does indeed correspond to the global minima of the error surface, repetition of learning from several different random initial conditions (different initial values for all w

ji

in the network) is required. This increases heavily the computational time required for learning. On the other hand, batch learning is easier to parallelize than the sequential mode, thus decreasing the calculation time in computers that support parallel computation.

The sequential mode takes advantage of redundant or repeated examples while, on the other hand, batch mode has some difficulties in dealing with repeated learning examples.

3.7 Generalization

The goal of learning is to adapt a network’s output to a desired behavior. However, the overall goal of deploying neural networks is not to obtain the correct output for the learning examples (for those are known) but to be able to generalize the learning to new examples, unseen during the learning process.

Learning from a set of examples can and often does produce some kind of over-specification from the network, which will be capable of exactly reproduce the desired output of the learn- ing set, but, unable or less able to cope with new, previously unseen examples. This can be understood by realizing that neural network learning is equivalent to performing a high dimensional fit from the input space to the output space. Once the network has many free parameters, it is possible, even likely, that it will over-fit the learning data, acquiring

“knowledge” that are specific to the learning set, but that are useless or even harmful for the goal of generalization.

Over-fitting must be reduced, and two simultaneous approaches should be followed:

• Reducing the complexity of the network. Over-fitting can happen whenever there are more free parameters that those necessary to describe the input-output mapping. Care should be taken to determine the smallest architecture capable of solving the problem at hand. This will not only reduce computational time in the application phase, but also be in accordance with the Occam’s razor principle. Reducing the complexity of the network does not ensure by itself the best generalization. After a certain number of learning epochs the network will start to become too specialized to the training data (over-fitting the training data), thus loosing generalization capabilities.

• The second approach is called early stopping. Early stopping consists in determining the optimal moment for stopping learning on the basis of a measure of the general- ization capability of the network, and not on the error it commits on the learning set.

This approach is further discussed in the next section, and is illustrated in figure 3.5.

(18)

3.8 Cross validation

Cross validation is a standard statistical tool for testing different models in data fitting. In order to implement cross validation to ANNs, the available data is partitioned into three different sets: a learning set, used to train the network; a validation set, used to test generalization and to stop learning before the network becomes over-specific to the learning examples; and finally a test set. This last set is used to produce a more reliable measure of the accuracy of the network, independent of the two other sets that are used in training and stopping learning. Other variants of cross validation exist (namely “leave one out” or

“hold out” methods).

Cross validation is used not only to “early stop” the learning process and avoid over- fitting, but also to select the complexity of the model (number of hidden units), as will be explained in the following paragraph.

Measuring generalization performance It is possible to measure a network’s capability to generalize by testing the output error of the network against a set of examples, drawn from the same population as the learning examples, but never presented to the network during learning. This set of examples is called the validation set, and it is used to verify, throughout the learning process, the generalization capacities of the classifier. When the network becomes too specific to the training set, learning is stopped. This procedure is called early stopping. A similar approach is used to determine the optional network architecture.

Figure 3.5: Illustration of early stopping using cross validation (figure from [1]).

Early stopping As described in detail earlier, neural networks learn by minimizing the

error committed in classifying a set of learning examples. A neural network’s learning

behavior is thus expected to be as schematically described in figure 3.5: with each epoch, the

error committed by the network in classifying the examples from the learning set decreases

(here mean squared error). This decrease will be important in the first epochs and slower

thereafter, as the network approaches a local minima, and the error gradient approaches

zero.

(19)

3.9. STATISTICAL VALIDATION OF THE NETWORK’S OUTPUT 141 On the other hand, the generalization error - the error committed by the network in classifying the validation set -, computed at the end of each epoch, will decrease at first, in parallel with the decrease observed for the learning set. As soon as the network has begun to be over-specific to the learning data, the error committed for the learning set is expected to increase (figure 3.5).

Stopping learning as soon as the classification capability of the network degrades is called early stopping. The early stopping point is determined (see figure 3.5) and the network weights and bias as they were at the moment of early stopping are kept. This is done for each architecture, and for each set of the network’s initial conditions

In order to implement early stopping from a practical perspective, only a minor change in the algorithm is required, namely: training is performed in batches of a certain number of epochs, and generalization is estimated at the end of each batch. Training is resumed if an increase in generalization performance is observed. Training is stopped when a consistent decrease in generalization performance is observed.

Statistical determination of the optimal network complexity When approaching a problem using neural networks, probably the first question that one is faced with is the choice of network architecture, which corresponds to determining the amount of complexity necessary to solve the problem at hand. A certain degree of complexity is necessary, but the higher the number of free parameters, the more likely it is that the model will over-fit the data and thus lose generalization capacity. Higher complexity also means longer calculation times when applying the network to real data. Determining the smallest network architecture capable of solving the problem at hand is of the utmost importance.

Cross validation, or one of its variants can be used in order to determine the smallest network architecture. For this purpose networks of different architectures are trained, using early stopping to determine the optimal operational point for each trained network.

As the outcome of learning depends on the initial conditions it is not sufficient to train a single network for each architecture, this should be performed for different random initializa- tions. By calculating the generalization capability for each network of identical architecture and different initialization, the average generalization capability for that architecture is de- termined.

Repeating this procedure for different architectures, and performing pair-wise comparison between different architectures, one can determine the smallest network complexity that is still capable of succeeding in the task being implemented. That is done by selecting the smallest architecture that presents no statistical significant difference in average performance compared with more the complex networks implemented.

For the chosen architecture, the network presenting the best generalization, amongst the different random initializations, is selected for implementation.

3.9 Statistical validation of the network’s output

As mentioned in the previous section (3.8), cross validation imposes a separation of the available data in three sets. The use of two of those sets (learning and validation set) has been discussed in the previous lines. In this section the use of the test set is discussed.

The test set is used to measure the classification performance of the network or algorithm

in real data. The test set must thus be representative of the population and should not

contain any example presented in the network during training, for this would optimistically

(20)

bias the results ([3]). Distribution of data by the different sets should be done in a random way, in order to avoid any bias.

The most widespread measure of a classifier success is the error rate, which is simply the ratio between the number of error and the number of test cases. With a finite number of test cases, this constitutes simply an estimation of the true error rate, one that is increasingly accurate as the number of cases increases.

In two classes classification problems, where the answer can only be 0 or 1, class A or B, each answer can only be either “correct ” or “incorrect”. However, as the output of the a neuron or network is often the result of a continuous function with output in the [0, 1]

interval, the choice of attributing a certain output to class 0 or 1, depends on a operator set threshold.

Types of error For two class classification problem, such as the one that will be presented in the following chapter, in which only the presence or absence of an event is estimated, the output of a classifier (R) in relation to the expert classification (C) can be grouped into four different kinds, depicted in table 3.1.

Class Positive (C+) Class Negative (C-) Prediction Positive (R+) True Positive (TP) False Positive (FP) Prediction Negative (R-) False Negative (FN) True Negative (TN)

Table 3.1: Types of error in two class classification.

Two classes of correct classification exist: true positive corresponds to a positive event, classified as positive by the algorithm; true negative corresponds to a negative event, accu- rately classified as negative by the algorithm.

Two types of errors can be committed: a positive event can be scored as negative - this is called a false negative; and a negative event can classified as positive - false positive.

From these classes the following measures of classification performance are computed (3.2).

Sensitivity T P/C+

Specificity T N/C −

Positive predictive value T P/R+

Negative predictive value T N/R −

Accuracy (T N + T P )/((C+) + (C − )) Table 3.2: Measures of classification performance.

An ideal perfect classifier is a classifier with a 100% sensitivity, and 100% positive predic-

tive value. In practice this is never the case. Moreover, classifiers that yield an output that

is continuous (say a number in the [0,1] interval, usually corresponding to a probability),

require an operator set threshold, establishing the frontier separating outputs belonging to

one class and the other. The level of this threshold influences the distribution between the

two classes and thus changes the measures of classification performance.

(21)

3.9. STATISTICAL VALIDATION OF THE NETWORK’S OUTPUT 143

Figure 3.6: Decision boundary in an idealized two class problem characterized by one- dimensional Gaussian distributions (figure from [1]). The two types of error committed are illustrated in different shades of gray.

It is clear from the figure 3.6 that, by increasing the value set as threshold, the amount of false positives will decrease, while the number of false negatives will increase. A high threshold will thus maximize the positive predictive value, by attributing to the positive class only those events that have a hight probability of belonging there, but this will come at the expense of lowering the sensitivity of the classifier, for more C+ will be excluded from R+ by this raised threshold. This can be clearly seen from the idealized example depicted in figure 3.7. In order to measure classification accuracy independently of the value set for the threshold, an approach more complex than the indexes presented in table 3.2 is needed.

Receiver operator characteristic (ROC) (sometimes also short for “Relative Operating Characteristic”) was introduced for the purpose of establishing a measure of performance independent of the threshold level ([4]). The basis of this approach is to plot, for each threshold level, the curve described by the corresponding false negative proportion (FP/C-) (x-axis) against the true positive proportion (TP/C+) (y-axis). The surface (A) bellow this curve is then computed, and A is a measure of accuracy of the algorithm independent of the value set for the threshold.

When no discrimination exist (as when a coin is flipped in order to reach a decision) the ROC curve will be the identity line, and A is 0.5. A perfect classifier (100% TP, 0% FP) will present A=1. The value of A in practical cases will vary between 0.5 and 1, the closest it comes to 1, the better the algorithm is.

ROC analysis thus measures the accuracy of a two classes classification system, and allows

to do so independently of the thresholds and decision boundaries. It does not however, yield

the optimal operating point for those thresholds. This decision must take in to account the

relative cost and consequences of false positive and false negative diagnosis. For instance, if

the application in question concerns explosive detection at airports, the consequences of a

false negative are intolerable, while added controls associated with a false positive are only

a minor consequences. In such a case, the threshold should be set “low”(see figure 3.6).

(22)

Figure 3.7: ROC figure (from [4]), in which the true positive proportion is plotted against the false positive proportion for various possible settings of the decision criterion. The idealized curves shown correspond to the individual values of the accuracy of measure A. Observed curves usually differ slightly from the idealized curves, being less symmetrical.

On the contrary, the threshold should be set “high” when the relative cost of performing a

FN is lower than that of a FP. This is the case, for instance, when analyzing lie detectors tests,

where the cost associated with a FP is much greater than that associated with committing

a FN.

(23)

Bibliography

[1] Simon Haykin. Neural Networks - a comprehensive foundation. Prentice Hall, Piscataway, NJ: IEEE Press, 2nd edition, 1999.

[2] Brian D. Ripley. Pattern recognition and Neural Networks. Cambridge University Press, 1997.

[3] Sholom M. Weiss and Casimir A. Kulikowski. Computer systems that learn - Classifi- cation and preciction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann Publishers, 1991.

[4] J. A. Swets. Measuring the accuracy of diagnostic systems. Science, 240(4857):1285–

1293, 1988.

145

(24)

(25)

Chapter 4

Automated breath detection on long-duration signals using

feedforward backpropagation artificial neural networks

This chapter is an extended version of the article of the same title, by Rui Carlos S´a and Yves Verbandt, published in IEEE Transactions on Biomedical Engineering, Vol. 49, No.

10, October 2002.

4.1 Abstract

A new breath-detection algorithm is presented, intended to automate the analysis of res- piratory data acquired during sleep. The algorithm is based on two independent artificial neural networks (AN N

insp

and AN N

expi

) that recognize, in the original signal, windows of interest where the onset of inspiration and expiration occurs. Postprocessing consists in finding inside each of these windows of interest, minimum and maximum corresponding to each inspiration and expiration. The AN N

insp

and AN N

expi

correctly determine respec- tively 98.0% and 98.7% of the desired windows, when compared with 29 820 inspirations and 29 819 expirations detected by a human expert, obtained from three entire-night recordings.

Postprocessing allowed determination of inspiration and expiration onsets with a mean dif- ference with respect to the same human expert of (mean ± SD) 34 ± 71 ms for inspiration and 5 ± 46 ms for expiration. The method proved to be effective in detecting the onset of inspiration and expiration in full night continuous recordings. A comparison of five human experts performing the same classification task yielded that the automated algorithm was undifferentiable from these human experts, falling within the distribution of human expert results. Besides being applicable to adult respiratory volume data, the presented algorithm was also successfully applied to infant sleep data, consisting of uncalibrated rib cage and abdominal movement recordings. A comparison with two previously published algorithms for breath detection in respiratory volume signal shows that the presented algorithm has a higher specificity, while presenting similar or higher positive predictive values.

147

(26)

4.2 Introduction

The study of physiological time series during sleep and other long-duration experiments requires automated algorithms that permit the detection of events, in order to extract from the raw signals the relevant information for the intended analysis. Ideally, these methods have to be robust to cope with the intrinsic physiological variability (for example, associated with the different sleep stages), with movement and other artefacts. They should also be operator independent - not subject to the operators ability to set certain parameters such as filtering, threshold, de-trending, window length, etc., - in order to allow a valid comparison between results coming from different experiments. Implementation of automated algorithms also significantly reduces the amount of operator time required for analysis.

The detection of the onset of inspiration and expiration in each respiratory cycle is of particular importance, not only for the study of respiration ([1]), but also for the study of cardiovascular interactions such as respiratory sinus arrhythmia ([2], [3]) and cardio- respiratory synchronization ([4]).

Respiratory data are conveniently acquired by means of respiratory inductive plethys- mography (RIP) ([5]) or strain gauges ([6]). These techniques do not require the use of mask nor mouthpiece, which are not comfortable for long-duration experiments and are known to induce changes in the respiratory pattern ([7]). Conventionally, RIP or strain gauges ac- quisition is performed as follows: two sensors are placed, one around the abdomen and the other around the rib cage. With this setup the calibration of the relative gain of each signal can be achieved by performing an isovolume manoeuvre, yielding a signal proportional to respiratory volume ([8]). Figure 4.1(A) depicts a typical relative gain calibrated RIP signal during sleep. Figure 4.1 (B)-(D) depicts three of the common disturbances present in sleep RIP signals, namely baseline drift, noise, and movement.

In this chapter, a breath-detection algorithm intended to automate the analysis of res- piratory volume data acquired during sleep is presented. The algorithm is based on two independent artificial neural networks (ANNs) that recognize in the original signal windows of interest where the onset of inspiration and expiration occurs. This part of the algorithm is fully user independent and restrains the subsequent analysis to the windows of interest.

Postprocessing verifies alternation of inspiratory and expiratory windows and pinpoints the exact moment of beginning of inspiration and expiration by finding the minimum and maxi- mum inside each of these windows. A schematic view of the implementation of the algorithm is presented in figure 4.2(A).

To the best of our knowledge, previous algorithms used for inspiration and expiration detection in respiratory volume signal generally consist of searching maxima and minima in the volume signal and verifying their consistency with a certain number of user imposed thresholds and plausibility rules. We have implemented the methods described by Schmidt in [1] (algorithm 1), by Wilson ([9]), and by Cohen ([10]) for comparison with the new proposed algorithm.

The comparison algorithms are susceptible to errors due to different physiological events (sighs, swallows, movements, cardiac activity), as well as to noise and drifts in the baseline signal ([1], [9], [10]), which occur frequently in uncontrolled breathing, namely during sleep.

The proposed algorithm should be rendered independent of baseline drifts by the application

of a running window followed by normalization. In addition, by using ANNs in the core

detection, the overall algorithm is expected to be less susceptible to noise, and, to some

extent, to movements and other artefacts.

(27)

4.2. INTRODUCTION 149

SÁ AND VERBANDT: AUTOMATED BREATH DETECTION ON LONG-DURATION SIGNALS 1131

Fig. 1. (A) Typical relative gain calibrated RIP signal during deep sleep. Calibrated RIP signals presenting the following disturbances: (B) baseline drift, (C) noise, and (D) movement/artefact. (A–D) Onset of inspiration and expiration, as scored by a human expert (bottom).

The comparison algorithms are susceptible to errors due to different physiological events (sighs, swallows, movements, cardiac activity), as well as to noise and drifts in the baseline signal [1], [9], [10], which occur frequently in uncontrolled breathing, namely during sleep. The proposed algorithm should be rendered independent of baseline drifts by the application of a running window followed by normalization. In addition, by using ANNs in the core detection, the overall algorithm is

expected to be less susceptible to noise, and, to some extent, to movements and other artefacts.

II. ALGORITHM

The algorithm [Fig. 2(A)] is composed of three main blocks:

preprocessing, the ANNs and postprocessing. The preprocessing block transforms the signal by resampling it (when

Figure 4.1: (A) Typical relative gain calibrated RIP signal during deep sleep. Calibrated

RIP signals presenting the following disturbances: (B) baseline drift, (C) noise, and (D)

movement/artefact. (A)-(D) Onset of inspiration and expiration, as scored by a human

expert (bottom).

(28)

4.3 Algorithm

The algorithm (figure 4.2(A)) is composed of three main blocks: preprocessing, the ANNs and postprocessing. The preprocessing block transforms the signal by resampling it (when necessary) at 50 Hz (resampling performed using Matlabs built-in polyphase filter imple- mentation [11]) and applying a 2-s running window, with a time step of 20 ms. Each window is normalized in amplitude to the [0, 1] interval. A window length of 2 s was set as a compro- mise between a shorter window length, which would allow a faster implementation, and the necessity to convey sufficient information as input to the ANN. Two independent ANN were trained to recognize the 2-s intervals that contain an onset of inspiration and expiration in the central 200 ms of the window. The ANNs constitute the core of the algorithm (figure 4.2(B)).

The postprocessing block analyses the output of both ANNs, imposing alternating win- dows of interest for inspiration and expiration. Point detection is performed by searching the minimum and maximum inside each retained window of interest.

4.3.1 Artificial neural networks

Two independent ANNs were trained to recognize in respiratory volume signals the moment of beginning of inspiration (AN N

insp

) and expiration (AN N

expi

). For clarity, in the follow- ing paragraphs only the expiration network will be presented. A similar approach is used for inspiration.

1. Training: Each ANN consisted of 100 input units (2-s window sampled at 50 Hz), one hidden layer with a variable number of units and 1 output unit. All units are additive and apply the sigmoid transfer function (figure 4.2(B)). All examples are preprocessed before input to the network, namely normalized in amplitude to the [0, 1] interval. In order to avoid the asymptotic convergence to zero and one of the sigmoid function ([12]), the 0 − 1 tags associated with each window were transformed into 0.1 − 0.9.

Learning was achieved by using the resilient backpropagation algorithm, from the Matlab Neural Network toolbox 4.0.1 (The MathWorks Inc, MA) ([13]). All examples in the learning set were presented to the network in each cycle, i.e., a learning epoch.

Learning took place for a predetermined number of epochs, with certain intermediate results being saved for further analysis.

2. Cross validation and optimization of the number of hidden units : At intermediate steps during learning, the capability of the network to generalize was tested by means of the validation set (VS). In order to have a measure of generalization performance independent of the value chosen for the output threshold (“Thresh.” in figure 4.2(B)), receiver operating characteristic (ROC) analysis was performed ([14]). The area under the ROC curve (AUC) was calculated numerically, varying the threshold value from zero to one with a step of 0.01.