Validation strategies - Methods for emotion assessment

Chapter 4 Methods for emotion assessment

4.1.2 Validation strategies

From the ground-truth acquired according to the methods presented in Section 4.1.1, the emotion assessment task is defined as supervised classification. It is supervised because a ground-truth is available to learn a model (the yi values) and it is classification because the goal is to retrieve emotional classes of interest ˆy_i. In this case, the methods usable for emotion assessment originate from the pattern recognition and machine learning fields [119, 120].

When training classifiers overfitting can occur when the obtained model perfectly fits the data from which it is learned but performs poorly on new unseen data [119]. In order to control for the generalization capability of the model, it is thus important to test the performance of a learned model (the classifier) on a different dataset than the one used for learning. Validation strategies consist in segmenting the data in two sets: a training set from which the model is learned and a test set on which the performance of the model is tested (Figure 4.1).

Figure 4.1. Validation scheme for classification, where y is the vector of the classes estimated by model for the ˆ test set, A is the accuracy.

In this study, the performance of a model was tested by using the following measure of accuracy:

Learning Testing

c t

A N

N (4.1)

where N_t is the number of samples in the test set and N_c is the number of test samples correctly classified (test samples where ˆy_i y_i). A confusion matrix (Table 4.1) will also be used to determine how the samples are classified in the different classes. A confusion matrix gives the percentage of samples belonging to class i and classified as class j.

Estimated labels ˆy

Table 4.1. A confusion matrix, Pi,j is the percentage of samples belonging to class i and classified as class j.

The accuracy A can be retrieved from the confusion matrix by summing its diagonal elements P_i,i weighted by the prior probability p( i) of occurrence of the class i:

For the model to correctly represent the data, it is important that the training set contains enough samples (or instances). On the other hand it also important that the test set contains enough samples to avoid a noisy estimate of the model performance. This can be problematic because it often occurs that the amount of collected data is limited in practice. This is particularly true in our case since the number of emotional stimulations is limited by the duration of the protocols which should not be too long to avoid participant fatigue as well as elicitation of undesired emotions.

Cross-validation methods help to solve this problem by splitting the data in different training / test sets so that each sample will be used at least once for training and once for testing.

The two well known cross-validation methods are the k-fold and the leave-one-out [136]. In the k-fold cross-validation, the data is split in k folds containing the same amount of samples.

Generally the folds are determined so that the prior probability p( _i) of observing each class i is the same for each fold. Each fold is then used in turn as the test set and a model is learned from the remaining k-1 folds. By using this method k accuracies are obtained from the k test sets so that it is possible to compute the average accuracy and its standard deviation. The leave-one-out cross-validation is similar to the k-fold cross-validation except that the size of the test set is always 1. Thus N models are tested in turn on each sample and learned from the N-1 remaining samples of the database. The advantage of this cross-validation method is that it provides the

maximum possible size for the training set which generally helps to find a better model, especially in the case where few samples are available in the database. On the other hand only the average accuracy can be computed reliably since the test set contains only one sample (the accuracy is thus either 0 or 1).

When designing a general computational model for emotion assessment (i.e. a model that is not person dependent but can be used to assess emotions of anyone) based on physiological features it is important to take into account the high variability that can be observed in physiological reactions. To control the performance of such a model it should be tested on physiological data of persons whose features were not used for the learning of the model. For this reason the participant cross-validation method was proposed. The database was segmented in folds where each fold contains the samples computed from the physiological signals of a single participant.

Then the classification performance was computed similarly to the k-fold cross-validation, by using each fold in its turn as the test set. This method allows testing the classification performance as in a “real-case” where the emotions of a user would be assessed by using a model defined from the physiological activity of other persons.

4.1.3 Classifiers

Section 4.1.2 detailed how to determine the performance of a classifier. This section will describe the different classifiers used in this study, all being part of the pattern recognition and the machine learning fields [119, 120]. For most classification algorithms, it is important that the features be normalized (i.e. belongs to the same range of value). This normalization was applied at each cross-validation step by whithening each feature using mean and standard deviation computed from the training set.

a. Naïve Bayes

Several classifiers rely on the Bayes’ rule to find the most probable class in which a sample represented by its feature vector f should be classified. This is done by attributing the class i that maximize the posterior probability p( i | f) to the estimated label ˆy. According to the Bayes’

rule:

One of the advantages of this type of classifier is that it is able to output the posterior probability p( i | f) that a sample belong to a given class i. Notice that it is enough to find the maximum value of the numerator to maximize p( i | f) since the denominator has the same value for any

class i. The main differences between Bayesian classifiers is the way by which the conditional probabilities p(f | i) are estimated.

For the Naïve-Bayes classifier the assumption of conditional independence of the features given

i is made:

( | )

_i ^F

( | )

_j _i

p f p f

(4.4)

where F is the number of features in the feature vector f. In this study, the conditional probability p(fj | i) of a feature j was estimated by quantizing the features in 10 bins of equal sizes and computing the associated conditional probability mass function from the training set. The prior probability p( i) could also be computed from the training set; it would however be biased by the stimuli presented for each class in the protocol. For instance, if the aim of the classifier is to distinguish between calm and excited emotional states and more excited stimulus were presented to the participants then the prior probability would be higher for the excited class. While this is coherent in this particular protocol it does not have any meaning in real applications since there is nothing that guarantees the higher occurrence of excited states in this case. For this reason the prior probability p( _i) was set to 1/C under the assumption of equiprobability of the classes.

b. Discriminant analysis

Two discriminant analysis methods, namely the linear discriminant analysis (LDA) and the Quadratic discriminant analysis (QDA) are used in this study. Both are based on the Bayes rule to find the class with the highest posterior probability p( i | f) [119]. For this purpose the following g_i discriminant functions are defined:

( ) ln( ( | ) ( ))

i i i

g f p f p

^(4.5)

Finding the class i with the highest gi value is then similar to finding the class that maximizes the numerator of equation 4.3 Under the assumption that the conditional distributions p(f | i) are Gaussians with different means µi and covariance matrices i this rule automatically defines a (hyper-)quadratic decision boundary (hence the name QDA for the associated classifier):

1 ( ) ( ) ( ) ln 2 ln ln ( )

2 2 2

i i i i i i

g f f F p

^(4.6)

where T and |.| respectively stands for the transpose and determinant operators. Vectors µi and matrices i are computed from the training set. In the case where _i _j

, i j

the boundary becomes linear, yielding an LDA classifier. With the LDA it is sufficient to compute a single

covariance matrix from the complete training set without distinction between classes. Similarly to the Naïve Bayes classifier, the prior probability p( i) was defined as 1/C.

In the case where the size of the feature space F is large and the number of samples available for learning is small, the discriminant analysis can fall in the singularity problem where the _i¹ matrix is not invertible. In this case we used the diagonalized version where covariance matrices are assumed to be diagonal, containing the variances of the features. Notice that in this case the discriminant analysis is a Naïve Bayes classifier with a conditional independent Gaussian assumption for the p(f_j | _i) distributions. The Matlab statistics toolbox (v. 5.0.1) implementation of those algorithms was used in this study.

c. Support Vector Machines (SVM’s)

A SVM [120, 137] is a two class classifier (C=2) using a linear model of the form:

( ) ( )^T

h f w f b (4.7)

where a feature vector f is estimated as being from class 1 if h(f)<0 and 2 if h(f)>0. The function projects a feature vector in another feature space, generally of higher dimensionality, thus allowing for non linear separation of the data in the original feature space. In order to find the model weights w and b, an SVM tries to maximize the distance between the decision surface created by the h function and a margin to this surface as well as to minimize the error on the training set. The trade-off between margin maximization and the training error minimization is controlled by a parameter C_SVM that was empirically set to 1 in this study. The advantage of SVM's is that they minimize an upper bound on the expected risk rather than only the error on the training data, thus enabling good generalization even for undersampled datasets, as well as interesting performances in high dimensional feature spaces [138]. Moreover, they provide sparse solutions where not all of the data points are used for classification.

The SVM optimization problem can be expressed in a dual form [120, 137], where a kernel function ( , )k f f ( ) ( )f f ^T is introduced between two samples f and f´. In this new formulation, the decision boundary becomes a function of only some of the data points called the support vectors. In this study, both linear and radial basis function (RBF) kernels were used:

( , ) .

linear T

k f f f f (4.8)

( , ) 2

kRBF f f e ^{f f} (4.9)

where ||.|| is the norm operator. In the case of RBF kernels, the size of the kernel was chosen by applying a 5-fold cross-validation procedure on the training set and finding the yielding the best accuracy. The tested values belonged to the 5.10^-3 to 5.10^-1 range with a step of 5.10^-3.

There are two drawbacks to the use of SVM’s as classifiers: they are intrinsically only two-class classifiers and their output is uncalibrated so that it is not directly usable as a confidence value in the case one wants to combine outputs of different classifiers or modalities. In this study the first point was addressed by using the one-versus-one approach where C(C-1)/2 classifiers are trained on each possible pair of classes. The class associated to a test sample is the one that receives the highest number of votes from the C(C-1)/2 classifiers.

Figure 4.2. Obtaining posterior probabilities p( i | h) from SVM outputs. a) Histograms representing the distributions of the SVM output for two classes. b) Posterior probabilities estimates from the Bayes rules

applied on the histogram of a) and from the sigmoid fit proposed by Platt [139].

For the second point, Platt (2000) proposed to model the probability p( ₁| h) of being in the first of the two classes knowing the output value h of the SVM. As can be seen in Figure 4.2 this could be done by applying the Bayes rule. The discrete posterior probability plotted in Figure 4.2 approximately follows a sigmoid curve; this is why Platt proposed to model those probabilities by using:

( | ) 1

1 exp( )

p h

h (4.10)

where the and values are found by the algorithm proposed in [139] and improved in [140].

Figure 4.2 shows the result of the sigmoid curve fitting. Concretely, the h values were obtained from a 5-fold cross-validation on the training set and the parameters and were determined from those h values. The posterior probabilities of the test samples were then computed using equation 4.10. Finally, to compute the posterior probabilities p( i | h) when there are more than two classes to separate, the solution proposed in [141] was employed. The libSVM [142] Matlab toolbox was used as an implementation of the SVM and probabilistic SVM algorithms.

h function value h function value

p(1 | h)

a) b)

d. Relevance Vector Machines (RVM’s)

RVM’s [143] are algorithms that have the same functional form as SVM's but embedded in a Bayesian learning framework. They have been shown to provide results similar to SVM's with generally sparser solutions. They have the advantage that they directly give an estimation of the posterior probability of having class i.

RVM’s try to maximize the likelihood function of the training set using a linear model including kernels. The main difference with classical probabilistic discriminative models is that a different prior is applied on each weight thus leading to sparse solutions that should generalize well. For all the following studies, the multiclass RVM version presented in [144] was used.

Dans le document Emotion assessment for affective computing based on brain and peripheral signals (Page 91-97)