Ground-truth definitions - Data collection

Chapter 5 Assessment of emotions elicited by visual stimuli

5.2 Data collection

5.3.1 Ground-truth definitions

The visual stimuli used in both the valence and arousal experiments were purposely chosen to belong to two distinct classes: negative vs. positive for the valence experiment and calm vs.

excited for the arousal experiment. Since self-evaluations were also collected, the ground-truth can be defined either priori, based on the classes defined by the IAPS evaluations, or a-posteriori using the self-evaluations. Some of the advantages and disadvantages of these two methods are discussed in Section 4.1.1.

This section compares the IAPS evaluations to the self-assessment values and discusses the construction of the ground-truth for emotion assessment. For this purpose, the IAPS evaluations (ranging from 1 to 9) were linearly projected in the same range as the self-assessment values (ranging from -2 to 2 for valence and 0 to 4 for arousal) using the following formulas:

( 1)

( 2 5) 2

IAPS

A A V V

(5.1)

VIAPS and AIAPS being the original IAPS valence / arousal values and V and A being the new valence / arousal values.

a. Valence experiment

As can be seen from Figure 5.2 the valence distribution of the self-evaluations is very close to the one obtained from the IAPS evaluations. This tends to validate that the visual stimuli elicited the expected emotions: either positive or negative emotions. However, 3 of the 4 participants judged some of the stimuli as being of neutral valence with 79% of those stimuli belonging to the positive class according to the IAPS evaluations. Moreover only one participant ranked three of the stimuli with a value of 2. Those results demonstrate that, according to the 4 participants self-assessments, positive emotions were more difficult to elicit than negative emotions.

Concerning arousal evaluations, the IAPS and self-assessment distributions were found to be quite different. For instance most of the stimuli were self-evaluated with an arousal value of 1 while, because of the constraints applied for the selection of images, none of the stimuli had an IAPS arousal value below 2. This result can be explained by two factors. First, the stimuli may have elicited lower arousal than expected. Secondly, the participants may have used the complete scale to report for the arousal difference between the stimuli (which is a measure relative to the set of stimuli) rather than to report absolute arousal value. When looking at the IAPS arousal

histogram with a higher number of bins than in Figure 5.2 (which is possible because the IAPS values are means computed from a 9 points scale), the histogram computed in the interval [2,4] is then very similar to the one obtained from the self-assessment. This encourages the argument of relative self-assessment and shows that the self-assessed arousal values were not so different from the IAPS evaluations.

Since the self-assessments were close to the IAPS evaluations, particularly for valence evaluations, the ground-truth was defined based on the IAPS evaluations only. This allows to construct two classes of interest: one class corresponding to the positive stimuli and one corresponding to the negative stimuli.

Original IAPS evaluations Self-assessments

Number of IAPS images Number of IAPS images

Valence Valence

Number of IAPS images Number of IAPS images

Arousal Arousal

Figure 5.2. Histograms of the IAPS and self evaluations (valence and arousal) for the valence experiment. For easier comparison of IAPS evaluations and self evaluations the IAPS values have been normalized to the same

range as the self evaluations.

b. Arousal experiment

As for the valence experiment, the distribution of valence obtained from the IAPS and self-assessment values are quite similar (see Figure 5.3). The higher number of images self-evaluated as having a valence of 1 compared to the IAPS evaluations is mostly due to participants

evaluating originally neutral images (valence value to 0) as slightly positive. This is particularly true for participant 3 as can be seen from Figure 5.3.

Original IAPS evaluations Self-assessments

Number of IAPS images Number of IAPS images

Valence Valence

Number of IAPS images Number of IAPS images

Arousal Arousal

Figure 5.3. Histograms of the IAPS and self evaluations (valence and arousal) for the arousal experiment. For easier comparison of IAPS evaluations and self evaluations the IAPS values have been normalized in the same

range as the self evaluations.

For arousal, the histograms from the IAPS and self-assessment values were again different. Due to the constraints applied to construct sets of low and high arousal stimuli, two peaks can be observed for the arousal IAPS histogram. However, those peaks are not clearly visible for the histogram obtained from self-assessments. Only participant 3 obtained similar peaks with a high number of images rated with arousal values of 0 and 2. Since no constraint was applied on the variance of arousal during the selection of the stimuli, the histogram difference is certainly due to a large variability of the arousal judgments across participants and demonstrates the difference in evaluation that can be observed for the same stimuli.

Since the distributions of self-evaluations and IAPS values were found to be different for arousal and the purpose of this experiment is to assess the arousal dimension of emotions, different sets of classes were constructed based on either the IAPS values or the self-assessments.

The images used for the arousal assessment were purposely chosen to be of either very low or very high IAPS arousal values, that is they essentially should have belonged to 2 classes. For this reason, when using the IAPS judgments as a basis to build ground-truth classes, it was natural to divide data into two sets, one for the calm emotions and the other for the exciting emotions. In this way, two well balanced ground-truth classes of 50 patterns each were obtained.

It is more difficult to determine classes from the self-assessment values. As shown by the histograms of arousal, the evaluations are not equally distributed across the 5 choices and in particular do not readily correspond to 2 classes. Taking this into account, two different classification experiments based on the self-assessment were done:

- with 2 ground-truth classes, were the calm class contained patterns judged in the calmest category and the exiting class the others,

- with 3 ground-truth classes (calm, neutral, exciting) were the calm class corresponded to the first of the 5 judgment values, the neutral class to the second and third, and the exciting class to the last two.

Both class definitions led to unbalanced classes, especially for the 3-classes problem: the exciting class contained very few samples (6 to 23 depending on the participant) compared to the calm class (32 to 45) and the neutral class (32 to 55).

5.3.2 Methods

An ANOVA test was applied on the features of the EEG_Lateral feature set to control that significant differences were observed in the asymmetry scores between the positively and negatively valenced emotional states. This was done to verify the precedent findings concerning alpha lateralization in the case where emotions are stimulated by pictures and also to check if the asymmetry scores can be used as features for the purpose of classification. Since classification was done in an intra-participant framework (a model was designed for each participant) the ANOVA test was run separately for each participant. For the EEG_Area feature set, no ANOVA test was applied because Aftanas et al. [82] already demonstrated the interest of those features for arousal discrimination in a protocol very similar to the one proposed in this chapter.

A Naïve Bayes classifier was first applied on the features of each participant. This classifier is known to be optimal under the assumption of conditionally independent features and in the case of complete knowledge of the underlying probability distributions of the problem. Modeling the underlying distributions is unfortunately difficult in our study, since very few samples are available to construct them; a performance decrease is thus unavoidable. For the sake of comparison, classification based on LDA was also performed. In this case the distributions are assumed to be multivariate Gaussians with no assumption of independence. Due to the rather

limited number of patterns, a leave one out cross validation was preferred to a k-fold strategy in order to maximize the size of the training set (see Section 4.1.2). Results presented in the next section are the percentage of well classified examples.

5.4 Results

Dans le document Emotion assessment for affective computing based on brain and peripheral signals (Page 110-114)