• Aucun résultat trouvé

Fear-type emotion recognition for future audio-based surveillance systems

N/A
N/A
Protected

Academic year: 2021

Partager "Fear-type emotion recognition for future audio-based surveillance systems"

Copied!
39
0
0

Texte intégral

(1)

HAL Id: hal-00499211

https://hal.archives-ouvertes.fr/hal-00499211

Submitted on 9 Jul 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Fear-type emotion recognition for future audio-based

surveillance systems

Chloé Clavel, I. Vasilescu, L. Devillers, Gael Richard, T. Ehrette

To cite this version:

Chloé Clavel, I. Vasilescu, L. Devillers, Gael Richard, T. Ehrette. Fear-type emotion recognition for

future audio-based surveillance systems. Speech Communication, Elsevier : North-Holland, 2008, 50

(6), pp.487. �10.1016/j.specom.2008.03.012�. �hal-00499211�

(2)

Accepted Manuscript

Fear-type emotion recognition for future audio-based surveillance systems

C. Clavel, I. Vasilescu, L. Devillers, G. Richard, T. Ehrette

PII:

S0167-6393(08)00037-X

DOI:

10.1016/j.specom.2008.03.012

Reference:

SPECOM 1696

To appear in:

Speech Communication

Received Date:

26 October 2007

Revised Date:

14 March 2008

Accepted Date:

14 March 2008

Please cite this article as: Clavel, C., Vasilescu, I., Devillers, L., Richard, G., Ehrette, T., Fear-type emotion

recognition for future audio-based surveillance systems, Speech Communication (2008), doi:

10.1016/j.specom.

2008.03.012

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers

we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and

review of the resulting proof before it is published in its final form. Please note that during the production process

errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

(3)

ACCEPTED MANUSCRIPT

Fear-type emotion re ognition for future audio-based surveillan e systems

C. Clavel

a

I. Vasiles u

b

L. Devillers

b

G. Ri hard

c

T. Ehrette

a

a

Thales Resear h andTe hnology Fran e, RD 128, 91767 Palaiseau Cedex, Fran e

b

LIMSI-CNRS, BP 133,91403 Orsay Cedex, Fran e

c

TELECOMParisTe h, 37rue Dareau, 75014 Paris, Fran e

Abstra t

Thispaperaddressestheissueofautomati emotionre ognitioninspee h.Wefo us onatypeofemotionalmanifestationwhi hhasbeenrarelystudiedinspee h pro ess-ing:fear-typeemotionso urringduringabnormalsituations(here,unplannedevents wherehumanlifeisthreatened).Thisstudyisdedi atedtoanewappli ationin emo-tionre ognition -publi safety.Thestarting point ofthiswork isthedenitionand the olle tion of data illustrating extreme emotional manifestations inthreatening situations. For this purpose we develop the SAFE orpus (Situation Analysis in a Fi tional and Emotional orpus) based on  tion movies. It onsists of 7 hours of re ordingsorganizedinto 400audiovisualsequen es.The orpus ontainsre ordings ofboth normalandabnormal situationsand provides alarge s ope of ontextsand thereforealarges opeofemotionalmanifestations.Inthisway,notonlyitaddresses theissueof thela kof orporaillustratingstrongemotions,but alsoitformsan in-teresting support to study a high varietyof emotional manifestations. We dene a task-dependent annotationstrategywhi hhastheparti ularityto des ribe simulta-neouslytheemotionandthesituationevolutionin ontext.Theemotionre ognition systemis based on these dataand must handle a large s ope of unknownspeakers andsituationsinnoisysoundenvironments.It onsistsofafearvs.neutral lassi a-tion.Thenoveltyofourapproa hreliesondisso iateda ousti modelsofthevoi ed and unvoi ed ontents of spee h. The two are then merged at the de ision step of the lassi ation system. The results arequitepromising given the omplexityand thediversityof thedata:theerror rateis about 30%.

Key words: Fear-type emotions re ognition;Fi tion orpus; Annotations heme; A ousti featuresof emotions;Ma hinelearning; Threatening situations; Civil safety.

Email address: hloe. lavelthalesgroup. om(C.Clavel).

(4)

ACCEPTED MANUSCRIPT

1 Introdu tion

One of the hallenges of spee h pro essing isto give omputers the ability to understandhumanbehaviour. The omputerinputisthesignal apturedbya mi rophone,i.e.thelowlevelinformationprovided by audiosamples.Closing the gap between this low level data and understanding of human behaviour, it'sas ienti hallenge.Consequently,theissuenowisnotonlytoknowwhat is said but alsoto knowthe speaker's attitude, emotionorpersonality.

This paper on erns the emerging resear h eld of emotion re ognition in spee h. We propose to investigate the integration of emotion re ognition in a new appli ation, namely automati surveillan esystems. This study omes withinthe s opeoftheSERKET

1

proje t,whi haimstodevelopsurveillan e systems dealing with dispersed data oming from heterogeneous sensors, in- luding audio sensors. It is motivated by the ru ial role played by the emo-tional omponent of spee h in the understanding of human behaviour, and therefore inthe diagnosis of abnormalsituation.

Our audio-based surveillan e system is ultimately designed to onsider the information onveyed by abnormalnon-vo alevents su h asgunshots (Clavel etal.,2005),thoughwefo ushereonthepartofthesystemdealingwithvo al manifestations in abnormalsituations. We look at things from the viewpoint of prote ting human life inthe ontext of ivil safety and we hoose to fo us onabnormalsituationsduringwhi hhumanlifeisindanger(e.g.re, psy ho-logi alandphysi alatta k).Inthis ontext, thetargetedemotions orrespond to a type of emotional manifestation whi h has been so far rarely studied  fear-type emotionso urring duringabnormalsituations.

The development of an emotionre ognition system an be broken down into four distin t steps: the a quisition of emotionaldata, the manual annotation of the emotional ontent, the a ousti des ription of the emotional ontent andthedevelopmentofma hinelearningalgorithm.Giventhatthe emotional phenomenon is espe ially omplex and hard todene,these steps require the know-howofasetofdistin tdis iplinessu haspsy hology,so ials ien es, bi-ology,phoneti ,linguisti ,arti ialintelligen e,statisti s,a ousti s andaudio signalpro essing.Inthisintrodu tion,wesetoutrsttounraveltheknow-how ofthesedis iplinesfromanemotionre ognitionsystempointofviewandthen topresent the additional hallenges impliedby the surveillan e appli ation.

1

http://www.resear h.thalesgroup. o m/software / ogn itive_ solution s/Serket /index.html

(5)

ACCEPTED MANUSCRIPT

1.1 Overview of emotion re ognition systems

1.1.1 A quisitionof the emotional re ordings

Thebasisofemotionresear hstudiesisthea quisitionofdatathatare re ord-ings of emotional manifestations. More pre isely, data are required for the on eption of emotion re ognition systems so that the ma hine an learn to dierentiate the a ousti models of emotion. In this ase, the hallenge is to olle talargenumberofre ordings illustratingemotionsasthey areexpe ted to o ur in appli ation data. In parti ular, data should be ideally represen-tative of everyday life if the appli ation has to run in everyday life ontexts (Douglas-Cowie etal.,2003).Besides, not onlythe typeof olle tedemotions but also the type of pi tured ontexts shouldbeappropriate for the targeted appli ation.The ontext of emotionemergen e on erns the situation(pla e, triggering events), the intera tion (human-human or human-ma hine, dura-tion of the intera tion), the so ial ontext (agent- ustomer for all entres), the speaker (gender, age), the ultural ontext, the linguisti ontext (lan-guage, diale t), and the inter-modal ontext (gesture and spee h for surveil-lan e appli ationsor spee h alonefor all entres).

The HUMAINE network of ex ellen e has arried out an evaluation of the existing emotional databases

2

. This evaluation shows that one requirement is not adequately addressed in existing databases: there is a la k of orpora illustrating strong emotions with an a eptable level of realism. Indeed spe- i real-life emotional data are di ult to olle t given their unpredi table and ondentialnature. That's the reason why a ted databases are still used to a large extent in emotional spee h studies: Juslin and Laukka (2003) list 104studiesonemotionsandestimateat87%theper entageofstudies arried out on a ted data. The di ulty is greater when dealing with extreme emo-tions o urring in real-life threat ontexts and extreme emotions are almost ex lusively illustrated in a ted databases (Mozzi ona i (1998), Kienast and Sendlmeier(2000),vanBezooijen(1984),Abelinand Allwood(2000),Ya oub et al. (2003), M Gilloway (1997), Dellaert et al. (1996), Banse and S herer (1996),Bänziger et al.(2006)).

A ted databases generallytend toree tstereotypes thatare more orless far fromemotionslikelytoo urinreal-life ontexts.Thisrealismdependsonthe speaker(professionala torornot)andonthe ontextorthes enarioprovided to the speaker for emotion simulation. Most a ted databases are laboratory data produ ed under onditions designed to remove ontextual information. Some re ent studies have aimed to olle t more realist emotional portrayals by using a tingte hniques that are thought tostirgenuine emotionsthrough

2

http://emotion-resear h.net/wiki/Da tabases

(6)

ACCEPTED MANUSCRIPT

a tion (Bänzigeret al.,2006), (Enos and Hirshberg, 2006).

An alternative way to obtain realisti emotional manifestations is to indu e emotionswithoutspeaker'sknowledge,su haswiththeeWIZdatabase(Aubergé etal.,2004),andtheSALdatabase(Douglas-Cowieetal.,2003).However,the indu tionoffear-typeemotionsmaybemedi allydangerousandunethi al,so that fear-type emotionsare not illustrated ineli ited databases.

Thethird typeof emotionaldatabase, real-lifedatabase,illustrates, toalarge extent,everyday life ontexts in whi h so ialemotions urrently o ur. Some real-life databases illustrate strong emotional manifestations (Vidras u and Devillers (2005), Fran e et al. (2003)) but the types of situational ontexts are very spe i (emergen y all entre and therapy sessions), whi h raises the matter of using databases illustrating a restri ted s ope of ontexts (as dened previously) for various appli ations.

1.1.2 Annotation of the emotional ontent

The se ond step onsists of the emotional ontent annotation of the re ord-ings.The hallengeistodeneanannotationstrategywhi hisagoodtrade-o between generi ity (data-independent) and the omplexity of the annotation task.Annotated dataarerequirednotonlytoevaluatethe performan eofthe system, but also tobuild the trainingdatabase by linkingre ordings to their emotional lasses. The annotated data must therefore provide an a eptable levelof agreement.However, the emotionalphenomenonisespe ially omplex and subje ted to dis ord. A ording to S herer et al. (1980), this omplexity isin reased bythetwooppositeee tspush/pullimpliedinemotionalspee h: physiologi al ex itations push the voi e in one dire tion and ons ious at-tempts driven by ulturalrules pull them in ananotherdire tion.

The literature on emotion representation models has its roots in psy holog-i al studies, and oers two major des ription models. The rst one onsists of representing the range of emotional manifestation in abstra t dimensions. Various dimensions have been proposed and vary a ording to the underly-ing psy hologi al theory. The a tivation/evaluation spa e is re ently the one whi h is used the most frequently and is known to apture a large range of emotionalvariation (Whissel,1989).

These ondone onsistsofusing ategoriesfortheemotiondes ription.Alarge amountofstudiesdedi atedtoemotionalspee huse ashortlistof`basi '(see the overview of Orthony and Turner (1990)) or `primary' (Damasio, 1994) emotiontermswhi hdiera ordingtotheunderlyingpsy hologi altheories. The`BigSix'(fear,anger,joy,disgust,sadnessandsurprise)denedbyEkman and Friesen(1975) are the most popular.However fullerlists(Ekman (1999), Whissel (1989), Plut hik (1984))have been established todes ribe

(7)

ACCEPTED MANUSCRIPT

related states' (Cowie and Cornelius, 2003) and Devillers et al.(2005b) have shown that emotions in real-life are rarely `basi emotions' but omplex and blendedemotionalmanifestations.Ata ognitivelevel,thistypeofdes ription involves drawing frontiers in the per eptive spa e. Ea h emotional ategory may be onsidered as a prototype  enter of a lass of more or less similar emotional manifestations (Kleiber, 1990) whi h an be linked to other simi-lar manifestations. The di ulty of the ategorizationtask strongly depends on the emotionalmaterial. The majority of a ted databases aim toillustrate predened emotionalprototypes. All the emotionalmanifestations illustrated by thistypeof orpusarestrongly onvergent tothesame prototype.By on-trast, emotions o urring in real-life orpora are un ontrolled. They display unpredi table distan es to their theoreti al prototype. This propensity to o - ur in dierent situationsthrough various manifestations engenders labelling hallenges when one makes use of a predened list of labels. In addition,the omplexity of emotional ategorization is in reased by the diversity of the data.

Existingannotation s hemesfallshort ofindustrialexpe tations. Notingthis, whi h is losely akin to the motivation of the EARL proposal

3

by the W3C Emotion In ubator Group, leads us to unravel the emotion des ription task froman emotionre ognition system point view.

1.1.3 A ousti des ription of the emotional ontent

After the emotional material has been olle ted and labelled, the next step is to extra t from the spee h re ordings a ousti features hara terizing the various emotionalmanifestations. This representation of spee h signal willbe used as the input of the emotion lassi ation system. Existing representa-tions are based onboth high-level and low-levelfeatures. High-level features, su haspit horintensity,aimat hara terizingthespee hvariation a ompa-nying physiologi al or bodily emotional modi ations (Pi ard (1997) S herer etal.(2001)).First studiesfo us onprosodi features whi h in ludetypi ally pit h,intensity and spee h rate and are largely used in emotion lassi ation systems (Kwon et al. (2003) M Gilloway (1997) S huller et al. (2004)) and stand out to be espe ially salient for fear hara terization (S herer (2003), Devillers and Vasiles u (2003),Batliner et al. (2003)). Voi e quality features whi h hara terize reaky, breathy or tensed voi es have also re ently been used for emotional ontent a ousti representation (Campbell and Mokhtari, 2003).Low-levelfeatures su has spe traland epstral features,were initially

3

Emotion Annotation and Representation Language http://emotion-resear h.net/earl/proposal. The W3C Emotion In ubator Group, after one year of joint work involving several HUMAINE partners, has published its Final Reportanda paperinACII 2007 (S hroederetal.,2007)

(8)

ACCEPTED MANUSCRIPT

used for spee h pro essingsystems, but an alsobeused foremotion lassi- ation systems (Shafranet al.(2003) Kwon etal.(2003)).

1.1.4 Classi ation algorithms

The nal step onsists in the development of lassi ation algorithmswhi h aim to re ognize one emotional lass among others or to lassify emotional lasses among themselves. The used emotional lasses vary a ording to the targeted appli ation orthe type of studied emotionaldata.

Emotion lassi ation systems are essentially based on supervised ma hine learningalgorithms:SupportVe torMa hines(DevillersandVidras u,2007), GaussianMixtureModels(S hulleretal.,2004),HiddenMarkovModels(W ag-ner, 2007), k nearest neighbors (Lee etal., 2002),et .. It is rather di ultto ompare the e ien y of the various existing approa hes, sin e no evalua-tion ampaign has been arried out so far.Performan es are besides not only dependent onthe adopted ma hinelearning algorithmbut alsoon:

- the diversity of the tested data: ontexts (speakers, situations, types of intera tion),re ording onditions;

- the emotional lasses (numberand type);

- thetrainingandtest onditions(speaker-dependent ornot) (S hulleretal., 2003);

- thete hniquesfora ousti feature extra tionwhi hare moreorless depen-dentof priorknowledgeof thelinguisti ontent andof thespeakeridentity (normalization by speaker or by phone, analysis units based on linguisti ontent).

A rst eort to onne t existing systems has been arried out with the CE-ICES(CombiningEortsforImprovingautomati Classi ationofEmotional userStates)laun hedin2005 bythe FAU Erlangenthrough the HUMAINE

4

network of ex ellen e (Batliner etal., 2006).

1.2 Contributions

Itemergesfromthepreviousoverviewthatthe developmentofemotion re og-nition systems is a re ent resear h eld and the integration of su h systems in ee tive appli ations requires to raise new s ienti issues. The rst sys-temonlaboratoryemotionaldata wasindeed arriedout re ently by Dellaert etal.(1996).Althoughsomeemotionre ognitionsystemsarenowdealingwith spontaneousandmore omplexdata(Devillersetal.,2005b),thisresear held

4

http://www5.informatik.uni-erlangen. de/Fors hung /Projekte/HUMAINE/?language=e n

(9)

ACCEPTED MANUSCRIPT

just begins tobestudied with the perspe tive of industrialappli ations su h as all entres(Leeetal.,1997)andhuman-robotintera tion(Oudeyer,2003). Inthis ontext, ourapproa h ontributestoanimportant hallenge,sin ethe surveillan e appli ation implies the onsideration of a new type of emotion and ontextfear-typeemotionso urring duringabnormalsituations and the integration of new onstraints.

1.2.1 Theappli ation: audio-surveillan e

Existingautomati surveillan esystemsare essentiallybasedonvideo uesto dete t abnormal situations: intrusion, abnormal rowd movement, et .. Su h systemsaimtoprovideanassistan etohumanoperators.Theparallel surveil-lan e of multiple s reens in reases indeed the ognitive overload of the sta and raises the matter of vigilan e.

However audio event dete tion has only begun to be used in some spe i surveillan e appli ations su h as medi al surveillan e (Va her et al., 2004). Audio ues, su h as gun shots ors reams (Clavelet al., 2005) typi ally, may onvey useful informationsabout riti al situations. Using several sensors in- reasesthe availableinformationand strengthens the qualityof the abnormal situation diagnoses. Besides audio information is useful when the abnormal situationmanifestations are poorly expressed by visual uessu h asgun-shot events or human shouts or when these manifestations go out of shot of the ameras.

1.2.2 Thepro essing of a spe i emotional ategory: fear-type emotions o - urringduring abnormal situations.

Studies dedi ated to the re ognition of emotionin spee h ommonly refer to a restri ted number of emotions su h as the `Big Six' (see 1.1.2) espe ially when they are based ona ted databases. Among the studied emotions, fear-type emotions in their extreme manifestations are not frequently studied in the resear h eld of real-life ae tive omputing. Studies prefer to take into a ountmoremoderateemotionalmanifestationswhi ho urineveryday life and whi h are shaped by politeness habitsand ulturalbehaviours. Indeed, a largepartofappli ationsisdedi atedtoimprovethenaturalnessofthe human-ma hineintera tionforeveryday tasks(dialogsystemsforbanksand ommer- ialservi es(DevillersandVasiles u,2003),arti ialagents(Pela haud,2005), robots(Breazeal andAryananda,2002)).However, someappli ations,su has dialogsystemsformilitaryappli ations(Varadarajanetal.,2006),(Fernandez and Pi ard, 2003) or emergen y all entres (Vidras u and Devillers, 2005), deal with strong fear-type emotionsin spe i ontexts (see Se tion 1.1.1)

Theemotionstargeted bysurveillan eappli ationsbelongtothespe i lass

(10)

ACCEPTED MANUSCRIPT

of emotions emerging in abnormal situations. More pre isely fear-type emo-tions may be symptomati for threat situationswhere the matter of survival israised.Here,wearelookingforfear-typeemotionso urringindynami sit-uations,duringwhi hthe matterof survivalisraised. Insu hsituationssome expe ted emotional manifestations orrespond to primary manifestations of fear (Darwin, 1872): they may o ur as a rea tion to a threat. But the tar-getedemotional lassin ludesalsomore omplexfear-relatedemotionalstates (Cowie and Cornelius, 2003) rangingfrom worryto pani .

Fear manifestations are indeed varying a ording to the imminen e of the threat(potential,latent,immediate orpast). Forour surveillan eappli ation, weareinterested inthehumanassistan ebydete tingnotonlythe threatbut alsothe threat emergen e. Thereis therefore a strong interest to onsider all the various emotionalmanifestations inside the fear lass.

1.2.3 Theappli ation onstraints

Fromasurveillan eappli ation pointofview, the emotionre ognitionsystem has to:

- runon datawith ahigh diversityin termsof number and typeof speakers, - ope with more or less noisy environments (e.g. bank, stadium, airport,

subway, station),

- bespeakerindependent and opewithahighnumberofunknown speakers, - be text-independent, i.e. not rely on a spee h re ognition tool, as a onse-quen eof the needto dealwith variousqualities ofthe re orded signal ina surveillan eappli ation.

1.2.4 Approa h and outline

In this paper, we ta kle all the various steps involved in the development of anemotion re ognitionsystem:

- the developmentofa newemotionaldatabaseinresponse tothe appli a-tion onstraints: the hallenge is to olle t data whi h illustrate a large s ope of threat ontexts, emotional manifestations, speakers, and envi-ronments (Se tion 2),

- the denition and development of a task-dependent annotation strategy whi hintegrates this diversity and the evolution ofemotional manifesta-tions a ording tothe situation(Se tion 2),

- the extra tion of relevant a ousti features for fear-type emotions har-a terization:thedi ultyrelies inndingspeaker-independent and text-independent relevant features (Se tion 3),

(11)

ACCEPTED MANUSCRIPT

- the development of an emotion re ognition system based on ma hine-learning te hniques: the system needs to be robust to the variability of the expe ted data and to the noise environment (Se tion 3),

- theperforman eevaluationinexperimental onditionsas loseaspossible tothose of the ee tivetargeted appli ation(Se tion 4).

2 Colle tion and annotation of fear-type emotions in dynami sit-uations

2.1 Colle tion of audiovisual re ordings illustrating abnormal situations

Abnormalsituationsareespe iallyrareandunpredi tableandreal-life surveil-lan e data are often ina essible in order to prote t personal priva y. Given thesedi ulties,we hose torelyonatypeofsupporthithertounexploitedby emotionalstudies,namely the tion.Our  tion orpus(theSAFE Corpus SituationAnalysisinaFi tionalandEmotional orpus) onsistsof400 audio-visualsequen esinEnglishextra tedfroma olle tionof30re entmoviesfrom variousgenres:thrillers,psy hologi aldrama,horrormovies,movieswhi haim atre onstitutingdramati newsitemsorhistori aleventsornaturaldisasters. Thedurationofthe orpustotals7hoursofre ordingsorganizedinsequen es from8se ondsto5minuteslong.Asequen eisamoviese tionillustratingone type of situation  kidnapping, physi al aggression, ood et . The sequen e durationdepends onthe way of illustrationand segmentationof the targeted situationinthemovie.Amajority71%oftheSAFE orpusdepi ts abnor-malsituationswithfear-typeemotionalmanifestationsamongotheremotions, the remainingdata onsisting innormal situationstoensurethe o uren e of asu ient numberof other emotionalstates orverbal intera tions.

The  tional movie support has so far rarely been exploited for emotional omputing studies

5

. On the one hand,  tion undoubtedly provides a ted emotionsand audiore ordings ee ts whi h annot always ree ta true pi -ture of the situation. Furthermore, the audio and video hannels are often remixed afterward and are re orded underbetter onditions than inreal sur-veillan edata.On the other hand,we are hereworkingondata verydierent from laboratory data, whi h are taken out of ontext with lean re ording onditions, and whi h have been largely studied inthe past. The  tion pro-vides re ordings of emotionalmanifestations in their environmental noise. It oers a large s ope of believable emotionportrayals.Emotions are expressed byskilleda torsininterpersonalintera tions.Thelarge ontextdenedbythe

5

Wefoundonlyonepaper(AmirandCohen,2007)whi hexploitsdialogextra ted froman animatedlm to study emotional spee h.

(12)

ACCEPTED MANUSCRIPT

movie s ript favours the identi ation of a tors with hara ters and tends to stir genuine emotions. Besides, the emotionalmaterial is quite relevant from the appli ationpointof view. Variousthreatsituations, speakers, and re ord-ing onditionsare indeedillustrated. Thisdiversityisrequiredforsurveillan e appli ations.But the two major ontributions of su ha orpus are:

- thedynami aspe t ofthe emotions: the orpusillustratesthe emotion evo-lutiona ording to the situationin interpersonal intera tions.

- thediversityof emotionalmanifestations:the  tiondepi ts alarge variety ofemotionalmanifestations whi h ouldberelevantfor numberof appli a-tions but whi h would be very di ultto olle t in real life.

2.2 In situ des ription of the emotional ontent

We propose a task-dependent annotation strategy whi h aims both to dene the emotional lasses that will be onsidered by the system and to provide informationto help understandsystem behaviours.

2.2.1 Annotation tools and strategy

Theannotations hemeisdened viathe XMLformalism(eXtensiveMark-up Language)underANVIL(Kipp,2001)(Devillersetal.,2005a)whi hprovides anappropriate interfa e for multimodal orpora annotation (see Figure 1).

The audio ontent des ription is arried out `in situ', whi h means in the ontextof the sequen e and with the help of video support. It onsists inthe sequen edes riptionofbothsituationalandemotional ontents.Thesequen e issplitintoaudio-basedannotationunitsthesegments.Thesederivefromthe dialogand emotionalstru ture of theinterpersonalintera tions.The segment orresponds to a speaker turn or a portion of speaker turn with an homo-geneous emotional ontent, that is without abrupt emotional hange, taking intoa ountthefollowingemotionaldes riptors( ategori alanddimensional). This`in situ'des riptionmakesitpossibleto apturetheevolutionofthe emo-tionalmanifestationso urringinasequen e andtostudyits orrelationwith the evolutionof the situation.

2.2.2 Annotation tra ks

The situation illustrated in the sequen e is depi ted by various ontextual tra ks:

- Thespeakertra k providesthe genre of the speakerand alsoitspositionin

(13)

ACCEPTED MANUSCRIPT

Fig.1.Annotations heme under ANVIL the intera tion (aggressor, vi timor others).

- The threat tra k gives information about the degree of imminen e of the threat (no threat, potential, latent, immediate or past threats) and its in-tensity. Besides, a ategorization of threat types is proposed by answering the following stepby step questions:If thereis athreat, isit known by the vi tim(s)? Do the vi tims know the origin of the threat? Is the aggressor present inthe sequen e? Is he/she a familiarof the vi tims?

- The spee h tra k stores the verbal and non-verbal (shouts, breathing) on-tentofspee ha ordingtotheLDC

6

trans riptionrules.The typeofaudio environment(musi /noise) and the quality of spee h are alsodetailed. The ategories obtained via this annotation ould be employed to test the ro-bustness of the dete tion methodsto environmentalnoise.

6

Linguisti DataConsortium

(14)

ACCEPTED MANUSCRIPT

Categori al and dimensional des riptors are used to des ribe the emotional manifestations at the segment level. Categori al des riptors provide a task-dependentdes riptionoftheemotional ontentwithvariouslevelsofa ura y. Indeed, itis espe iallydi ultto a urately delimitthe emotional ategories (intermsofper eived lassesfortheannotationstrategyandofa ousti mod-els for the dete tion system see Se tion 1.1.2) when the data variability is high, as it is the ase here. In order to limit the number of emotion lasses, we have sele ted four major emotion lasses: global lass fear,other negative emotions,neutral,positiveemotions.Global lass fear orrespondstoall fear-relatedemotionalstatesandtheneutral lass orrespondstonon-negativeand non-positiveemotionalspee h witha faintemotionala tivation, asdened in Devillers(2006)

7

.Thesebroademotional ategoriesarespe iedbyemotional sub ategories whi h are hosen froma listof emotionso urring in abnormal situations.This list onsistsinboth simplesub ategories presented inTable1 andmixedsub ategoriesobtainedby ombiningthesimplesub ategories(e.g. stress-anger).

Table 1

Emotional ategories and sub ategories.

Broad ategories Sub ategories

fear stress, terror, anxiety, worry, anguish, pani , distress,mixed sub ategories othernegative emotions anger, sadness, disgust, suering,

de ep-tion, ontempt, shame, despair, ruelty, mixedsub ategories

neutral

-positive emotions joy, relief, determination, pride, hope, gratitude, surprise,mixedsub ategories

Dimensional des riptors are based on three abstra t dimensions: evaluation, intensity and rea tivity. They are quantied on dis rete s ales. Evaluation axis overs dis rete values from wholly negative to wholly positive (-3,-2,-1,0,+1,+2,+3).The intensity andrea tivityaxesprovidefourlevelsfrom0to 3.The intensity dimension is a variantof the a tivation dimension dened in

7

The on ept ofneutral emotionis ambiguous andneeds to be laried. The per- eptionofneutral emotionisspeaker-dependentandvariesa ordingto the emo-tionalintelligen e ofthelabellers(Clavel etal.,2006a). Inthiswork,theneutral emotion orrespondsto the ases where thejudges ouldnot per eive any emotion inthemultimodalexpression.Indeed,we areherefo using ontheexpressiveaspe t of the emotional phenomenon, that is one of the three aspe ts ( ognitive, physio-logi al,andexpressive) urrentlya eptedas omposingtheemotionalphenomenon (S herer,1984).Harriganetal.(2005)spe ifyalsothisfo usontheexpressiveaspe t ofemotion to dene neutral attributes for their study.

(15)

ACCEPTED MANUSCRIPT

psy hologi altheories(Osgoodetal.,1975)as the levelof orporalex itation expressed by physiologi al rea tions su h as heartbeat in reasing or transpi-ration. But we prefer to use the intensity dimension as we estimate it more suitableforthe des riptionofthe oralemotionalmanifestations.Forintensity and evaluation, the level 0 orresponds to neutral. The rea tivity value indi- ateswhetherthespeakerseemstobesubje tedtothesituation(passive,level 0) or to rea t to it (a tive, level 3) and has been adapted to the appli ation ontext from the most frequently used third dimension - named the ontrol dimension (Russell, 1997). Besides this dimension is only used for emotional manifestations o urring during threats.

Abstra tdimensions allowthe spe i ation ofthe broad emotional ategories by ombining the dierent levels of the s aled abstra t dimensions. The per- eptualsalien eofthose des riptorsand ofthe annotationunit wasevaluated atthe beginningof our work and as apreliminarystep invalidatingthe data a quisitionand annotation strategy in Clavel etal. (2004).

2.2.3 Annotation task of labellers

The segmentation and the rst annotation of the orpus were arried out by anativeEnglishlabeller(Lab1).Twoother Fren h/Englishbilinguallabellers (Lab2 and Lab3) independently annotated the emotional ontent of the pre-segmented sequen es. It would be interesting to arry out further annotation exer ises to strengthen the reliability but the annotation task is espe ially ostly.

The ontextual and video support of the sequen e ompli ates the segment annotation and in reases the annotation time. Indeed, the annotation of the emotional ontent in the pre-segmented sequen es (7 hours) takes about 100 hoursasthede isionistaken by onsideringthe ontextandtheseveral han-nels (audio, video). But this support is ru ial to strengthen the reliability of the annotations. The segmentation pro ess is also very ostly, sin e the ompletesegmentationandannotationtasktakestwi ethetimeofthe simple annotationof thepre-segmentedsequen es. Given thes ale ofthis taskwedo not so far have a validation proto ol for the segmentation step and for the other annotation tra ks.

2.2.4 Evaluation of the reliability

Whendealingwithemotion omputing,therearetwomainaspe tstohandle: the diversityof emotionalmanifestationsand the subje tivity ofemotion per- eption. Weattempttodeal withthe rstaspe tby onsideringvariouslevels of a ura y in our annotation strategy. The se ond aspe t is here unraveled by a omparative in-depth analysis of the annotations obtained by the three

(16)

ACCEPTED MANUSCRIPT

labellers (Lab1, Lab2 and Lab3). The inter-labeller agreement is evaluated using traditional kappa statisti s (Carletta, 1996) (Bakeman and Gottman, 1997) for the four emotional ategories, and using Cronba h's alpha measure (Cronba h, 1951)for the three dimensions.

The kappa oe ient

κ

orresponds here to the agreement ratio taking into a ount the proportionof times that raters would agree by han e alone:

κ =

p

¯

o

p

¯

e

1 − ¯

p

e

where

p

¯

o

isthe observed agreementproportionand

p

¯

e

the han eterm. These two proportions are omputed as follows:

p

¯

0

=

1

N

seg

P

N

seg

i=1

p

seg

i

and

p

¯

e

=

P

K

k=1

p

2

cl

k

.

p

cl

k

orresponds tothe overallproportionof segmentslabelled with the lass k,and the proportion

p

seg

i

orrespondsto themeasure of agreement on ea h segment i between the

N

ann

labellers. The kappa is at 0 when the agreementlevel orresponds to han e,and at 1when the agreementis total.

Cronba h's alphaisanothermeasureof inter-labellerreliability,moresuitable than kappa for labels on a numeri al s ale. It is omputed by the following formula:

α =

N

ann

r

1 + (N

ann

1).¯

r

where

r

¯

is the average inter orrelation between the labellers. The higher the s ore, the more reliable the generated s ale is. The widely-a epted so ial s ien e ut-oisthatalphavaluesat.70orhigher orrespondtoana eptable reliability oe ientbut lowerthresholdsare sometimesusedintheliterature (Nunnaly, 1978).

TheCronba h'salphaandthekappastatisti s omputedontheSAFECorpus between the three labellers'annotations are presented in Table 2.

Table 2

Kappa s ore and Cronba h's alpha oe ient omputed between the three labellers omputed on the5275 segments of the SAFE orpus

Kappa Cronba h Categories (four ategories) 0.49 .

Lab1vs.Lab2 0.47 . Lab1vs.Lab3 0.54 . Lab2vs.Lab3 0.48 . Intensity(fourlevels) 0.26 0.77 Evaluation(seven levels) 0.32 0.86 Rea tivity(fourlevels) 0.14 0.55

(17)

ACCEPTED MANUSCRIPT

The kappa s ore obtained for the agreement level of the four emotional at-egories between the three labellers is 0.49, whi h is an a eptable level of agreement for subje tive phenomena su h as emotions (Landis and Ko h, 1977). Indeed, the use of global emotional ategories allows us to obtain an a eptable level of agreement for the development of an automati emotion re ognition system. Moreover, this hoi e has been adopted by other studies: Douglas-Cowie et al.(2003), Shafranet al.(2003), Devillerset al.(2005b).

On the other hand, the kappas obtained for the three labellers are indeed mu h lower than for global ategories. The kappa used here is the same as the one used for the ategories and isdedi ated to measure the level of stri t agreement between the dimensional levels. The best kappa value is at 0.32 and isobtained for the evaluationaxis fromwhi hthe ategories are derived. It shows that the level of stri t agreement is poor and not su ient to use the dimensions as distin t lasses for the system. However the labellers' an-notations a ording to the dimensions ome out as orrelated espe ially for intensityandevaluation,asillustratedby thehighCronba h's alphavaluesin Table 2.Ea h labeller seems to use his own referen e s ale on the dimension axis. However, this dimensional annotation provides interesting information to analyse the dis repan ies between the labellers' annotations su h as done inClavel etal.(2006a).

For the system presented in this work, we make use of the annotation in global ategories. For ea h ategory, we keep the data annotated as this at-egory by the two labellers who have shown the highest disagreement on the entire orpus (the ouple of labellers who has the lowest kappa). Segments for whi h these two labellers disagreeare not onsidered for the system. This hoi e orresponds to a trade-o between the quantity and the reliability of the data onsidered for the training. Indeed, we did not hoose to onsider for the system the three annotations be ause the quantity of data where the three annotations onverge is insu ient to build Gaussian mixture models. The interse tion of two annotations allows us to obtain more data and the onsiderationof the two most divergent labellers (with the lowest kappa) en-sures that onthe data, where they agree, someone else would more probably alsoagree

8

.

8

Anothersolutiontoobtainatrade-obetweenthequantityandthereliabilityisto onsiderthesegmentswhere atleasttwooftheevaluatorsagree. This onguration ouldbetested ina futurework.

(18)

ACCEPTED MANUSCRIPT

2.3 SAFE CorpusContent

2.3.1 Global ontent

Table 3 des ribes the SAFE orpus ontent in terms of sequen es and seg-ments.Thesegmentdurationdependsondialogintera tionsandonemotional variations in a speaker's turn. It follows that the segment duration is highly variable. The 5275 segments of the SAFE orpus represent 85% of the total duration of the orpus, and orrespond to 6 hours of spee h. The remaining 15% orrespond toportionsof re ordings withoutspee h,that is with silen e ornoise only.

Table 3

SAFECorpus Content

Unit minimum maximum mean number ofsegments persequen e 1 53 13 Segments duration 40 mse . 80 mse . 4mse .

total numberand duration 5275 - 6 h

Sequen es duration 8 se . 5min. 1 min. total numberand duration 400 - 7 h

2.3.2 Sound environments

In most movies, re ording onditions tend to mirror reality: speaker move-ments implying a natural variation in voi e sound level are thus respe ted. However,theprin ipalspeakerwillbeaudiblemoreoftenina tional ontext. We an hypothesize that this is not systemati ally the ase in real re ording onditions. Overall, the sound environments are strongly dependent on the movie and may vary inside a movie and alsoinside a sequen e. They depend onthe type of situationsdepi ted and theirevolution.

The rst diagram (Figure 2) indi ates the segment distribution a ording to their sound environment (N= noise only, M = musi only, N & M = noise and musi , lean =without neithernoise nor musi )and the se onddiagram (Figure3)a ordingtospee h quality(Q0=badqualitytoQ3=goodquality). 78% ofthe segments present ana eptable levelof spee h quality(Q2 orQ3) even though lean segmentsare not the majoritypart (21%)of the orpus. It shows that, despite the strong presen e of noise or musi , the spee h quality is quite good. The orpus provides therefore a high number of exploitable segments.Aspresented further(see orpusSAFE_1inSe tion4.1),wesele t, forthesystem developmentandevaluation,segmentswithana eptablelevel

(19)

ACCEPTED MANUSCRIPT

of spee hquality.

Fig.2.Segmentdivisiona ordingtothe sound environment.

Fig. 3. Segment division a ording to spee h quality.

2.3.3 Speakers

The surveillan e appli ation needs to ope with a high number of unknown speakers. With this in mind, the SAFE Corpus ontains about 400 dierent speakers. The distribution of spee h duration a ording to gender is as fol-lows: 47% male speakers, 31% female speakers, 2% hildren. The remaining 20% of spoken duration onsists in overlaps between speakers, in ludingoral manifestations of the rowd (2%). We are aware of the need to pro ess all the varioustypesof spoken manifestationsin ludingoverlapsfor theultimate appli ation. However, the urrent work, given itsexploratory hara ter, does not take into a ount the 20% of the spoken duration onsisting in overlaps (see orpus SAFE_1inSe tion4.1),asthe a ousti modellingoffearismu h harder inthis ase (e.g on urrent sour esfor pit h estimation).

2.3.4 Emotions

In this paper we emphasize the main features hara terizing the emotional ontentof theSAFE orpus,thatisthepresen e ofextremefearasillustrated byabstra tdimensionintensityandtherelationshipbetweentheemotionlabel and ontext (threat). The emotional ontent is presented by onsidering the per entage of attributions for ea h label by the three labellers, so that the three annotationsare taken intoa ount

9

. Theattribution per entage of the four emotional ategories is thus the following: 32% for fear, 31% for other negativeemotions, 29% for neutral,and 8% for positive emotions.

9

We don't use the majority voting be ause there are more possible annotation hoi es than labellers. So there are segments where the three labellers may have three distin t annotations and whi h ould therefore not be taken into a ount for the orpus des ription.

(20)

ACCEPTED MANUSCRIPT

The attribution per entages of the various levels of the three dimensions is presentedinthetable4.Therea tivityisonlyevaluatedonsegmentso urring during abnormalsituations(71% of the segments) (see Se tion 2.2.2).In this ontext, the emotional manifestations are more majoritary asso iated with a low rea tivity of the speaker to the threat. Very few segments are evaluated as positive onthe evaluationaxis (8% of positiveemotions) and almost none of them is evaluated as level 3. Another spe i ity of our orpus onsists in the presen e of intense emotional manifestations: 50% of the segments are evaluatedas level 2 or3 onthe intensity axis.

Table 4

SAFECorpus Content:attribution per entage of emotional dimensions

X

X

X

X

X

X

X

X

X

X

X

dimensions levels -3 -2 -1 0 1 2 3 intensity . . . 29% 21% 30% 20% evaluation 9% 34% 20% 29% 5% 3% 0% rea tivity . . . 6% 36% 19% 10%

Fear-type emotions are per eived as more intense than other emotions. 85% of fear segments are labelled as level 2 or 3 on the intensity s ale while the majorpartofotheremotionsarelabelledlevel1.Besides,thepresen eof ries (139) seems tobe asso iated with the presen e of extreme fear.

2.3.5 Emotionalmanifestations and threat

The orrelationof ategori aldes riptionsofemotionswiththethreatprovides ari hmaterialto analyse the various emotionalrea tions toa situation. Fig-ure 4 shows the distribution of ea h emotional ategory (fear,other negative emotions, neutral, positive emotions) as a fun tion of the threat imminen e. Fear is the major emotion during latent and immediate threats. By ontrast fearisnot very mu hineviden eduringnormal situations.Normalsituations in ludea major partof neutralsegmentsand alsonegativeand positive emo-tions. Latent and past threats seem to ause a large part of other negative emotionsthan fear, whi h suggests that emotionalrea tions against a threat may be various.

Table 5 illustrates the segment distribution (%) a ording to ea h fear sub- ategory for ea h degree of imminen e. The sub ategory anxiety has almost never been sele ted by the labellers. Therefore we hoose to merge the two sub ategories worry and anxiety. The three last olumns orrespond to the mostfrequentmixed ategories(seeSe tion2.2.2).Takenasawhole,the sub- ategories whi h are the most represented are thus anxiety-worry, pani and stress. During immediate threats, the major emotional sub- ategory is pani (31.8%).By ontrast,normalsituationsin ludeamajorpartofanxiety-worry

(21)

ACCEPTED MANUSCRIPT

No threat

Potential

Latent

Immediate

Past

0

10

20

30

40

50

60

70

80

Percentage of segments for each category

fear

other negative emotions

neutral

positive emotions

Fig. 4. Segment distribution a ording to emotional ategories for ea h degree of threat imminen e

but neither distress nor fear mixed with suering, and very little terror and stress during normal situations. These sub- ategories are almost ex lusively presentduringimmediateorlatentthreats.Otherwise,itisworth notingthat fear frequently o urs mixed with other emotions su h as anger (mostly), surprise, sadness and suering.

During past threats, the sub ategories anxiety-worry and pani are in the majority.Thepast threatswhi hare depi ted inthe SAFE orpusare indeed threatso urringattheendofasequen e,andtheyo urjustafterimmediate threats. It explains the presen e of similaremotional manifestations as those o urring during immediate threats, yet with a higher proportion of anxiety aused by the threat.

Table 5

Segment distribution (%) a ording to ea h fear sub- ategory for ea h imminen e degree (Pot. = potential, Lat.= latent, Imm.= immediate, Norm.= normal situa-tion,anx.= anxiety,wor.= worry,ang.= anguish,distr.= distress,pan.=pani , terr.= terror)

X

X

X

X

X

X

X

X

X

X

X

threat emotion

anx. stress ang. distr. pan. terr. fear- fear- fear-wor. anger suering surprise Norm. 64,1 4,3 5,4 0,0 12,0 7,6 3,3 0,0 2,2 Pot. 61,2 4,1 6,1 0,0 10,2 6,1 10,2 0,0 2,0 Lat. 50,9 9,7 8,3 5,5 17,3 4,2 3,1 0,0 0,0 Imm. 25,2 13,7 6,7 6,0 31,8 8,1 3,5 2,7 0,8 Past 42,0 4,0 4,0 8,0 22,0 6,0 6,0 0,0 0,0 19

(22)

ACCEPTED MANUSCRIPT

3 Carrying out an audio-based fear-type emotion re ognition sys-tem

The fear-type emotion dete tion system fo uses on dierentiating fear lass from neutral lass. The audio stream has been manually pre-segmented into de ision frames whi h orrespond to the segments as dened in Se tion 2.2. The systemis basedona ousti ues and fo usesas arst step on lassifying the predened emotionalsegments.

3.1 A ousti features extra tion,normalization and sele tion

The emotional ontent is usually des ribed in terms of global units su h as the word, the syllable or the ` hunk' (Batliner et al., 2004), (Vidras u and Devillers, 2005) by omputing statisti s. Alternatively, some studies use de-s riptions atthe frameanalysis level(S huller et al.,2003). Here, we propose anew des ription approa h whi h integrates various units of des ription that areatboththeframeanalysislevelandthetraje torylevel.Atraje tory gath-ers su essive frames with the same voi ing ondition (see Figure 5). These two temporal des ription levels have the advantage of being automati ally extra ted.

Emotions inabnormal situationsare a ompaniedby a strong body a tivity, su has runningortensing, whi h modies the spee h signal, inparti ular by in reasingthe proportionof unvoi ed spee h.Therefore some segments inthe orpus do not ontain a su ient number of voi ed frames. The information onveyed by the voi ed ontent of the segment is therefore insu ient to de-du ewhetheritisafearsegmentornot.Su hsegmentso urlessfrequentlyin everyday spee h than in strong emotionalspee h. Here, 16% of the olle ted fear segments against 3% of the neutral segments ontain less than 10% of voi ed frames.The voi ed modelis not able toexploit those segments. Given the frequen y of unvoi ed portions and in order to handle this de ien y of thevoi edmodel,amodeloftheemotionalunvoi ed ontentneedstobebuilt. The studies whi h take the unvoi ed portions into a ount onsist of global temporal level des riptions (S huller et al. (2004)), by omputing for exam-plethe proportion of unvoi ed portions ina` hunk'. Ourapproa h isoriginal be ause itseparately onsiders:

- the voi ed ontent traditionallyanalysed and whi h orresponds to vowels orvoi ed onsonants su h asb" or d" and,

- theunvoi ed ontent whi hisageneri termforbotharti ulatorynonvoi ed portionsof the spee h (for exampleobstruants)and portionsof non-modal spee h produ ed without voi ing (for example reaky, breathy voi e,

(23)

ACCEPTED MANUSCRIPT

mur).

The spee h ow of ea h segment is divided into su essive frames of 40 ms with a 30 ms overlap. The voi ing strength of the frame is evaluated under Praat (Boersma and Weenink, 2005) by omparing the auto orrelation fun -tiontoathresholdinordertodividethe spee howintovoi edand unvoi ed portions. Features are rst omputed frame by frame. In order to modelthe temporalevolutionof the features, their derivativesand statisti s (min,max, range, mean, standard deviation, kurtosis, skewness) are then omputed at the traje tory level su h as illustrated in Figure 5. Some features (the jitter, theshimmerandtheunvoi edproportion)are omputedatthe segment level.

Fig.5.Feature extra tionmethodwhi hseparately onsiders thevoi ed andunvoi ed ontent and integrates various temporal levels of des ription

The omputedfeaturesallowusto hara terizethreetypesofa ousti ontent, and an be sorted intothree feature groups:

- the prosodi group whi h in ludes features relating topit h (F0), intensity ontours,andthedurationofthevoi edtraje tory,whi hareextra tedwith Praat.Pit his omputed usingarobustalgorithmforperiodi ity dete tion based on signal auto orrelation on ea h frame. Pit h and duration of the voi ed traje toryare of ourse omputed only forthe voi ed ontent; - the voi e quality group whi h in ludes the jitter (pit h modulation), the

shimmer(amplitude modulation), the unvoi ed rate ( orresponding to the proportion of unvoi ed frames in a given segment) and the harmoni to noise ratio (ratio of signal periodi part to non-harmoni part omputed here using the algorithm developed in Yegnanarayana et al. (1998). The HNR allows us to hara terize the noise ontributionof spee h during the vo aleort.The per eived noiseis due toirregularos illationsof the vo al ordsand toadditivenoise. Thealgorithmreliesonthe substitutiondegree forharmoni s by noise.

- the spe tral and epstral features group onsisting of the rst two formants and their bandwidths, the Mel Frequen y Cepstral Coe ients (MFCC),

(24)

ACCEPTED MANUSCRIPT

the Bark band energy (BBE) and the spe tral entroid (Cs).

A total of 534 features are thus al ulated for the voi ed ontent and 518 for the unvoi ed ontent.

A ousti featuresarenotvaryingex lusivelywiththeemotional ontent.They are alsodependent onthe speaker andonthe phoneti ontent.It istypi ally the ase for pit h-related features and for the rst two formants. To handle thisdi ultymost ofthe studiesuseaspeakernormalizationforpit h-related features and a phoneme normalization for the rst two formants. However the speaker normalization may be judged as inadequate to the surveillan e, sin e the system needs to be speaker independent and has to ope with a high numberof unknown speakers. TheSAFE orpus providesabout 400 dif-ferent speakers for this purpose. The phoneme normalization is here alsonot performed as it relies on the use of a spee h re ognition tool in order to be able to align the trans ription and the spee h signal. The re ording ondi-tions of the spee h signal in a surveillan e appli ation require to develop a text-independent emotion dete tion system whi h does not rely on a spee h re ognition tool.As a preliminarysolution, we hoose touse a min-max nor-malization whi h onsists in the s aling of the features between -1 and 1. However, in future work, we plan to test more omplex normalization te h-niques su h as those used for speaker re ognition (e.g. feature warping) and whi h might improve robustness to the mismat hof soundre ordings and to noise.

The feature spa e is redu ed by sele ting the 40 most relevant features for a two lass dis rimination by using the Fisher sele tion algorithm (Duda and Hart,1973)intwosteps. Arst sele tionis arriedout onea hfeature group (prosodi , voi e quality, and spe tral) separately. One fth of features is se-le tedforea hgroupprovidingarst featureset in ludingabout100features. The nal feature set is then sele ted by applying the Fisher algorithmtothe rst feature set a se ondtime .This method avoids having strong redundan- ies between the sele ted features by for ing the sele tion algorithmto sele t features fromea h group.The salien e of the features is evaluatedseparately forthe voi ed andunvoi ed ontents.The Fishersele tionalgorithmrelieson the omputation of the FisherDis riminantRatio (FDR) of ea h feature i:

F DR

i

=

i,neutral

µ

i,f ear

)

2

σ

2

i,neutral

+ σ

i,f ear

2

where

µ

i,neutral

and

µ

i,f ear

are lassmeanvalueoffeatureve toriforfear lass and neutral lass respe tively and

σ

2

i,neutral

and

σ

2

i,f ear

the varian e values.

(25)

ACCEPTED MANUSCRIPT

3.2 Ma hine learning and de ision pro ess

The lassi ation system merges two lassiers, the voi ed lassier and the unvoi ed lassier, whi h onsider respe tively the voi ed portions and the unvoi ed portionsof the segment(Clavel etal.,2006b).

The lassi ation is performed using the Gaussian Mixture Model (GMM) based approa h whi h has been thoroughlyben hmarked in the spee h om-munity. For ea h lass

C

q

(Fear, Neutral and for ea h lassier (Voi ed, Un-voi ed) a probability density is omputed and onsists of a weighted linear ombinationof 8 Gaussian omponents

p

m,q

:

p(x/C

q

) =

8

X

m=1

w

m,q

p

m,q

(x)

where

w

m,q

are theweightedfa tors. Othermodelordershavebeentestedbut ledto worse results. The ovarian e matrix is diagonalwhi hmeans that the models are trained by onsidering independently the data orresponding to ea hfeature.

The parameters of the models (the weighted fa tors, the mean ve tor and the ovarian e matrix of ea h Gaussian omponent) are estimated using the traditionalExpe tation-Maximizationalgorithm(Dempsteretal.,1977) with 10iterations.

Classi ationisperformedusingtheMaximumAPosterioride isionrule.For the voi ed lassier, the A Posteriori S ore (APS) of a segment asso iated with ea h lass orresponds to the mean a posteriori log-probability and is omputed by multiplying the probabilities obtained for ea h voi ed analysis frame, givingfor example for the voi ed ontent:

˜

AP S

voiced

(C

q

) =

P

N

fvoiced

n=1

log(p(x

n

/C

q

))

N

f

voiced

Depending onthe proportion

r

of voi ed frames (

r ∈ [0; 1]

)in the segment, a weight(

w

)isassignedtothe lassiers inordertoobtainthe nalAPSofthe segment:

AP S

f inal

= (1 − w) ∗ AP S

voiced

+ w ∗ AP S

unvoiced

Theweightisdependentonthevoi edrate(

r ∈ [0; 1]

)ofthesegmenta ording tothe followingfun tion:

w = 1 − r

α

.

α

varies from

0

(the resultsof unvoi ed lassier are onsidered only when the segment does not ontain any voi ed frame)to

+∞

(onlytheresultsofunvoi ed lassierare onsidered).The rate that the weight de reases asa fun tion of the voi ed rate isadjusted with

α

.

(26)

ACCEPTED MANUSCRIPT

Thesegmentis then lassieda ordingtothe lass (fearorneutral)thathas the maximum a posterioris ore:

q

0

= arg max

1∈[1:q]

˜

AP S

f inal

(C

q

)

4 Experimental validation and results

4.1 Experimental database and proto ol

The SAFE orpus stands for the variability of spoken emotional manifesta-tions inabnormalsituationsatseveral levels: intermsof speakers, soundet . Inordertorestri tthisvariabilitygiventheexploratory hara terofthiswork, we fo used here on the most prototypi al emotional distin tion, i.e. the fear vs. neutral dis rimination. The following experiments and analysis are thus performedon asub orpus ontainingonlygood quality segments labelledfear and neutral. The quality of the spee h in the segments on erns the spee h audibility and has been evaluatedby the labellers (see Se tion 2.3). Remain-ingsegmentsin ludevariousenvironmenttypes(noise,musi ).Segmentswith overlaps between speakers have been dis arded (see Se tion 2.3.3). Only seg-ments, where the two human labellers who have obtained the lowest kappa value agree, are onsidered (see Se tion 2.2.4), i.e. a total of 994 segments (38% of fearsegments and 62% ofneutral segment).Table 6 shows the quan-tity of data orresponding to ea h lass in terms of segment, traje tory and frameanalysis. This sub orpus willbe namedSAFE_1.

Table 6

Experimental database SAFE_1(seg. = segment,traj. = traje tories)

Classes numberofseg. numberof traj. numberofframes duration Fear 381 2891 113385 19 min Neutral 613 5417 181615 30 min Total 994 8308 295000 49 min

The test proto ol follows the Leave One Movie Out proto ol: the data is divided into30 subsets, ea h subset ontains all the segments of a movie. 30 trainingsareperformed,ea htimeleavingoutoneofthesubsetsfromtraining, and then the omitted subset is used for the test. This proto ol ensures that the speaker used for the test isnot found inthe training database

10

.

10

This is a tually almost the ase. Three speakers over the 400 speakers an be found intwo lms.

(27)

ACCEPTED MANUSCRIPT

4.2 Global system behaviour

4.2.1 Sele ted features

It omes out from the feature sele tion step that pit h-related features are the most useful for the fear vs. neutralvoi ed lassier. Withregard to voi e qualityfeatures,boththejitterandtheshimmerhavebeensele ted.The spe -tral entroid is alsothe most relevant spe tral feature for the voi ed ontent. As for the unvoi ed ontent, spe tral features and the Bark Band Energy in parti ular ome out asthe most useful.

Ea h lassier onsiders the features sele ted as the most relevant for the two- lasses dis riminationproblem.Table 7and Table 8 show the 40sele ted features sorted by group for ea h ontent voi ed or unvoi ed. For the voi ed ontent, the prosodi features are all sele ted after the se ond overall Fisher sele tion, whi h means that this feature group  espe ially the pit h related features  seems to be the most relevant for the fear-type emotions hara -terization. Voi e quality features also seem to be relevant: both, jitter and shimmer have been sele ted. However the harmoni to noise ratio has not been sele ted. This may be explainedby the presen e of various environmen-tal noise in our data whi h makes the HNR estimation more di ult. We should also keep in mind that the presen e of musi ould bias the feature sele tion, su h astheenvironmentalnoise.However, the segmentswhi hhave been sele ted for the experiments ontains also ba kground musi , but at a ratherhighsignal(spee h)tonoise(musi )ratio,sothatthis inuen eshould not be detrimental.

Thespe traland epstralfeatureswhi h orrespondtolowerlevelfeaturesare preferredoverthe HNR. This featuregroup initiallyinthe greatest number wasthemost represented inthenal featureset.The mostrelevantspe tral featureisthe spe tral entroid and epstralfeaturesseem tobemorerelevant thanfeaturesdes ribingdire tlythespe tralenergy.Formantsarealsolargely represented in the nal feature set.

For the unvoi ed ontent the Bark band energy-related features seem to be more relevant than epstralfeatures. The HNR has alsobeen sele ted.

Overall, the sele ted features orrespond tostatisti s omputedat the traje -tory levelwhi h seems tobe suitablefor emotional ontent hara terization.

4.2.2 Voi ed lassier vs. unvoi ed lassier

Classi ation performan e is evaluated by the equal error rate (EER). The EER orrespondstotheerrorratevalueo urringwhenthede isionthreshold

(28)

ACCEPTED MANUSCRIPT

Table 7

Listof the40sele ted featuresfor thevoi ed ontent ofSAFE_1(Table6). Nini1= numberof extra ted features, Nini2= number offeatures whi hare submittedto the se ondsele tion (

N ini2 =



N ini1

5



),Nnal= number of sele ted features atthe end ofthe two su essivesele tions,stdev = standarddeviation, kurt =kurtosis, skew= skewness, d=derivative

Group Nini2/Nini1 Sele tedfeatures Nnal/Nini2 Prosodi 7/33

meanF

0

,

minF

0

,

F

0

,

maxF

0

,

stdevdF

0

,

rangedF

0

,

rangeF

0

7/7

Vo ie quality 8/37

J itter

,

Shimmer

2/8 Spe tral 93/464

meanC

s

,

minM F CC1

,

meanM F CC4

,

maxF

1

,

minM F CC4

,

mindF

1

,

mindF

2

,

meanM F CC1

,

rangedF

1

,

rangedF

2

,

rangeF

1

,

rangeF

2

,

M F CC4

,

M F CC1

,

stdevF

2

,

maxdF

1

,

maxdF

2

,

maxM F CC4

,

maxC

s

,

minM F CC3

,

skewBBE3

,

meanBBE3

,

maxF

2

,

stdevdM F CC11

,

stdevdF

2

,

kurtdF

1

,

minF

2

,

kurtF

1

,

rangeM F CC1

,

stdevdM F CC6

,

minM F CC6

31/93

Table 8

List of the 40sele ted features for the unvoi ed ontent of SAFE_1(Table 6).

Group Nini2/Nini1 Sele ted features Nnal/Nini2

Prosodi 4/16

rangeInt

1/4

Voi equality 8/36

tauxN onV oise

,

kurtdP AP

2/8 Spe tral 93/464

rangeBBE6

,

rangeBBE7

,

rangeBBE10

,

stdevBBE6

,

stdevBBE7

,

rangeBBE8

,

rangeBBE5

,

stdevBBE10

,

rangeBBE11

,

rangeBBE9

,

stdevBBE8

,

rangeBBE12

,

maxM F CC3

,

stdevBBE5

,

stdevBBE9

,

rangeBBE4

,

rangeM F CC10

,

rangeM F CC12

,

stdevBBE11

,

minBBE10

,

rangeM F CC8

,

minBBE7

,

minBBE6

,

rangeBBE3

,

rangeM F CC6

,

minBBE8

,

rangedBBE6

,

minM F CC11

,

stdevBBE12

,

stdevBBE4

,

minBBE9

,

meanM F CC3

,

rangeM F CC11

,

rangedBBE5

,

maxBw

1

,

rangedBBE4

,

maxBw

2

37/93

(29)

ACCEPTED MANUSCRIPT

of the GMM lassier is set su h that the re all will be approximatelyequal tothe pre ision.

Figure6 shows the EER for fearfrom neutral lassi ation forvarious values of

α

. The voi ed lassier is more e ient than the unvoi ed one. The EER rea hes 40% when the unvoi ed lassier is used alone (

α = ∞

). This worst ase is equivalent to never onsidering the voi ed ontent. However, the EER is at32% whenthe voi ed lassieris used inpriority (the unvoi ed lassier is used only when the segments are totally unvoi ed,

α = 0

). Best results (

EER = 29%

)are obtained when the unvoi ed lassier is onsidered with a weight de reasingqui kly asthe voi ed rate in reases (

α = 0.1

).

Fig.6.EERa ordingtotheweight(

w = 1−r

α

)oftheunvoi ed lassieragainstthe voi ed lassier obtained on SAFE_1 (Table 6) ( onden e intervalat 95%: radius

3%)

The onfusion matrix resulting from the fear vs. neutral lassier with the

alpha

parameter set at

α = 0.1

ispresented inTable 9.It illustratesthe on-fusions between the automati labelingof the lassierand the manuallabels provided by the labellers. We ompute alsothe Mean Error Rate (MER) and the kappa (see Se tion 2.2.4) between human annotation and system lassi- ation for the performan e evaluation. The kappa value at 0.53 orresponds here to the performan e of the system taking into a ount the han e. This value integrates the unbalan ed repartition of the data into the two lasses and allows us to ompare the system with han e (when

κ = 0

,the system is working su h as han e).

Themean a ura yrateof thesystem is71%.It orrespondstoquite promis-ing results given the diversity of fear manifestations illustrated in the SAFE Corpus (400 speakers, various emergen e ontexts and re ording onditions).

(30)

ACCEPTED MANUSCRIPT

Table 9

Confusion matrix, mean error rate (MER), equal error rate (EER) and

κ

for fear vs. neutral lassi ation tested on SAFE_1 (Table 6) ( onden e interval at 95%: radius

3%)

X

X

X

X

X

X

X

X

X

X

X

manual automati Neutral Fear Neutral 71% 29% Fear 30% 70% MER 29% EER 29%

κ

0.53

Otherwise, if one would expe t deterioration of performan e when trying to dete tfearexpressedinreal ontext,performan e ouldbeimprovedby adapt-ingthe system to aspe i sound environmentand re ording ondition for a spe i surveillan e appli ation.

We ompute alsothe onfusion matrix between the system outputs and ea h of the three labellers separately in Tables 10, 11, 12. The EER obtained are between 30% (when onsidering Lab3's annotation) and 35% (when onsid-ering Lab1's annotation) whi h means a 5% gap. It orresponds to similar results, given the onden eintervals.

Table 10

Confusionmatrix for the fear vs. neutral lassi ation system usingLab1's annota-tionsasa referen e (704segments for neutral lass and631 segments forfear lass, onden e intervalat 95%: radius

4%)

X

X

X

X

X

X

X

X

X

X

X

Lab1 System Neutral Fear Neutral 70% 30% Fear 39% 61% MER 34% EER 35%

κ

0,48 28

(31)

ACCEPTED MANUSCRIPT

Table 11

Confusionmatrix for the fear vs. neutral lassi ation system usingLab2's annota-tionsasareferen e (1322segmentsforneutral lassand518segmentsforfear lass, onden e intervalat 95%: radius

4%)

X

X

X

X

X

X

X

X

X

X

X

Lab2 System Neutral Fear Neutral 69% 31% Fear 32% 68% MER 32% EER 32%

κ

0,45 Table 12

Confusionmatrix for the fear vs. neutral lassi ation system usingLab3's annota-tionsasa referen e (352segments for neutral lass and309 segments forfear lass, onden e intervalat 95%: radius

5%)

X

X

X

X

X

X

X

X

X

X

X

Lab3 System Neutral Fear Neutral 72% 29% Fear 31% 69% MER 30% EER 30%

κ

0.53

4.2.3 Systemperforman e vs.human performan e

A supplementary blind annotation basedon the audio support only(i.e. by listening to the segments with no a ess to the ontextual information on-veyed by video and by the global ontent of the sequen e) has been arried outby anadditionallabeller(LabSys)onSAFE_1.LabSys hasto lassifythe segments intothe ategories fear or neutral with the same available informa-tion asthe one provided tothe system. We present inTable 13 the onfusion matrixandthekappas oreobtainedbyLabSysonSAFE_1.Thistable anbe linked with Table 9 inorder to ompare the system performan e with human performan e.

The kappa obtained by the system is 0.53. It orresponds to a good per-forman e ompared to the value of 0.57 obtained by LabSys. However, the behaviors of LabSys and the system are quite dierent. LabSys is better to re ognizeneutral (99%of orre tre ognitionagainst71% forthesystem) and the systemisbetter tore ognize fear (70%of orre tre ognitionagainst64% for LabSys).

(32)

ACCEPTED MANUSCRIPT

Table 13

Confusionmatrix obtained by LabSys on SAFE_1

X

X

X

X

X

X

X

X

X

X

X

SAFE_1 LabSys Neutral Fear Neutral 99% 1% Fear 36% 64%

κ

0.57

LabSys annotates more segments as neutral. Almost all the segments anno-tatedneutral and36%ofthoseannotatedfear inSAFE_1arelabelledneutral by LabSys. This shows that some fear ues are di ult to be per eived only with the audio hannel.

4.3 Lo al system behaviour

Table 14spe ies the system behaviouron the various segmentsa ording to the threat during whi h they o ur. With this aim, the emotional ategory annotations are orrelated with the threat tra k annotations as presented in Clavel et al.(2007).Fivefear sub lasses are thus obtained:

- NoThreat Fear: fear o urring during normal situation, i.e. situation with nothreat,

- LatentFear: fearo urring duringlatent threats, - Potential Fear:fear o urring during potentialthreats, - Immediate Fear: fearo urring during immediatethreats. - Past Fear: fearo urring during past threats.

Thereliabilityoftheerrorrates

err

isevaluatedbythe95% onden einterval (Bengio and Mariéthoz., 2004). The radius

r

of the onden e interval

I =

[err − r; err + r]

is omputed a ording to the followingformula:

r = 1, 96

v

u

u

t

err(1 − err)

N

seg

where

N

seg

isthe numberof segments used for the test.

The segment distribution of the fear lass in the experimental database a - ordingtothetypeofthe threatduringwhi hthesegmento ursispresented inTable 14.

Withregardtothe fearre ognition,we anseeinTable9that70%ofthe seg-mentslabelledfearare orre tlyre ognizedby thesystem. Bestperforman es (78%)areobtained onImmediateFear segments.By ontrast,the re ognition

Figure

Fig. 1. Annotation sheme under ANVIL
Fig. 2. Se gment division aording to the
Fig. 4. Segment distribution a ording to emotional ategories for eah degree of
Fig. 5. F eature extration method whih separately onsiders the voied and unvoied
+2

Références

Documents relatifs

In the previous section of this article, we found that affective abilities, in particular ERA and the Emotional Understanding component of emotional intelligence, predicted

• disgust is recognized with the x coordinates of the eyes, the nose, and the upper lip and the information of the distances of the eye region while using audio features other than

In this work we want to present our last results toward the development of a multimodal, person independent, emotion recognition software of this kind. We have tested our system on

There, through approx- imating Sobolev functions by polynomials, Lu gave both local and global Sobolev interpolation inequalities (weighted and nonweighted version) of any higher

The MIME sub-type audio/32KADPCM is defined to hold binary audio data encoded in 32 kbit/s ADPCM exactly as defined by ITU-T Recommendation G.726.. No header information shall

The MIME sub-type audio/32KADPCM is defined to hold binary audio data encoded in 32 kbit/s ADPCM exactly as defined by ITU-T Recommendation G.726.. No header information shall

Component process model (CPM) is an alternative view to explain emotions which borrows elements of dimensional models, as emergent results of underlying dimensions, and elements

Most computer models for the automatic recognition of emotion from nonverbal signals (e.g., facial or vocal expression) have adopted a discrete emotion perspective, i.e., they output