• Aucun résultat trouvé

Recognition of Emotion From Vocal Cues

N/A
N/A
Protected

Academic year: 2022

Partager "Recognition of Emotion From Vocal Cues"

Copied!
5
0
0

Texte intégral

(1)

Article

Reference

Recognition of Emotion From Vocal Cues

JOHNSON, William F., et al.

Abstract

In two studies investigating the recognition ofemotion from vocal cues, each of four emotions (joy, sadness, anger, and fear) was posed by an actress speaking the shame, semantically neutral sentence. Jdgments of emotion expressed in these segments were compared with similar judgments of voice-synthesized (Moog synthesizer) samples (study 1) or with three different alterations of the full-speech mode (study 2). Correct identification of the posed emotion was high in the full-speech samples. Voice-synthesized samples seemed to capture some cues promoting emotion recognition, but correct identification did not approach that of other segments. Recognition of emotion decreased, but not as dramatically as expected, in each of the three alterations of the original samples.

JOHNSON, William F., et al . Recognition of Emotion From Vocal Cues. Archives of General Psychiatry , 1986, vol. 43, no. 3, p. 280-283

DOI : 10.1001/archpsyc.1986.01800030098011

Available at:

http://archive-ouverte.unige.ch/unige:101897

Disclaimer: layout of this document may differ from the published version.

(2)

Recognition of Emotion From Vocal Cues

William F.Johnson,

MD;

Robert N.

Emde,

MD;Klaus R. Scherer, PhD;

Mary

D. Klinnert,PhD

\s=b\In two studies

investigating

the

recognition

of emotion

from vocalcues,each of four emotions

(joy,

sadness, anger, and

fear)

was

posed by

anactress

speaking

thesame,seman-

tically

neutralsentence.

Judgments

of emotion

expressed

in

these segments were

compared

with similar

judgments

of

voice-synthesized (Moog synthesizer) samples (study 1)

or

with three different alterations of the

full-speech

mode

(study 2).

Correct identification of the

posed

emotionwas

high

inthe

full-speech samples. Voice-synthesized samples

seemed to capture some cues

promoting

emotion

recognition,

but cor-

rect identification did not

approach

that of other segments.

Recognition

of emotiondecreased,butnotas

dramatically

as

expected,

in each of the three alterations of the

original samples.

(Arch

Gen

Psychiatry 1986;43:280-283)

The

voice and the facecarrymuch of the affectivecontent of human communication and often form a basis for clinical intuition. Mental health

professionals,

in

particular, rely

onskillsin

recognizing

and

understanding

emotionsin

the context of

diagnosis

and treatment. It would seem

valuable to

design

programs to enhance clinical skills of emotion

recognition,

but to do this it is

important

to

understand the features of normal emotional communica¬

tion.Thebasis of suchaviewcanbe found in

Darwin,1

who

long

ago

postulated

the existence of universal discrete

expressions

of

emotion,

an idea that in recent years has received validation fromextensive cross-cultural research

involving

facial

expressions.2,3

The cross-cultural

findings implied

aninnate

ability

to expressand

recognize

different

emotional

expressions involving

the

face,

and salient fea¬

turesof thefaceineach of the basic emotions havenowbeen defined.4,5 As Darwin also

observed, however,

discrete

emotionsare

expressed through

the vocal channelaswell.

Heretofore,

the voice has been a

relatively neglected

aspect

of communication. Nonverbal vocal cues have been little

investigated,

in

part

because of the difficulties

posed by

their intimate association with

speech.

This

investiga¬

tion was

designed

to

explore

the accuracy with which emotions are

recognized

from vocal cues.

Intuitively

we

seem to understand that humans

recognize

vocal expres¬

sions of emotions

quite

well. The

question remaining

for

research

is, What,

other than the verbal content of a

person's speech,

contributestothe hearer's

recognition

that

the

speaker

is sadrather than

happy

orangryratherthan fearful?Inother

words,

Whatrole do nonverbalcues

play

in

the

recognition

of emotion from verbal communications?

Mixed results from

previous experimental

studies ofemo¬

tion

recognition

from thevoice make it difficulttodraw any conclusions

(see

summaries of this

research6,7).

Inthis

study

we

attempted

toovercome anumber of

problems plaguing previous experimental designs.

We used clear

signals

to

give unambiguous expressions

ofbasic emotions. Wecom¬

pared signals

that had a

recognizable

verbalcontentwith

nonverbal

signals

and with those that had the verbal content masked. In

previous studies,

which concentrated

onfull

speech,

itwasdifficulttodetermine whichcueswere

important

in the

judges'

decisions.

Therefore,

inthis

study,

we useda

variety

of

signal-masking techniques

to better

delineate the acoustic features

contributing

to emotion

recognition. Further,

becausemany

previous

studies forced

respondents

to choosebetween a limited number of listed emotions

(thereby potentially introducing

unknown con¬

straints),

we

compared

forced-choiceresponseswithafree- response

technique.

METHOD

Two studiesweredone. Both studiesusedaconvenient

sample

of

judges composed primarily

of

professional

trainees in clinical mental health (first- and

second-year psychiatric

residents and clinical

psychology

interns).

Twenty-one judges

wereusedin

study

1and 23were usedin

study

2.

The

studies weredone one year

apart, and 16

judges

instudy2 hadalso

participated

in

study

1.

Judges

werein their late 20sor

early

30sandwere

gathered

from

theclinical

setting

in whichoneofus(W.F.J.)worked. Male

judges predominated,

with75% in

study

1and67% in

study

2.

Study

1

In

study

1,

judgments

of the emotion

expressed

invoice

samples

were

compared

with similar

judgments

of

voice-synthesized (Moog synthesizer) samples

asusedinpreviouswork8(alsoseediscussion AcceptedforpublicationSept28,1984.

From the Department of Psychiatry, University of Colorado Health Sciences Center, Denver(Drs Johnson and Emde); the Department of Psychology, UniversityofGiessen, Giessen,WestGermany(DrScherer);

and theDepartmentof PediatricPsychiatry,the National JewishHospital,

Denver(DrKlinnert).

Reprint requests to Department of Psychiatry, C268, University of

Colorado Health SciencesCenter,4200 E NinthAve, Denver,CO 80262(Dr Emde).

(3)

Table 1.—Results of

Study

1

Acoustic Stimulus and Emotion

%(Proportion)of

Correct Identification ForcedChoice FreeResponse Human voice

Joy 100 (84/84) 51.2 (43/84)

Sadness 100 (84/84) 91.7 (77/84)

Anger 100 (84/84) 97.6 (82/84)

Fear 98.8 (83/84) 52.4 (44/84)

Voicesynthesizer

Joy 65.5 (55/84) 40.5 (34/84)

Sadness 92.9 (78/84) 50 (42/84)

Anger 32.1 (27/84) 27.4 (23/84)

Fear 32.1 (27/84) 7.1 (6/84)

below).For the human voice

samples,

anactresswhowasfamiliar with theresearch

question spoke

a

semantically

neutralsentence:

"The greenbookis

lying

onthe table." Thesentencewaspresented

insuchawayso astogiveclear vocal

expressions

of each of four

commondiscrete emotions:

joy, sadness,

anger, and fear. While this

departs

somewhat from what would be encountered in a

clinical

situation,

itservestoreduce

potential

confusioncaused

by

the contentofthesentence. For voice-synthesized samples, non- speechstimuliwereused that inearlierwork8,9hadbeenshownto

generate tendencies toward an attribution of these same four emotions. We thus

hoped

to assess the relative

importance

of

certain known

synthetically

encodedcuesforemotionrecognition.

Forstudy 1,each vocalexpressionwas

reproduced

four times and

randomly ordered,

resulting

in 16 voice items. Voice-synthesized

itemswere

similarly reproduced

and

randomly

ordered. In this

study

we

compared

twodifferent

judgment procedures:

oneinvolv¬

ing

forced-choice

labeling

and the other

involving free-response labeling.

Inthe forced-choice

procedure, judges

weretold the four emotions under

study

andwere askedto selectone of thefour choicesforeach

auditory

stimulus. Inthe

free-response procedure, judges

were asked to

respond

to each

auditory

stimuluswith a

word orphrase

describing

the emotion

they

heardmost

clearly expressed. Free-response

emotion labelswere then

categorized

into discrete emotions

according

to a lexicon ofemotion words

previously

validated

by

Izard.10 The lexicon allows

placement

of

English

words intooneof the

following

categories:

joy,

interest/

excitement,

sadness/distress, surprise, disgust,

anger,

fear, shame, passive-bored/sleepy,

and unclassifiable. (The

original

lexicon was modified

slightly

and is available on

request.)

The presentation of

voice-synthesized

and human voice

samples

was

varied to examine fororder effects, butto avoid bias the free- response

procedure

was

always

administered before the forced- choice

procedure.

Study

2

Oneyearlatertheabove-mentioned voice

samples

wereused in

study

2. In this

study,

however, responsesto thesestimuliwere

comparedwiththose ofthesamevoice

samples subjected

tothree separate

"masking"

conditions

designed

toobscure certainacous¬

ticfactorsandto

highlight

others. The "altered" stimuliwerethus

composed

ofthe

following:

(1) content-filtered stimuli

(low-pass

filtered withacutoff

frequency

of500Hz;afilterwith 45-dB roll-off

was used [Krohnhite]); (2)

randomly spliced

stimuli (with voice samplefiles

segmented

into units of about 200msand

randomly rearranged

withtheuseofa

digital speech manipulation

system);

and(3)reverse

speech

stimuli with the

original

voice

sample

tapes

played

backward (fordetailson thesethree

procedures,

see the report

by

Scherer"). These

masking

conditions affect different paralinguisticcues.In the content-filteredcondition,voice

quality

is maskedand distorted whileloudness is

partially

masked, but fundamental

frequency

(pitch),

intonation,

tempo,

rhythm,

and pauses are not influenced. With random

splicing,

the pauses in

speech

are

fully

masked;

rhythm

is

distorted,

intonationandtempo

are

partially

masked, andloudness, fundamental

frequency,

and

voice

quality

are unchanged. With backward

speech, rhythm,

intonation, and voice

quality

are all

distorted,

and

loudness,

fundamental

frequency,

pauses,andtempoare

preserved.12

Asin

study

1,ourstimulus

presentation

involved 16unaltered voice items(foureach of thefouremotions). Instudy 2,

however,

therewerealso48altered stimuli: four

samples

ofeach of three alterations of thefour

original

vocalstimuli.All

judgments

in

study

2weremade

according

tothe

free-response labeling procedure

and

werescoredasin

study

1.

RESULTS

Study

1

For thevoice

samples,

the accuracy ofemotion

judgment

was

almostperfectunder the forced-choice condition(Table1). There

was

only

oneerroramong336responses.Evenwith

free-response labeling,

sadnessand angerwere

accurately recognized. Samples representing

fear and

joy, however,

wereidentifiedassuch

only

50% of the time; judges often "heard" interest/excitement or

surprise

ormadeother responsesthat couldnotbe classified.

The

voice-synthesized samples

were not as

accurately

recog¬

nized

by

our

judges.

In the forced-choice

condition,

the sadness stimuliweremostoften

correctly

identified(93%correct),followed by joy(65% correct).

Anger

andfearwere

recognized

atlevelsclose toexpectablechance

probability

(32%). With

free-response

label¬

ing, only

50%

recognized

the sadness

signals,

40%

recognized joy,

27%

recognized

anger,and

just

7%

recognized

fear.

The orderofstimuluspresentation

(synthesized

voicevshuman

voice and viceversa)hadnoeffectontheresults.

Study

2

Our results for

study

2

(given

inTable2)reinforce the

impression

that the human voice carries

important

information about emo¬

tional state. The

procedure

of

free-response labeling

maximizes

the

opportunity

for

judgments involving

other

emotions; thus,

some

judges

may

respond

to "blends"

(eg,

ofinterestor

surprise

with

joy

orfearstimuli).For the unalteredvoice,

judges correctly

identified

joy, sadness,

and angermorethan80% of thetime,and sadness andanger

signals

werealmost

perfectly recognized.

Fear

was identified with 51% accuracy. These results

replicate

those

obtained oneyear earlier with the

free-response labeling

proce¬

dureof

study i.

Test-retest

reliability

for the16

judges participat¬

ing

inboth studieswas74%(189/256),

despite

the

lengthy

interval

between

study

1 and

study

2.

Theresultsfor the

electronically

altered voice

samples

werea

surprise.

The three alterations diminished correctidentification

only slightly,

never morethan15%. Sadness andangerwerestill

recognized

withalmost95% accuracy.

Joy

fellfrom85%to75%,and

fear

slipped

from51%to35%to40%recognition. Thus,thepattern of incorrectresponses

changed

little with

signal masking. Joy

and

fear

signals

weremostoften misclassifiedasinterest/excitement whether

they

were

original

oraltered stimuli.

COMMENT

Our

judges

had little

difficulty

in

recognizing

the emo¬

tions

represented by

the unaltered voice

samples.

When

forcedto choose between fouralternativesin

study 1,

the

proposed

emotionwas

correctly

identifiedmorethan99%of

the

time,

as

compared

with the 25% one

might expect by

chance. When

judges

were

given

no

restrictions, they

still

accurately recognized

all emotions more than 50% of the

time,

and for sadness andangerthe

recognition surpassed

90%. Similarresultswereobtainedwiththe unalteredvoice

samples

in

study

2. The

percent

correctidentification of

sadness,

anger, and fear was almost identical to that of

study 1,

but for

joy,

identification increased from 50% to 85%. Such

improvement

ledustowonderabout a

possible practice effect,

even

though

the studies were conducted

approximately

one year

apart.

Had our

judges

"learned"

something

about

recognizing joy?

The

voice-synthesized samples

were

considerably

harder

(4)

Table 2.—Results of

Study

2

IncorrectResponses

Emotion

Acoustic Stimulus

%(Proportion)

of Correct Identification'*' Joy

Interest/

Excitement

Sadness/

Distress Passive,

Bored,Sleepy Surprise Disgust Otherf Joy Full voice

Random-spliced Reverse Content filtered

84.8 (78/92)

76.1 (70/92)

72.8 (67/92)

77.2 (71/92)

11

10 Sadness Full voice

Random-spliced

Reverse Content filtered

100 (92/92)

96.7 (89/92)

94.6 (87/92)

96.7 (89/92) Anger Full voice

Random-spliced

Reverse Content filtered

98.9 (91/92) 96.7 (89/92)

94.6 (87/92)

92.4 (85/92)

Fear Full voice

Random-spliced

Reverse Content filtered

51.1 (47/92) 24

41.3 (38/92) 37

41.3 (38/92) 30

35.9 (33/92) 34

12

10

*Free response.

t'Other"includes anger,fear,shame,andunclassifiablecategories:68 of 69incorrectresponses labeled "other"were"unclassifiable one was"fear."

forour

judges

to"decode"

emotionally.

Someof theessen¬

tial cues for

specific

emotion

recognition

seemed to be

contained in the

samples because,

in the forced-choice

situation, joy

and sadness were

recognized

with

greater

than chance

frequency.

The

voice-synthesized sounds,

how¬

ever, did notcontain asufficient number of

unambiguous, emotion-specific

cues to allow accurate emotion

recogni¬

tion,

a conclusion further demonstrated

by

the fact that correct identifications fell off

dramatically

when

judges

werefreetoanswerwithanyemotion word. Some

samples

didseemto

capture

moreof theessentialcuesthan others.

Even with the

free-response technique,

50% ofour

judges correctly

identified the

voice-synthesized

sadness

samples,

whereas

only

7%

recognized

"fear."

The remarkable ease with which emotions wererecog¬

nized in the human voice in

study

1 led us to the

signal manipulations

of

study

2. A

high percentage

of correct

judgments persisted

evenwhenacousticcues

present

inthe

original

voice

samples

were

degraded by

these

techniques.

On the basis of

previous

theoretic and

experimental work,8,13

wehad

expected judgment

accuracyofanemotion tofall offasrelevant acoustic factorsweremasked

by

each

signal

alteration. For each

type

of

alteration,

the

percent

of

correct

judgments

fell

slightly.

Because none of these

resulted ina

major drop

ofaccuracy,

however,

wecouldnot form any

hypotheses concerning

the relevance of

specific

cuesthat weremasked.

Overall,

sadness was the emotion most

clearly

recog¬

nized;

even

voice-synthesized samples

were

correctly

iden¬

tified50% ofthe time

(90%

in the forced-choice

procedure).

The nonverbalcuesfor this emotionseemtobe basic:slow

tempo

and little

pitch

variation that

yield

an

impression

of

an

energyless

or

passive speech style. Tempo

is

partially

masked inthe

randomly spliced stimuli,

but these

samples

were no less

accurately

identified than the other two.

Enough

essential cues survived the

manipulations

that

produced

the altered

stimuli,

and suchcueswere

fairly

well

captured

in the

voice-synthesized samples.

Anger

was

consistently recognized

witha

high degree

of

accuracy from the human voice-based

signals,

but fared

poorly

in the

voice-synthesized samples.

On the basis of

previous experimental work,

salient acoustic characteris¬

tics of anger were

thought

to be many

harmonics,

fast

tempo, high pitch level,

small

pitch variation,

and

rising pitch

contours.

Although

each ofthe

masking techniques

used in this

study

would

substantially degrade

selected

characteristics, recognition

ofanger inthe altered stimuli remained

high.

These results

suggest

that there is no

unique

combination of

paralinguistic

features necessary to

recognize

angerand that there is

enough redundancy

inthe

features

remaining

toallow identification.

Perhaps

there is

a

special "angry"

voice

quality,

eg, a

growling harshness,

that

persists

even in the electronic filter condition and allows for the clear identification. This

voice-quality

im¬

pression might

bebasedon

pitch perturbation.13

Fear was the emotion our

judges

had most

difficulty recognizing. Although

there was

only

one error in the

forced-choice conditionof

study 1,

correctidentificationof unaltered voice

samples

of fear fell to 50% in the two

(studies

1and

2) free-response

conditions. All otheremo¬

tions in

study

2were

recognized

over80%of the time. Fear

signals

were often classified as interest/excitement

and,

less

often,

as

surprise;

thismayreflecta commonelement of

"arousal."

Arousal

(sometimes

referredtoas"stress"or

"anxiety"

in

previous

studies of nonverbal

communication)

has been showntoaffect emotion

recognition by

meansof its associa¬

tion with anincrease inmeanfundamental

frequency.14,15

While this mechanism may have influenced fear

judg¬

ments and

although

there is increased arousal inangeras

well,

our

judges

did not

misidentify

anger.

Despite

the

masking conditions, joy

was

correctly

identi¬

fied

by approximately

75% ofour

judges.

This

figure

is

substantially

less than for anger and sadness but is more than for fear. Whilethereare

aspects

ofa

joyful

voice that

were not rendered

unrecognizable by

the

signal degrada-

(5)

tions,

the

judges'

confusion with interest/excitement and

surprise

hints that arousal is

again

an

important

cue.

Joy

is

anemotionwhose

paralinguistic

features have been

poorly

studied

(in part

becausemuchof the research has focusedon

detecting

stressand/or

deception).

CONCLUSION

Our studies have demonstrated that the voice is indeeda

powerful

sourceof information aboutemotion and that it isa sourcethat is difficultto

"disguise."

Vocalcues seemtobe

strong

and

stable, yet

subtle. We donot

yet

understand the

degree

of

redundancy

in such

signals

orhow the various characteristics contribute to

particular

emotion

recogni¬

tion.

Clinicians may underestimate how much the vocal chan¬

nel contributes toone's intuitive

"feeling"

abouta

patient.

Voice

quality

is less obvious than

physical

appearancebut

seems to be an

important aspect

of

impressions

about

anotherperson.Vocalcuesmaybe

especially

relevant when

our intuition is

apparently

at odds with the words of the

patient.

Astute clinicians may be able to "hear

through"

attempts

to

disguise

affect.

Although

it has been

widely

claimed that nonverbal communication of emotion is

pri¬

marily accomplished by

the visual

channel,16

there issome

evidence that

personality judgments

based on the vocal

channel correlate more

highly

with

whole-person judg¬

ments.17 As our results

indicate,

even limited vocal cues

allowthe correctidentificationofsomeemotions.

Clinical

sensitivity

is

likely

toremainan

important aspect

of

psychiatric practice, despite

the

increasing sophistica¬

tion of

special diagnostic procedures.

A crucial

part

ofour

clinical skill is the

ability

to

recognize

both

visually

and

vocally

communicated affects andto

interpret

the words of the

patient

within these contexts. Our

study

has demon¬

strated theconsiderable communicativepowerof the voice and has

suggested

that clinicians may

respond

to that

source of informationmorethan had been

appreciated.

As

the

important

features of nonverbal vocal communication

arefurther

delineated, sensitivity

tothis channel of infor¬

mation about emotion willbebetter

understood,

and meth¬

odsto

impart

such

sensitivity

toclinicianscanbe

developed.

We await furtherresearchto

integrate

these

findings

about

the voice with data aboutfacial

expression

as westriveto

better understandthe

phenomenon

of clinical intuition.

References 1. Darwin C:TheExpression ofEmotions in Man and Animals.London,

JohnMurray,1872(reprinted UniversityofChicagoPress,Chicago,1965).

2. EkmanP, FriesenWV,EllsworthP:Emotion intheHumanFace:

Guidelines for Research and an Integration of Findings. New York, Pergammon Press,1972.

3. Izard CE: The FaceofEmotion. NewYork,Appleton-Century-Crofts,

1971.

4. Scherer KR, Ekman P (eds):Handbook ofMethods in Nonverbal Behavior Research. NewYork,Cambridge University Press,1982.

5. Izard CE (ed):MeasuringEmotions inInfantsandChildren. New York, Cambridge UniversityPress,1982.

6. SchererKR:Nonlinguisticindicators of emotion andpsychopathology,

in lzard CE(ed):Emotions inPersonalityandPsychopathology.NewYork,

PlenumPublishing Corp,1979,pp 495-529.

7. SchererKR: Speech and emotional states,in DarbyJ (ed):Speech

Evaluation in Psychiatry. New York, Grune & Stratton Inc, 1981,

pp 115-135.

8. SchererKR, OshinskyJ: Cue utilization in emotion attribution from

auditorystimuli. MotivationEmotion, 1977;1:331-346.

9. SchererKR:Acousticcommunicantsof emotional dimensions:Judging

affect fromsynthesizedtonesegments,inWeitz S(ed):Nonverbal Commu-

nication. NewYork,OxfordUniversityPressInc, 1979,pp 105-111.

10. Izard C: Patterns ofEmotion: A NewAnalysis of Anxiety and Depression. NewYork,Academic PressInc,1972.

11. SchererKR:Methodsof researchonvocal communication:Paradigms

andparameters,inSchererKR,Ekman P(eds):HandbookofMethods in Nonverbal BehaviorResearch. NewYork, Cambridge UniversityPress, 1982,pp136-198.

12. Scherer KR, Feldstein S, Bond RN, Rosenthal R: Vocal cues to

deception:Acomparativechannelapproach,in press.

13. Lieberman P: Perturbations in vocal pitch. J Acoust Soc Am 1961;33:597-603.

14. Williams CE, Stevens KN: Emotions and speech: Someacoustical

correlates. J Acoust Soc Am1972;52:1238-1250.

15. Streeter LA, Krauss RM, Geller V, Olson LL, Apple W: Pitch changes during attempted deception.J PersSocPsychol1977;35:345-350.

16. DePauloBM,RosenthalR,EisenstatRA,RogersPL,Finkelstein S:

Decoding discrepantnonverbalcues.J Pers SocPsychol 1978;36:313-323.

17. EkmanP,FriesenWV,O'SullivanM,SchererK: Relativeimportance

offace,bodyandspeechinjudgmentsofpersonalityand affect.J PersSoc Psychol 1980;38:270-277.

Références

Documents relatifs

We compared performances of the GPR and Support Vector regression (SVR) and found out that GPR gives better results than SVR for the static per song emotion estimation task.. For

For the video material used it could be observed that the recognition of the emotion happy works best (assuming that joy is closely related to happiness in terms of arousal and

Within this context, Commercial off-the- shell (COTS) sensors together with a light simplified embedded machine learning approach for emotion recognition have been implemented within

Most computer models for the automatic recognition of emotion from nonverbal signals (e.g., facial or vocal expression) have adopted a discrete emotion perspective, i.e., they output

However, findings are inconsistent with regard to how mood affects emotion recognition: For sad moods, general performance decrements in emotion recognition have been reported, as

Study 2 added face blurring as a variable to the bodily expressions and showed that it decreased accurate emotion recognition—but more for recognition of joy and despair than for

Component process model (CPM) is an alternative view to explain emotions which borrows elements of dimensional models, as emergent results of underlying dimensions, and elements

Using two male and two female professional radio actors, a stand- ardized emotion portrayal instruction based on a scenario approach, and a specifically