Article
Reference
Recognition of Emotion From Vocal Cues
JOHNSON, William F., et al.
Abstract
In two studies investigating the recognition ofemotion from vocal cues, each of four emotions (joy, sadness, anger, and fear) was posed by an actress speaking the shame, semantically neutral sentence. Jdgments of emotion expressed in these segments were compared with similar judgments of voice-synthesized (Moog synthesizer) samples (study 1) or with three different alterations of the full-speech mode (study 2). Correct identification of the posed emotion was high in the full-speech samples. Voice-synthesized samples seemed to capture some cues promoting emotion recognition, but correct identification did not approach that of other segments. Recognition of emotion decreased, but not as dramatically as expected, in each of the three alterations of the original samples.
JOHNSON, William F., et al . Recognition of Emotion From Vocal Cues. Archives of General Psychiatry , 1986, vol. 43, no. 3, p. 280-283
DOI : 10.1001/archpsyc.1986.01800030098011
Available at:
http://archive-ouverte.unige.ch/unige:101897
Disclaimer: layout of this document may differ from the published version.
Recognition of Emotion From Vocal Cues
William F.Johnson,
MD;
Robert N.Emde,
MD;Klaus R. Scherer, PhD;Mary
D. Klinnert,PhD\s=b\In two studies
investigating
therecognition
of emotionfrom vocalcues,each of four emotions
(joy,
sadness, anger, andfear)
wasposed by
anactressspeaking
thesame,seman-tically
neutralsentence.Judgments
of emotionexpressed
inthese segments were
compared
with similarjudgments
ofvoice-synthesized (Moog synthesizer) samples (study 1)
orwith three different alterations of the
full-speech
mode(study 2).
Correct identification of theposed
emotionwashigh
inthefull-speech samples. Voice-synthesized samples
seemed to capture some cuespromoting
emotionrecognition,
but cor-rect identification did not
approach
that of other segments.Recognition
of emotiondecreased,butnotasdramatically
asexpected,
in each of the three alterations of theoriginal samples.
(Arch
GenPsychiatry 1986;43:280-283)
The
voice and the facecarrymuch of the affectivecontent of human communication and often form a basis for clinical intuition. Mental healthprofessionals,
inparticular, rely
onskillsinrecognizing
andunderstanding
emotionsinthe context of
diagnosis
and treatment. It would seemvaluable to
design
programs to enhance clinical skills of emotionrecognition,
but to do this it isimportant
tounderstand the features of normal emotional communica¬
tion.Thebasis of suchaviewcanbe found in
Darwin,1
wholong
agopostulated
the existence of universal discreteexpressions
ofemotion,
an idea that in recent years has received validation fromextensive cross-cultural researchinvolving
facialexpressions.2,3
The cross-culturalfindings implied
aninnateability
to expressandrecognize
differentemotional
expressions involving
theface,
and salient fea¬turesof thefaceineach of the basic emotions havenowbeen defined.4,5 As Darwin also
observed, however,
discreteemotionsare
expressed through
the vocal channelaswell.Heretofore,
the voice has been arelatively neglected
aspect
of communication. Nonverbal vocal cues have been littleinvestigated,
inpart
because of the difficultiesposed by
their intimate association withspeech.
Thisinvestiga¬
tion was
designed
toexplore
the accuracy with which emotions arerecognized
from vocal cues.Intuitively
weseem to understand that humans
recognize
vocal expres¬sions of emotions
quite
well. Thequestion remaining
forresearch
is, What,
other than the verbal content of aperson's speech,
contributestothe hearer'srecognition
thatthe
speaker
is sadrather thanhappy
orangryratherthan fearful?Inotherwords,
Whatrole do nonverbalcuesplay
inthe
recognition
of emotion from verbal communications?Mixed results from
previous experimental
studies ofemo¬tion
recognition
from thevoice make it difficulttodraw any conclusions(see
summaries of thisresearch6,7).
Inthisstudy
we
attempted
toovercome anumber ofproblems plaguing previous experimental designs.
We used clearsignals
togive unambiguous expressions
ofbasic emotions. Wecom¬pared signals
that had arecognizable
verbalcontentwithnonverbal
signals
and with those that had the verbal content masked. Inprevious studies,
which concentratedonfull
speech,
itwasdifficulttodetermine whichcueswereimportant
in thejudges'
decisions.Therefore,
inthisstudy,
we useda
variety
ofsignal-masking techniques
to betterdelineate the acoustic features
contributing
to emotionrecognition. Further,
becausemanyprevious
studies forcedrespondents
to choosebetween a limited number of listed emotions(thereby potentially introducing
unknown con¬straints),
wecompared
forced-choiceresponseswithafree- responsetechnique.
METHOD
Two studiesweredone. Both studiesusedaconvenient
sample
ofjudges composed primarily
ofprofessional
trainees in clinical mental health (first- andsecond-year psychiatric
residents and clinicalpsychology
interns).Twenty-one judges
wereusedinstudy
1and 23were usedin
study
2.The
studies weredone one yearapart, and 16
judges
instudy2 hadalsoparticipated
instudy
1.Judges
werein their late 20sorearly
30sandweregathered
fromtheclinical
setting
in whichoneofus(W.F.J.)worked. Malejudges predominated,
with75% instudy
1and67% instudy
2.Study
1In
study
1,judgments
of the emotionexpressed
invoicesamples
were
compared
with similarjudgments
ofvoice-synthesized (Moog synthesizer) samples
asusedinpreviouswork8(alsoseediscussion AcceptedforpublicationSept28,1984.From the Department of Psychiatry, University of Colorado Health Sciences Center, Denver(Drs Johnson and Emde); the Department of Psychology, UniversityofGiessen, Giessen,WestGermany(DrScherer);
and theDepartmentof PediatricPsychiatry,the National JewishHospital,
Denver(DrKlinnert).
Reprint requests to Department of Psychiatry, C268, University of
Colorado Health SciencesCenter,4200 E NinthAve, Denver,CO 80262(Dr Emde).
Table 1.—Results of
Study
1Acoustic Stimulus and Emotion
%(Proportion)of
Correct Identification ForcedChoice FreeResponse Human voice
Joy 100 (84/84) 51.2 (43/84)
Sadness 100 (84/84) 91.7 (77/84)
Anger 100 (84/84) 97.6 (82/84)
Fear 98.8 (83/84) 52.4 (44/84)
Voicesynthesizer
Joy 65.5 (55/84) 40.5 (34/84)
Sadness 92.9 (78/84) 50 (42/84)
Anger 32.1 (27/84) 27.4 (23/84)
Fear 32.1 (27/84) 7.1 (6/84)
below).For the human voice
samples,
anactresswhowasfamiliar with theresearchquestion spoke
asemantically
neutralsentence:"The greenbookis
lying
onthe table." Thesentencewaspresentedinsuchawayso astogiveclear vocal
expressions
of each of fourcommondiscrete emotions:
joy, sadness,
anger, and fear. While thisdeparts
somewhat from what would be encountered in aclinical
situation,
itservestoreducepotential
confusioncausedby
the contentofthesentence. For voice-synthesized samples, non- speechstimuliwereused that inearlierwork8,9hadbeenshownto
generate tendencies toward an attribution of these same four emotions. We thus
hoped
to assess the relativeimportance
ofcertain known
synthetically
encodedcuesforemotionrecognition.Forstudy 1,each vocalexpressionwas
reproduced
four times andrandomly ordered,
resulting
in 16 voice items. Voice-synthesizeditemswere
similarly reproduced
andrandomly
ordered. In thisstudy
wecompared
twodifferentjudgment procedures:
oneinvolv¬ing
forced-choicelabeling
and the otherinvolving free-response labeling.
Inthe forced-choiceprocedure, judges
weretold the four emotions understudy
andwere askedto selectone of thefour choicesforeachauditory
stimulus. Inthefree-response procedure, judges
were asked torespond
to eachauditory
stimuluswith aword orphrase
describing
the emotionthey
heardmostclearly expressed. Free-response
emotion labelswere thencategorized
into discrete emotions
according
to a lexicon ofemotion wordspreviously
validatedby
Izard.10 The lexicon allowsplacement
ofEnglish
words intooneof thefollowing
categories:joy,
interest/excitement,
sadness/distress, surprise, disgust,
anger,fear, shame, passive-bored/sleepy,
and unclassifiable. (Theoriginal
lexicon was modified
slightly
and is available onrequest.)
The presentation ofvoice-synthesized
and human voicesamples
wasvaried to examine fororder effects, butto avoid bias the free- response
procedure
wasalways
administered before the forced- choiceprocedure.
Study
2Oneyearlatertheabove-mentioned voice
samples
wereused instudy
2. In thisstudy,
however, responsesto thesestimuliwerecomparedwiththose ofthesamevoice
samples subjected
tothree separate"masking"
conditionsdesigned
toobscure certainacous¬ticfactorsandto
highlight
others. The "altered" stimuliwerethuscomposed
ofthefollowing:
(1) content-filtered stimuli(low-pass
filtered withacutoff
frequency
of500Hz;afilterwith 45-dB roll-offwas used [Krohnhite]); (2)
randomly spliced
stimuli (with voice samplefilessegmented
into units of about 200msandrandomly rearranged
withtheuseofadigital speech manipulation
system);and(3)reverse
speech
stimuli with theoriginal
voicesample
tapesplayed
backward (fordetailson thesethreeprocedures,
see the reportby
Scherer"). Thesemasking
conditions affect different paralinguisticcues.In the content-filteredcondition,voicequality
is maskedand distorted whileloudness is
partially
masked, but fundamentalfrequency
(pitch),intonation,
tempo,rhythm,
and pauses are not influenced. With randomsplicing,
the pauses inspeech
arefully
masked;rhythm
isdistorted,
intonationandtempoare
partially
masked, andloudness, fundamentalfrequency,
andvoice
quality
are unchanged. With backwardspeech, rhythm,
intonation, and voicequality
are alldistorted,
andloudness,
fundamental
frequency,
pauses,andtempoarepreserved.12
Asin
study
1,ourstimuluspresentation
involved 16unaltered voice items(foureach of thefouremotions). Instudy 2,however,
therewerealso48altered stimuli: four
samples
ofeach of three alterations of thefouroriginal
vocalstimuli.Alljudgments
instudy
2weremade
according
tothefree-response labeling procedure
andwerescoredasin
study
1.RESULTS
Study
1For thevoice
samples,
the accuracy ofemotionjudgment
wasalmostperfectunder the forced-choice condition(Table1). There
was
only
oneerroramong336responses.Evenwithfree-response labeling,
sadnessand angerwereaccurately recognized. Samples representing
fear andjoy, however,
wereidentifiedassuchonly
50% of the time; judges often "heard" interest/excitement or
surprise
ormadeother responsesthat couldnotbe classified.The
voice-synthesized samples
were not asaccurately
recog¬nized
by
ourjudges.
In the forced-choicecondition,
the sadness stimuliweremostoftencorrectly
identified(93%correct),followed by joy(65% correct).Anger
andfearwererecognized
atlevelsclose toexpectablechanceprobability
(32%). Withfree-response
label¬ing, only
50%recognized
the sadnesssignals,
40%recognized joy,
27%
recognized
anger,andjust
7%recognized
fear.The orderofstimuluspresentation
(synthesized
voicevshumanvoice and viceversa)hadnoeffectontheresults.
Study
2Our results for
study
2(given
inTable2)reinforce theimpression
that the human voice carries
important
information about emo¬tional state. The
procedure
offree-response labeling
maximizesthe
opportunity
forjudgments involving
otheremotions; thus,
some
judges
mayrespond
to "blends"(eg,
ofinterestorsurprise
with
joy
orfearstimuli).For the unalteredvoice,judges correctly
identified
joy, sadness,
and angermorethan80% of thetime,and sadness andangersignals
werealmostperfectly recognized.
Fearwas identified with 51% accuracy. These results
replicate
thoseobtained oneyear earlier with the
free-response labeling
proce¬dureof
study i.
Test-retestreliability
for the16judges participat¬
ing
inboth studieswas74%(189/256),despite
thelengthy
intervalbetween
study
1 andstudy
2.Theresultsfor the
electronically
altered voicesamples
wereasurprise.
The three alterations diminished correctidentificationonly slightly,
never morethan15%. Sadness andangerwerestillrecognized
withalmost95% accuracy.Joy
fellfrom85%to75%,andfear
slipped
from51%to35%to40%recognition. Thus,thepattern of incorrectresponseschanged
little withsignal masking. Joy
andfear
signals
weremostoften misclassifiedasinterest/excitement whetherthey
wereoriginal
oraltered stimuli.COMMENT
Our
judges
had littledifficulty
inrecognizing
the emo¬tions
represented by
the unaltered voicesamples.
Whenforcedto choose between fouralternativesin
study 1,
theproposed
emotionwascorrectly
identifiedmorethan99%ofthe
time,
ascompared
with the 25% onemight expect by
chance. When
judges
weregiven
norestrictions, they
stillaccurately recognized
all emotions more than 50% of thetime,
and for sadness andangertherecognition surpassed
90%. Similarresultswereobtainedwiththe unalteredvoice
samples
instudy
2. Thepercent
correctidentification ofsadness,
anger, and fear was almost identical to that ofstudy 1,
but forjoy,
identification increased from 50% to 85%. Suchimprovement
ledustowonderabout apossible practice effect,
eventhough
the studies were conductedapproximately
one yearapart.
Had ourjudges
"learned"something
aboutrecognizing joy?
The
voice-synthesized samples
wereconsiderably
harderTable 2.—Results of
Study
2IncorrectResponses
Emotion
Acoustic Stimulus
%(Proportion)
of Correct Identification'*' Joy
Interest/
Excitement
Sadness/
Distress Passive,
Bored,Sleepy Surprise Disgust Otherf Joy Full voice
Random-spliced Reverse Content filtered
84.8 (78/92)
76.1 (70/92)
72.8 (67/92)
77.2 (71/92)
11
10 Sadness Full voice
Random-spliced
Reverse Content filtered
100 (92/92)
96.7 (89/92)
94.6 (87/92)
96.7 (89/92) Anger Full voice
Random-spliced
Reverse Content filtered
98.9 (91/92) 96.7 (89/92)
94.6 (87/92)
92.4 (85/92)
Fear Full voice
Random-spliced
Reverse Content filtered
51.1 (47/92) 24
41.3 (38/92) 37
41.3 (38/92) 30
35.9 (33/92) 34
12
10
*Free response.
t'Other"includes anger,fear,shame,andunclassifiablecategories:68 of 69incorrectresponses labeled "other"were"unclassifiable one was"fear."
forour
judges
to"decode"emotionally.
Someof theessen¬tial cues for
specific
emotionrecognition
seemed to becontained in the
samples because,
in the forced-choicesituation, joy
and sadness wererecognized
withgreater
than chancefrequency.
Thevoice-synthesized sounds,
how¬ever, did notcontain asufficient number of
unambiguous, emotion-specific
cues to allow accurate emotionrecogni¬
tion,
a conclusion further demonstratedby
the fact that correct identifications fell offdramatically
whenjudges
werefreetoanswerwithanyemotion word. Some
samples
didseemto
capture
moreof theessentialcuesthan others.Even with the
free-response technique,
50% ofourjudges correctly
identified thevoice-synthesized
sadnesssamples,
whereas
only
7%recognized
"fear."The remarkable ease with which emotions wererecog¬
nized in the human voice in
study
1 led us to thesignal manipulations
ofstudy
2. Ahigh percentage
of correctjudgments persisted
evenwhenacousticcuespresent
intheoriginal
voicesamples
weredegraded by
thesetechniques.
On the basis of
previous
theoretic andexperimental work,8,13
wehadexpected judgment
accuracyofanemotion tofall offasrelevant acoustic factorsweremaskedby
eachsignal
alteration. For eachtype
ofalteration,
thepercent
ofcorrect
judgments
fellslightly.
Because none of theseresulted ina
major drop
ofaccuracy,however,
wecouldnot form anyhypotheses concerning
the relevance ofspecific
cuesthat weremasked.
Overall,
sadness was the emotion mostclearly
recog¬nized;
evenvoice-synthesized samples
werecorrectly
iden¬tified50% ofthe time
(90%
in the forced-choiceprocedure).
The nonverbalcuesfor this emotionseemtobe basic:slow
tempo
and littlepitch
variation thatyield
animpression
ofan
energyless
orpassive speech style. Tempo
ispartially
masked inthe
randomly spliced stimuli,
but thesesamples
were no less
accurately
identified than the other two.Enough
essential cues survived themanipulations
thatproduced
the alteredstimuli,
and suchcueswerefairly
wellcaptured
in thevoice-synthesized samples.
Anger
wasconsistently recognized
withahigh degree
ofaccuracy from the human voice-based
signals,
but faredpoorly
in thevoice-synthesized samples.
On the basis ofprevious experimental work,
salient acoustic characteris¬tics of anger were
thought
to be manyharmonics,
fasttempo, high pitch level,
smallpitch variation,
andrising pitch
contours.Although
each ofthemasking techniques
used in this
study
wouldsubstantially degrade
selectedcharacteristics, recognition
ofanger inthe altered stimuli remainedhigh.
These resultssuggest
that there is nounique
combination ofparalinguistic
features necessary torecognize
angerand that there isenough redundancy
inthefeatures
remaining
toallow identification.Perhaps
there isa
special "angry"
voicequality,
eg, agrowling harshness,
that
persists
even in the electronic filter condition and allows for the clear identification. Thisvoice-quality
im¬pression might
bebasedonpitch perturbation.13
Fear was the emotion our
judges
had mostdifficulty recognizing. Although
there wasonly
one error in theforced-choice conditionof
study 1,
correctidentificationof unaltered voicesamples
of fear fell to 50% in the two(studies
1and2) free-response
conditions. All otheremo¬tions in
study
2wererecognized
over80%of the time. Fearsignals
were often classified as interest/excitementand,
less
often,
assurprise;
thismayreflecta commonelement of"arousal."
Arousal
(sometimes
referredtoas"stress"or"anxiety"
inprevious
studies of nonverbalcommunication)
has been showntoaffect emotionrecognition by
meansof its associa¬tion with anincrease inmeanfundamental
frequency.14,15
While this mechanism may have influenced fear
judg¬
ments and
although
there is increased arousal inangeraswell,
ourjudges
did notmisidentify
anger.Despite
themasking conditions, joy
wascorrectly
identi¬fied
by approximately
75% ofourjudges.
Thisfigure
issubstantially
less than for anger and sadness but is more than for fear. Whilethereareaspects
ofajoyful
voice thatwere not rendered
unrecognizable by
thesignal degrada-
tions,
thejudges'
confusion with interest/excitement andsurprise
hints that arousal isagain
animportant
cue.Joy
isanemotionwhose
paralinguistic
features have beenpoorly
studied
(in part
becausemuchof the research has focusedondetecting
stressand/ordeception).
CONCLUSION
Our studies have demonstrated that the voice is indeeda
powerful
sourceof information aboutemotion and that it isa sourcethat is difficultto"disguise."
Vocalcues seemtobestrong
andstable, yet
subtle. We donotyet
understand thedegree
ofredundancy
in suchsignals
orhow the various characteristics contribute toparticular
emotionrecogni¬
tion.
Clinicians may underestimate how much the vocal chan¬
nel contributes toone's intuitive
"feeling"
aboutapatient.
Voice
quality
is less obvious thanphysical
appearancebutseems to be an
important aspect
ofimpressions
aboutanotherperson.Vocalcuesmaybe
especially
relevant whenour intuition is
apparently
at odds with the words of thepatient.
Astute clinicians may be able to "hearthrough"
attempts
todisguise
affect.Although
it has beenwidely
claimed that nonverbal communication of emotion is
pri¬
marily accomplished by
the visualchannel,16
there issomeevidence that
personality judgments
based on the vocalchannel correlate more
highly
withwhole-person judg¬
ments.17 As our results
indicate,
even limited vocal cuesallowthe correctidentificationofsomeemotions.
Clinical
sensitivity
islikely
toremainanimportant aspect
of
psychiatric practice, despite
theincreasing sophistica¬
tion of
special diagnostic procedures.
A crucialpart
ofourclinical skill is the
ability
torecognize
bothvisually
andvocally
communicated affects andtointerpret
the words of thepatient
within these contexts. Ourstudy
has demon¬strated theconsiderable communicativepowerof the voice and has
suggested
that clinicians mayrespond
to thatsource of informationmorethan had been
appreciated.
Asthe
important
features of nonverbal vocal communicationarefurther
delineated, sensitivity
tothis channel of infor¬mation about emotion willbebetter
understood,
and meth¬odsto
impart
suchsensitivity
toclinicianscanbedeveloped.
We await furtherresearchto
integrate
thesefindings
aboutthe voice with data aboutfacial
expression
as westrivetobetter understandthe
phenomenon
of clinical intuition.References 1. Darwin C:TheExpression ofEmotions in Man and Animals.London,
JohnMurray,1872(reprinted UniversityofChicagoPress,Chicago,1965).
2. EkmanP, FriesenWV,EllsworthP:Emotion intheHumanFace:
Guidelines for Research and an Integration of Findings. New York, Pergammon Press,1972.
3. Izard CE: The FaceofEmotion. NewYork,Appleton-Century-Crofts,
1971.
4. Scherer KR, Ekman P (eds):Handbook ofMethods in Nonverbal Behavior Research. NewYork,Cambridge University Press,1982.
5. Izard CE (ed):MeasuringEmotions inInfantsandChildren. New York, Cambridge UniversityPress,1982.
6. SchererKR:Nonlinguisticindicators of emotion andpsychopathology,
in lzard CE(ed):Emotions inPersonalityandPsychopathology.NewYork,
PlenumPublishing Corp,1979,pp 495-529.
7. SchererKR: Speech and emotional states,in DarbyJ (ed):Speech
Evaluation in Psychiatry. New York, Grune & Stratton Inc, 1981,
pp 115-135.
8. SchererKR, OshinskyJ: Cue utilization in emotion attribution from
auditorystimuli. MotivationEmotion, 1977;1:331-346.
9. SchererKR:Acousticcommunicantsof emotional dimensions:Judging
affect fromsynthesizedtonesegments,inWeitz S(ed):Nonverbal Commu-
nication. NewYork,OxfordUniversityPressInc, 1979,pp 105-111.
10. Izard C: Patterns ofEmotion: A NewAnalysis of Anxiety and Depression. NewYork,Academic PressInc,1972.
11. SchererKR:Methodsof researchonvocal communication:Paradigms
andparameters,inSchererKR,Ekman P(eds):HandbookofMethods in Nonverbal BehaviorResearch. NewYork, Cambridge UniversityPress, 1982,pp136-198.
12. Scherer KR, Feldstein S, Bond RN, Rosenthal R: Vocal cues to
deception:Acomparativechannelapproach,in press.
13. Lieberman P: Perturbations in vocal pitch. J Acoust Soc Am 1961;33:597-603.
14. Williams CE, Stevens KN: Emotions and speech: Someacoustical
correlates. J Acoust Soc Am1972;52:1238-1250.
15. Streeter LA, Krauss RM, Geller V, Olson LL, Apple W: Pitch changes during attempted deception.J PersSocPsychol1977;35:345-350.
16. DePauloBM,RosenthalR,EisenstatRA,RogersPL,Finkelstein S:
Decoding discrepantnonverbalcues.J Pers SocPsychol 1978;36:313-323.
17. EkmanP,FriesenWV,O'SullivanM,SchererK: Relativeimportance
offace,bodyandspeechinjudgmentsofpersonalityand affect.J PersSoc Psychol 1980;38:270-277.