A multi-modal perception based assistive robotic system for the elderly

(1)

OATAO is an open access repository that collects the work of Toulouse

researchers and makes it freely available over the web where possible

Any correspondence concerning this service should be sent

to the repository administrator:

tech-oatao@listes-diff.inp-toulouse.fr

This is an author’s version published in:

http://oatao.univ-toulouse.fr/2

2026

To cite this version:

Mollaret, Christophe

and Mekonnen, Alhayat Ali

and Lerasle, Frédéric

and Ferrané, Isabelle

and Pinquier, Julien

and Boudet, Blandine and

Rumeau, Pierre A multi-modal perception based assistive robotic system for the

elderly. (2016) Computer Vision and Image Understanding, 149. 78-97. ISSN

1077-3142 .

Official URL:

https://doi.org/10.1016/j.cviu.2016.03.003

(2)

A

multi-modal

perception

based

assistive

robotic

system

for

the

elderly

C. Mollaret

a,b,c

_,

_A

_.A

_.

_Mekonnen

a,b,∗

_,

_F.

_Lerasle

b,c

_,

_I.

_Ferrané

a

_,

_J.

_Pinquier

a

_,

_B.

_Boudet

d

_,

P. Rumeau

d

a IRIT, Univ de Toulouse, 118 route de Narbonne, 31062 Toulouse Cedex 9, France b CNRS, LAAS, 7, Avenue du Colonel Roche, F-31400 Toulouse, France

c Univ de Toulouse, UPS, LAAS, F-31400 Toulouse, France

d Laboratoire de gérontechnologie La Grave, CHU Toulouse/UMR 1027 Inserm, Gérontopôle, Université Toulouse-3, place Lange, TSA 60033,

31059 Toulouse Cedex 9, France

Keywords: Assistive technology Elderly care Intention detection Multi-modal data fusion Human–robot interaction Robotic perception

a

b

s

t

r

a

c

t

Inthispaper,wepresentamulti-modalperceptionbasedframeworktorealizeanon-intrusivedomestic assistiveroboticsystem.Itisnon-intrusiveinthatitonlystartsinteractionwithauserwhenitdetects the user’sintentiontodo so. Allthe robot’sactions arebasedonmulti-modalperceptionswhich in-cludeuserdetectionbasedonRGB-Ddata,user’sintention-for-interactiondetectionwithRGB-Dandaudio data,andcommunicationviauserdistancemediatedspeechrecognition.Theutilizationofmulti-modal cuesindifferentpartsoftheroboticactivitypavesthewaytosuccessfulroboticruns(94%successrate). Eachpresentedperceptualcomponentissystematicallyevaluatedusingappropriatedatasetand evalua-tionmetrics.FinallythecompletesystemisfullyintegratedonthePR2roboticplatformand validated throughsystemsanitycheckrunsanduserstudieswiththehelpof17volunteerelderlyparticipants.

1. Introduction

As livingconditions andhealthcarefacilities improve, the av-eragelife expectancyincreases leadingtoa growing elderly pop-ulation. For example, in France, in 2005, there were ﬁve young andadultpeople foronesenior, in2050, itisexpectedthat there will be 10 young and adult people for every seven seniors [1]. Thoughmost people agewell, some become frail, at risk of dis-easeandcostlydependence.Therefore,theﬁnancialand organiza-tionalburdenonthesocietyislikelytorise.Howcanwe provide qualitycaretopeople requiringconstantassistanceinvarious as-pects,includingthosesufferingfromdeteriorationincognitive ca-pabilitiesduetoaging,headtrauma,andAlzheimer’sdisease(AD)? Thisenlightenment hasled to agrowing necessity fornew tech-nologiesthatcan assistthe elderlyintheir dailyliving.Onesuch technologyisthedeploymentofassistiverobotsfortheelderly.In fact,such kindofrobotic systems serving various tasks and pur-poses in the social care and medical/health sectors beyond the

∗ _{Corresponding author at: CNRS, LAAS, 7, Avenue du Colonel Roche, F-31400}

traditional scope of surgical and rehabilitation robots are poised tobecomeoneofthemostimportanttechnologicalinnovationsof the21stcentury [2]. But,whenrobots leaveindustrial mass pro-duction tohelp withdailyliving activities,i.e., householdchores, therequirementsforrobotplatformschange.Whileindustrial pro-ductionrequiresstrength,precision,speed,andendurance, domes-ticservicerequiresa robustnavigationinindoor environments, a dexterous object manipulation, and an intuitive way of commu-nicating(speech,gestures,andbody language)withusers.Inthis perspective,manyissuesarestill tobesolved, suchasperception andsystemintegration.

Makingarobotasociallycompetentserviceproviderinallthe daily life areas is very challenging. Hence, we focuson the con-ceptionofaroboticsystemthatprovidesmild memoryassistance totheelderly,arequirementhighlycovetedfortheelderlywhom mightexhibitmildcognitiveimpairment(MCI)orAlzheimer’s dis-ease (AD) [3,4]. A serious issue for the elderly with MCI is for-getting where they haveput objects that they use everyday, for example,keys,remotecontrol,glasses,etc.Thisleadsthemto ex-perience stress, loss of conﬁdence, and become irritable, putting themathealth riskespecially consideringtheir frailty [4]. Conse-quently,weconsiderdeployingaroboticsystemthathelpsthe el-derlywithMCIinlocatingeverydayobjectswhicharehidden(out oftheuser’ssight),orputinunusualplaces.The workpresented

Toulouse, France.

E-mail addresses: cmollare@laas.fr (C. Mollaret), aamekonn@laas.fr, alhayat.ali@ gmail.com (A .A. Mekonnen), lerasle@laas.fr (F. Lerasle), ferrane@irit.fr (I. Ferrané),

(3)

User Detection

(RGB-D data)

User’s

Intention-for-Interaction Detection

(head/shoulder pose, speech)

Close Human-Robot

Interaction (speech)

Move to

Monitoring Position

Move to

Interaction Position

Fig. 1. Framework adopted to realize the complete non-intrusive autonomous assistive robotic system.

in thispaperispartofa FrenchNational ResearchAgency(ANR) fundedresearchprojectcalledRIDDLE,1 _an_acronym_for_Robots_for

perceptual Interactions Dedicated to Daily Life Environment, which aimstomakeastepforwardinthesedirectionsbycombiningthe underlying multipleanduncertain perceptualanalyses relatedto, (1) objectsandspaceregardingtherobot’sspatialintelligence,and (2) multi-modalcommunicationregardingtherobot’stransactional intelligence.

Topaintaclearpicture,letusconsiderthefollowingexemplar scenario.ApersonsufferingfromMCI,whichwe henceforthrefer astheuser,iscarryinghis/hernormaleveryday activity.Then,let ussaytheuserwantstochangethechannelontheTVbutrealizes he/shecouldnotrememberwheretheremotecontrolis.Theuser willthenhavetoposethequestiontotherobotwhichis monitor-ing him/herstowedawayina non-interferingposition.The robot willthenanswertheuser’squestions/riddles abouttheobject uti-lizing appropriate actions (speech,displacement, pointingaction). Based onthisscenario, weidentifythree mainkey functional re-quirements for therobot: (1)detecting the userat all times,(2) detecting when the user wants to interact with the robot, i.e., whentheuserwantstoposeaquestiontotherobotorneedsits attention—calleduser’sintention-for-interaction,and(3)interaction viaspeechbasedcommunication.Inthiswork,weassumethetype andpositionoftheobjectsare knownaprioriandwefocusonly onthehighlightedthreerequirements.Theconsideredobjectsare speciﬁcallyeyeglasses,keys,mobile phone,wallet,remotecontrol, andmedication—frequentlylostobjectsbytheelderlyasidentiﬁed throughapilotstudy[4].

Theentirebehavioroftheproposedassistivesystemisbasedon a widely accepted reactive behavior which cycles through “mon-itor” and “interact” phases,e.g., [5–7].In thisbehavior, the robot monitors the user until he/she demands it to do something or shows an interest to interact with it. Then the robot continues through theinteraction phasewhereit interactswiththeuserto providerequested serviceorassistance. Inlinewiththis,we pro-pose a domestic assistive robotsystem that incorporates a novel

user’sintentiondetectionmechanismtotransitionfrommonitoring phasetointeractionphase.Therefore,weproposeascenariowhere therobotcomesintoaroom,checksthepresenceofauserinthis room,andthenstowsawayatanobservationplacetomonitorthe userdiscretely.Whentheuserexpresseshis/herintenttointeract with the robot—either by looking at it, calling it, ora combina-tion of both—the robot approaches the userand startsthe close interactionphase.Werefertothisasanon-intrusivebehavior—the robotisnot movingto stalktheuser—withthree distinctphases: userdetection,usermonitoring,andinteractionphases(see Fig.1

in Section 3). Initially whenthe robot correctlydetects theuser, it goes to the monitoring stage, actively reading the user’s

in-1_{http://projects.laas.fr/riddle/}

tention. Then, when it detects the user’s intention-for-interaction, it makes a transition into the interaction phase which is carried out via speech modalities. During the speech based communica-tion, dependingon the utilized sensor, recognition tool, and hu-man/robot(H/R)situation(distancevariations),thecommunication qualitycanbeaffected[8,9].Wheneverpossibleanadaptive mech-anismshouldbeputinplacetomaximizethechancesofhavingan idealcommunicationgiventheavailableresources.

In this work, the robot considers that there is only one user to communicate with. The used language is French; it could be, nevertheless, generalized to any language. The person can freely moveinhis/herenvironment.However,whenaninteractionstarts, after the intention-for-interaction step, the person’s position is fixed during the interaction process. The goal of the interaction phaseisfortherobottofindalostobjectuponrequest.Therobot has to indicate the direction of the object by pointing its head towardsitandgivingsomeverbalprecisionsaboutitslocation—at themomentnoobjectdisplacementismanagedbytherobot.For example:“theremote control isunderthetable”. Theobjectsare storedinauser-definedsemanticmapsincethefocusofthepaper is not on object detection. All the algorithms presented in this workareembeddedonthePR22_robotic_platform_using_the_Robot Operating System (ROS) middleware framework [10]. ROS is a collectionofsoftwareframeworksforrobotsoftwaredevelopment with a very active community and numerous publicly available packages that provide an operating system-like functionality on aheterogeneous computer cluster. Hence, we baseall our imple-mentationsandassociateroboticintegrationonROS(basedonC++ and Python programming languages). Furthermore, all essential algorithms are integrated on the robot, while data visualization modulesareseamlessly integratedonan externalcomputer with-out overloading the PR2 system. All sensors are embedded on therobot.Theonlyexception isan Androidsmartphone,whichis locatedwithin 2 m fromthe user(in his/herhand oron atable nearhim/her),thatisusedtocaptureaudiosignal.

Thispapermakesthefollowingfourcorecontributions: 1. A complete multi-modal perception driven non-intrusive

do-mesticroboticsystem.

2. A novel multi-modal user’s intention-for-interaction detection modality.

3. Afusion methodto improvethespeech recognitiongiven the user’sposition,availablesensors,andrecognitiontools. 4. Details of the complete implemented system along with

rel-evant evaluations that demonstrate the soundness of the framework via an exemplar application. The applicationis an assistivescenariowherebytherobothelpstheuserﬁndhidden ormisplacedobjectsinhis/herlivingplace.

2 PR2 (Personal Robot 2) is a robotic platform developed by Willow Garage: _http: //www.willowgarage.com

(4)

The proposed frameworkis furtherinvestigatedby conducting relevantuserstudiesinvolving17elderlyparticipants.

This paperisorganizedasfollows:Section 2discusses related workbriefly, Section 3 presentsan overview ofthe adopted sys-tem.Sections4–6describetheuserdetection,theuser’sintention detection,andcloseHRI(withspeechmodality)part,respectively. Then, Section 7 explains how the task-level coordination is real-izedalongwithrelevantimplementationdetails.Experimentsand resultsonPR2aredetailedinSection8,andfinally,thepaper fin-isheswithconcludingremarksinSection9.

2. Relatedwork

As highlighted in the Introduction,we considerto endow the robot with monitoring and interacting phases similar to most assistiveroboticsystems,e.g.,[5–7].Duringthemonitoringphase, therobotremains staticandobserves thesceneuntilatriggering action leads it to transition to the interacting phase, making it beginaninteractionwiththeuser.Intheliterature,theinteraction andmonitoringphasesadoptedbyvariousassistiveroboticsystem arevery similar—Broekens et al.[3]provide a review of relevant assistivesocial robots inelderly care.The main difference comes fromwhat triggers thetransition frommonitoring tointeraction. Differentapproaches havebeeninvestigated: a userinterface (on screen)[7],anexplicitvocaldemand[5],arecognizedgesture[11], or a scheduled triggering (e.g., take medicine at 8 am) [6], are some examples. Even though all these triggering mechanisms haveled toasuccessfulassistiveroboticscenario intheirspeciﬁc context, we argue that further improved system can be reached by triggering mechanism on user’s intention, i.e., start interac-tion phase whenever the user expresses intent to interact with the robot. We call this notion the user’s intention-for-interaction. Indeed,endowing robotswith the ability to understandhumans’ intentions opens up the possibility to create robots that can successfully interact with people in a social setting as humans, stepping towards proactivity [5]. We use the term non-intrusive

to describe the behavior of this assistive robotic system. It is non-intrusivein that it only starts interaction with a userwhen itdetects theuser’sintentiontodo so,andbeforethat therobot monitorsthepersonwithanRGB-Dcameraandaudiosensor.This isinlinewithseveralworkintheliterature thatidentifywiththe termbyusingcamerasforobservationwithoutinterruptingand/or intrudingintothetargetuser[12–15].

The restof thissection presentsrelatedwork that pertainsto eachconsidered perceptualcomponent,i.e., userdetection, inten-tionperception,andspeechrecognition.

2.1.Userdetection

Userdetectioninourcontextfallsinthegeneric researcharea ofautomated peopledetection.Mostofthepromisingapproaches relyonvisual sensors,primarily classicalRGBcamerasandRGB-D (Dstands for depth)cameras/stereo-heads providing3D data. To date, several visual (visual camera based) people detectors have beenproposed(see comprehensivesurveysin[16,17]).When con-sideringacamera ona moving vehicle,asina mobile robot, the detector has to rely on information per frame and cannot rely onstationaryorslowlychangingbackgroundassumptions/models. InRGB based sensors, the mostrelevant worksare that ofDalal andTriggs’s histogram oforiented gradients (HOG) based detec-tor[18],Felzenszwalbetal.’sdeformablepartsmodel(DPM)based detector[19],and the aggregate channel feature (ACF)based de-tector of Dollar et al. [20]. In this vein, further improvements havebeenachievedbyutilizingvariousheterogeneouspoolof fea-tures[21,22]. Withtheadventofeﬃcientandeasy touseRGB-D cameras,e.g.,Microsoft’sKinectsensor,moreimprovedandrobust

people detectorshavebeenproposed [23,24].TheseRGB-D based propositionshavearemarkableimpactintheroboticscommunity; in additionto beingcompelling alternatives tolaser range ﬁnder based detectors [25], they have various useful qualities: accurate distancetouser,betterocclusionreasoning,robustnessto appear-ance clutter, etc. As a result, they have recently been a popular choice wheneveran RGB-Dsensor isinvolved[23].Consequently, inourframework,weprimarily usean RGB-Dbaseddetectorand coupleitwithanRGBbaseddetectorforfurtherimprovement.

2.2. Intentionperception

Detecting user’s intention has,in recent years, gained signifi-cantattention in human–robot interaction (HRI)research. Gener-allyspeaking,understandinguser’sintentionsplaysafundamental role in establishing coherent communicationamong people [26]. Inspired by this, different researchers havebeen working on de-tecting user’sintention forimproved human–machine interaction ingeneral,e.g.,[27–30].Knowingauser’strueintentionopensup severalpossibilities:(1)understandhis/heractivityatthe earliest (beforethe activity iseven complete);(2) constrain the spaceof possiblefutureactions andprovidecontext[29];and(3)correctly understand his/her action, for example, in the event of a motor neurondisorderwhereactionsmightnotreflecttrueuser’s inten-tion[31].Inparticular,aParkinsonusercouldemitincoherent ges-turesfortherobot,whichcouldbedifficulttointerpret.

Intention has previously been synthesizedin the literature as “robot-awareness” by Drury et al. [32]. We can also ﬁnd a sim-ilar concept deﬁned as “attention estimation” in Hommel and Handmann [33]. However, theseterms refer to lots ofnotions in robotics,such as“context-awareness”, “user-awareness”, etc.Even “user-awareness” can havemorethanonemeaning.Forexample,it couldmeanthattherobotneedsonlytoknowwheretheuseris, foran obstacleavoidancecontext,oritcouldmeanthattherobot needs to interpretthe user’s emotions inthe context ofhuman– robot interaction [34]. In this work, we will focus on detecting the intentionof a userto interactwiththe robot, which we call

intention-for-interaction, often simplystated asuser intention de-tection. Theusage ofthe termintentiondetectionis inlinewith the terminology used in relevant literatures that address similar problems,e.g.,[35–37].

Recently,variousworkrevolvingarounduserintention percep-tionhasbeenburgeoningintheHRIcommunity[29,31,38,39].The needtounderstandpeople’sintentionmostlystemsfromearly ac-tivitydetection[30,38,40],contextestablishment[29,30],andtrue intention understanding in case of confusing actions [31]. Inten-tioncanbedescribedinseveralaspects,suchasthenatureofdata (mono-modal, multi-modal, discrete, continuous, etc.), the fusion strategy,andﬁnallytheapplicationcontext.Focusingontheinputs for the intention perception detector, several data channels can bedistinguished. First,themostobviousshouldbe theheadpose andeye gazeestimation asdemonstrated inMartinez etal.[31]. A second cue comes straightforward withthe context awareness in Clair etal. [29]. Bascettaet al.[40] used an onlineprediction of user’s trajectory, which can be associated with a user’s habit. Morecuesarerelatedwithuser’sbodypartorientation.Huber[28]

based his work on user’s feet position and orientation, Kuan etal.[38]usedelbowanglesandforcesignals.Inordertoextract allthesefeatures,RGB-Dcamerasandclassicalcamerasare domi-nantlychosenfortracking,headposeandeyegazeestimation,but sometimes physiologicalsensors are used such asmuscular elec-tromyography(EMG)andforcesensors.Surprisingly,contrarytoits pervasivepresence,audiosensors/signalshavebeenrarelyutilized forintention detection, butratherforuser engagementdetection infewoccasions,e.g.,[41].

(5)

Evidently, fusing different heterogeneous cues robustiﬁes the estimation step further.When considering multi-modal/multi-cue based intentionestimation, the considered fusion/inference mod-ule plays an important role in robustness. In the literature, the most promising work utilizes probabilistic frameworks for fu-sion andinference.ForinstancedynamicBayesian networks[42], hidden Markov models (HMMs) [30,40], and generic recursive Bayesianﬁlters[31].Generally,allintentionperceptionmodules re-late to safety considerations and improved communication. They are usedina largevariety ofapplications:action prediction[29], electric wheelchair’s navigation [31], and guiding or resisting a user aspart of a rehabilitation process [38].These types of per-ceptionmodulesareevenusedinsmartpublicdisplayinorderto determine the intentionto read an advertisement[28].Based on these insights, the presentedmulti-modal intention-for-interaction

detection schemefuses userhead orientation, useranterior body orientation, andaudioactivity—heterogeneous cuesthathave not been considered altogether before for detecting intention—in a probabilisticframework.

2.3. Speechrecognition

For many years, researchers have been proposing methods to improve speech recognition accuracy by combining automatic speech recognition(ASR) systems.Different fusion techniquesare describedby Lecouteux etal.[43],some relying on theposterior merging of the recognition hypotheses, otherson acoustic cross-adaptation.Theseauthorshavealsoproposedadrivendecoding al-gorithmintegratingtheoutputsofsecondarysystemsinthesearch algorithmofaprimaryone,improvingtheirresultsby14.5% rela-tive word errorrate(WER). In thesedifferentcases, each system processesthesameaudiosignal.However,inourcontext,we con-sider two different inputs: embeddedand external microphones. Wealsoconsiderthattheusercanspeakfromdifferentlocations, beingclose fromone microphoneandfar fromanother. Depend-ing on the user position, the speech recognition could be more efficient ifthe input wassometimes takenfrom one microphone andsometimestakenfromanother.Inmanycases,audioand com-puter vision are used separately. However, approaches based on the user/microphone distance have already been investigated in differentcontexts. With meetingdata, likein Magantiet al.[44], wheretheuseristrackedinordertoextracthispositionwhichis thensenttoabeamformingalgorithm.Thebeamiscreatedinthe directionoftheuserinorderto removeanyexternalaudio inter-ference andretrievetheuser’sspeechonly.Inthisvein,thework by Onumaetal.[8]presentsa speechdenoisingalgorithm work-ingwithabankoffiltersandusingthepositionoftheuseraround two microphones. In both cases, the work is done at the signal level while our fusion approach is carried out at the hypothesis level.InStiefelhagenetal.[9],astudyonremotemicrophonesand variable distances wascarried out. The datawere recorded from different “user-to-microphone” distancesand the speech recogni-tion systemwas adapted,with an adaptivealgorithm, to usethe correct model knowing the distance. In this work a 10%–15% of WERreduction wasreached. In ourwork we wantto take bene-fit from the specificityand the diversity ofexisting sensors, and ASR systems in order to exploit the link between multiple mi-crophone/speech API combinations,we also wantto use the cur-rent distancebetweenthe userandthe sensorcapturingthe au-dio input—user/microphonedistance.These distancesare inferred by computervisionwiththeskeletonfittingalgorithm[45].Since thereisnoonebestglobalcombinationforeachdistance,we pro-pose a framework to fuse information fromsimultaneously run-ningcombinationsanddynamicallyselectthemostadaptedor ap-propriate configurationasa functionof theuser/microphone dis-tance.The aimis to obtaina reducedWERduringan interaction

session.Thisfusionframework,inspiredfromtheworkofHerranz etal.[46]comes toenhance ourinteractionmodule andadapt it tomorespatialvariationsininteraction.

3. Completesystemoverview

This work is based on the overall framework illustrated in

Fig.1. This framework is generic andcan be applied to a whole range of scenarios involving HRI. It can be summarized as fol-lows:(1)Findthe personinthe environment;(2) goto a prede-fined monitoring (garage) position and start monitoring the per-son;(3)detecttheperson’sintention-for-interaction(desireto in-teractwith therobot), e.g., namecalling, directing gaze, etc.; (4) approach this person, who becomes the user once intention has beendetected,bymovingtoaconvenientpositionforinteraction; and(5)beginclosehuman–robotinteraction.Toelaboratefurther, first,similartoanysysteminvolvingHRI,ourframeworkbeginsby findingthelocationoftheuser-to-beperson.Thisisaccomplished withtheuserdetectionmodality presentedinSection4.Oncethe robothas detected the person in the room, it does not start in-teraction directly as we are interested in a non-intrusive robotic behavior(seeSection1).Itratherpositionsitselfinawaitingarea monitoringtheactivityofthisperson.Inourframework,the moni-toringstepaimstodetecttheuser’sintention-for-interaction,which is presented in Section 5. When detected, the robot approaches thepersonandstartsthe envisagedinteractionroutine described in Section 6. In Fig. 1, the steps involving robot motion (physi-cal transition) are shown in dashed rectangles;the other blocks, whicharealsopictoriallyillustrated,involvesome formofsensor drivenmulti-modalperceptualactivity.Itgoeswithoutsayingthat thecloseHRIblockisselfloopingunlessexplicitlystoppedbythe user,inwhichcasetherobotgoesbacktothemonitoringstate.As aninstanceofthisframework,weillustrateademonstrative appli-cationin Section 8 wherethis framework is usedto help a user findvariousobjectsinthe vicinityoftherobotthroughprimarily speechbased interaction andknowledge aboutthe user environ-ment.

Throughout this work, we use the PR2 robot from Willow GarageInc.showninFig.2asthetargetroboticplatform.PR2isa popularrobotthathasbeenusedasatestbench bymanyrobotic researchers all over the world. It is composed of an omnidirec-tionalmobilebaseenablingitsmovements;twoarticulatedarms; atelescopicspine;twolaser sensors,onetilting onthetorsoand anotherﬁxedonthebase;apan/tiltcapablehead;andmanymore components(whicharenotlistedhereastheyarenotusedinthis work).TheheadfeaturesanRGB-Dsensor(Kinect),twowide-angle color cameras, and two narrow-angle monochrome cameras. Its computingpowerreliesprimarily ontwoQuad-Corei7Xeon Pro-cessors(8cores)with24GBRAM.Therobotisequippedwitha1.3 kWhlithium-ionembeddedbatterypackthatprovides2h approx-imate runtime—this made cord-less robotic runs possible during experimentalsessions. Weprimarily usetheKinectRGB-D sensor mountedonitforuserrelatedperception.Audiodata,inaddition to the Kinect microphones, is captured using a Samsung Galaxy Note2 smartphone(running Android 4.2). The smartphone com-municateswithPR2 viaacommonWi-Finetwork. Software-wise, our PR2 uses the Groovyinstance of the ROS software architec-ture[10]. Fig.3illustrates an ROSnode basedarchitecture ofour perceptiondrivenassistiveroboticsystemimplementation.

Fig. 3 is presented here to provide a general idea of the overall system structure. Each rounded rectangle represents an individual ROS node with the arrows indicating the data ﬂow between the different nodes. The shaded nodes correspond to the nodes we implemented, either from scratch or as a wrap-per on top of an existing classical implementation, and the rest denote publicly available implementations. The main perceptual

(6)

Fig. 2. The PR2 robot with part of its sensors and hardware information.

modalitiespresentedinthisworkrelyontheKinect sensor—RGB-Ddataforvisionrelatedperceptionsandaudiodataforspeech re-latedperceptions. The intention-for-interaction detectornode, la-beled“fused_intention”,takesshoulderposeestimation,headpose estimation,andtheresultofvoiceactivitydetection(VAD)to mea-sure the intention of the user. On the other hand, the dialogue module, “audio_interpret” node, takes speech recognition results andgives thesought objects’location information(with thehelp ofthe speech synthesizer module). The taskcoordinator, dubbed “demo_smach_supervisor” node,is the highest decision maker. It controlstheexecutionofthedifferentstepsinthescenarioalong withtheir transitionscoherently.Itactsasthe maininterface be-tween the robot’snative software that controls its actuators and thedevelopedmodalitiesthatleadstoconsistentcoordinationand executionofthe envisaged scenario.It is worth mentioning here that the robotactually has native software architecture that car-riesthebasicfunctionalitiesofanautonomousmobilerobot.Some ofthesefunctionalities includeaccurate robot localizationwithin agivenmap,navigation,obstacleavoidance,systemdiagnosis,etc. FurtherdetailsareprovidedinSection7.

4. Userdetection

As highlighted inSection 3,one requirementofthe presented frameworkisthecorrectdetectionandlocalizationoftheuserin theroom.Theuserdetectionmoduleshouldbeabletodetectthe user whetherhe/she is sitting on a chair or standingin an up-rightposition.It shouldalso berobust to partialbody occlusions asmuch aspossible.Eventhough duringcloseinteractionphases

thedistancebetweentherobotandtheuserislessthan5m(our depthsensor’srange),theadoptedusecaseentailsdetectinga per-son at a farther distance than that, especially when the robotis initiallylookingfortheuserinaroom.

Primarily relying onthe RGB-D sensormounted on the robot, we have chosen to use a state-of-the-art upper body detector (headandshoulder)recentlyproposedbyJafarietal.[23]withtwo main improvements. The original detector proposed in [23] has two components:an RGB-D data basedupperbody detector,and anRGBonlyfullbodydetectorcalledgroundHOG.Theupperbody detectordetectsclose-bypersonsupto5m—thesensor’sdepth op-eratingrange,butfailstodetectpeoplethatarebeyondthisrange. Ontheotherhand,thecomplementarygroundHOG,whichisbased onDalalandTriggsHOGdetector[18],detectspeoplethatare far-thereffectively—asaperson’sfullbodywillbeinthecamerafield of view (FOV)—but fails to detect close-by people due to possi-blecropped out (outof cameraFOV)body parts (e.g.,legs).Both detectionsare then directlycombinedandnon-maximal suppres-sionisappliedtodiscardoverlappingboundingboxesontheimage plane.Inthiswork,weusethiscombineddetectorbyfurther mak-ingtwo importantmodifications:(1)wereplacegroundHOGwith our optimized Binary Integer Programming (BIP)based full body detector (BIP-HOG) [47],and (2)we employ a greedy data asso-ciation algorithm [48]to combine the upperbody andfull body detectionswithrealworldspatialconsistency.Thesemodifications improve the overall detection performance as demonstrated in

Section8.3.1.

Fig. 4 shows a block diagram of the combined detector. The upper body detector and the optimized BIP-HOG based detector

(7)

Fig. 3. Illustration of the complete implemented system based on the ROS framework. Each rounded rectangle represents a standalone ROS node and the arrows indicate the message (data) passing pipeline. The shaded nodes correspond to the nodes we implemented or adapted and the rest denote publicly available implementations we utilized.

Depth Image

Upper-body

Detector [23]

RGB Image

Optimized

BIP-HOG Detector [47]

Fusion with

Spatial Consistency

Combined

Detections

Fig. 4. Block diagram illustrating the utilized user detector.

areusedtodetectpeopleintheenvironmentindependently.Their detections are then combined to try to maximize the chance of detecting allpeopleinthescene.Theupperbodydetector[23]is based ona learnedupperbody template whichisapplied on in-coming depth images in a sliding window mode. It computes a distance matrixconsistingof theEuclidean distancebetweenthe template andeach overlaid normalizeddepth image segment, la-belingeachwindowwhosenormalizedexponentialdistancescore from the template is above a threshold as a positive instance. Rather than applying the template on all positions, the authors usedifferenttechniquestoextractregionofinterest(ROI)basedon groundplane estimationandcolorimage segmentation.Thisstep ﬁltersoutthemajorityoftheimagebackgroundreducingpossible false alarms andspeedingup detectionrates. Theoptimized BIP-HOGdetectorisagraphicalprocessingunit(GPU)implementation of our detector presented in Mekonnenet al. [47]. This detector uses discreteoptimizationtoselectasubset ofHOGfeaturesina cascadeframeworkthatleadstoimprovedframeratewhile main-tainingcomparabledetectiontotheclassicalDalalandTriggsHOG detector.

Toclearlypresentthe adoptedfusion strategy indetail,let us denote all the detections obtained from each detector as _D_bip₌

{

D(k)

bip

}

N_bip

k=1 andDub=

{

D(ubk)

}

N_ub

k=1,wherethe suﬃxbiprefers to BIP-HOGandubforupperbody detections.NbipandNub standforthe

numberofdetectionsfromBIP-HOGandupperbodyrespectively. Eachdetection_{D is}representedas{r,p,_ϑ},wherer speciﬁesthe rectangular bounding box, p denotes the 3D position,and ϑ de-notesthe detectionconﬁdence scoredetermined by the detector, ofthetarget.Forexample,forthekthdetection_D_bip(k)obtained us-ingtheBIP-HOGdetector,itisexpressedas

{

r_bip(k),p(_bipk),

ϑ

(k)

bip

}

.Ther,

p, andϑ for the upperbody detector are providedby the detec-tor(p corresponds tothe 3Dcoordinateof theapproximatehead position determined directly from the depth data). For BIP-HOG,

r andϑ are alsoprovided by thedetector directly; to determine the corresponding values for p, we rely on the standard camera calibration parameters (

α

u,

α

v,uo,

v

o [49]) and a hp=1.75m

av-eragehumanheightassumption.First,givena detectionbounding boxr,we determinean approximateheadposition inpixel coor-dinate

(

ud,

v

d

)

bytakingaﬁxedoffsetfromthetopboundingbox

(8)

Fig. 5. Illustration of pixel coordinates used to determine an approximate 3D posi- tion of a BIP-HOG detection.

marginas showninFig.5. Then, theapproximate3D position of the kth detection p(_bipk) (corresponding to the pixel

(

u_d(k)_,

v

(k)

d

)

) is

determinedwiththehelpoftheassumedhpusingEq.(1)(please

crossreferencerelevantvariableswithFig.5).Finally,thefusedset ofdetections_D_{f us}₌

{

_D(_{f us}k)

}

Nf us

k=1 are determined by combiningDbip

and_Dup usingthe greedy data association algorithm outlined in

Algorithm1.Thealgorithmmergesall detectionsarisingfromthe samespatial position and separately addsthe rest ifthey seem reliable. p_bip(k)=

_x y z

(k) bip =

h p·

α

v

(k) 1 −

v

( k) 2

·

⎡

⎢

⎣

u(k) d −uo αu v(k) d −vo αv 1

⎤

⎥

⎦

(1)

Referring to Algorithm 1, the fusion algorithm begins by computing a distance score matrix S

(

m,n

)

for each detection– detection (_D_bip(m)_,_D_ub(n)) pair as _S

(

m_,n

)

₌

p(_bipm)_{− p}(_ubn)

₂ (line 2). Then two detections m∗ and n∗ with the least score are associ-atediftheir distancescoreislessthanthethreshold

ρ

d(lines 4–

8).Thisassociateddetectionisaddedtothesetoffuseddetections withrandptakenfromtheupperbodydetectorandadetection scoreset to the sumof the coupled detectionscores. The corre-sponding detections are removedfrom the bip and ub detection sets.If associated detectionshave scores higherthan the thresh-old(lines8–12),theupperbodydetectionisdirectlyaddedtothe fuseddetectionset,whereas theBIP-HOGdetectionisaddedonly ifitisfartherthan5m—upperbodyismorereliableincloserange whileBIP-HOG isnot. Finally,all unassociated upperbody detec-tionsare directlyadded tofuseddetection, whereas unassociated BIP-HOG detections are added only if they are farther than 5m (lines 14–20). Comparative experimental evaluation of thisfused detectoranditsconstituentsisdetailedinSection8.3.1.

Data

:

D

bip

,

D

ub

Result

:

D

f us

1

D

_{f us}

=

∅

,

m

∈

{

1 ,

...,

|

D

_bip

|}

,

n

∈

{

1 ,

...,

|

D

_ub

|}

2

S

(

m,

n

)

:

scores

for

each

detection-detection

(

D

(m)

bip

,

D

(n)

ub

)

pair

3

while

D

_bip

=

∅

and

D

up

=

∅

do

4

(

m

∗

,

n

∗

)

=

arg

min

m,n

S

(

m,

n

)

5

if

S

(

m

∗

,

n

∗

)

<

ρ

_d

then

6

d

∗

←

{

r

(_ubn∗)

,

p

_ub(n∗)

,

ϑ

_bip(m∗)

+

ϑ

_ub(n∗)

}

7

D

f us

←

{

D

f us

∪

d

∗

}

8

else

9

D

_{f us}

←

{

D

_{f us}

∪

D

_ub(n∗)

}

10

if

p

(_bipm∗) ,z

> 5

m then

D

f us

←

{

D

f us

∪

D

(m ∗₎ bip

}

11

;

12

end

13

D

_bip

←

{

D

_bip

\

D

_bip(m∗)

}

,

D

_ub

←

{

D

_ub

\

D

(_ubn∗)

}

14

end

15

while

D

_bip

=

∅

do

16

if

p

(_bipm) ,z

> 5

m then

D

f us

←

{

D

f us

∪

D

_bip(m)

}

17

;

18

D

_bip

←

{

D

_bip

\

D

(m) bip

}

19

end

20

while

D

up

=

∅

do

21

D

_{f us}

←

{

D

_{f us}

∪

D

_ub(n)

}

,

D

_ub

←

{

D

_ub

\

D

(_ubn)

}

22

end

23

return

D

_{f us}

Algorithm1:Algorithmforpeopledetectionfusionwith spa-tialconsistency.

5. Intention-for-interactiondetection

Fig. 6 shows a block diagram of the framework used to esti-mate user’s intention-for-interaction using an RGB-D camera (e.g., Kinect, stereo rig) and an audio sensor (e.g., smartphone,tablet, microphone,etc.). Itestimatestheuser’sintentionbasedonthree important cues: user’s line of sight—inferred from head pose; user’santeriorbody direction—determinedfromshoulder orienta-tion; and speech used to draw attention—captured via VAD. The head pose detection andshoulder orientation detectionmodules rely on depth images. The detection outputs are further ﬁltered using a particleswarm optimizationinspired tracker (PSOT) pre-sentedinMollaretetal.[50].BoththeVADandtrackeroutputare consideredasobservationinputsandarefusedtoprovidea prob-abilisticintentionestimateusingaHMM.

RGB-D Sensor

Head Pose

Detection

Shoulder

Orien-tation Detection

Particle Swarm

based Tracking [50]

Audio Sensor

Voice Activity

Detection

Intention

Estimation

Probabilistic

Intention

Measure

(9)

x

t−1

x

t

z

1t−1

z

2t−1

z

1t

z

2t

Fig. 7. Probabilistic graphical model used for intention estimation.

The probabilistic graphical modeldepicted inFig. 7 illustrates therelationship betweenthehiddenvariables,xt andxt−1,which are the intentionindicators at time t andt_{− 1} respectively, and theobservationvariablesz1

t,z2t,z1t−1,andz2t−1.Theintention indi-catorxt isadiscreterandomvariablethat takesonvalues{intent,

¬intent}attimet.Theobservationvariablesz1

t,andz2t aredeﬁned

accordingtoEq.(2). z 1t =

θ

h

φ

h

θ

sh

; z2 t ∈

{

v

ad , ¬

v

ad

}

(2) z1

t is a continuous vector-valued random variable that

repre-sentstheheadorientation(headyaw

θ

handpitch

φ

h)and

shoul-der orientation (shoulderyaw

θ

sh) observations fromtheparticle

swarm basedtrackerthat provides estimatedheadpose(position andorientation)inspace,andshoulderorientationwithrespectto the verticalplane ofthecameraoptical frame inspace(yaw).z2

t,

ontheother hand,isadiscrete randomvariablethat takeson ei-ther

v

ad or¬

v

ad torepresenttheobservationfromthevoice activ-itydetectionmodule.Furtherdescriptionofthesetwoobservation variables along with their associatedprobability distributions are providedinSections5.1and5.2,respectively.

With the assumption that the observations are conditionally independent giventhe state (encoded in the graphical model in

Fig. 7), and making use of Bayes rule, the posterior probability distribution overthe stateP(xt|Z1:t)given allmeasurements upto

timet,Z1:t,canbeexpressedwithEq.(3). P

(

x t

|

Z 1:t

)

=

η

P

z 1 t

|

x t

P

z 2 t

|

x t

xt−1 P

(

x t

|

x t−1

)

P

(

x t−1

|

Z 1:t−1

)

, (3) whereP

(

xt

|

xt−1

)

isthestatetransition(dynamics)distributionand

η

is a normalization factor. Again here both x and z2 _are dis-crete random variables that take on values {intent, ¬intent} and

{

v

ad,¬

v

ad

}

respectively, whereas z1 _takes _on _continuous _values giventhePSOTtrackeroutput.

5.1. Headandshoulderposeestimation

This observation is based on head pose estimation, shoul-derorientationestimation,andparticleswarmoptimizationbased trackingforﬁlteredestimatesasexplainedintheprevioussection.

Head pose estimation. This module is based on the work of Fanelli et al. [51]which formulates the pose estimation asa re-gressionproblemandusesrandomregressionforestsondepth im-ages from an RGB-D sensor. This choice is motivated by regres-sion forests capability to handle large training datasets. The re-gressionis basedon differenceof rectangularpatches resembling thegeneralizedHaar-likefeaturesin[52].Thetrainingisdone us-ing the BiwiKinect Head Pose Dataset [51]. The 6D state vector [x _h,y _h,z _h,

θ

_h_,

_φ

_h_,

_ψ

_h_]T

contains the 3D head position and the three orientation angles relative to the sensor. The claimed pre-cisioninthepaperis5.7° meanerrorinyawestimationwith15.2° standard deviation, and 5.1° mean error inpitch estimation with 4.9° standarddeviation. Additionally,headposeisdetected witha mean errorof13.4mm (±21.1mm). Thismode worksbest with close-bysubjects,subjectsplacedatadistanceof1.5–2.0m.

Fig. 8. The head pose is displayed with the green cylinder (head) on the point cloud, while shoulders are displayed in red and their orientation in dark blue (be- low the neck). (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Shoulder orientation estimation. For this we primarily rely on OpenNIlibrary[45]whichprovidesaﬁttedskeletonmodelofthe userbased on the depth data. Then, using simple geometry, the user’s shoulder orientation is obtained by computing the vector betweenthe left and right shoulder joint pose determined from theﬁttedskeleton.The shoulderorientationisexpressedwith re-spectto the RGB-Dsensor parallel optical plane providinga yaw angle

θ

_sh forthetrackingstep.Illustrativeestimatedshoulderand head orientations are displayed in Fig. 8. Following the Kinect— thespeciﬁc RGB-D camera used in thiswork—characteristics and OpenNI library, the skeleton tracking algorithm works up to a rangeof4.0m.

Theheadandshoulderposesprovidedbytheabovetwo mod-ules are computed frame by frame without any temporal link. It is also possible to have missing estimates from any of the modules at any times.To alleviate that andprovide a smoothed continuous estimate, hence ﬁltering, we make use of our PSOT tracker.Givenhead poseandshoulder orientation inthe formof

sd=[x h,y h,z h,

θ

h,

φ

h,

ψ

h,

θ

sh]

T

fromthe headposeand shoul-derorientation estimationmodules attimeframe t,PSOT isused to determine spatio-temporal posterior point estimates (ﬁltered estimates) of head pose and shoulder orientation at time frame

t with the state vector of the PSOT tracker represented as s= [xh,yh,zh,

θ

h,

φ

h,

ψ

h,

θ

sh]

T

.The PSOT is afilteradapted from par-ticle swarm optimization (PSO) [53] that combines interesting amenities of PSO with particle filter [54], namely state dynamic modelforimprovedtrackingperformance. Contraryto PSO, there is only one iteration of the PSOT at each time frame (there is no adaptive learning). It has a linear complexity with the num-berofparticles used(which is150inthiswork), andit doesnot needan expensivecomputingresource towork inrealtime.Also contrarytootherparticlebasedalgorithms,particlesinteractwith each other with a social and cognitive component in the update step.Thisbehaviorleadstoamoreefficientestimationofstateas thereisnoparticledegenerationphenomenon.Fortheheadpose and shoulder orientation estimation, the adopted tracker uses a random walk dynamicmodel anda multivariate Gaussian model withadiagonalcovariancematrixastheobservationmodelinthe fitnessevaluation(pleasereferto[50]fordetails).

The distribution P

(

z1

t

|

xt

)

is derived based on the tracker

out-put,i.e.,basedon

θ

h,

φ

h,

θ

shangles.Theseanglesarerepresented

in such a way that when the user is looking right into the op-ticalframewiththeiranteriorbodyorientedparalleltotheimage plane,allanglesare0.Withthisinmind,P

(

z1

t

|

xt

)

isrepresentedas

amultivariatenormaldistribution,i.e.,P

(

z1

t

|

xt

)

=N

zt1; 0,

with z1 t =[

θ

h,

φ

h,

θ

sh] T

(10)

thoughnotappliedhere,itsvaluescanbevariedusingthetracked headposition.

5.2.Voiceactivitydetection

Audiosignalisusedforintentiondetectionbasedonuservoice activitydetection.Usershavethetendencytotalktoarobotwhen they want its attention. Taking advantage of this, we denotethe onsetof a voice activity asone indicator for user’s intention-for-interaction. In this work we rely on the voice activity detection modulefromPocketSphinx C library.3 _This _algorithm_is _based_on signalenergy.It ﬂagsthe givenaudioframeascontainingspeech elementsifthesignalenergyisaboveapredeﬁnedthreshold.Since signalenergyisaffectedbythenoiseintheenvironment,the im-plementation inPocketSphinx doesan initial calibration stage so astobestseparate signalfromstationarynoiseusinga statistical-basednoise removalmethod.Dependingontheenvironment am-bientnoise (robot noise and room noise), the VAD is estimated properlyupto2mfromtheaudiosensor.

The observation from the VAD module is represented by the random variable z2

t at time t taking discrete values

{

v

ad,¬

v

ad

}

.

Since both z2

t and xt take binary discrete values, the associated

likelihooddistributionP

(

z2

t

|

xt

)

issimplyrepresentedbyfour

prob-abilityvalues.

6. Closehuman–robotinteraction

Duringthisphase,thehumanandtherobotareengagedin ba-sicinteraction. The personasksfor assistance, andthe robot an-swersby providinga usefulresponse orassistance. This phase is implementedbyastaticstatemachinedialoguemodulethat man-ages the interaction. Each speciﬁc question/request coming from theusercantriggertransitionstodifferentstatesleadingtorobotic serviceprovision.

The close interaction is started when the user’s intention-for-interactionhasbeendetected.Theinteractionstopswhentheuser explicitlytellstherobotthathe/shedoesnotneeditanymore,for example,by saying the French equivalentsof, “thank you, good-bye!”, “thank you, all is ok now”, “goodbye”, and the like. The robotthengoesbacktoitsgaragepositionandresumestheuser’s monitoring state until the next user’s intention-for-interaction is detected.

As an exemplarinstance, weimplement ascenario wherethe robot,uponspecificinquiry,helpstofindobjectsthattheuserhas forgotten.Hence, in thisspecific case thegoal of the dialogue is toretrievetargetedobject.The basicsemanticanalysisisdoneby usingkeywords in the speech recognition hypothesis and a vote to find the preponderant meaning between the N-Best hypothe-sis fromthe interpretation.For example,if the interpretationset isIs=[FIND,FIND,GREET,UNKNOWN,UNKOWN,FIND], the

inter-pretationresultwill beequal toI=FIND,moving theinteraction tothe next state accordingly.The verbal answer willbe givento theuserbytheGoogleTextToSpeechAPIwhichcontainsasetof predeﬁnedsentences.

Ourcontributions inthispartdo notlie intheinteractionper se,butintheimprovementoftheuser’sperceptionbyusing mul-tipleinputaudiostreamsandmultipleASRsystems.Inthiswork, weusethe CMUAPI PocketSphinxandtheGoogle SpeechAPI as ASRsystems. We also use two microphones: the embedded mi-crophoneof theKinect sensor, andthe microphonefroma Sam-sungGalaxy Note2 smartphone.We createfourcombinations of microphone/ASRsystem,buttheglobalframeworkcouldeasilybe generalizedtomore combinationswiththe additionofmore mi-crophones or ASR systems. Based on the work of [46], a fusion

3_{http://cmusphinx.sourceforge.net/}

d

U

S

Fig. 9. Bayesian network for homogeneous fusion.

frameworkiscreatedastheBayesiannetworkmodelillustratedin

Fig. 9. The idea is that, for each set of distances dbetween the userandthemicrophones,andforeachsetofhypothesisU com-ingfromeach combination,wewanttoﬁndthebestcombination

S.ThisisillustratedbyEq.(4)whereP(S|d,U)representsthe prob-abilityof thecombinationSbeingthe mostreliableknowing the distances and thehypothesis. P(S|d) standsfor the probability of thecombinationS beingthemostreliable knowingthe distances alone,and[US=∅]representsthepresenceofahypothesisforthe

combinationS.[US=∅]=1ifthereisahypothesisor[US=∅]=0

otherwise. argmax S

(

P

(

S

|

d, U

))

∝ argmax S

(

P

(

S

|

d

)

[U S=∅]

)

(4)

InordertodynamicallyselectthebestcombinationSgiventhe userdistancesandthehypothesis,weﬁrstlearnthedensityP(S|d) which is inversely proportional to the WER function of the dis-tance.This measurewill be further explainedin Section 8.2. The density islearned with the dataset presentedin Section 8.1and the WER values are interpolated using a 3rd degree polynomial. ThisdensityisthenusedinAlgorithm2,whichisan implementa-tionoftheBayesiannetworkpreviouslydescribed.

1 Result:argmax

S

(

P

(

S

|

d,U

))

=Fusion(d,U,P

(

S

|

d

)

2 Pmax=0

3 foreachsystemSdo 4 Compute[U_S=∅] 5 ComputeP

(

S

|

d,U

)

∝P

(

S

|

d

)

[U_S=∅] 6 ifP

(

S

|

d,U

)

>Pmaxthen 7 Pmax=P

(

S

|

d,U

)

8 end 9 end 10 ComputeS=argmax S

(

P

(

S

|

d,U

))

11 returnS

Algorithm 2:Fusionalgorithm basedon aBayesian network forhomogeneousfusion.

7. Task-levelcoordinationandimplementationdetails

Deploying the complete system described in this paper on a roboticplatformisacomplextaskwhichneedsacoordinatingtool tostart andstop serviceswheneverrequiredandtomake transi-tions betweendifferent services smoothly withproper exception propagation. Thesedifferentservicescould bethought ofas indi-vidualrobotic tasksthat canbe coordinatedtocreateandlaunch acompleteworkingroboticdemo.Forexample,inthiswork,user detection,user’sintentiondetection,human–robotinteraction,and robotcontrol(robotmovementorspeciﬁcjointcontrol)canmake up individual tasks.A populartool, especially forROSbased sys-tems, adopted hereis smach[55].Smach is a task-level architec-ture for rapidly creating complex robot behavior that has been successfully used in many robotic applications, e.g., [58,59]. The completesystemspanningfromuserdetectiontohuman–robot in-teractioniscoordinatedvia differentstate containersincluding ﬁ-nite state machines, concurrent state machines, and action state

(11)

Fig. 10. Visualization of the task-level coordination state machine constructed using smach [55] . Each shaded state machine is a container by itself (consisting of a sub state machine), the outputs of which are shown in red rectangular blocks. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Table 1

List of the main tasks and associated sub-tasks deployed to realize the assistive robotic system presented. Boldface nodes indicate the ones developed in this work.

Main task Main smach state machine Sub-tasks Corresponding ROS node (cf. Fig. 3 ) Executing machine

User detection FINDPERSON User detection user_detector PR2_c1

RGB-D Kinect streaming openni_camera [56] PR2_c1

Move Robot head ROS actionlib interface PR2_c1

Move to monitoring position MOVETOGARAGE Robot localization pr2_2dnav [57] PR2_c2

Navigation to goal ROS actionlib interface PR2_c2

Obstace avoidance ROS actionlib interface PR2_c2

Intention-for-interaction detection INTENT4INTERACT RGB-D Kinect streaming openni_camera [56] PR2_c1

Head orientation head_pose_estimation [51] PR2_c1

Shoulder orientation sh_pose_estimation PR2_c1

PSOT tracking tpso_head_shoulder PR2_c1

Audio (Android/Kinect) streaming audio_acquis Smartphone/PR2_c1

VAD detection audio_vad PR2_c1

Move to interaction position APPROACHUSER Robot localization pr2_2dnav [57] PR2_c2

Navigation to goal ROS actionlib interface PR2_c2

Obstace avoidance ROS actionlib interface PR2_c2

Close human–robot interaction DIALOGUE Audio (Android/Kinect) streaming audio_acquis Smartphone/PR2_c1

VAD detection audio_vad PR2_c1

Speech recognition audio_stt PR2_c1

Speech synthesis audio_tts PR2_c1

Speech interpretation audio_interpreter PR2_c1

Point robot head POINT-HEAD Move Robot head ROS actionlib interface PR2_c2

Task-level coordination All Fig. 10 – demo_smach_supervisor PR2_c2

Data visualization – – ROS tools (rviz,rqt_plot,etc) Workstation

machines. The concurrent state machine, for example,is used to launchthe userdetectoractioninparallelwitha robotheadpan actiontoscantheroomforpossiblepresenceofaperson, assum-ing the person is not obscured by any furniture in the environ-ment. Fig.10showsa trimmeddown version(without subtle in-termediarystatesthatmakeadjustmentfordetecteduserlocation) ofthesmachbasedstatemachineforthisspeciﬁcapplication.The ‘FINDPERSON’ concurrent machine handles the search foruserin theroom, andthe‘DIALOGUE’ statemachine showsa representa-tionofthedifferentstatesinthecloseHRIinteractiongearedbya to-and-frodialogue.

The main tasks and associated sub-tasks are categorized and listed in Table 1. Each main task has an associated smach state machineshowninFig.10.Thesub-tasksrepresentedasROSnode implementations are also shown (cross reference with Fig. 3). The boldface nodesindicate theonesdeveloped inline withthis work, while the rest are publicly available implementations. The main components of our framework, as presented in Section 3, havecorrespondingsmachstatemachineslabeledasFINDPERSON,

MOVETOGARAGE, INTENT4INTERACT, APPROACHUSER, and DIA-LOGUE respectively. The POINT-HEAD state machine is an inter-mediary state used to point the head of the robot towards the person before the INTENT4INTERACT and DIALOGUE phases. The correspondingsub-tasksarealsodetailedclearlyshowingthelink withthe corresponding ROS node. For all the robot actions that involvean actuator (robotmotion, headmovement),the ROS ac-tionlibinterfacespeciﬁcallydevelopedforthePR2isused[60].All the modules presented in Table 1 execute across four heteroge-neousmachines:twoIntel(R)Quad-Corei7Xeonmachinesonthe PR2(PR2_c1 andPR2_c2),an Intel(R)Core(TM) i7-2720QM work-station,andaSamsungGalaxyNote2smartphone,thankstoROS’s abstractionthatenables seamlessTCP/IP communicationbetween thesemachines. The robotand the smartphoneare connected to the network via Wi-Fi connections, whereas the workstation is connectedvia a ﬁxed network cable.In the current implementa-tion,the positionofthe objects areassumed tobe knowna pri-ori.ThisinformationisstoredintheObjectMapDatabase.It con-tains the position ofthe various objects in the environment, for

(12)

Table 2

Summary of the different datasets used for evaluating user detection, tracker precision, and intention detection components. A partial view of the cluttered robotic lab-1 is shown in Fig. 14, and that of the cluttered robotics lab-2 in Figs. 15 (bottom row), 19, and 20. The cluttered oﬃce scene is shown in Fig. 15 (top row).

Name Mode Image frames Duration Max persons Ground truth Environment

UserDet-DT RGB-D 235 — 4 Manual annotation Cluttered robotic lab-1

Intent-DT1 RGB-D + Audio 4230 141s 1 Manual annotation Cluttered oﬃce

Intent-DT2 RGB-D + Audio 3930 131s 1 Manual annotation Cluttered robotic lab-2

Intent-DT3 RGB-D + Audio 3180 106s 1 Manual annotation Cluttered robotic lab-2

Fig. 11. A screen snapshot of the developed Android application user interface.

example, as “The remote is under the table”. For improved and morerealisticscenarios,itwillsuﬃcejusttoupdatethisdatabase withinformationofdetected objects should an automated object detectionmodulebecomeavailableinthefuture.

Regarding the audiodata from the smartphone,the real-time audiosignalusedbytheASRsystemisstreamedtotherobotfrom an Android application. Thisapplication isdeveloped inthe con-textofthisworkusingtherosjavaAndroidbuildtool [61]toboth usethephoneasaninputdevice andadebuggingtool when op-erating the robot. Therefore, the smartphone can be used to vi-sualizethespeechrecognitionhypothesisandthecurrentstate of therobot.Theaudioisstreamedtothenetworkwith512sample buffersof16bitsquantizationlevel.Alockbuttonissettoprevent anyunwantedtouchesonthescreen(e.g.,fromstoppingtheaudio streamaccidentally).AscreensnapshotoftheAndroidapplication userinterfaceisshowninFig.11.

8. Experimentsandresults

Thissectionpresentsthedifferentexperimentalevaluations car-ried out to validate and demonstrate the proposed multi-modal perceptiondriven assistive roboticframework along withthe ob-tainedresults.Itiscategorizedunderfourheadings:(1)thevarious datasetsused forevaluationinSection 8.1;(2) theadopted stan-dardevaluationmetricsinSection8.2;(3)details oftheobtained resultsinSection8.3;and(4)theconducteduserstudyalongwith analysisofobservationsmadethereininSection8.4.

8.1. Datasets

Thevariouscomponentsofthepresentedsystemrelyon multi-modaldataacquiredusingaKinectsensor(RGB-Dandaudio)and an Android mobile device (audio). Due to the multi-modal na-ture andhence the diﬃculty in ﬁnding a public dataset tailored forourapplication,wehaveusedseveralproprietarydatasets.The datasetsincludeRGB-Dimageframesforuserdetection,combined RGB-D and audiofor intention detection, and audioonly for the speechrecognitionpart.All thedatasets describedin thissection are purposelycollectedbyusforconsequent experimental evalu-ations. Table2summarizes theonesusedforevaluating theuser detection,andintentiondetectioncomponents.

Userdetectiondataset (UserDet-DT). Toevaluatetheuserdetection module,weuseaproprietaryRGB-Ddatasetconsistingof235 im-ageframesintermittentlyacquiredusingaKinectsensormounted onour mobilerobotinourrobotic lab(clutteredroboticlab-1in

Table 2). Eachframe of the datasetcontains atleastone person, themajoritycontaintwopersons,andafewimage framesfeature fourpersons(the maximumnumberofpersonsperimageframe). Intermsofdetectiontargets,thereareatotalof521target occur-rences out of which 182 are situated farther than 5m from the Kinectsensor.Eventhoughtheapplicationcontextinthisworkis singleuserdetection, weevaluate thedetectionmodulewiththis datasetcontainingmultiplepersonsperimage frameto character-izeitsdetectioncapabilitythoroughlyasisdoneinstandard peo-pledetectionliterature[16].Thiswillincreasethechanceoftesting the detector under broadly varying conditions, e.g., inter-person occlusions,deformationsduetoarticulations,anddifferentperson postures(walking,sitting, standing,etc.). Additionally,itwill help demonstrateits capability andpotential tobe used inmulti-user scenarios.Thedatasetismanuallyannotatedtocreateacomplete groundtruthbydelineatingeachpersonineachimageframewith rectangularboundingboxes.

Intentionevaluationdataset(Intent-DT). Foruser’sintention detec-tion evaluation, we acquire three separate datasets: Intent-DT1, Intent-DT2, and Intent-DT3. Intent-DT1 is acquired merely in an oﬃcesettingusingastandalone Kinectandsmartphone,whereas theother two, Intent-DT2 andIntent-DT3, are acquiredusingthe PR2 in a robotic experimental area. Their lengths vary between 3180 and 4230 image frames (acquired at 30 fps). The datasets constituteofRGB-Dandaudiostreams.Inallcases,theuserseats atanapproximatedistanceof1_.5–2mfromtheRGB-Dsensorand demonstrates intention-for-interaction by facing the Kinect sensor and/orspeaking.Thedatasetsaremanuallyannotatedtomark in-tentionactiveregionswiththehelpoftheuser.

Speech recognition evaluation dataset. To evaluate the micro-phone/speech recognition API fusion framework, a proprietary dataset (corpus) was collected dedicated to this study involving fourspeakers.Inthiscorpus,eachparticipantutters17French sen-tences that have been selectedto ﬁt out HRI context repeatedly.

(13)

Eachsentence isrepeatedthreetimesby each ofthefour speak-ers at four different distances: 10cm, 50cm, 1m, and 2m. To furtherclarify,duringtheacquisitionsession,thesamespeaker re-peats the samesentence atotal of12 times,leadingto 204 sen-tencesperspeaker.Whentheuser/microphonedistanceisgreater than 2 m,the WER, seeSection 8.2,usually reaches amaximum and no hypothesis is produced by anycombination. The acquisi-tionisiteratedwithatotaloffourspeakers.Allinall,thisdataset contains17sentencesuttered12timesbyfourspeakers—resulting in816totalnumberofrecordedsentencesspanningatotaltimeof 34’03”.

8.2. Evaluationmetrics

Toquantifytheperformance ofthedifferentperceptualblocks utilized in this work, we make use of various well established metrics.

Userdetectorperformance. Toevaluatetheperformanceoftheuser detector,weusemissrate(MR)andaveragefalsepositivesper im-age(FPPI)metricsasdeﬁnedinEqs.(5)and(6)respectively.Both theMRandFPPIarethemostwidelyestablishedevaluation met-ricsinpeopledetection[16].MRindicatesthe proportionof peo-plenotdetectedintheentiredataset(itis“1−truepositiverate”). FPPIontheotherhandindicatesthefalsepositivesaveragedover thetotalnumberofimageframes—itsigniﬁeshowmanyfalse pos-itivesarelikelytooccurwhenappliedonanewimageframe.

MR=1− Nf i=1 TPi TPi+FNi (5) FPPI= 1 N f Nf i=1 FPi (6)

IntheseEqs. (5)and(6),irepresents theimage frame,Nfthe

totalnumberofimageframesinthedataset,andTPi,FPi,FNi

de-note true positives, false positives, andfalse negativesat the ith test image frame respectively. The evaluationgenerally resultsin anMR– FPPIplotinlog–logscalethatisgeneratedbyvaryingthe threshold(ϑo) onthe ﬁnaldetectorscoreϑ.Forexample,

increas-ing the threshold _ϑo will increase the MR,less windows will be

detected, butit alsoreducesthenumberoffalsepositives(hence the FPPI), andvice-versa. ϑo is a tunable parameterthat deﬁnes

theoperatingpointofthedetector.Tosummarizetheperformance ofthedetector,thelog-averagemissrateisused.Itiscomputedby averaging missrateatnine FPPIratesevenlyspacedinlog-space intherange10−2–100_(for_curves_that_end_before_reaching_a_given FPPIrate,theminimummissrateachievedisused)[16].

Intention evaluation. The ﬁnal inference engine of the intention-for-interaction is basedon an HMM. Hence, toquantify its detec-tionperformance, wemakeuseofvarious metricsmostlyusedin HMMapplicationsintheliterature(e.g.,[62])listedbelow.For eas-iermathematicalnotation,letusrepresentthedifferentmeasures asfollows:letusdeﬁnetheindicatorfunctionI

(

x,y

)

inEq.(7) as-sumingx,y∈

{

intent,¬intent

}

tobeusedtoﬂagatruepositive,a falsepositive,andatruenegative.LetIt,Gt∈

{

intent,¬intent

}

rep-resenttheintentionlabelassignedbytheintentiondetection mod-ule andthe groundtruth intention labelat time frame t respec-tively. LetT representthe entirelength ofthe evaluationdataset whichconsistsofTintent=

Tintent, j

Nj=1IT andT¬intent=

T_¬intent, j

Nj=1NT disjoint spans where there is and there is no user intention in the groundtruth annotation respectively.NIT isthe totalnumber of such spans where there is userintention, andNNT isthe total numberofsuch spanswhereisno intention.The |.|indicatesthe

durationofatimespan,e.g.,

|

Tintent, j

|

isthedurationofthejth

in-tentionmarkedtimespan.Consequently,T=Tintent∪T¬intentis

sat-isﬁed.Finally,letthesetJ∗standto representthe setof indexesof suchtime spanswhereanintentioniscorrectlydetectedatsome pointinthespan,i.e.,J∗=

{

j

|

_t_∈_T

intent, jI

(

It,Gt

)

>0

}

.

I

(

x, y

)

=

1, ifx=intent andy=intent

0, otherwise (7)

• Truepositiverate(TPR):Itisdeﬁnedastheratioofcorrect in-tentiondetection(inaccordancewiththegroundtruth)tothat oftotalintentiontagged(Gt=intent)dataframes.Itisformally

expressedusingEq.(8).

TPR= 1

t∈TI

(

Gt, Gt

)

t∈T

I

(

It, Gt

)

(8)

• False alarmrate(FAR):Itis theratioofthenumberof obser-vation dataframesforwhich thedetectionoutput ﬂagsan in-tentionwherethereisnoneinthegroundtruth,tothatofthe totalnumberofnointentiondataframesasdescribedwithEq. (9). FAR= 1 t∈TI

(

¬Gt, ¬Gt

)

t∈T I

(

It, ¬Gt

)

(9)

• Averageearlydetection(AED):Givenanobservationspanj la-beled withintent,Tintent, j,the early detectiontime isthe

dis-cretetime t_d_,_jthesystemtooktocorrectlydetectionan inten-tion.TheAED,then,iscomputedbyaveragingthenormalized early detectiontime overall correctlydetected intentions. Us-ingJ∗,theAEDcanbeexpressedasinEq.(10).

AED=

j∈J∗

t d, j

|

Tintent, j

|

(10)

• Averagecorrectduration (ACD):Itisdeﬁnedinasimilar fash-ionasAED,butinsteadconsidersthecorrectlydetected inten-tion duration. Ifcd,j representsthe discrete time span

(dura-tion)throughwhichanintentioniscorrectlydetected,theACD is determined by averaging over all correctly detected inten-tionsasinEq.(11).

ACD=

j∈J∗

c d, j

|

T_{intent, j}

|

(11)

Speechrecognition. Theevaluationofthespeechrecognition mod-ule is based on the WERmetric. The classic WER is deﬁnedby: WER₌S+D+I

N where S is the number of word substitutions,D is

thenumberofworddeletions,Iisthenumberofwordinsertions andNisthenumberofwordsinthereference.

Inordertolearn P(S|d)thatisusedbythefusionAlgorithm2, the WER of each combination function of the distance is first learned. Two sets of metrics are defined and computed on the datasetpresented inSection 8.1 (predefined sentences uttered at differentdistances) used for the P(S|d) estimation. These metrics arepresentedbelow.

• TotalWER(T-WER): Inthiscase, theWERiscomputedforall utterances, even if there is no hypothesis found by a speech recognition API. This is the classic mean WER. This metric is usedtocomparedifferentsystems.

• UtteranceWER(U-WER):Inthiscase,theWERiscomputedfor allutterances,onlyiftherearehypothesesfound,whichmeans that whennothinghasbeenrecognized,thehypothesis is dis-carded. Thiscan beseenastheprecisionoftherecognitionof aspeechrecognitionAPI.Thismetricisusedtolearnthe map-pingmatrix.