Reinforcement learning-based cell selection in sparse mobile crowdsensing

(1)

HAL Id: hal-02321018

https://hal.archives-ouvertes.fr/hal-02321018

Submitted on 23 Oct 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Reinforcement learning-based cell selection in sparse

mobile crowdsensing

Wenbin Liu, Leye Wang, En Wang, Yongjian Yang, Djamal Zeghlache, Daqing

Zhang

To cite this version:

Wenbin Liu, Leye Wang, En Wang, Yongjian Yang, Djamal Zeghlache, et al.. Reinforcement

learning-based cell selection in sparse mobile crowdsensing. Computer Networks, Elsevier, 2019, 161,

pp.102-114. �10.1016/j.comnet.2019.06.010�. �hal-02321018�

(2)

ContentslistsavailableatScienceDirect

Computer

Networks

journalhomepage:www.elsevier.com/locate/comnet

Reinforcement

learning-based

cell

selection

in

sparse

mobile

crowdsensing

Wenbin

Liu

a,d,1

_,

_Leye

_Wang

b,c,1

_,

_En

_Wang

a,∗

_,

_Yongjian

_Yang

a

_,

_Djamal

_Zeghlache

d

_,

Daqing

Zhang

b,c,d

a College of Computer Science and Technology, Jilin University, Changchun, Jilin, China b Key Lab of High Conﬁdence Software Technologies, Peking University, Beijing, China c School of Electronic Engineering and Computer Science, Peking University, Beijing, China d RS2M, Telecom SudParis, Evry, France

a

r

t

i

c

l

e

i

n

f

o

Article history: Received 27 March 2019 Revised 28 May 2019 Accepted 11 June 2019 Available online 12 June 2019

Keywords: Mobile crowdsensing Cell selection Reinforcement learning Compressive sensing

a

b

s

t

r

a

c

t

SparseMobileCrowdsensing(MCS)isanovelMCSparadigmwhichallowsustousethemobiledevices to collectsensing datafromonlyasmall subsetofcells (sub-areas)in thetarget sensing area while intelligentlyinferring the dataofothercellswith quality guarantee.Since selectingsenseddatafrom differentcellsetswillprobablyleadtodiverselevelsofinferencedataquality,cellselection(i.e.,choosing whichcellsinthetargetareatocollectsenseddatafromparticipants)isacriticalissuethatwillimpact thetotalamountofdatathatrequirestobecollected(i.e.,datacollectioncosts)forensuringacertain level ofdataquality. To address thisissue, thispaperproposesthe reinforcementlearning-based cell selectionalgorithmforSparseMCS.First,wemodelthekeyconceptsinreinforcementlearningincluding state,action,andreward,andthenproposeaQ-learningbasedcellselectionalgorithm.Todealwiththe large statespace,weemploythedeep Q-networktolearnthe Q-functionthat canhelpdecide which cell is abetterchoiceunder a certainstateduringcell selection. Then,we modify theQ-networkto adeeprecurrentQ-networkwithLSTMtocatchthetemporalpatternsand handlepartialobservability. Furthermore,weleveragethetransferlearningtechniquestorelievethedependencyonalargeamountof trainingdata.Experimentsonvariousreal-lifesensingdatasetsverifytheeffectivenessofourproposed algorithmsoverthe state-of-the-artmechanismsinSparseMCS byreducingupto20% ofsensedcells withthesamedatainferencequalityguarantee.

1. Introduction

Mobile crowdsensing(MCS) [3]isa novelsensingmechanism, whichallows usto usethe ubiquitousmobile devices toaddress variousurban monitoring needs inenvironment andtraﬃc mon-itoring [29]. While the traditional MCS applications usually re-cruit many participants in order to cover all the cells (i.e., sub-areas) of the target area to ensure sensing quality. This costs a lot andmay even be impossible (e.g., there is no participant in some cells) [18,19,28]. To deal with these problems, a new MCS paradigm,namelySparseMCS,isproposedrecently[21,23],which collects datafromonly a subset ofcells whileintelligently

infer-∗ _{Corresponding author.}

E-mail addresses: [email protected] (W. Liu), [email protected] (L. Wang), [email protected] (E. Wang), [email protected] (Y. Yang), [email protected] (D. Zeghlache), daqing.zhang@telecom- sudparis.eu (D. Zhang).

1 Wenbin Liu and Leye Wang contributed equally.

ringthedataofothercellswithqualityguarantee(i.e.,theerrorof inferreddataislowerthanathreshold).

InSparse MCS,one key issueiscellselection— whichcellsthe organizer needsto choose and collectsensed datafrom participants

[21]. To show the importance of cell selection, Fig. 1 (left part) gives an illustrative exampleof two different cell selection cases inacity,whichissplitinto4× 4cells.InCase1.1,alltheselected cellsare gatheredin one cornerof thecity; in Case1.2, the col-lected dataisevenlydistributedinthewhole city.Asthedataof mostsensingtasks hasspatialcorrelations(i.e., nearbycellsmay havesimilar data), e.g.,airquality [30],thecell selection ofCase 1.2 will generate a higher inference quality of the inferred data than Case1.1. Moreover, a MCScampaignusually lastsfora long time (i.e.,sensingevery onehour), sothatnot onlyspatial corre-lations,butalsotemporalcorrelationsneedtobecarefully consid-eredincellselection. AsshowninFig.1 (rightpart),sensingthe samecellsincontinuouscycles(Case2.1) maynot beaseﬃcient assensingthe differentcells(Case 2.2)considering the inference

(3)

Fig. 1. Different cell selection cases.

quality.Therefore,the dataof differentMCSapplicationsmay in-volvediversespatio-temporalcorrelations,whichishardtomodel anddetermine,sothepropercellselectionstrategyisanon-trivial task.

Existing works on Sparse MCS mainly leverage Query-By-Committee (QBC) [20,23] in cell selection. QBC ﬁrst uses various inference algorithmsto deducethedataofallthe unsensedcells, andthenchoosesthecellwheretheinferreddataofvarious algo-rithmshasthelargestvarianceasthenextcellforsensing.Brieﬂy, QBC chooses the mostuncertain cell considering a committeeof inference algorithms, which dealswiththe cellselection skilfully andhasshownitseffectivenessasawhole[20,23].However,QBC onlychoosesthecellwhichisthemostuncertainatthatmoment, andignoreswhetherthecurrentselectionwouldhelptheinferring inthefutureornot.Forexample,asshowninFig.1(rightpart),if we selectone cellattimet_k,itwouldhelp theinferringnotonly forthismomentbutalsoforthesubsequentinstantt_k₊₁.

Toovercometheselimitations,inthispaper,westudythe criti-calcellselectionprobleminSparseMCS,withreinforcement learn-ing, which can capture the spatio-temporal correlations in the sensingdataandapproximatethe globaloptimalstrategy forcell selection. In recent years, reinforcement learning has shown its successes in decision making problems in diverse areas such as robotcontrolandgameplaying[11,16],whichcanbeabstractedas ‘anagentneedstodecidetheactionunderacertainstate,inorder to maximize some notions of cumulative reward’. Reinforcement learningtriesoutdifferentactions,observestherewardsandthus learnstheoptimaldecisionsforeachstate.Ourcellselection prob-lemcanbeactuallyinterpretedas‘anMCSserver(agent)needsto

choosethenextcellforsensing(action)consideringthedataalready collected(state),inordertominimizethenumberofsensedcells un-der a qualityguarantee (reward)’. In this regard, it is appropriate to apply reinforcement learning on thecell selection problemin SparseMCS.

By usingreinforcement learning, thecell selection problemin SparseMCScanbewellsolved.Firstofall,amodel-free reinforce-mentlearningmethodcanrecordwhichcellwouldhelpmost un-der a certainstate through trialanderror.In fact,trial anderror is exactly the fundamental idea of reinforcement learning. After suﬃcienttraining,reinforcementlearningwouldrecordallthe re-wards for each state-action pair andselect the action whichhas thebiggestrewardunderthestate.Moreover,reinforcement learn-ingaddstherewardattainablefromthefuturestatetothereward initscurrentstate,effectivelyinﬂuencingthecurrentselectionby the potential reward in the future. Thus, reinforcement learning can approximate the global optimal strategy for cell selection in SparseMCS.

Toeffectively employreinforcement learningin cell selection, weface severalissues.(1)How to mathematically model thestate, action,andreward,whicharekeyconceptsinreinforcement learn-ing[17].Brieﬂyspeaking,reinforcementlearningattemptstolearn aQ-functionwhichtakesthecurrentstateasinput,andgenerates

rewardscoresforeachpossibleactionasoutput.Then,wecantake theactionwiththehighestrewardscoreasourdecision. (2)How to learn theQ-function. Traditional Q-learningtechniques in rein-forcementlearningusetablestostorerewardsforeachstate-action pair.Itworkswellinthescenarioswherethenumberofstatesand actionsislimited.However,inSparseMCS,thenumberofstatesis actuallyquitelarge.Weproposetouseaneuralnetworktoreplace thetable, i.e., leveragingdeep reinforcement learningtolearn Q-functionforourcellselectionproblem.(3)Thetrainingdatascarcity issue.Usually,deepreinforcementlearningrequiresalotoftraining datatolearnQ-function.However,inMCS,wecouldonlyobtaina smallamountof datafortraining. Todeal withthisproblem,we propose to collect a small amountof redundant datato conduct theeffectivetrainingbyrandomcombination.Moreover,wetryto introducethetransferlearningtechnique,inordertomakeuseof thewelltrainedQ-functionandreducetherequiredtrainingdata forheterogeneoussensingtasksinsimilartargetarea.

Insummary,thisworkhasthefollowingcontributions: (1)Tothebestofourknowledge,thisworkistheﬁrstresearch thatattemptstoleveragethereinforcementlearningtoaddressthe criticalcellselectionissueinSparseMCS.Webelievethatusing re-inforcementlearningwouldbeapromisingwaytosolvesuchkind ofdecisionmakingproblems,especiallywhenwecannotobtaina directsolutionandthedecisionshavelong-termutilities.

(2)Weproposethereinforcementlearning-basedalgorithmsfor cellselectioninSparseMCS.First,wemodelthestate,action,and

reward andpropose a tabular Q-learning based algorithm, which recordstherewardscoresforeachstate-actionpairintables. Con-sideringtheextremely largestatespace, weemploya neural net-workinsteadoftablesandlearna Q-functionto calculatethe re-wardscores. Sincethe neuralnetwork cannot catchthe temporal patterns and handle partial observability well, we propose a re-currentdeepneural network structure,which uses a Long-Short-Term-Memorylayerinsteadofthedenselayer.Finally,wecollecta smallamountofredundantdata toconduct theeffectivetraining by random combination and propose a transfer learning method betweenheterogeneous sensingtasks,in orderto relieve the de-pendenceonalargeamountoftrainingdata.

(3)Experimentswithapplicationsintemperature,humidity,air qualityandtraﬃcmonitoringhaveveriﬁedtheeffectivenessofour proposed algorithms. In particular, our proposed algorithms can outperform the state-of-the-art mechanism QBC by selecting up to20% fewercellswhile guaranteeingthesame qualityin Sparse MCS.

Theremainderofthepaperisorganizedasfollows.Firstly,we reviewrelatedworksinSection2.Theproblemformulationare in-troducedinSection3.InSection 4,wepropose thereinforcement learning-based cell selection algorithms and discuss the training andtransferlearningmethod.Then,theperformancesofthe pro-posedalgorithmsareevaluatedthroughextensivesimulationsover threerealworlddatasetsinSection5.Finally,weconcludethis pa-perinSection6.

2. Relatedworks

2.1. Sparsemobilecrowdsensing

MCSisproposedtoutilizewidespreadcrowdstoperform large-scale sensing tasks [3,29]. Existing works in MCS mainly recruit manyparticipantstoensuresensingquality[18,19,28],whichcosts alotandmayevenbeimpossible.Tominimizesensingcostwhile

(4)

ensuringdataquality,someMCStasksinvolveinferencealgorithms toﬁll missingdataof unsensed cells, such asnoise sensing[12], traﬃc monitoring [31], and air quality sensing [20]. It is worth notingthat in such MCS tasks, compressive sensinghas become thede factochoiceoftheinferencealgorithm[12,20,23,27,31]. Re-cently,byextractingthecommonresearchissuesinvolvedinsuch tasks involving data inference, Wang et al. [21] proposed a new MCSparadigm,calledSparseMCS.Besidestheinferencealgorithm, SparseMCSalsoabstractsothercriticalresearchissuessuchascell selection and quality assessment. Later, privacy protection mecha-nismwasalsoaddedintoSparseMCS[22].Inthispaper,wefocus onthe cell selection andaim to use deep reinforcement learning techniquestoaddressit.

2.2.Reinforcementlearning

ReinforcementLearning(RL)[17]isconcernedwithhowtomap statestoactionssoastomaximizethecumulativerewards.It uti-lizesrewards to guidethe agent to do the better sequential de-cisions, and has substantive and fruitful interactions with other engineeringand scientific disciplines. Recently, manyresearchers focus on combining deep learning with reinforcement learning to enhance RL in order to solve concrete problems in the sci-ences,business,andotherareas.Mnihetal.[10]proposedthefirst deepreinforcement learningmodel(DQN)to dealwiththe high-dimensionalsensoryinput successfullyandapply itto playseven Atari 2600 games. More recently, Silver et al. [15] applied DQN andpresentAlphaGo,whichwasthefirstprogramtodefeat world-classplayersinGo.Moreover,todealwiththepartiallyobservable states,HausknechtandStone[6]introducedadeeprecurrent neu-ralnetwork(DRQN),andappliedittoplayAtari2600games. Lam-pleandChaplot[9]evenusedDRQNtoplayFPSGames.

Although thereinforcement learninghasalreadybeenused in avarietyofareas,likeobjectrecognition,robotcontrol,and com-munication protocol[17], MCS researchers justbegan to apply it veryrecently.Xiaoetal.[24] formulatedtheinteractionsbetween aserverandvehiclesasavehicularcrowdsensinggame.Thenthey proposedtheQ-learningbasedstrategiestohelpserverand vehi-clesmake theoptimaldecisionsforthedynamicgame.Moreover, Xiaoetal.[25]appliedDeepQ-Networktoderivetheoptimal pol-icyfortheStackelberggamebetweenaMCSserverandthe smart-phoneusers. As far as we know, this paper is the ﬁrst research attemptstouse reinforcementlearningin cellselection ofsparse MCS,soastoreduce therecruited participantswhilestill guaran-teeingthedataquality.

3. Systemmodelandproblemformulation

First, we deﬁne several key concepts, andintroduce the com-pressivesensingfordatainferenceandBayesianinferencefor qual-ityassessmentbrieﬂy.Thenwe mathematicallyformulatethecell selectionprobleminSparseMCS. Finally,arunningexampleis il-lustratedtoexplainourprobleminmoredetails.

3.1.Deﬁnitions

Deﬁnition 1. Sensing Area. We suppose that the target sensing areacanbesplitintoasetofcells(e.g.,1km_{× 1}kmgrids[23,30]). Theobjectiveofasensingtaskistogetacertaintypeofdata(e.g., temperature,airquality)ofallthecellsinthetargetarea.

Deﬁnition2.SensingCycle.Wesupposethesensingtaskscanbe splitinto equal-length cycles,andthe cyclelength isdetermined bytheMCSorganizersaccordingtotheirrequirements[23,26].For example,if an organizer wants to update the data of the target sensingarea every one hour,then hecan set thecycle lengthto onehour.

Deﬁnition3.GroundTruthDataMatrix.Supposewehavemcells andncycles,thenforacertainsensingtask,thegroundtruthdata matrixisdenotedbyDm×n,whereD[i,j]isthetruedataincelli

atcyclej.

Deﬁnition 4. Cell Selection Matrix.In Sparse MCS, we will only selectpartialcellsineachcyclefordatacollection,whileinferring thedatafortherestcells.Cellselectionmatrix,denotedas_Cm×n,

marksthecellselection results.C[i,j]=1meansthat thecelliis selectedatcyclejfordatacollection;otherwise,C[i,j]=0.

Deﬁnition5.CollectedDataMatrix.Acollectedsensingdata ma-trixSm×n recordstheactual collecteddata:Sm×n=D◦ C,where◦

denotestheelement-wiseproductoftwomatrices.

Deﬁnition 6. Inferred Data Matrix. In Sparse MCS, when an or-ganizerdecidesnot tocollect anymoredata inthecurrentcycle, thedataofunsensedcellswillthen beinferred.Then, wedenote theinferreddataofthekthcycleas_Dˆ[:,k],andthustheinferred dataofall thecyclesasamatrix_Dˆm×n.Note thatinSparse MCS,

compressivesensingisthedefactochoiceofthedatainference al-gorithmnowadays[12,20,23,27,31],andwealsouseitinthiswork.

Deﬁnition7.(e,p)-quality[23].InSparseMCS,thequality guar-anteeiscalled(e,p)-quality,meaningthatinp· 100%ofcycles,the inferenceerror(e.g.,meanabsoluteerror)isnotlargerthane. For-mally,

|{

k

|

error

(

D[:,k],_Dˆ[:,k]

)

≤ e,1≤ k≤ n

}|

≥ n· p, (1)

wherenisthenumberoftotalsensingcycles.

Notethat inpractice, sincewe donot knowthe groundtruth datamatrix_D,wealsocannotknowwhethererror

(

_D[:_,k]_,_Dˆ[:_,k]

)

issmaller thane inthe currentcyclewith100%conﬁdence.This iswhyweincludepinthequalityrequirement,asitisimpossible toensure100%ofcycles’errorlessthane.Toensure(e,p)-quality, certainqualityassessmentmethodisneededinSparseMCSto es-timatetheprobabilityoftheerrorlessthaneforthecurrentcycle. Iftheestimatedprobabilityislargerthanp,thenthecurrentcycle satisﬁes(e, p)-qualityandnomoredatawillbecollected(wewill then move to the next sensing cycle).In Sparse MCS, leave-one-outbasedBayesianinferencemethodisoftenleveragedforquality assessment[20,21,23],andwealsouseitinthiswork.

3.2. Datainference

Compressivesensingisthedefactochoicetoinferthefull sens-ing matrix from the partially collected sensing values and has shownits effectivenessin some scenarios [20,23]. It reconstructs thefullsensingmatrix_Dˆ basedonthelow-rankproperty:

minrank

(

Dˆ

)

(2)

s.t.,_Dˆ_{◦ C}₌_S, (3)

where_Dˆ_{◦ C is}theelement-wiseproductoftheinferredfull sens-ing matrix andcell selection matrix,and S is the collecteddata matrix.

Withthehelp ofthesingularvaluedecomposition,i.e.,_Dˆ₌LRT_,

weconverttheaboveoptimizationproblemasfollows[31]:

min

λ

(

L

2

F+

R

2F

)

+

LRT◦ C− S

2F. (4)

Moreover,inordertobettercapturethespatio-temporal corre-lationsinthesensingdata,we furtheradd theexplicit spatiotem-poral correlations into compressive sensing [8,13], and the opti-mizationfunctionisdenotedbyEq.(5):

min

λ

r

(

L

2F+

R

2F

)

+

LRT◦ C− S

2F

+

λ

s

S

(

LRT

)

2F+

λ

t

(

LRT

)

TT

2F, (5)

whereSandTarespatialandtemporalconstraintmatrices,while

λ

r,

λ

s, and

λ

t are chosen to balance the weights of different

(5)

leastsquares[8,13]proceduretoestimateLandRiteratively,in or-dertogettheoptimal_Dˆ (_Dˆ₌LRT_).

3.3. Qualityassessment

In this paper, the leave-one-out based Bayesian inference is usedtoassesstheinferencequality.First,weusetheleave-one-out resamplingtoobtainthesetofinferred-truedatapairs.Then, com-paringtheinferreddatatothetruecollecteddata,theBayesian in-ferenceisleveragedtoassesswhetherthecurrentdataqualitycan satisfythepredeﬁned(e,p)-qualityrequirementornot.

Thebasicideaofleave-one-outresamplingissimplebut effec-tive. Consider thatwe collectsensingdata fromm out ofallthe

mcellsandthuswehavemobservations.Foreachtime,weleave oneobservationoutandinferitbasedontherestm_{− 1} observa-tions byusingcompressivesensing.After runningthisprocess for allm observations,weobtainminferred-truedatapairs.

Based ontheminferred-true datapairs,we canuseBayesian inference toestimate the probability distributionofthe inference errorE in allthem cells, whichcanhelp qualityassessment. Ac-tually,satisfyingthe(e,p)-qualitycanbeseenasP

(

E≤ e

)

≥ p.We regard_{E as}anunknownparameterandupdatetheprobability dis-tribution of E based on ourobservation

θ

(m inferred-true data pairs).Therefore,wecanapproximateP

(

E≤ e

)

:

P

(

E≤ e

)

≈ e

−∞g

(

E

|

θ

)

dE, (6)

where g

(

E

|

θ

)

is the estimated probability distribution of E. For two widely usederrormetrics,mean absoluteerror(for continu-ousvalue)andclassiﬁcationerror(forclassiﬁcationlabel), calculat-ingtheg

(

E

|

θ

)

basedontheobservationcanbeseenastheclassic Bayesian statistics problem: inferring normal mean with unknown variance andCoin Flipping, andthen we can calculatethe g

(

E

|

θ

)

byt-distribution[1]andBetadistribution[4],respectively.

3.4. Problemformulation

Basedonthepreviousdeﬁnitionsandthebriefintroductionon compressivesensingandBayesianinferenceusedinthispaper,we deﬁneourresearchproblemandfocusonthecellselection.

Problem [Cell Selection]: Given a Sparse MCS task with m

cellsandncycles,usingcompressivesensingasdata inference method andleave-one-out based Bayesian inferenceas quality assessmentmethod,weaimtoselectaminimalsubsetof sens-ingcellsduringthewholesensingprocess(minimizethe num-berofnon-zeroentriesinthecell-selectionmatrixC),while sat-isfying(e,p)-quality: min m i=1 n j=1 C[i,j] s.t.,satisfy

(

e,p

)

−quality

We now use a running example to illustrate our problem in more details, asshown inFig. 2. (1)Consider that theMCS task

Fig. 2. Running example.

onlyhave5cellsanditiscurrentlyinthe5thcycle;(2)Byusing thecellselectionalgorithms,weselectcell3tocollectthesensing data,andthenusethecompressivesensingandBayesianinference toassesswhethertheselectedcellsinthiscyclecansatisfy(e,p )-quality; (3)Since the current cyclecannot satisfy the quality re-quirement,wecontinuetoselectcell5forcollectingdata;(4)The qualityrequirementisnowsatisﬁed,so thedatacollectionis ter-minatedforthecurrentcycle,andthedataoftheunsensedcellsis inferredbycompressivesensing.Inthisexample,wetotallyobtain 11datasubmissionsforthese5cyclesandourobjectiveisexactly to minimize the numberof datasubmissions while ensuring the quality.In addition,weshould noticethatmaybesome cells can-notbesensedatthecurrentsensingcycle(e.g.,therearenousers inthesecells).Inpracticaluse,weﬁrstupdateacandidatecellset, inwhichcellscanbesensedinthecurrentsensingcycle,andthen selectthenextcelltosensefromthecandidatecellset.

4. Methodology

In this section, we propose the reinforcement learning-based algorithms to address the cell selection problem in Sparse MCS. First,wewillmathematicallymodelthestate, reward,andaction. Then, with a simplified MCS taskexample (i.e., there are only a fewcellsinthetargetarea),weexplainhowtraditional reinforce-mentlearningfindthemostappropriatecellforsensingbasedon our state, action, and reward modeling. Afterward, we elaborate howdeep learningcanbe combinedwithreinforcement learning towork onmorerealistic casesofcellselection wherethetarget areacan includea largenumberof cells.Finally,we describe the trainingstageandexplain howwe can collecta smallamountof redundantdatatoconducttheeffectivetrainingbyrandom combi-nation.Moreover,weintroducetransferlearningtechniquetohelp usgenerateacellselectionstrategywithonlyalittletrainingdata undersomespecificconditions.

4.1. Modelingstate,action,andreward

To apply reinforcement learning on cell selection, we ﬁrst modelthe key concepts in termsof state, action,and reward,as showninFig. 3.Speciﬁcally, under thestate (consists ofthe cur-rentdatacollection andsome additionalinformation), weshould learnaQ-function(willbeelaboratedinthenextfewsubsections), whichcalculatethe reward scoreforeach action(choosing which celltocollect thesensingdata). Ifan actiongets ahigherreward score,itmaybeabetterchoice.Nextweformallymodelthethree concepts.

(1) State represents the current data collection condition. In SparseMCS,cellselectionmatrix(Deﬁnition4)cannaturallymodel the state well, as it recordsboth whereand when we have col-lecteddatafromthetargetsensingareaduringthewholetask.

(6)

Fig. 4. An example of state model.

In this paper, we keep the recent k cycles’ cell selection ma-trixandthelast-timeselectionvectorinsteadofthecellselection matrix,called the recent-cycle selection and the last-time selec-tion,denotedas[s_−k₊₁,...,s₋₁,s0,L].s0 representsthecell

selec-tion vector of the current cycle (1 means selected and0 means no),s₋₁ representslast cycle, andso on;L recordshow long the cellshavenotbeenselected.Therecent-cycleselectiononlykeeps therecentselections, whichavoidsthat thepreviousselectionsof lowvalue fordatainference disturbtheresults.The last-time se-lectionhasgatheredmorepreviousselectionswithoutmissingtoo muchinformation.Inaddition,weshouldalsoaddsomenecessary informationintothestate,e.g.,thetime Tforthestrongtemporal correlationsexistinginmanysensingtaskssuchastraﬃc monitor-ing.

Fig. 4 shows an example ofhow we encode the current data collectionconditionintothestate model.Inthisexample,the re-centtwocycles,atotalofﬁvecycles’last-timeselection,andtime areconsidered,andthestate canbe denotedasS=[s₋₁,s0,L,T].

Note that the value for cell 5 in last-time selection is 6, which meansthatthelast-timeselectionforcell5isout ofrange(a to-talofﬁvecycles)andwesetitas5₊1₌6.AndthetimeTisset accordingto thespeciﬁcscene.Forexample,thedataiscollected everyonehourandthetimeTcanbesetas

{

0,1,...,23

}

,inorder tocapturethestrongtemporalcorrelations.

In addition,weuseSto denotethewhole setofstates.Asan easyexample,supposethat weonlyconsiderthe recent-cycle se-lection(two cycles) and ignore the last-time selection andtime. Thereare totallyﬁvecellsinthe targetarea, thenthe numberof possiblestates,i.e.,

|

S

|

=22×5₌₁₀₂₄_, _which_is_such _a _large_state

space.

(2)Action meansallthepossibledecisionsthatwe maymake in cell selection. Suppose there are totally m cells in the target sensingarea,thenournextselectedcellcanhavemchoices, lead-ingto thewholeaction set_A₌

{

1_,2_,_{· · · ,}m

}

.Inpractice,we will notselectonecellformorethanonceinonecycle,tomakethe ac-tionsetconsistentunderdifferentstates,weassumethatthe pos-sibleaction set is always the complete set ofall thecells under anystates. Morespeciﬁcally, ifsome cells have alreadybeen se-lectedinthecurrentcycle,thentheprobability ofchoosingthese cellsiszero.

Note thatwe selectcellsone afteranother, sincethe multiple cellselectionsat atime may leadto the largeaction space.Also the reinforcement learning algorithms consider the potential re-wards in the future. After suﬃcient training, the one-by-one se-lectionwouldachieve thelargesttotalreward,whichisthe same goalofthemultipleselectionsatatime.

(3)Reward isused toindicatehow goodan actionis.Ineach sensingcycle,weselectactionsonebyoneuntiltheselectedcells can satisfy the quality requirement in the current cycle (i.e., in-ferenceerror lessthaneSatisfyingthisquality requirementisthe goalofcellselectionandshouldbereﬂectedinthereward model-ing.Hence, apositivereward,denotedbyR,wouldbegiventoan

actionunderastateSifthequalityrequirementissatisﬁedinthe currentcycleaftertheactionistaken.Inaddition,asselecting par-ticipantsto collectdata incurscost, we alsoputa negativescore −c intherewardmodelingofan action.Then,therewardcan be writtenasR=q· R− c,whereq∈{0,1}meanswhethertheaction makesthecurrentcyclesatisfytheinferencequalityrequirement.

While this reward is actually the immediate utility for one state-actionpair.Consideringthatthecurrentselectionwouldhelp the inferring in the future,we should add the reward attainable fromthefuturestatetotherewardinitscurrentstate,whichcan be simply denoted as R₌R₊R. R represents the next reward, whichwillbecalculatediteratively.Supposethatwehavencycles andselectm cellsforeach cycleinaverage tosatisfy the quality requirement, then we obtain theﬁnal reward asR₌n

(

R_{− m}_{· c}

)

. Thedifferentactionsunderacertainstate wouldfacethesamen, R,andc,whiletheactionwhichwillincursmallermwillachieve a larger reward. Thus, our reward mechanism would guide the agenttominimizethenumberofselectedcellswhileensuringthe dataquality.Wewouldlike toseta positiverewardto accelerate convergence,i.e., setR≥ m· c.While thevalues ofRandc would notinﬂuence theperformanceafterthe Q-functionhasbeenwell trained,sincethedifference valuebetweenrewardsonlydepends onthenumberofselectedcells.

Withtheabovemodeling,wethenneedtolearntheQ-function

(see Fig. 3) which can output the reward score of every possi-ble action under a certain state. In the next subsection, we will ﬁrst use a traditional reinforcement learning method, tabular Q-learning, to illustrate a simpliﬁed casewhere a small numberof cellsexistinthetargetsensingarea.

4.2. TrainingQ-functionwithtabularQ-Learning

Intraditionalreinforcementlearning,thetabularQ-learninghas beenwidelyusedtoobtaintheQ-function.Inthismethod,wecan useaQ-tabletorepresenttheQ-function. TheQ-table,denotedas

Q_|_S_|_×_|_A_|, records the reward score for each possible action A∈A underthe state S∈S.The objectiveof learningthe Q-functionis thenequivalenttoﬁllingalltheelementsintheQ-table,orcalled Q-value.

Thetabular Q-learningbasedcellselectionalgorithmisshown inAlgorithm1. Underthecurrentstate S, thealgorithmﬁrst up-dates the candidate action set Ac, in which cells can be sensed

byusers inthecurrentsensingcycleandhavenotbeenselected. Then, it checks the Q-table and selects the action who has the largestvalue from Q[S,A],

∀

A∈Ac (in fact, not always thebest

action is selected, will be elaborated later). After the action has beenconducted,i.e.thecellhasbeenselectedandthedataofthe cell hasbeen collected,the currentstate will changeto the next stateS.Notethatifthecurrentcyclesatisﬁesthequality require-ment(i.e.,inference errorlessthane), thenextstate willshiftto anewcycle.Fortheselectedaction,wewouldgettherealreward

(7)

re-Fig. 5. An illustrative example of tabular Q-learning.

Algorithm1 TabularQ-learningbasedcellselection.

Initialization:

Q-table:Q[S_,A]₌0_,

∀

S_∈_S_,

∀

A_∈_A,L,T,_Ac 1: whileTruedo

2: S=[s_−k,...,s₋₁,s0,L,T]

3: Update_Ac,inwhichcellscanbesensedinthecurrent

sens-ingcycleandhavenotbeenselected.

4: CheckQ-table,selectandperformAfromAc,whichhasthe

largestQ-valueviathe

-greedyalgorithm.

5: ifSatisfythe

(

e,p

)

-qualitythen

6: //Nextcycle 7: s1=0m×1,S=[s−k+1,...,s0,s1,L,T] 8: R=R− c 9: else 10: s₀=s0+[0,· · · ,0,1,0,· · · ,0]T (1isintheAthelement), S₌[s_−k_,_._._._,s₋₁_,s₀_,L_,T] 11: R=−c 12: endif

13: UpdateQ-tablevia(7)and(8).

14: endwhile

quirementofthecurrentcycleissatisﬁed.Thenweshouldaddthe possiblerewardthatwemightgetinthefutureiterativelyand up-dateQ-tableaccordingtotheequationsasfollows

Q[S,A]=

(

1−

α

)

Q[S,A]+

α

R+

γ

V

(

S

)

, (7)

V

(

S

)

=max

A Q[S

_,_A_]_,

_∀

_A_∈_A ₍₈₎

whereV(S)providesthehighestexpectedrewardscoreofthenext state S ;

γ

∈[0, 1] is the discount factor indicating the myopic view of the Q-learning regardingthe futurereward;

α

∈(0, 1] is thelearningrate.

Notethatifwealwaysselecttheactionwiththelargestreward scoreintheQ-table,thealgorithmmaygetalocaloptima.To ad-dressthisissue,weneedtoexploreduringtraining,i.e.,sometimes tryingactionsother thanthebestone.Wethususethe

−greedy algorithmbeforeselection.Morespeciﬁcally,underacertainstate, we select thebest actionaccording tothe Q-table witha proba-bility1−

andrandomlyselectoneoftheother actionswiththe probability

.Followingtheexistingliterature,atthebeginningof the training,we seta relativelylarge

so that wecan trymore; then, with the training process proceeds, we gradually reduce

untiltheQ-tableisconvergedandthenAlgorithm1isterminated.

Fig.5illustratesanexampleofourproposedtabularQ-learning based cell selection algorithm. For simplicity,the discount factor

γ

andthelearningrate

α

aresetto 1,andweonlyconsidertwo recentcycles (i.e., the last andcurrent one)as the statesin this example.We suppose that there are ﬁve cellsin thetarget area, and hence the state S has the dimension of 25× 2_, _as _shown _in S0,S1,andS2.Thevalue 1meansthat thecell hasbeenselected

and0meansnot.First,weinitializethetable,i.e.,allthevaluesin theQ-tableare setto0.Whenwe ﬁrstmeetsomestates,e.g.,S0,

scoresofall theactions intheQ-tableunderS0 are0(Q-table:t0

inFig.5).Wethenrandomlyselectoneactionsinceallthevalues areequal.Ifwechoose theactionA3 (selectthe cell3),thestate

turnstoS1.ThenweupdateQ[S0,A3]asthecurrentscoreplusthe

maximumscoreofthenextstate S1 (i.e.,futurereward). The

cur-rentrewardis−c sincethecurrentcyclecannotsatisfythequality requirement(c=1 in the example). The maximum scorefor the stateS1is0intheQ-table.Hence,wegetQ[S0,A3]=−1+0=−1

(Q-table:t1 in Fig.5). Similarly, underS1,we choose A5.Ifthese

selectionscould satisfy thequality, we get the currentreward is

R− c=4(R issetto5,i.e.,total numberofcells).Also, the max-imum possible reward of the next state S2 is 0 in the current

Q-table.Then we update Q[S1,A5]=5− 1+0=4 (Q-table:t2 in

Fig.5).Aftersomerounds,wehavemetS0manytimesandmaybe

selecting other actions under S0 is not good choice. And the

Q-table wouldbe changedto Q-table: tk inFig. 5.Thistime, under

S0,wecheckQ-tableandﬁndthatA3hasthelargestvalue,sowe

choose andperform A3. Then, we update Q[S0,A3]=−1+4=3,

sincethemaximumrewardscoreofthenextstateS1is4(Q-table:

tk+1inFig. 5).Therefore,atthenexttimeswhenwemeetS0again,

we wouldprobablychoose theaction A3,since ithasthe largest

rewardscore,whichmeans thatunderthestateS0,theactionA3

wouldgiveusthemostreturn.

The tabular Q-learning based algorithm can work well foran MCS task in a target area including a small number of cells, as shown in the above example, while the practical MCS applica-tions usually contain a large number of cells. Suppose there are 50 cellsin the target area andwe only consider recent 2 cycles tomodelstates,thenthestatespacewillbecomeextremelyhuge,

|

S

|

=22×50₌₂100_,_which_is_intractable_in_practice._Moreover,_if_we

addthelast-timeselectionandsomenecessaryinformationtogive amorecomprehensiverepresentationofthestate,thestatespace will be even larger, known as the “curse of dimensionality”. To overcomethisdiﬃculty,inthenextsubsectionwewillproposeto leveragedeeplearningwithreinforcementlearningtotrainthe de-cisionfunctionforcellselectioninSparseMCS.

(8)

4.3.TrainingQ-functionwithdeepreinforcementlearning 4.3.1. DeepQ-Network

Toovercometheproblemincurredbytheextremelylargestate space in the cell selection, we then turn to use the Deep Q-Network(DQN),whichcombinesQ-learningwithneuralnetworks. ThedifferencebetweenDQNandtabularQ-learningisthata neu-ralnetworkisusedtoreplacetheQ-tabletodealwiththe dimen-sioncurse.InDQN,we donotneed theQ-tablelookups,but cal-culateQ[S,A]foreachstate-actionpairselection.Morespeciﬁcally, theDQNinputsthecurrentstateandaction,thenitusesaneural networktoobtainanestimatedvalueofQ[S,A],shownas Q

(

S,A

)

=E

R+

γ

max

A Q

(

S

_,_A

₎

₍₉₎

In DQN, howto design the network structureimpacts the ef-fectivenessofthe learnedQ-function. Onecommon wayis using denselayerstoconnecttheinput(state)andoutput(arewardscore vectorofall possibleactions). Actually,the networkstructure with denselayersisappropriateforcellselection.Itcanhandle hetero-geneousinputs(consistsoftherecenttwocycles,thelast-time se-lection,andtime)andcatchthecomprehensivecorrelationsinour state.Thus,weuseaneuralnetworkparameterizedby

θ

to calcu-latetheQ-function,whichconsistsoftwofullyconnectedlayers.2

The state is fed into thefully connected layerand a linear layer outputstheQ-valuesforallpossibleactions.

The DQN-basedcell selectionalgorithm,i.e., D-Cell,is summa-rizedinAlgorithm2.SameasQ-learning,weﬁrstupdatethe

cur-Algorithm2 DQN/DRQN-basedcellselection.

Initialization:

t₌0,D₌_∅,L,T,_Ac

InitializeDQN/DRQNwithrandomweights

θ

1: whileTruedo

2: S=[s_−k,...,s₋₁,s0,L,T]

3: UpdateAc,inwhichcellscanbesensedinthecurrent

sens-ingcycleandhavenotbeenselected.

4: Calculate Q-value by DQN/DRQN with

θ

t via (9), select A

fromAcwith

-greedyalgorithm. 5: ifSatisfy

(

e,p

)

-qualitythen

6: //Nextcycle 7: s1=0m×1,S=[s−k+1,...,s0,s1,L,T] 8: R=R− c 9: else 10: s₀₌s0+[0,· · · ,0,1,0,· · · ,0]T (1is inthe Athelement), S=[s_−k,...,s₋₁,s₀,L,T] 11: R=−c 12: endif 13: et=

S,A,R,S

→ D

14: RandomlyselectsomeefromD

15: Calculate

θ

t via(12)/(14) 16: t++

17: ift%RPLACE_ITER== 0then

18:

θ

=

θ

t 19: endif

20: endwhile

rentstate S. The state S is fed into the neural network and ob-taintheQ-values.Then,weupdatethecandidateactionset_Acand

selectthe actionfromAc with

−greedyalgorithm,whichis also 2 How to design the network structure is an important research problem, while it is not the main concern of this paper. Some other network structures could be modiﬁed for cell selection to deal with the heterogeneous inputs. For example, we can use a convolutional neural network to pretrain the recent-cycle selection. Then, the results and the rest of our state are fed into the fully connected layers.

usedinQ-learningtobalancetheexplorationandexploitation.To obtain theestimationofthe Q-valuewhichapproximates the ex-pectedonein(9),ourproposedDQN-basedalgorithmusesthe ex-perience replay technique[11]. After one selection, we obtain the experienceatcurrenttimestept,denotedaset=

S,A,R,S

,and

the memorypool is D=

{

e1,e2,...,et

}

. Then, thealgorithm

ran-domly chooses part of the experiences to learn and update the network parameters

θ

. The goalis to calculatethe best

θ

to ob-tain Q_θ≈ Q.The stochasticgradient algorithmis appliedwiththe learningrate

α

andthelossfunctionisdeﬁnedasfollow,

L

(

θ

t

)

=ES,A,R,S

R+

γ

maxAQ_θt

(

S,A

)

− Qθt

(

S,A

)

₂

(10) Thus

∇

θtL

(

θ

t

)

=E S,A,R,S

R+

γ

max A Qθt

(

S _,_A

₎

−Qθt

(

S,A

)

∇

θtQθt

(

S,A

)

(11)

Foreach update, D-Cell randomly chooses partofexperiences from D, then calculates and updates the network parameters

θ

. Moreover, to avoid the oscillations (i.e., the Q-function changes quite rapidly in training), we apply the ﬁxed Q-targets technique

[11].Morespeciﬁcally,wedonotalwaysusethelatestnetwork pa-rameter

θ

t tocalculatethemaximumpossiblerewardofthenext

state(i.e.,max_AQ_θ

t

(

S,A

)

),butupdatethecorresponding

param-eter

θ

everyafewiterations,i.e.,

∇

θtL

(

θ

t

)

=E S,A,R,S

R+

γ

max A Qθ

(

S _,_A

₎

−Q_θt

(

S,A

)

∇

θtQθt

(

S,A

)

(12)

4.3.2. DeeprecurrentQ-Network

InDQN-basedcellselection,weuseaneuralnetworkwithtwo dense layers to catch the correlations inour state. However, the temporalcorrelationsalsoexistinourstates,buttheDQNonly fo-cus the singlestate and thus cannot catch the temporal pattern well.Moreover, thereal-worldtasksoftenfeatureincompleteand noisystate information resulting frompartial observability, lead-ing to a decline in the DQN’s performance. We thus propose to useLSTM (Long-Short-Term-Memory)layers insteadofdense lay-ersinDQNsoastocatchthetemporalpatternsinourstatesand handlepartialobservability,whichisalsocalledDeepRecurrent Q-Network (DRQN) [6]. More speciﬁcally, in DRQN-based cell selec-tion,Q-functioncanbedeﬁnedas,

Q_θt

(

S,Ht−1,A

)

(13)

whereHt−1istheextrainputreturnedbytheLSTMnetworkfrom

the previous time stept− 1.Same asD-Cell,the lossfunction is deﬁnedasfollow,

∇

θtL

(

θ

t

)

=E S,A,R,S

R+

γ

max A Qθ

(

S _,_H t−1,A

)

−Q_θt

(

S,Ht−1,A

)

∇

θtQθt

(

S,Ht−1,A

)

(14)

DifferentfromDQN,DRQNusesaLSTMlayerinsteadoftheﬁrst fullyconnectedlayer.SequentialstatesS_t_−k_,...,St−1andSt are

pro-cessed through time by the LSTM layer and output the Q-values after the last fully connected layer. Note that we use the LSTM layer totrain our networkto understand temporaldependencies, sowecan’trandomlychooseexperiencesfromDlikeDQN.Hence, werandomlychoosesometracesofexperiencesofagivenlength, i.e.,randomlyselect2tracesof2continuousexperiences,suchas

e1,e2 ande9,e10.Despitethechangesintheneuralnetwork,the

DRQN-basedalgorithm,i.e., DR-Cell,isalmostsameasD-Cell,and wesummarizethesetwoalgorithmtogetherinAlgorithm2.

4.4. Trainingdataandtransferlearning

With deepreinforcement learning, we can get the Q-function thatoutputsrewardscoresforallthepossibleactionsundera cer-tainstate,thenwecanchoosethecellthathasthelargestscorein

(9)

Table 1

Statistics of three datasets.

Sensor-Scope U-Air TaxiSpeed

City Lausanne Beijing Beijing

Data temperature, humidity PM2.5 traﬃc speed

Cell size 50 ∗_{30 m}2 _{10 0 0}∗_{10 0 0 m}2 _{road segment}

Cell number 57 36 118

Cycle length 0.5 h 1 h 0.5 h

Duration 7 days 11 days 4 days

Mean ± Std. 6.04 ± 1.87 _{84.52 ± 6.32 %}◦C 79.11 ± 81.21 13.01 ± 6.97m/s

cell selection. Obviously, the Q-function learningalgorithm men-tionedintheprevioussectionsmayneedalargeamountof train-ing data, while in MCS, we cannot have an unlimited historical datafortraining.Then,canwereducetheamountoftrainingdata undercertaincircumstances?

The easiest way to deal with this problem is using a small amount ofhistorical datato conduct theeffectivetrainingset by randomcombination.Thehistoricaldatacanbeobtainedbya pre-liminarystudyonthetargetsensingarea,i.e.,collectingdatafrom somecellsforashorttimebeforerunning.Werandomlycombine the sensed cells from the same cycles, and obtain many experi-ences,i.e.,et=

S,A,R,S

,totrainourmodel.Notethatwewould

like toselect some redundantcellsforeach cycle, asan extreme example,wecollectthedatafromallthecellsforashorttime.We usethese redundantdatato conductvarious combinationsof se-lectedcellsinonecyclewhichcansatisfyour(e,p)-quality,which ensurestheeffectivenessoftraining.

Ina practicalapplication, wedonot needtocollect datafrom manycells,sincetheeﬃcientcellsunderacertainstateareﬁnite, andthus theeffectivecombinationswhichcan satisfy thequality arelimited.Therefore,wecanselect3_a_small_amount_of_redundant

cells to conduct a smaller but effective training set, which con-tains enough experiences for training. We have conducted some experimentsinthe Section5.4toshow that ourmethodcan col-lectasmallamountofredundantdatatotrainourQ-functionand achieveagoodenoughperformance.However,thismethodstill re-quiresa preliminarystudy,andtoomuch trainingon theasmall amountofdatamaygetalocaloptima.

Moreover,inapracticalapplication,wefurtherconsiderthe pe-riodicretrainingasasupplementtooursystem. Ontheonehand, periodicretraining makes thesystembetterabletodeal withthe environmentchanges.Ontheotherhand,thesystemhascollected moredataafterrunningaperiodoftime,whichcanbeusedinthe new training and further improve the performance of reinforce-mentlearning.Notethattheperiodicretrainingcanbe conducted in anoffline manner, withoutaffectingthe availability ofthe on-line system. Moreover, ourproposed transfer learning/fine-tuning techniquescanbeusedtosignificantlyreducethere-trainingcost. Inordertomakebetteruseofthewell trainedQ-functionand furtherreducetheamountoftrainingdata,wetrytointroducethe transferlearningtechniqueintoourproblem.Inreality,manytypes ofdatahaveinter-datacorrelations,e.g.,temperatureandhumidity

[23].Then,ifthere aremultiplecorrelated sensingtasksina tar-get area,probablythe cellselection strategy learnedforone task canbeneﬁtanothertask.Withthisintuition,wepresentatransfer learningmethodforlearningtheQ-functionofanMCStask(target

task)withthehelpofthecellselection strategylearnedfrom an-othercorrelatedtask(sourcetask).Weassumethatthesourcetask hasadequate trainingdata,while thetarget taskhasonly alittle

3 Without loss of generality, we randomly select cells for each cycle to collect data for training. Actually, the reinforcement learning-based algorithms also randomly select cells for the early stages as discussed in the previous sections.

trainingdata.Inspired by the ﬁne-tuningtechniques widely used in image processing with deep neural networks, for training the Q-function of the target task, we initialize the parameters of its DRQN tothe parameter valuesofthe sourcetask DRQN (learned fromtheadequate trainingdataofthesourcetask).Then, weuse thelimitedtrainingdataofthetargettasktocontinuetheDRQN learningprocess (Algorithm2). In this way,we can make use of thewelltrainedQ-functionandreducetheamountoftrainingdata requiredforobtaining a goodcell selection strategy ofthetarget task.

5. Evaluation

In this section, we conduct extensive experiments based on three real-world datasets, which contain various types of sensed data,includingtemperature,humidity,airquality,andtraﬃcspeed.

5.1. Datasets

We adoptthree real-life datasets, Sensor-Scope [7], U-Air[30], andTaxiSpeed [14] to evaluate the performance of our proposed cell selection algorithms D-Cell and DR-Cell. Thesethree datasets contain various types of senseddata, including temperature, hu-midity,airquality,andtraﬃcspeed.Thedetailedsettingsofthree datasetsareshowninTable1.Althoughthesesenseddatainthree datasets are collectedfrom static sensors orstations, the mobile devices can also be used to obtain them (as in [2,5]). Thus, we cantreatthem asthe datasensedbysmartphonesandusethese datasetsinourexperimentstoshowtheeffectivenessofour algo-rithms.

Sensor-Scope[7]:TheSensor-Scopedatasetcontainsthe temper-ature and humidity readings for 7 days collected from the EPFL campuswithan area about500m× 300m. Thistargetarea is di-videdinto100cellswiththesize50m_{× 30}m.Theaverage temper-ature/humidityreadingsandtheirdistributionsareshowninFig.6. Sinceonly57 out ofthese100cellsaredeployedwithvalid sen-sors, we justusethe senseddata atthe 57 cellsto evaluate our algorithms.Theinference errorismeasuredby meanabsolute er-ror.

U-Air[30]: The U-Airdatasetcollectedthe airquality datafor 11days from Beijing by existing monitor stations. Same as[30], wesplittheBeijingintocellswhereeachcellis1km× 1km.Then, thereare 36cellswiththe sensedairquality readings.Withthis dataset,we conductthe experimentofPM2.5sensing, andtry to inferthe air quality indexcategory4 _of _unsensed_cells. _The

infer-enceerrorismeasuredbyclassiﬁcationerror.

TaxiSpeed[14]: TheTaxiSpeed datasetcontainsthe speed infor-mation in 4 days for road segments in Beijing. The dataset has morethan 33,000trajectoriescollectedby GPS on taxis.Sameas

4 Six categories [30] _{: Good (0–50), Moderate (51–100), Unhealthy for Sensitive} Groups (101–150), Unhealthy (150–20 0), Very Unhealthy (201–30 0), and Hazardous ( > 300)

(10)

Fig. 6. The average temperature/humidity readings and their distributions in Sensor-Scope .

[31],weconsidertheroadsegmentsasthecells,and118road seg-ments with the valid sensed valuesare selected to evaluate our algorithms.Theinferenceerrorismeasuredbymeasurethemean absoluteerror.

5.2.Baselinealgorithms

We compareD-CellandDR-Celltotwo existingmethods:QBC andRANDOM.

QBC: ExistingworksonSparse MCSmainlyleverageQueryby Committeeincellselection[20,23].QBCselectsthesalientcell de-terminedby “committee” to allocatethe next task.More speciﬁ-cally,QBCattemptstousesomedifferentdatainferencealgorithms (suchascompressivesensingandK-NearestNeighbors)toinferthe full sensing matrix.Then, it chooses the cell where the inferred dataofvariousalgorithmshasthelargestvarianceasthenext se-lectionforsensing.

RANDOM:Ineachsensingcycle,RANDOMwillrandomlyselect cellsonebyoneuntiltheselectedcellscanensureasatisfying in-ferenceaccuracy.NotethatRANDOM actuallyachievesa competi-tiveperformancesincetherandomselectioncanalreadyprovidea lotofinformationtothepowerfulinference technologiesas com-pressedsensing.Hence,weconsiderthatRANDOMissuitableasa baseline.

5.3.Experimentprocess

Tolearnourproposedreinforcementlearning-basedalgorithms, weusetheﬁrst10hto2daydataofeachdatasettotrainour Q-function,i.e., wesuppose that theMCSorganizerswill conducta 10hto2daypreliminarystudytocollectdatafromthecells.Then wetrainourQ-functiononthetrainingdatasetbyconducting var-iousexperiences until the Q-function isconverged. We alsovary the proportionof selected cells foreach cycle, in order to show thatasmallamountofdatacanbeconductedtoaneffective train-ingsetwithoutlossofperformance.Besides,wealsoconductsome experimentstoevaluateourstateandrewardsettings.Weset dis-countfactor

γ

=0.9andlearningrate

α

=0.05inEq.(7)and dy-namicallyadjust

from1to0.1forwholeprocessoftraining.

After thetrainingstage, weobtain thewelltrainedQ-function and enter the running stage. For each sensingcycle, we use the proposed cell selection algorithms to select the cells forsensing untiltheselectedcellscansatisfythe(e,p)-quality.Notethat sat-isfying(e,p)-qualitymeansthatinp· 100%ofcycles,theinference erroris not larger than e, whichis practical in realworld appli-cations. Here, p should be set to a large value as 0.9 and 0.95, andwesetetoasmallvalueaccordingtothesensingtasks,such as0.25◦C fortemperature. Theselarge pandsmall

build up a more reasonableandrealistic scenario forSparse MCSand could evaluate the effectivenessof ourproposed algorithms well. Thus, ourobjectiveistoselectcellsasfewaspossiblewiththequality guarantee, andwe will compare thenumber ofcells selected by D-Cell,DR-Cellandbaselinemethodstoverifytheeffectivenessof ourproposedreinforcementlearning-basedalgorithms.

5.4. Experimentresults

Weevaluatetheperformancebyusingthetemperatureand hu-miditydatainSensor-Scope,thePM2.5datainU-Air,andthe traf-ﬁc speed datain TaxiSpeed,respectively. Withoutloss of general-ity,we ﬁrst evaluate theperformance without considering (e, p )-quality.Wecompareourinferredvalueswiththerealvalueto ob-taintheaverageinferenceerror,whilechangingthenumberof se-lectedcellsforeachcycle.AsshowninFig.7,theresultsshowthe similartendenciesoverfourtypesofsensingtasks.Alongwiththe increaseofthenumberofselectedcells,theaverageerrorsbecome smaller,sincethemoreselectedcellsprovidemoreinformationto help thedatainference.Ourproposed DR-CellandD-Cellachieve the betterperformance thanthe other baseline algorithms, espe-cially when the number of selected cells is small, which proves theeffectivenessofouralgorithms.Next,wewillevaluateand dis-cusstheperformancesofourcellselectionalgorithms considering (e,p)-quality,whichispracticalinrealworldapplications.

5.4.1. Numberofselectedcells

Weconsidertherecent5cycles(the last4cyclesandthe cur-rentcycle),thelast-timeselection, andtime asourstate. The re-sultsareshowninFig.8andTable2.

(11)

Fig. 7. Average inference error for Temperature, Humidity, PM2.5, and Traﬃc Speed sensing tasks.

Fig. 8. Number of selected cells for Temperature, Humidity, PM2.5, and Traﬃc Speed sensing tasks.

Forthetemperaturein Sensor-Scope,weset theerrorbounde

to0.25◦Cor0.3◦Candpto0.9or0.95asthepredeﬁned(e, p )-quality.Thus, the qualityrequirement inthisscenario isthat the inference error is smallerthan 0.25 ◦C or 0.3◦C for around 90% or 95% of cycles.The average numbers of selected cellsfor each sensing cycles have been shownin Fig. 8(a) and(b), where DR-Celland D-Cellalways outperform twobaseline methods.

Specif-ically, when(e, p) = (0.25◦C,0.9), DR-Celland D-Cellcan select 16.8%and9.7%fewercellsthanQBC,andachieve21.3%and14.6% fewercellsthanRANDOM.Ingeneral,DR-Cellonlyneedstoselect 11.93outof57cellsforeachsensingcyclewhenensuringthe in-ferenceerrorbelow0.25◦Cin90%ofcycles.Whenweimprovethe quality requirementto p=0.95, DR-Cell andD-Cell needs to se-lectmorecellsto satisfythehigherrequirement. Particularly,

DR-Table 2

Proportion of the cycles which satisfy the ( e, p )-quality.

Temperature Humidity

( e, p ) D-Cell DR-Cell QBC ( e, p ) D-Cell DR-Cell QBC (0.25 ◦_C,0.9) _0.906 _0.892 _0.919 _{(1.5%, 0.9)} _0.861 _0.879 _0.896 (0.25 ◦_C,0.95) _0.957 _0.948 _0.965 _{(1.5%, 0.95)} _0.933 _0.957 _0.940 (0.30 ◦_C,0.9) _0.910 _0.904 _0.948 _{(2.0%, 0.9)} _0.926 _0.901 _0.956 (0.30 ◦_C,0.95) _0.976 _0.957 _0.974 _{(2.0%, 0.95)} _0.969 _0.961 _0.975

PM2.5 Traﬃc Speed

( e, p ) D-Cell DR-Cell QBC ( e, p ) D-Cell DR-Cell QBC (6/36, 0.9) 0.901 0.896 0.930 (2.0 m/s, 0.9) 0.886 0.861 0.895 (6/36, 0.95) 0.951 0.957 0.961 (2.0 m/s, 0.95) 0.928 0.935 0.977 (9/36, 0.9) 0.918 0.909 0.925 (2.5 m/s, 0.9) 0.852 0.883 0.906 (9/36, 0.95) 0.968 0.944 0.950 (2.5 m/s, 0.95) 0.940 0.947 0.987

(12)

Fig. 9. State, reward and training data for temperature and humidity sensing tasks ( e = 0 . 25 ◦_{C / 1 . 5% , p = 0 . 9 ).}

CellandD-Cell selects14.93and 15.93out of57 cellsunder the (0.25◦C,0.95)-qualityandachievesbetterperformancesby select-ing12.3%/6.4%and16.9%/11.3%fewercellsthanQBCandRANDOM, respectively.Whenweimprovetheerrorboundtoe=0.3◦C, DR-Cell and D-Cell need to select less cells since we have a lower qualityrequirement.Here,DR-CellandD-Cellhavethecloser per-formances, and the number of sensed cells is reduced by 10.0% to16.2%. For humidity inSensor-Scope, a similar tendency is ob-servedinFig.8(c)and(d),withqualityrequirementas(1.5%/2.0%, 0.9/0.95).NotethatDR-CellandD-Cellachievebetterperformances thanQBCandRANDOM andDR-Cell performsbetter than D-Cell, since it would better capture the temporal patterns and handle partialobservabilityinhumidityofSensor-Scope.

For the other two scenarios, i.e., PM2.5 in U-Air and traﬃc speed inTaxiSpeed, we get thesimilar observations, asshown in

Fig.8(e)–(h).ForthePM2.5scenario,weseteas6/36or9/36and

pas0.9or0.95.When eis6/36andpis0.9/0.95, DR-Cellselects 13.9/16.7out of36cellsandreduces8.8%/5.8%,and10.3%/6.8%of selectedcellsthanQBCandRANDOM,respectively.Wheneis9/36, thenumberofsensedcellsisreducedby8.7%to18.0%.Fortraﬃc speed,weseteas2m/sor2.5m/sandachieve areduced propor-tionas6.4%to20.0%.NotethatD-Cellmayunderperformsincethe traﬃcspeedhassuchastrongcorrelationwithtime,whichcanbe betterprocessedbyourDR-Cell.

Table 2showstheactualproportionofthecycleswhich satis-fiesthe(e,p)-quality.Weseethatmostofthevaluesinthetable arelarger than its predefined p, which meansthat ourproposed DR-CellandD-Cellcanprovidetheaccurateinferencesmostofthe time.Notethat some resultsareslightlylessthanthe predefined

p, since compressive sensingandBayesian inference in our algo-rithms have the intrinsic probabilistic characteristics and would cause some minor errors, which is within the acceptable range. Basedontheseresults,wecouldsaythatourproposedalgorithms canachieveasatisfactoryperformance.

5.4.2. Stateandreward

Then we evaluate the state andreward settings in reinforce-mentlearningbasedcellselection,i.e., DR-Cell.Weconductsome experimentsontwoMCSscenario,i.e.,temperatureandhumidity monitoring.The stateinourwork consistsofthe recent-cycle se-lection,thelast-timeselection,andthetime.Sincetherecent-cycle selectionmakesupthelargestpercentageandhasthegreatest im-pacton thenext cellselection,we varythe last3–6cycles while keepingthe others ﬁxed in state, as shownin Fig. 9(a). We can seethat when we keep the recent 4or 5 cycles,our algorithms achievethebetter performances,whilethe lessormorecycles(3 or 6) would reduce them. This is probably due to the fact that themorecycleskeptinstateprovidetoomuchinformationoflow value,whichmaydisturbtheoutcome.

Forthe rewards, we wouldlike to illustratethat thedifferent valuesof R andc wouldnot inﬂuence theperformance afterthe Q-function has been well trained. In this paper, we consider all the costs c are thesame andsetthe cost c as1,without lossof generality. Note the case where the data collection costs of dif-ferent cellsare diversecould be considered and modiﬁedin the futurework,byprovidinga morecomplexrewardfunction.Then, wevarytheRfrom5to25,asshowninFig.9(b),wherethe per-formances underdifferentRareverycloseiftheQ-functionhave enoughtraining,andthesmallchangesaremostlikelyduetothe randomnessinourexperiments.

5.4.3. Trainingdata

Fig.9 (c)and(d)illustrate that we could usea smallamount oftrainingdatatotrainourQ-functionwhilekeepagoodenough performance.Weﬁrststudyhowthechangeofrequiredcyclesfor trainingwillimpacttheevaluationresults.Wecollectdatafromall thecellsandvarythecyclesfrom20to100,i.e.,conducta10hto 2day(50h)preliminarystudyintemperatureandhumidity mon-itoring tasks. As shown in Fig. 9(c), the reinforcement learning-based algorithmachieves a better performance withtheincrease inthe numberof cycles.When we haveenough cycles for train-ing,i.e.,80–100cycles,theperformancesareveryclose.Thereason could bethat ourproposedalgorithms wouldcapturethe tempo-ralcorrelationwellbyusinga2daytrainingdata,whileusingthe 10hdatacannotbehavewell.Thenweusethe2daydatabut ran-domlyselectpartofcellsforeachcycleandconductthetraining set. The results are shown in Fig. 9(d). The numbers of selected cellsareincreasedalongwiththereducedproportionsofcollected dataineachcyclefortraining,sincethelesscollecteddatacannot conductacomprehensivetrainingset.However,theperformances by usingpartoftrainingdataare evengoodenough. TheDR-Cell using 20% trainingdata still achieves better performancesby se-lecting8.8%/15.0%and14.3%/14.9%fewer cellsthanQBC/RANDOM ontemperatureandhumiditytasks,respectively.

5.4.4. Transferlearning

Wethen conduct theexperimentson themulti-task MCS sce-nario, i.e., temperature-humidity monitoring, in Sensor-Scope to verifythe transfer learningperformance. We use DR-Cellto con-duct 2-way experiments, i.e.temperature as the source taskand humidityasthe target task;andvice versa.More speciﬁcally,for the source task,we still suppose that we obtain 2 day data for training;butforthe targettask,we suppose that weonly obtain 10 cycles(i.e.,5 h)of trainingdata.Moreover, we add two com-paredmethodsto verifytheeffectivenessofourtransferlearning method: NO-TRANSFER and SHORT-TRAIN. NO-TRANSFER is the methodthatdirectlyusestheQ-functionofthesourcetasktothe targettask,andSHORT-TRAINmeansthatthetargettaskmodelis onlytrainedonthe10-cycletrainingdata.

(13)

Fig. 10. Number of selected cells for temperature and humidity sensing tasks (transfer learning).

The qualityrequirementoftemperatureis(0.25◦C,0.9)-quality andthe humidityis (1.5%, 0.9)-quality. Fig.10 showstheaverage numbers of selected cells. When temperature is seen as the tar-get task,TRANSFER can achieve better performance by reducing 5.0%, 6.0%, and6.4% selected cellscompared withNO-TRANSFER, SHORT-TRAIN, and RANDOM, respectively. When humidity is the target task, similarly, TRANSFER can select 4.0%, 5.0%, and 3.4% fewer cells than NO-TRANSFER, SHORT-TRAIN, and RANDOM, re-spectively. Note that NO-TRANSFER and SHORT-TRAIN even per-formworse thanRANDOM inthiscase. It emphasizesthe impor-tance ofhavingan adequate amount oftrainingdataforDR-Cell. Byusingtransferlearning,wecansigniﬁcantlyreducethetraining datarequiredforlearninga goodQ-functioninDR-Cell,andthus furtherreducingthedatacollectioncostsofMCSorganizers.

5.4.5. Computationtime

Finally, we report the computation time of DR-Cell. Our ex-periment platform is equippedwith Intel Xeon CPU E2630 v4 @ 2.20GHz and32GBRAM. WeimplementourD-CellandDR-Cell trainingalgorithmsinTensorFlow(CPUversion).Inourexperiment scenarios, thetrainingtimeconsumesaround 2–4h,whichis to-tally acceptablein real-lifedeploymentsasthetrainingis an off-line process. Table 3 shows the running time of the online pro-cess,i.e.,thetestingstageinourexperiments.Comparedwith‘Cell Selection’, the‘Quality Assessment’ costs the mostsince it needs torunthe‘DataInference’forsometimestoestimatethecurrent quality by leave-one-out based Bayesian inference. In ‘Cell Selec-tion’, although our algorithms need the off-line training, DR-Cell andD-Cellonlyneedverylittletime(_{∼ 0.002s)}todecidethenext selected cell duringthe onlineprocessing, while QBC need ∼ 1s sinceithastorunvariousinferencealgorithms.Webelievethatit isworthytoconducta _{∼ 4}hoﬄinetraininginordertoachievea fasterandmoreeﬃcientcellselectionstrategy.

Table 3

Runtime for each stage.

Temperature Humidity PM2.5 Traﬃc Speed Data Inference 0.49 s 0.50 s 0.35 s 0.97 s Quality Assessment 4.43 s 4.46 s 4.75 s 8.01 s DR-Cell 0.0015 s 0.0016 s 0.0 0 07 s 0.0026 s D-Cell 0.0014 s 0.0018 s 0.0 0 09 s 0.0028 s QBC 1.04 s 1.18 s 0.91 s 1.39 s 6. Conclusion

In this paper, we propose the novel reinforcement learning-basedcell selection algorithms to improve thecell selection eﬃ-ciencyinSparseMCS.First,wemodelthestate,reward,andaction forcellselectionandproposeaQ-learningbasedcellselection al-gorithm.Todeal withthelarge statespace, we useaneural net-worksto replacethe Q-table,whichis theDQN-based cell selec-tionalgorithm,andthen modifytheDQNwithLSTMtocatch the temporalpatternsinourstateandhandlepartialobservability. Fur-thermore,wecollectasmallamountofredundantdatatoconduct theeffectivetrainingbyrandomcombinationandproposea trans-ferlearningmethodtorelievethedependenceona largeamount oftrainingdata.Extensive experimentsverifythe effectivenessof ourproposed algorithms in reducing the datacollection costs. In thefuturework, wewouldliketostudyhowcanwe conductthe reinforcementlearning-basedcell selectioninacompletelyonline manner,sothatwedonotneedapreliminarystudystagefor col-lectingthetrainingdataanymore.

DeclarationofCompetingInterest

None.

Acknowledgements

ThisworkissupportedbytheNationalNaturalScience Founda-tionofChinaunderGrantNo.61772230andNaturalScience Foun-dation of China forYoung Scholars No. 61702215, Chinese Schol-arship Council No. 201706170165, andChina Postdoctoral Science FoundationNo. 2017M611322 and No. 2018T110247. This work is supported in part by the NSFC under Grant No. 61572048 and

71601106,HongKongITFGrantNo.ITS/391/15FX.

Supplementarymaterial

Supplementary material associated with this article can be found,intheonlineversion,atdoi:10.1016/j.comnet.2019.06.010.

References

[1] W.M. Bolstad , J.M. Curran , Introduction to Bayesian Statistics, John Wiley & Sons, 2016 .

[2] S. Devarakonda , P. Sevusu , H. Liu , R. Liu , L. Iftode , B. Nath ,Real-time air quality monitoring through mobile sensing in metropolitan areas, in: Proceedings of the 2nd ACM SIGKDD International Workshop on Urban Computing, ACM, 2013, p. 15 .

[3] R.K. Ganti , F. Ye , H. Lei , Mobile crowdsensing: current state and future chal- lenges, IEEE Commun. Mag. 49 (11) (2011) .

[4] A. Gelman , H.S. Stern , J.B. Carlin , D.B. Dunson , A. Vehtari , D.B. Rubin , Bayesian Data Analysis, Chapman and Hall/CRC, 2013 .

[5] D. Hasenfratz , O. Saukh , S. Sturzenegger , L. Thiele , Participatory air pollution monitoring using smartphones, Mobile Sens. 1 (2012) 1–5 .

[6] M.J. Hausknecht , P. Stone , Deep recurrent q-learning for partially observable mdps, AAAI Fall Symposium Series (2015) 29–37 . abs/1507.06527

[7] F. Ingelrest , G. Barrenetxea , G. Schaefer , M. Vetterli , O. Couach , M. Parlange , Sensorscope:application-speciﬁc sensor network for environmental monitoring, ACM Trans Sens Netw 6 (2) (2010) 1–32 .

[8] L. Kong , M. Xia , X.-Y. Liu , G. Chen , Y. Gu , M.-Y. Wu , X. Liu , Data loss and recon- struction in wireless sensor networks, IEEE Trans. Parallel Distrib. Syst. 25 (11) (2014) 2818–2828 .

[9] G. Lample , D.S. Chaplot , G. Lample , D.S. Chaplot , Playing FPS games with deep reinforcement learning, in: AAAI Conference on Artiﬁcial Intelligence, 2016 . [10] V. Mnih , K. Kavukcuoglu , D. Silver , A. Graves , I. Antonoglou , D. Wierstra ,

M. Riedmiller , Playing Atari with deep reinforcement learning, Comput. Sci. (2013) .

[11] V. Mnih , K. Kavukcuoglu , D. Silver , A .A . Rusu , J. Veness , M.G. Bellemare , A. Graves , M. Riedmiller , A.K. Fidjeland , G. Ostrovski , et al. , Human-level con- trol through deep reinforcement learning, Nature 518 (7540) (2015) 529 . [12] R.K. Rana , C.T. Chou , S.S. Kanhere , N. Bulusu , W. Hu , Ear-phone: an end–

to-end participatory urban noise mapping system, in: Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Net- works, ACM, 2010, pp. 105–116 .