• Aucun résultat trouvé

Temporal Variation of Terms as Concept Space for Early Risk Prediction

N/A
N/A
Protected

Academic year: 2022

Partager "Temporal Variation of Terms as Concept Space for Early Risk Prediction"

Copied!
12
0
0

Texte intégral

(1)

onept spae for early risk predition

MareloL.Errealde

1

, Ma.PaulaVillegas

1

,DarioG.Funez

1

,

Ma.JoséGariarenaUelay

1

,andLetiia C.Cagnina

1,2

1

LIDICResearhGroup,UniversidadNaionaldeSanLuis,Argentina

2

ConsejoNaionaldeInvestigaionesCientíasyTénias(CONICET)

{merrealde,villegasmariapaula 74,f unezd ario} gma il.o m

{mjgariarenauelay,lagnina} gmai l.o m

Abstrat. Early risk predition involvesthree dierent aspets to be

onsideredwhenanautomatilassierisimplementedfor thistask:a)

supportfor lassiation with partial information read up to dierent

timesteps, b) supportfor dealing with unbalaneddata sets and )a

poliytodeidewhenadoumentouldbelassiedasbelongingtothe

relevantlasswithareasonableondene.Inthispaperweproposean

approahthatnaturallyopeswiththersttwoaspetsandshowsgood

perspetives to deal with the last one. Our proposal, named temporal

variationofterms (TVT)isbasedonusingthevariationofvoabulary

along the dierent timesteps as onept spae to represent the dou-

ments.ResultswiththeeRisk2017datasetshowabetterperformane

ofTVTinomparison toothersuessfulsemanti analysisapproahes

andthestandardBOWrepresentation.Besides,italsoreahesthebest

reportedresultsuptothemomentfor

ERDE 5

and

ERDE 50

erroreval-

uationmeasures.

Keywords:Early Risk Detetion, Unbalaned Data Sets, Text Repre-

sentations,SemantiAnalysisTehniques.

1 Introdution

Early risk detetion (ERD) is a new researh area potentially appliable to a

widevarietyofsituationssuhasdetetionofpotentialpaedophiles,peoplewith

suiidal inlinations, or people suseptible to depression, among others. In a

ERDsenario,dataaresequentiallyreadasastreamandthehallengeonsists

in deteting risk ases assoon as possible. A usual situation in these ases is

thatthetargetlass(theriskyone)islearlyunder-sampledwithrespettothe

ontrollass(thenon-riskyone).Thatunequaldistributionbetweenthepositive

(minority)lassandthenegativeone,isawell-knownprobleminategorization

tasksandpopularlyreferredasunbalaneddata sets (UDS).

Besides dealing with the UDS problem, an ERD system needs to onsider

theproblemof assigningalasstodoumentswhenonlypartial informationis

available. A doument is proessedas a sequene of terms, and the goalis to

(2)

withpartialinformation (CPI)mightbeaddressedwithasimpleapproahthat

onsistsintrainingwithompletedoumentsasusualandonsideringthepartial

doumentsreaduptothelassiationpointasstandardompletedouments.

In[3℄ theCPI aspet wasonsidered byanalysing therobustnessof theNaïve

Bayesalgorithmtodealwithpartialinformation.

Last,but notleast, anERD system needsto onsider notonlywhih lass

shouldbeassignedtoadoument,but alsodeidingwhen to makethat assign-

ment. Thisaspet, that we willreferasthelassiation time deision (CTD)

issuehasbeenaddressedwithverysimpleheuristirules 3

althoughmoreelabo-

ratedapproahesmightbeused.

Inthisartileweproposeanoriginalideathatexpliitlyonsidersthesequen-

tialityofdata todealwiththeunbalaneddatasets problem.Inanutshell, we

usethetemporalvariationoftermsasoneptspaeofareentoniseseman-

tianalysis(CSA)approah[7℄.CSA isaninterestingdoumentrepresentation

tehniquewhihmodelswordsanddoumentsinasmalloneptspae whose

onepts are obtainedfrom ategorylabels.CSA hasobtained good resultsin

author proling tasks [8℄and the variantproposed in this artile, named tem-

poral variation of terms (TVT), seemstoshowsomeinterestingharateristis

todealwiththeERDproblem.Infat,itobtainedarobustperformaneonthe

eRisk 2017 data set and reahed the best (lowest) reported results up to the

momentfor

ERDE 5

and

ERDE 50

errorevaluationmeasures.

The rest of this doument is organized as follows: Setion 2 desribes our

proposed method for the ERD problem. Setion 3 showsthe obtainedresults

withourmethod ontheeRisk2017dataset.Finally,Setion4depitspotential

future worksandtheobtainedonlusions.

2 The proposed method

Our method is based on the onise semanti analysis (CSA) tehnique pro-

posed in [7℄ and later extendedin [8℄for author proling tasks. Therefore, we

rstpresentinSubsetion2.1thekeyaspetsofCSA andthenexplaininSub-

setion2.2 howweinstantiateCSAwith oneptsderivedfrom thetermsused

in thetemporalhunksanalysedbyanERDsystematdierenttimesteps.

2.1 Conise Semanti Analysis

Standard text representation methods suh as Bag of Words (BoW) suer of

two well known drawbaks. First, their high dimensionality and sparsity; se-

ond,theydonotapturerelationshipsamongwords.CSAisasemantianalysis

tehniquethat aims at dealingwith those shortomingsby interpreting words

and douments in a spae of onepts. Dierently from other semanti analy-

sisapproahessuhas latentsemantianalysis (LSA)[2℄andexpliit semanti

3

Forinstane,exeedingaspeiondenethresholdinthepreditionofthelas-

(3)

wordsandtextfragmentsinaspaeofoneptsthat arelose(orequal)tothe

ategory labels. Forinstane, if douments in the data set are labeled with

q

dierent ategorylabels (usually nomorethan 100 elements),wordsand do-

umentswill berepresentedin a

q

-dimensionalspae. That spaesizeis usually muhsmallerthanstandardBoWrepresentationswhihdiretlydependonthe

voabularysize(morethan10000or20000elementsingeneral).

ToexplainthemainoneptsoftheCSAtehniquewerstintroduesome

basinotationthatwillbeusedintherestofthiswork.Let

D = {hd 1 , y 1 i, . . . , hd n , y n i}

beatrainingsetformedby

n

pairsofdouments(

d i

)andvariables(

y i

)thatindi-

atetheoneptthedoumentisassoiatedwith,

y i ∈ C

where

C = {c 1 , . . . , c q }

is theonept spae.Forthemoment, onsiderthat theseonepts orrespond

tostandardategorylabelsalthough,aswewillsee later,theymightrepresent

moreelaborate aspets. Inthis ontext, wewill denote as

V = {t 1 , . . . , t m }

to

thevoabularyoftermsoftheolletionbeinganalysed.

Representing terms in the onept spae In CSA, eah term

t i ∈ V

is

represented as a vetor

t i ∈ R q

,

t i = ht i,1 , . . . , t i,q i

. Here,

t i,j

represents the

degreeofassoiationbetweentheterm

t i

andtheonept

c j

anditsomputation

requiressomebasistepsthat areexplainedbelow.First,therawterm-onept

assoiation between the

i

th term and the

j

th onept, denoted

w ij

, will be

obtained.If

D c u ⊆ D

,

D c u = {d r | hd r , y s i ∈ D ∧ y s = c u }

is thesubset ofthe

traininginstaneswhoselabelistheonept

c u

,then

w ij

mightbedened as:

w ij = X

∀d k ∈D cj

log 2

1 + tf ik

len(d k )

(1)

where

tf ik

is the numberof ourrenesofthe term

t i

in thedoument

d k

and

len(d k )

isthelength(numberofterms) of

d k

.

Asnotedin [7℄and[8℄,diret useof

w ij

to representtermsinthevetor

t i

ouldbesensibleto highly unbalaneddata. Thus,somekindofnormalization

isusuallyrequiredand,inourase,weseletedtheoneproposedin [8℄:

t ij = w ij m

P

i=1

w ij

(2)

t ij = t ij

q

P

j=1

w ij

(3)

With this last onversion we nally obtain, for eah term

t i ∈ V

, a

q

-

dimensional vetor

t i

,

t i = ht i,1 , . . . , t i,q i

dened over a spae of

q

onepts.

Up to now, those onepts orrespond to the original ategories used to label

thedouments.Later,wewilluseothermoreelaboratedonepts.

Representingdoumentsin the onept spae Onethetermsarerepre-

sentedinthe

q

-dimensionaloneptspae,thosevetorsanbeusedtorepresent

(4)

portanefordierentdoumentssoitisnotagood ideaomputingthatvetor

for thedoument asthesimpleaverage ofall itsterm vetors.Previousworks

in BoW [6℄ haveonsidered dierentstatisti tehniques to weight theimpor-

taneof termsin adoument suh as

tf idf

,

tf ig

,

tf χ 2

or

tf rf

,among others.

Here, we will use theapproah used in [8℄ for author proling that represents

eahdoument

d k

asthe weightedaggregation oftherepresentations(vetors) oftermsthatitontains:

d k = X

t i ∈d k

tf ik

len(d k ) × t

i

(4)

Thus,doumentsarealsorepresentedina

q

-dimensionaloneptspae(i.e.,

d k ∈ R q

) whih is muh smaller in dimensionality than the one required by standardBoWapproahes(

q ≪ m

).

2.2 Temporal Variation ofTerms

InSubsetion2.1wesaidthattheoneptspae

C

usuallyorrespondsto stan-

dard ategory names used to label the training instanes in supervised text

ategorizationtasks.Inthis senario,that in[7℄isreferredasdiretderivation,

eah ategory label simply orresponds to aonept. However, in [7℄ also are

proposed other alternatives likesplit derivation and ombined derivation. The

formerusesthelow-levellabelsinhierarhialorporaandthelatterisbasedon

ombiningsemantiallyrelatedlabelsinauniqueonept.In[8℄thoseideasare

extendedby rstlustering eahategoryof theorporaand then using those

subgroups(sub-lusters)asnewoneptspae.

4

Asweansee,theommonideatoalltheaboveapproahesisthatoneaset

of doumentsis identied asbelonging to agroup/ategory, that ategoryan

beonsideredasaoneptand CSA anbeapplied in theusual way.We take

asimilar viewto those works by onsideringthat thepositive(minority)lass

in ERDproblemsanbeaugmentedwiththeoneptsderivedfrom thesetsof

partial doumentsread along the dierent time steps. In order to understand

this ideait is neessarytorst introdueasequentialwork sheme astheone

proposedin[9℄forresearhinERD systemsfordepressionases.

Following [9℄,wewill assume a orpusof douments written by

p

dierent

individuals (

{I 1 , . . . , I p }

). For eah individual

I l

(

l ∈ {1, . . . , p}

),the

n l

dou-

mentsthat he haswrittenare provided in hronologialorder (fromthe oldest

texttothemostreenttext):

D I l ,1 , D I l ,2 , . . . , D I l ,n l

.Inthisontext,giventhese

p

streamsofmessages,theERDsystemhastoproesseverysequeneofmessages

(inthehronologialordertheyareprodued)andtomakeabinarydeision(as

early aspossible) on whether ornotthe individualmight be apositiveaseof

depression.Evaluationmetrisonthistaskmustbetime-aware,soanearlyrisk

detetionerror(ERDE)isproposed.Thismetrinotonlytakesintoaountthe

4

(5)

makethedeision.

Inausualsupervisedtextategorizationtask,wewouldonlyhavetwoate-

gorylabels:positive (risk/depressivease)andnegative(non-risk/non-depressive

ase).ThatwouldonlygivetwooneptsforaCSArepresentation.However,in

ERD problems there is additional temporal information that ould be used to

obtainanimprovedoneptspae.Forinstane, thetrainingset ouldbesplit

in

h

hunks,

C ˆ 1 , C ˆ 2 , . . . , C ˆ h

,insuhawaythat

C ˆ 1

ontainstheoldestwritings

ofallusers(rst(100/

h

)%ofsubmittedpostsoromments),hunk

C ˆ 2

ontains

the seond oldest writings,and soforth. Eah hunk

C ˆ k

anbepartitioned in twosubsets

C ˆ k +

and

C ˆ k

,

C ˆ k = ˆ C k + S C ˆ k

where

C ˆ k +

ontainsthe positiveases

ofhunk

C ˆ k

and

C ˆ k

thenegativesonesofthis hunk.

Itisinterestingtonotethatweanalsoonsiderthedatasetsthatresultof

onatenatinghunks that are ontiguousin timeand using thenotation

C ˆ i−j

torefertothehunkobtainedfromonatenatingallthe(original)hunksfrom

the

i

thhunk to the

j

th hunk (inlusive).Thus,

C ˆ 1−h

willrepresent thedata

set withtheompletestreamsofmessagesofallthe

p

individuals.Inthisase,

C ˆ 1−h +

and

C ˆ 1−h

willhavetheobvioussemantispeiedabovefortheomplete

doumentsofthetrainingset.

The lassi way of onstruting alassier would be to take the omplete

doumentsofthe

p

individuals(

C ˆ 1−h

)and useanindutivelearningalgorithm

suhas SVMorNaïveBayestoobtainthatlassier. Aswementionedearlier,

anotherimportantaspetinEDSsystemsisthatthelassiationproblembeing

addressed is usually highly unbalaned (UDS problem). That is, the number

of douments of the majority/negative lass (non-depression) is signiantly

larger than that of the minority/positive lass (depression). More formally,

followingthepreviouslyspeiednotation

| C ˆ 1−h |≫| C ˆ 1−h + |

.

AnalternativetotrytoalleviatetheUDSproblemwouldbetoonsiderthat

theminoritylassisformednotonlybytheompletedoumentsoftheindividu-

alsbutalsobythepartialdoumentsobtainedinthedierenthunks.Following

the generalideasposed in CSA, weouldonsider that thepartial douments

readin thedierenthunksrepresenttemporal oneptsthatshouldbetaken

intoaount.Inthisontext,onemightthinkthatvariationsofthetermsusedin

thesedierentsequentialstagesofthedoumentsmayhaverelevantinformation

forthelassiationtask.With thisideainmind, themethodproposedin this

worknamedtemporalvariation ofterms(TVT)arises,whihonsistsinenrih-

ingthedoumentsoftheminoritylasswiththepartial doumentsread inthe

rsthunks.Thesersthunksoftheminoritylass,alongwiththeiromplete

douments,willbeonsideredasanewoneptspaeforaCSA method.

Therefore, in TVT we rst determine the number

f

of initial hunks that

willbeusedtoenrih theminority(positive) lass.Then,weusethedoument

sets

C ˆ 1 + , C ˆ 1−2 + , . . . , C ˆ 1−f +

and

C ˆ 1−h +

asonepts forthe positive lassand

C ˆ 1−h

for the negative lass. Finally, we represent terms as douments in this new

(

f + 2

)-dimensionalspaeusingtheCSA approahexplainedinSetion 2.1.

(6)

3.1 Data Set

OurapproahwastestedonthedatasetusedintheeRisk2017pilottask 5

and

desribed in [9℄. It is a olletion of writings (posts oromments) from a set

of Soial Mediausers. Therearetwoategories ofusers,depressed and non-

depressed and,foreahuser,theolletionontainsasequeneofwritings(in

hronologial order). For eahuser, theolletionof writingshasbeendivided

into 10 hunks. The rst hunk ontains the oldest 10% of the messages, the

seond hunkontainstheseondoldest 10%,and soforth.This olletionwas

splitintoatrainingandatestsetthatwewillreferas

T R DS

and

T E DS

respe-

tively.The(training)

T R DS

setontained486users(83positive,403negative)

and the (test)

T E DS

set ontained 401 users (52 positive, 349 negative). The

userslabeledaspositivearethosethathaveexpliitlymentionedthattheyhave

beendiagnosedwithdepression.

Thistask was dividedinto atrainingstage and atestingstage.In therst

one, the partiipating teams had aess to the

T R DS

set with all hunks of

all training users. They ould therefore tune their systems with the training

data. To reproduethe same onditions of the pilot task, weuse the training

set (

T R DS

) togenerate aneworpusdividedinto atrainingset (that we will

refer as

T R DS − train

) and a test set (named

T R DS − test

) with the same

ategories (depressed and non-depressed)for eah sequene of writings of the

usersintheolletion.Thosesets maintainedthesameproportionsof post per

userandwordsperuserasdesribedin[9℄.

T R DS −train

and

T R DS −test

were

generatedbyrandomlyseletingarounda70%ofwritingsfor therstone and

therest 30%fortheseond one.Thus,

T R DS − train

resultedin 351writings

(63positive,288negative)meanwhile

T R DS − test

ontains135individuals(20 positive,115negative).Inthe pilot taskthe olletionof writingswasdivided

into10hunks,sowemadethesamedivisionon

T R DS −train

and

T R DS −test

.

3.2 ExperimentalResults

WereproduedthesameonditionsfaedbythepartiipantsoftheeRiskpilot

task, sowerstworkedonthedataset releasedonthetrainingstage (

T R DS

)

andthen,theobtainedmodelsweretestedontheteststage(

T E DS

).Theativ-

itiesarriedoutateahstagearedesribedbelow.

Training stage CSA is a doument representation that aims at addressing

somedrawbaksof lassialrepresentations suh as BoW.On theother hand,

TVTissupposedtoextendCSAbydeningoneptsthatapturethesequential

aspets oftheERD problems and thevariationsofvoabularyobservedin the

distintstagesoftheindividuals'writings.Thus,CSAandBoWariseasobvious

andidatesto ompareTVTinthedata setusedin thepilottask.Those three

5

(7)

NaïveBayesand RandomForest,among others.Ineahase, thebestparam-

eters were seletedfor eah algorithm-representationombination (model) and

thereportedresultsorrespondtothebest obtainedvalues.

We tested BoW with dierent weighting shemes and learning algorithms

but,in allases,thebestresultswereobtainedwithbinaryrepresentationsand

the Naïve Bayes algorithm. From now on, all referenes to BoW will stand

for that setting. We use CSA with representations of terms with normalized

weightsaordingtoEquations2and3anddoumentrepresentationsobtained

from Equation 4as proposed in [8℄ for author proling tasks. We named this

settingas

CSA

.FortheTVTrepresentation,adeisionmustbemaderelated to thenumber

f

ofhunksthatwill enrihtheminority(positive)lass.Inour

studies,weuse

f = 4

and,inonsequene,thepositivelasswasrepresentedby 5onepts.Inthat way,the numberof doumentsin thedepressed lasswas

inremented by 5 with respet to the original size,from 83 positive instanes

to 415. Aswean see, with this tehnique weare also obtainingsomekind of

balaningin thesizeofbothlassesandaddressinginthatwayanotherusual

problemthat wepreviouslyreferastheUDSproblem.

ApartiularitythatERD methodsmustonsideristheriterionusedtode-

idewhen (inwhatsituations)thelassiationgeneratedbythesystemison-

sideredthenal/denitivedeisionontheevaluatedinstanes(thelassiation

time deision (CTD) issue).We willstartourevaluation ofthedierentdou-

mentrepresentationsandalgorithmsassumingthatthelassiationismadeon

astatihunkbyhunkbasis.Thatis,foreahhunk

C ˆ i

providedtotheERD

systemswewillevaluatetheir performane onsidering thatall themodelsare

(simultaneously)appliedtothewritingsreeiveduptothehunk

C ˆ i

.With this

kind of information it will be possible to observeto what extent the dierent

approahesarerobusttothepartialinformationinthedierentstages,inwhih

momenttheystarttoobtainaeptableresults,andotherinterestingstatistis.

Tables1, 2and 3show theresults ofexperiments forthis stati hunk by

hunk lassiationsheme.Valuesofpreision (

π

), reall (

ρ

)and

F 1

-measure

(

F 1

) of the target (depressed) lass are reported for eah onsidered model.

Statistisalso inludetheearly riskdetetion error (ERDE) measureproposed

in [9℄.Thismeasure onsidersnotonlytheorretnessofthedeisionmadeby

thesystembutalsothedelayinmakingthatdeision.ERDEusesspeiosts

to penalize false positivesand false negatives. However,ERDE hasa dierent

treatmentwiththetwopossiblesuessfulpreditions(true negativesandtrue

positives).Truenegativeshavenoost(ost=0)butERDEassoiatesaostto

thedelayinthedetetionoftruepositivesthatmonotoniallyinreaseswiththe

number

k

oftextualitemsseenbeforegivingtheanswer.Inanutshell,thatost

islowwhen

k

islowerthan athresholdvalue

o

butrapidly approahes1when

k > o

.Inthat way,

o

representssometypeofurgeny in detetingdepression

ases:the lowest the

o

valuesthehighestthe urgenyin detetingthe positive

(8)

Inour study weonsider the two valuesof

o

used in the pilottask:

o = 5

(

ERDE 5

) and

o = 50

(

ERDE 50

). In eah hunk, lassiers usually produe

theirpreditionswithsomekindofondene,ingeneral,theestimatedprob-

abilityofthepreditedlass.Inthoseases,weanseletdierentthresholds

tr

onsideringthataninstane(doument)isassignedtothetargetlasswhenits

assoiatedprobability

p

isgreater(orequal)thanertainthreshold

tr

(

p ≥ tr

).

Inthis studyweevaluated5dierentsettingsfor theprobabilitiesassignedfor

eah lassier:

p = 1

,

p ≥ 0.9

,

p ≥ 0.8

,

p ≥ 0.7

and

p ≥ 0.6

. Due to spae

onstraints,onlythebestresultsobtainedwithapartiularsettingareshown.

6

Table1showstheresultsobtainedwith aBoW representationandaNaïve

Bayes lassier. Those values orrespond to the setting where an instane is

onsidered as depressive if the lassier assigns to the target/positive lass a

probabilitygreaterorequalthan0.8(

p ≥ 0.8

).Surprisingly,thebestresultsfor all theonsidered measures are obtainedon the rsthunk.Inthis hunk,we

an observethat this model only reoversa45%of the depressed individuals.

However, this is not the worst aspet. Only a 12%of the individual lassied

asdepressed eetively hadthis ondition resultingin onsequenein avery

low

F 1

measure(0.19).Table2showssimilarresultswhena

CSA

-RF(random

forest)ombinationwith

p ≥ 0.6

isusedtolassifythewritingsoftheindividuals.

Here,

F 1

measureisalsolowbutweanobserveadeteriorationinthe(

ERDE 5

)

and(

ERDE 50

)errorvalueswithrespettothepreviousmodel.

Finally,inTable3,theresultsofTVTwithaNaïveBayesalgorithmand

p ≥ 0.6

areshown.There,weanseearemarkableimprovementintheperformane ofthelassierinthehunk3withexellentvaluesof

ERDE 50

(7.02),preision

π

(0.63), reall

ρ

(0.85)and

F 1

measure(0.72). Analysingtheresultsalongthe

10 onsidered hunks we observe how the measures keep improving from the

hunk 1up to reahthe best values in hunk 3and, from then on,they start

to deteriorate hunk byhunk and obtainingtheworstresults onthe last two

hunks.Asweakpointsofthoseresultsweansaythatthebestvalueof

ERDE 5

obtainedinhunk1isnotverygood.Besides,eventhought

ERDE 50

valuesare

aeptable formostoftheonsidered hunks,theyneedat leasttwohunks to

show a ompetitive performane. That aspet looks reasonable if we onsider

that TVT is based onthe variation of terms between onseutivehunks and

that informationisnotavailable onthersthunk.

Asgeneralonlusiontothehunkbyhunkanalysis,weouldsaythatim-

balanedlassesseemtoaetinadierentwaytothedierentmethods.BoW

and CSA diretly depend on the voabulary of positive and negative lasses.

In therst hunk where texts are supposed to bethe shortest, relevantwords

of the positive lass appearing in the posts will probably havemore hane of

beingbalanedwithrespettothewordsappearinginthenegativelass.That

makeslassiersbemoresensitivetothepositivelassand,inonsequene,the

reallandgeneralperformaneisimproved.Asmoreinformationisread,words

relatedtothenegativelassaremoreprobabletoourintroduingnoiseand

6

All the tables generated for the dierent probabilities an be downloaded from

(9)

Table 1.Model:

BoW

+NaïveBayes(

p ≥ 0 . 8

).Chunkbyhunksetting.

ERDE 5

,

ERDE 50

,

F 1

-measure(

F 1

),preision(

π

)andreall(

ρ

)ofthedepressedlass.

ch 1 ch 2 ch 3 ch 4 ch 5 ch 6 ch 7 ch 8 ch 9 ch 10 ERDE 5

18.0920.98 21.5 21.7321.9521.9521.9521.9522.1722.17

ERDE 50

15.1716.8420.7720.2521.2121.9521.9521.5222.1722.17

F 1

0.19 0.16 0.09 0.11 0.09 0.09 0.09 0.09 0.09 0.13

π

0.12 0.11 0.06 0.17 0.06 0.06 0.06 0.06 0.06 0.08

ρ

0.45 0.35 0.2 0.25 0.2 0.2 0.2 0.2 0.2 0.3

Table2.Model:

CSA

+RF(

p ≥ 0 . 6

).Chunkbyhunksetting.

ERDE 5

,

ERDE 50

,

F 1

-measure(

F 1

),preision(

π

)andreall(

ρ

)ofthedepressedlass.

ch 1 ch 2 ch 3 ch 4 ch 5 ch 6 ch 7 ch 8 ch 9 ch 10 ERDE 5

21.9325.6425.4625.5726.1225.6825.6825.4625.3525.68

ERDE 50

19.4724.9425.4623.3525.37 24.2 23.46 22.5 22.3923.47

F 1

0.19 0.08 0.05 0.1 0.06 0.08 0.13 0.16 0.16 0.14

π

0.11 0.05 0.03 0.06 0.04 0.05 0.07 0.09 0.09 0.08

ρ

0.6 0.25 0.15 0.3 0.2 0.25 0.4 0.5 0.5 0.45

aetingin onsequenetheperformane.TVTdoesnotseemtobeso aeted

bythis problem showingamorestable performanealongallthe hunks,with

thebestresultsinthethirdhunkandthenwithalittledeteriorationfromthen

on. Those results ould be giving evidene that the variation of terms (with

f = 4

)allowsto betterdetet theourreneof relevantwords ofthe positive

lassin thersthunks.However,italsoseemstobeaetedbytheunbalane

problem in subsequent hunks, although in a lowerlevelthan BoW and CSA

representations.Unfortunately,verifyingthosehypotheseswouldrequireonsid-

eringmorebalaned settingsanddierent

f

valueswhatisoutofthesopeof

thispaper.However,thatimportantaspetwillbeaddressedinfuture works

Anotherapproah for the CTD issue ould be diretly use the probability

(orsomemeasureofondene)assignedbythelassiertodeidewhentostop

reading adoumentand givingitslassiation.That approah,that in [9℄is

referredasdynami,onlyonsidersthatthisprobabilityexeedssomepartiular

threshold to lassifytheinstane/individual aspositive.That means, that dif-

ferentstreams ofmessagesouldbelassiedasdepressed in dierentstages

(hunks). Table 4showthose statistisfor BoW,

CSA

and TVT representa- tionsforthoselearningalgorithmsandprobabilitythresholdsthatobtainedthe

best performane. There, we an see that TVT representation, with a Naïve

Bayesandlassifyinginstanesasdepressedwhentheassignedprobabilityis1,

obtains the best results for the measures we are more interested in:

ERDE 5

,

ERDE 50

and

F 1

-measure.Inthis ontext, BoW gets abetterreall valuebut

attheexpenseofloweringthepreisionvaluesresultinginapoor

F 1

-measure.

Testing stage The previous results were obtained by training the lassiers

with the

T R DS − train

data setand testingthemwith the

T R DS − test

data

(10)

Table 3.Model:

T V T

+NaïveBayes(

p ≥ 0 . 6

).Chunkbyhunksetting.

ERDE 5

,

ERDE 50

,

F 1

-measure(

F 1

),preision(

π

)andreall(

ρ

)ofthedepressedlass.

ch 1 ch 2 ch 3 ch 4 ch 5 ch 6 ch 7 ch 8 ch 9 ch 10 ERDE 5

14.2414.2714.5914.8315.1715.5115.7415.8416.2116.13

ERDE 50

10.80 7.22 7.02 9.24 9.25 9.97 10.7310.7311.0610.96

F 1

0.42 0.65 0.72 0.67 0.67 0.67 0.64 0.64 0.57 0.58

π

0.39 0.58 0.63 0.60 0.60 0.60 0.58 0.58 0.50 0.52

ρ

0.45 0.75 0.85 0.75 0.75 0.75 0.70 0.70 0.65 0.65

Table4.DynamiModels forBoW-NB,

CSA

-NBandTVT-NB.

ERDE 5 ERDE 50 F 1 π ρ

BoW(

p ≥ 0 . 8

) 21.05 18.13 0.24 0.14 0.75

CSA

-NB(

p = 1

) 23.09 23.07 0.06 0.04 0.15 TVT-NB(

p = 1

) 14.13 11.25 0.400.47 0.35

set.Theobviousquestionnowisifsimilarresultsareobtainedbytrainingwith

the full training set of the pilot task (

T R DS

) and using the lassiers with

the data set

T E DS

that was inrementally released during the testing phase of the pilottask. Inthis new senario,the TVT representationwasused with

a simple rule for the CTD issue that onsists in lassifying all the individual

in the hunk 3 as positive (depressed) if a Naïve Bayes lassier produed a

probabilityequalorgreaterthan 0.6forthe positivelass.That strategy, that

wewill referas

T V T p≥0.6 3

,is motivatedbythegoodresultsshowedbyTVTin

Table3.WealsotestedtheBoW,

CSA

andTVTrepresentationswithdynami strategiesandusingthoseprobabilitiesthatbestvaluesobtainedinthetraining

stage.As baselineswealso tested twoapproahes desribed in [9℄that will be

namedas

Ran

and

M in

.

Ran

,simplyemitsarandomdeision(depressed/non- depressed)foreahuserin thersthunk.

M in

,ontheotherhand,standsfor

minorityandonsistsinlassifyingeahuserasdepressiveinthersthunk.

Table5showstheperformaneofalltheabovementionedapproahesonthe

test set of thepilottask (

T E DS

).We alsoinluded theresultsreported in the

eRiskpageforthesystemsthatobtainedthebest

ERDE 5

(

F HDO−BCSGB

),

ERDE 50

(

U N SLA

) and

F 1

(

F HDO − BCSGB

) measures on the pilot task.

Here we an observethat results obtained with

T V T p≥0.6 3

are not as good as

those obtained in the training stage. However, the setting TVT-NB (

p = 1

)

wouldhaveobtainedthebest

ERDE 5

soreandthethird

ERDE 50

value,with

asmalldierenerespettothebest reportedvalue(9.84versus9.68).

Thosegood resultsof TVT were ahievedtaking into aountthe best pa-

rameters obtainedin the trainingstage. However,it also would beinteresting

analysingwhatwouldhavebeentheTVT'sperformaneifotherparameterset-

tingshadbeenseleted.Table6showsthistypeofinformationbyreportingthe

results obtainedwith dierent learningalgorithms (Naïve Bayes and Random

Forest)and dierent probability valuesfor dynami approahes to the CTD

(11)

Table5.Resultsonthe

T E DS

testset.

ERDE 5 ERDE 50 F 1 π ρ Ran

16.83 14.63 0.17 0.11 0.4

M in

21.67 15.03 0.23 0.13 1

BoW(

p ≥ 0 . 8

) 16.45 10.87 0.38 0.25 0.77

CSA

-NB(

p = 1

) 20.58 19.58 0.05 0.03 0.15

T V T p≥ 3 0.6

13.64 10.17 0.53 0.46 0.62

TVT-NB(

p = 1

) 12.38 9.84 0.42 0.50 0.37

F HDO − BCSGA

12.82 9.69 0.64 0.61 0.67

F HDO − BCSGB

12.70 10.39 0.55 0.690.46

U N SLA

13.66 9.68 0.59 0.48 0.79

Table6.ResultsofTVTwithdierentlearningalgorithmsandprobabilityvalues.

ERDE 5 ERDE 50 F 1 π ρ

TVT-NB(

p ≥ 0 . 6

) 13.59 8.40 0.50 0.37 0.75 TVT-NB(

p ≥ 0 . 7

) 13.43 8.24 0.51 0.39 0.75 TVT-NB(

p ≥ 0 . 8

) 13.13 8.17 0.54 0.42 0.73 TVT-NB(

p ≥ 0 . 9

) 13.07 8.35 0.52 0.42 0.69 TVT-NB(

p = 1

) 12.38 9.84 0.42 0.50 0.37 TVT-RF(

p ≥ 0 . 6

) 12.46 8.37 0.55 0.49 0.63 TVT-RF(

p ≥ 0 . 7

) 12.49 8.52 0.55 0.50 0.62 TVT-RF(

p ≥ 0 . 8

) 12.30 8.95 0.56 0.54 0.58 TVT-RF(

p ≥ 0 . 9

) 12.34 10.28 0.47 0.55 0.40 TVT-RF(

p = 1

) 12.82 11.82 0.20 0.67 0.12

the

ERDE

measuresindependentlyofthealgorithmusedtolearnthemodeland theprobabilityusedinthedynamiapproahes.Mostofthe

ERDE 5

valuesare

lowandin 7outof10settingsthe

ERDE 50

valuesarelowestthanthebestre-

portedinthepilottask(

U N SLA

:9.68).Inthisontext,TVTahievesthebest

reported

ERDE 5

valueupto now (12.30)with thesetting TVT-RF(

p ≥ 0.8

)

andthelowest

ERDE 50

value(8.17)withthemodel TVT-NB(

p ≥ 0.8

).

4 Conlusions and future work

In this artile wepresenttemporal variation of terms (TVT) anapproah for

earlyriskdetetionbasedonusingthevariationofvoabularyalongthedierent

time stepsas oneptspaefordoumentrepresentation.TVTnaturallyopes

with the sequentialnature of ERD problems and also givesa toolfor dealing

with unbalaned data sets. Preliminary results with the eRisk 2017 data set

showabetter performane ofTVT in omparisonto othersuessful semanti

analysisapproahandthestandardBOWrepresentation.Italsoshowsarobust

performane along dierent parameter settings and reahes the best reported

resultsuptothemomentfor

ERDE 5

and

ERDE 50

errorevaluationmeasures.

Asfuture work,weplantoapplytheTVTapproahtootherproblemsthat

(12)

PAN-2012ompetitiononsexualpredatoridentiation[5℄whihsharesseveral

harateristiswiththedatasetusedinthepresentworksuhasthesequentially

ofdata, unbalanedlassesandtherequirementofdetetingtheminoritylass

(predator)as soonaspossible,amongothers.

TVT is expliitly based on the enrihment of the minority lass with new

oneptsderivedfrom thepartialinformationobtainedfromtheinitialhunks.

However, some improvements an be ahieved by also lustering the negative

lass as proposed by [8℄ in author proling tasks. We arried out some initial

experiments by ombining TVT with the lustering of the negative lass but

more study is required to determine how both approahes an be eetively

integrated. Besides, in the present work, the eletion of

f = 4

mainly aimed

atobtainingbalanedpositiveandnegativelasses.Infuture works,dierent

f

valueswillbeonsideredto seehowtheyimpatontheTVT'sperformane.

TVTprovides,asasideeet,aninterestingtoolfordealingwiththeunbal-

aneddatasetproblem.WeplantoapplyTVTonunbalaneddatasetsthatdo

notneessarilyorrespondtotheERDeldandomparingitagainstotherwell

knownmethodsinthisarea,suhasSMOTE[1℄.Finally,itwouldbeinteresting

omparingtheoneptspae usedinourapproahagainstother reentand ef-

fetiverepresentationsbasedonwordembeddings.Inthisontext,itouldalso

be analysed how our onept spae representation an be extended/improved

withinformationprovidedbythoseembeddings.

Referenes

1. N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.Ph.Kegelmeyer.Smote:Syntheti

minorityover-samplingtehnique. J.Artif.Intell.Res.,16:321357, 2002.

2. S. Deerwester, S.T.Dumais, G.W. Furnas, T.K. Landauer,and R. Harshman.

Indexingbylatentsemantianalysis. Journalofthe ASIS,41(6):391407,1990.

3. H. Jair Esalante, M. Montes-y-Gómez, L. Villaseñor Pineda, and M. Errealde.

Earlytextlassiation:anaïvesolution.InA.Balahur,E.VanderGoot,P.Vossen,

and A.Montoyo,editors, Pro.of WASSANAACL-HLT 2016, 2016, SanDiego,

California,USA,pages9199.TheAssoiationforComputerLinguistis,2016.

4. E. Gabrilovih and S. Markovith. Wikipedia-based semanti interpretation for

naturallanguageproessing. JAIR,34(1):443498,Marh2009.

5. G.InhesandF.Crestani. Overviewof theinternationalsexual predatoridenti-

ationompetitionatpan-2012. InP.Forner,J.Karlgren,andC.Womser-Haker,

editors, CLEF(OnlineWorkingNotes/Labs/Workshop),pages112,2012.

6. M. Lan, Ch. Tan, J. Su, and Y. Lu. Supervisedand traditional termweighting

methodsforautomatitextategorization. IEEETPAMI,31(4):721735,2009.

7. Z.Li,Z.Xiong,Y.Zhang,Ch.Liu,andK.Li. Fasttextategorizationusingonise

semantianalysis. PatternReogn. Lett.,32(3):441448,February2011.

8. A. Pastor López-Monroy, M. Montes y Gómez, H. Jair Esalante, L. Villaseñor-

Pineda, and E. Stamatatos. Disriminative subprole-spei representations for

authorprolinginsoialmedia. Knowledge-BasedSystems,89:134147,2015.

9. D.E.LosadaandF.Crestani. Atestolletionforresearhondepressionandlan-

guageuse.InExperimentalIRMeetsMultilinguality,Multimodality,andInteration

-7thInt.Conf.oftheCLEFAssoiation,Portugal,pages2839,2016.

Références

Documents relatifs

Each Cluster spacecraft contains 11 identical instruments: an active spacecraft potential control instrument (ASPOC), an ion spectrometry instrument (CIS), an electron drift

In this lesson the students apply the FLE in order to calculate the expansion of space backwards from the current light horizon r lh.. The observable space is limited by the

Characterization of insect pheromones might have dominated the early days of the Journal of Chemical Ecology, but exciting contributions from the multifaceted discipline of

–   Mathieu Acher, Hugo Martin, Juliana Alves Pereira, Arnaud Blouin, Djamel Eddine Khelladi, Jean-Marc Jézéquel.. –

We studied the impact of plasminogen ac- tivator inhibitor (PAI)-1 deficiency in a transgenic mouse model of ocular tumors originating from retinal epithelial cells and leading to

Thus the next conceptual sign, the sign of differential integration, of electronic education is the ability of any system of education and each subject of education for

This requires the addition of traditional methodological foundations of the formation of professional development of undergrad- uates, new information and networking technologies,

Our first approach is quite similar to the one previously described, but we also make use of word or char embeddings in every model, as well as a fully connected layer after the