Temporal Variation of Terms as Concept Space for Early Risk Prediction

(1)

onept spae for early risk predition

MareloL.Errealde

1

, Ma.PaulaVillegas

1

,DarioG.Funez

1

,

Ma.JoséGariarenaUelay

1

,andLetiia C.Cagnina

1,2

1

LIDICResearhGroup,UniversidadNaionaldeSanLuis,Argentina

2

ConsejoNaionaldeInvestigaionesCientíasyTénias(CONICET)

{merrealde,villegasmariapaula 74,f unezd ario} gma il.o m

{mjgariarenauelay,lagnina} gmai l.o m

Abstrat. Early risk predition involvesthree dierent aspets to be

onsideredwhenanautomatilassierisimplementedfor thistask:a)

supportfor lassiation with partial information read up to dierent

timesteps, b) supportfor dealing with unbalaneddata sets and )a

poliytodeidewhenadoumentouldbelassiedasbelongingtothe

relevantlasswithareasonableondene.Inthispaperweproposean

approahthatnaturallyopeswiththersttwoaspetsandshowsgood

perspetives to deal with the last one. Our proposal, named temporal

variationofterms (TVT)isbasedonusingthevariationofvoabulary

along the dierent timesteps as onept spae to represent the dou-

ments.ResultswiththeeRisk2017datasetshowabetterperformane

ofTVTinomparison toothersuessfulsemanti analysisapproahes

andthestandardBOWrepresentation.Besides,italsoreahesthebest

reportedresultsuptothemomentfor

ERDE ⁵

^and

ERDE ⁵⁰

^error^eval-

uationmeasures.

Keywords:Early Risk Detetion, Unbalaned Data Sets, Text Repre-

sentations,SemantiAnalysisTehniques.

1 Introdution

Early risk detetion (ERD) is a new researh area potentially appliable to a

widevarietyofsituationssuhasdetetionofpotentialpaedophiles,peoplewith

suiidal inlinations, or people suseptible to depression, among others. In a

ERDsenario,dataaresequentiallyreadasastreamandthehallengeonsists

in deteting risk ases assoon as possible. A usual situation in these ases is

thatthetargetlass(theriskyone)islearlyunder-sampledwithrespettothe

ontrollass(thenon-riskyone).Thatunequaldistributionbetweenthepositive

(minority)lassandthenegativeone,isawell-knownprobleminategorization

tasksandpopularlyreferredasunbalaneddata sets (UDS).

Besides dealing with the UDS problem, an ERD system needs to onsider

theproblemof assigningalasstodoumentswhenonlypartial informationis

available. A doument is proessedas a sequene of terms, and the goalis to

(2)

withpartialinformation (CPI)mightbeaddressedwithasimpleapproahthat

onsistsintrainingwithompletedoumentsasusualandonsideringthepartial

doumentsreaduptothelassiationpointasstandardompletedouments.

In[3℄ theCPI aspet wasonsidered byanalysing therobustnessof theNaïve

Bayesalgorithmtodealwithpartialinformation.

Last,but notleast, anERD system needsto onsider notonlywhih lass

shouldbeassignedtoadoument,but alsodeidingwhen to makethat assign-

ment. Thisaspet, that we willreferasthelassiation time deision (CTD)

issuehasbeenaddressedwithverysimpleheuristirules 3

althoughmoreelabo-

ratedapproahesmightbeused.

Inthisartileweproposeanoriginalideathatexpliitlyonsidersthesequen-

tialityofdata todealwiththeunbalaneddatasets problem.Inanutshell, we

usethetemporalvariationoftermsasoneptspaeofareentoniseseman-

tianalysis(CSA)approah[7℄.CSA isaninterestingdoumentrepresentation

tehniquewhihmodelswordsanddoumentsinasmalloneptspae whose

onepts are obtainedfrom ategorylabels.CSA hasobtained good resultsin

author proling tasks [8℄and the variantproposed in this artile, named tem-

poral variation of terms (TVT), seemstoshowsomeinterestingharateristis

todealwiththeERDproblem.Infat,itobtainedarobustperformaneonthe

eRisk 2017 data set and reahed the best (lowest) reported results up to the

momentfor

ERDE 5

^and

ERDE 50

^error^evaluation^measures.

The rest of this doument is organized as follows: Setion 2 desribes our

proposed method for the ERD problem. Setion 3 showsthe obtainedresults

withourmethod ontheeRisk2017dataset.Finally,Setion4depitspotential

future worksandtheobtainedonlusions.

2 The proposed method

Our method is based on the onise semanti analysis (CSA) tehnique pro-

posed in [7℄ and later extendedin [8℄for author proling tasks. Therefore, we

rstpresentinSubsetion2.1thekeyaspetsofCSA andthenexplaininSub-

setion2.2 howweinstantiateCSAwith oneptsderivedfrom thetermsused

in thetemporalhunksanalysedbyanERDsystematdierenttimesteps.

2.1 Conise Semanti Analysis

Standard text representation methods suh as Bag of Words (BoW) suer of

two well known drawbaks. First, their high dimensionality and sparsity; se-

ond,theydonotapturerelationshipsamongwords.CSAisasemantianalysis

tehniquethat aims at dealingwith those shortomingsby interpreting words

and douments in a spae of onepts. Dierently from other semanti analy-

sisapproahessuhas latentsemantianalysis (LSA)[2℄andexpliit semanti

3

Forinstane,exeedingaspeiondenethresholdinthepreditionofthelas-

(3)

wordsandtextfragmentsinaspaeofoneptsthat arelose(orequal)tothe

ategory labels. Forinstane, if douments in the data set are labeled with

q

dierent ategorylabels (usually nomorethan 100 elements),wordsand do-

umentswill berepresentedin a

q

-dimensionalspae. That spaesizeis usually muhsmallerthanstandardBoWrepresentationswhihdiretlydependonthe

voabularysize(morethan10000or20000elementsingeneral).

ToexplainthemainoneptsoftheCSAtehniquewerstintroduesome

basinotationthatwillbeusedintherestofthiswork.Let

D = {hd 1 , y 1 i, . . . , hd n , y n i}

beatrainingsetformedby

n

^pairs^of^douments⁽

d i

⁾^and^variables⁽

y i

⁾^that^indi-

atetheoneptthedoumentisassoiatedwith,

y i ∈ C

^where

C = {c 1 , . . . , c q }

is theonept spae.Forthemoment, onsiderthat theseonepts orrespond

tostandardategorylabelsalthough,aswewillsee later,theymightrepresent

moreelaborate aspets. Inthis ontext, wewill denote as

V = {t 1 , . . . , t m }

^to

thevoabularyoftermsoftheolletionbeinganalysed.

Representing terms in the onept spae In CSA, eah term

t i ∈ V

^is

represented as a vetor

t _i ∈ R ^q

^,

t _i = ht i,1 , . . . , t i,q i

^. ^Here,

t i,j

^represents ^the

degreeofassoiationbetweentheterm

t i

^and^the^onept

c j

ândîtsômputation

requiressomebasistepsthat areexplainedbelow.First,therawterm-onept

assoiation between the

i

^th ^term ^and ^the

j

^th ^onept, ^denoted

w ij

^, ^will ^be

obtained.If

D c _u ⊆ D

^,

D c _u = {d r | hd r , y s i ∈ D ∧ y s = c u }

^is ^the^subset ^of^the

traininginstaneswhoselabelistheonept

c u

^,^then

w ij

^might^be^dened ^as:

w ij = X

∀d _k ∈D _cj

log 2

1 + tf ik

len(d k )

(1)

where

tf ik

îs ^the ^numberôf ôurrenesôf^the ^term

t i

ⁱⁿ ^the^doument

d k

and

len(d k )

îs^the^length^(numberôf^terms) ôf

d k

^.

Asnotedin [7℄and[8℄,diret useof

w ij

^to ^represent^termsⁱⁿ^the^vetor

t _i

ouldbesensibleto highly unbalaneddata. Thus,somekindofnormalization

isusuallyrequiredand,inourase,weseletedtheoneproposedin [8℄:

t ^′ _ij = w ij m

P

i=1

w ij

(2)

t ij = t ^′ _ij

q

P

j=1

w ij

(3)

With this last onversion we nally obtain, for eah term

t i ∈ V

^, ^a

q

^-

dimensional vetor

t _i

,

t _i = ht i,1 , . . . , t i,q i

^dened ôver â ^spae ôf

q

^onepts.

Up to now, those onepts orrespond to the original ategories used to label

thedouments.Later,wewilluseothermoreelaboratedonepts.

Representingdoumentsin the onept spae Onethetermsarerepre-

sentedinthe

q

-dimensionaloneptspae,thosevetorsanbeusedtorepresent

(4)

portanefordierentdoumentssoitisnotagood ideaomputingthatvetor

for thedoument asthesimpleaverage ofall itsterm vetors.Previousworks

in BoW [6℄ haveonsidered dierentstatisti tehniques to weight theimpor-

taneof termsin adoument suh as

tf idf

^,

tf ig

^,

tf χ ²

^or

tf rf

^,^among ^others.

Here, we will use theapproah used in [8℄ for author proling that represents

eahdoument

d k

^as^the ^weightedaggregation oftherepresentations(vetors) oftermsthatitontains:

d k = X

t _i ∈d _k

tf ik

len(d k ) × t

i

(4)

Thus,doumentsarealsorepresentedina

q

-dimensionaloneptspae(i.e.,

d k ∈ R ^q

⁾ ^whih ^is ^muh ^smaller ⁱⁿ dimensionality than the one required by standardBoWapproahes(

q ≪ m

^).

2.2 Temporal Variation ofTerms

InSubsetion2.1wesaidthattheoneptspae

C

^usually^orresponds^to ^stan-

dard ategory names used to label the training instanes in supervised text

ategorizationtasks.Inthis senario,that in[7℄isreferredasdiretderivation,

eah ategory label simply orresponds to aonept. However, in [7℄ also are

proposed other alternatives likesplit derivation and ombined derivation. The

formerusesthelow-levellabelsinhierarhialorporaandthelatterisbasedon

ombiningsemantiallyrelatedlabelsinauniqueonept.In[8℄thoseideasare

extendedby rstlustering eahategoryof theorporaand then using those

subgroups(sub-lusters)asnewoneptspae.

4

Asweansee,theommonideatoalltheaboveapproahesisthatoneaset

of doumentsis identied asbelonging to agroup/ategory, that ategoryan

beonsideredasaoneptand CSA anbeapplied in theusual way.We take

asimilar viewto those works by onsideringthat thepositive(minority)lass

in ERDproblemsanbeaugmentedwiththeoneptsderivedfrom thesetsof

partial doumentsread along the dierent time steps. In order to understand

this ideait is neessarytorst introdueasequentialwork sheme astheone

proposedin[9℄forresearhinERD systemsfordepressionases.

Following [9℄,wewill assume a orpusof douments written by

p

^dierent

individuals (

{I 1 , . . . , I p }

^). ^Fôr êah îndividual

I l

⁽

l ∈ {1, . . . , p}

^),^the

n l

^dou-

mentsthat he haswrittenare provided in hronologialorder (fromthe oldest

texttothemostreenttext):

D I _l ,1 , D I _l ,2 , . . . , D I _l ,n _l

^.^In^this^ontext,^given^these

p

^streamsôf^messages,^theÊRD^system^has^to^proessêvery^sequeneôf^messages

(inthehronologialordertheyareprodued)andtomakeabinarydeision(as

early aspossible) on whether ornotthe individualmight be apositiveaseof

depression.Evaluationmetrisonthistaskmustbetime-aware,soanearlyrisk

detetionerror(ERDE)isproposed.Thismetrinotonlytakesintoaountthe

4

(5)

makethedeision.

Inausualsupervisedtextategorizationtask,wewouldonlyhavetwoate-

gorylabels:positive (risk/depressivease)andnegative(non-risk/non-depressive

ase).ThatwouldonlygivetwooneptsforaCSArepresentation.However,in

ERD problems there is additional temporal information that ould be used to

obtainanimprovedoneptspae.Forinstane, thetrainingset ouldbesplit

in

h

^hunks,

C ˆ 1 , C ˆ 2 , . . . , C ˆ h

^,ⁱⁿ^suh^a^way^that

C ˆ 1

^ontains^the^oldest^writings

ofallusers(rst(100/

h

^)%ôf^submitted^postsôrômments),^hunk

C ˆ 2

^ontains

the seond oldest writings,and soforth. Eah hunk

C ˆ k

^an^bepartitioned in twosubsets

C ˆ _k ⁺

^and

C ˆ _k ⁻

^,

C ˆ k = ˆ C _k ⁺ S C ˆ _k ⁻

^where

C ˆ _k ⁺

^ontains^the ^positive^ases

ofhunk

C ˆ k

^and

C ˆ _k ⁻

^the^negatives^ones^of^this ^hunk.

Itisinterestingtonotethatweanalsoonsiderthedatasetsthatresultof

onatenatinghunks that are ontiguousin timeand using thenotation

C ˆ i−j

torefertothehunkobtainedfromonatenatingallthe(original)hunksfrom

the

i

^th^hunk ^to ^the

j

^th ^hunk (inlusive).Thus,

C ˆ 1−h

^will^represent ^the^data

set withtheompletestreamsofmessagesofallthe

p

individuals.Inthisase,

C ˆ _1−h ⁺

^and

C ˆ _1−h ⁻

^will^have^theôbvious^semanti^speiedâbove^for^theômplete

doumentsofthetrainingset.

The lassi way of onstruting alassier would be to take the omplete

doumentsofthe

p

individuals(

C ˆ 1−h

⁾ând ûseânîndutive^learningâlgorithm

suhas SVMorNaïveBayestoobtainthatlassier. Aswementionedearlier,

anotherimportantaspetinEDSsystemsisthatthelassiationproblembeing

addressed is usually highly unbalaned (UDS problem). That is, the number

of douments of the majority/negative lass (non-depression) is signiantly

larger than that of the minority/positive lass (depression). More formally,

followingthepreviouslyspeiednotation

| C ˆ _1−h ⁻ |≫| C ˆ _1−h ⁺ |

^.

AnalternativetotrytoalleviatetheUDSproblemwouldbetoonsiderthat

theminoritylassisformednotonlybytheompletedoumentsoftheindividu-

alsbutalsobythepartialdoumentsobtainedinthedierenthunks.Following

the generalideasposed in CSA, weouldonsider that thepartial douments

readin thedierenthunksrepresenttemporal oneptsthatshouldbetaken

intoaount.Inthisontext,onemightthinkthatvariationsofthetermsusedin

thesedierentsequentialstagesofthedoumentsmayhaverelevantinformation

forthelassiationtask.With thisideainmind, themethodproposedin this

worknamedtemporalvariation ofterms(TVT)arises,whihonsistsinenrih-

ingthedoumentsoftheminoritylasswiththepartial doumentsread inthe

rsthunks.Thesersthunksoftheminoritylass,alongwiththeiromplete

douments,willbeonsideredasanewoneptspaeforaCSA method.

Therefore, in TVT we rst determine the number

f

^of ^initial ^hunks ^that

willbeusedtoenrih theminority(positive) lass.Then,weusethedoument

sets

C ˆ ₁ ⁺ , C ˆ ₁₋₂ ⁺ , . . . , C ˆ _1−f ⁺

^and

C ˆ _1−h ⁺

âsônepts ^for^the ^positive ^lassând

C ˆ _1−h ⁻

for the negative lass. Finally, we represent terms as douments in this new

(

f + 2

)-dimensionalspaeusingtheCSA approahexplainedinSetion 2.1.

(6)

3.1 Data Set

OurapproahwastestedonthedatasetusedintheeRisk2017pilottask 5

and

desribed in [9℄. It is a olletion of writings (posts oromments) from a set

of Soial Mediausers. Therearetwoategories ofusers,depressed and non-

depressed and,foreahuser,theolletionontainsasequeneofwritings(in

hronologial order). For eahuser, theolletionof writingshasbeendivided

into 10 hunks. The rst hunk ontains the oldest 10% of the messages, the

seond hunkontainstheseondoldest 10%,and soforth.This olletionwas

splitintoatrainingandatestsetthatwewillreferas

T R DS

^and

T E DS

^respe-

tively.The(training)

T R DS

^set^ontained⁴⁸⁶^users⁽⁸³^positive,⁴⁰³^negative)

and the (test)

T E DS

^set ^ontained ⁴⁰¹ ^users ⁽⁵² ^positive, ³⁴⁹ ^negative). ^The

userslabeledaspositivearethosethathaveexpliitlymentionedthattheyhave

beendiagnosedwithdepression.

Thistask was dividedinto atrainingstage and atestingstage.In therst

one, the partiipating teams had aess to the

T R DS

^set ^with ^all ^hunks ^of

all training users. They ould therefore tune their systems with the training

data. To reproduethe same onditions of the pilot task, weuse the training

set (

T R DS

⁾ ^to^generate â^newôrpus^dividedînto â^training^set ^(that ^we ^will

refer as

T R DS − train

⁾ ^and ^a ^test ^set ^(named

T R DS − test

⁾ ^with ^the ^same

ategories (depressed and non-depressed)for eah sequene of writings of the

usersintheolletion.Thosesets maintainedthesameproportionsof post per

userandwordsperuserasdesribedin[9℄.

T R DS −train

^and

T R DS −test

^were

generatedbyrandomlyseletingarounda70%ofwritingsfor therstone and

therest 30%fortheseond one.Thus,

T R DS − train

^resultedⁱⁿ ³⁵¹^writings

(63positive,288negative)meanwhile

T R DS − test

^ontains¹³⁵individuals(20 positive,115negative).Inthe pilot taskthe olletionof writingswasdivided

into10hunks,sowemadethesamedivisionon

T R DS −train

^and

T R DS −test

^.

3.2 ExperimentalResults

WereproduedthesameonditionsfaedbythepartiipantsoftheeRiskpilot

task, sowerstworkedonthedataset releasedonthetrainingstage (

T R DS

⁾

andthen,theobtainedmodelsweretestedontheteststage(

T E DS

^).^The^ativ-

itiesarriedoutateahstagearedesribedbelow.

Training stage CSA is a doument representation that aims at addressing

somedrawbaksof lassialrepresentations suh as BoW.On theother hand,

TVTissupposedtoextendCSAbydeningoneptsthatapturethesequential

aspets oftheERD problems and thevariationsofvoabularyobservedin the

distintstagesoftheindividuals'writings.Thus,CSAandBoWariseasobvious

andidatesto ompareTVTinthedata setusedin thepilottask.Those three

5

(7)

NaïveBayesand RandomForest,among others.Ineahase, thebestparam-

eters were seletedfor eah algorithm-representationombination (model) and

thereportedresultsorrespondtothebest obtainedvalues.

We tested BoW with dierent weighting shemes and learning algorithms

but,in allases,thebestresultswereobtainedwithbinaryrepresentationsand

the Naïve Bayes algorithm. From now on, all referenes to BoW will stand

for that setting. We use CSA with representations of terms with normalized

weightsaordingtoEquations2and3anddoumentrepresentationsobtained

from Equation 4as proposed in [8℄ for author proling tasks. We named this

settingas

CSA ^⋆

^.^For^the^TVTrepresentation,adeisionmustbemaderelated to thenumber

f

ôf^hunks^that^will ênrih^the^minority^(positive)^lass.Înôur

studies,weuse

f = 4

^and,ⁱⁿ^onsequene,^the^positive^lass^wasrepresentedby 5onepts.Inthat way,the numberof doumentsin thedepressed lasswas

inremented by 5 with respet to the original size,from 83 positive instanes

to 415. Aswean see, with this tehnique weare also obtainingsomekind of

balaningin thesizeofbothlassesandaddressinginthatwayanotherusual

problemthat wepreviouslyreferastheUDSproblem.

ApartiularitythatERD methodsmustonsideristheriterionusedtode-

idewhen (inwhatsituations)thelassiationgeneratedbythesystemison-

sideredthenal/denitivedeisionontheevaluatedinstanes(thelassiation

time deision (CTD) issue).We willstartourevaluation ofthedierentdou-

mentrepresentationsandalgorithmsassumingthatthelassiationismadeon

astatihunkbyhunkbasis.Thatis,foreahhunk

C ˆ i

^provided^to^the^ERD

systemswewillevaluatetheir performane onsidering thatall themodelsare

(simultaneously)appliedtothewritingsreeiveduptothehunk

C ˆ i

^.^With ^this

kind of information it will be possible to observeto what extent the dierent

approahesarerobusttothepartialinformationinthedierentstages,inwhih

momenttheystarttoobtainaeptableresults,andotherinterestingstatistis.

Tables1, 2and 3show theresults ofexperiments forthis stati hunk by

hunk lassiationsheme.Valuesofpreision (

π

^), ^reall ⁽

ρ

⁾^and

F 1

^-measure

(

F 1

⁾ ^of ^the ^target (depressed) lass are reported for eah onsidered model.

Statistisalso inludetheearly riskdetetion error (ERDE) measureproposed

in [9℄.Thismeasure onsidersnotonlytheorretnessofthedeisionmadeby

thesystembutalsothedelayinmakingthatdeision.ERDEusesspeiosts

to penalize false positivesand false negatives. However,ERDE hasa dierent

treatmentwiththetwopossiblesuessfulpreditions(true negativesandtrue

positives).Truenegativeshavenoost(ost=0)butERDEassoiatesaostto

thedelayinthedetetionoftruepositivesthatmonotoniallyinreaseswiththe

number

k

ôf^textualîtems^seen^before^giving^theânswer.Înâ^nutshell,^thatôst

islowwhen

k

^is^lower^than ^a^threshold^value

o

^but^rapidly ^approahes¹^when

k > o

^.^In^that ^way,

o

^represents^some^type^of^urgeny ⁱⁿ ^deteting^depression

ases:the lowest the

o

^values^the^highest^the ^urgenyⁱⁿ ^deteting^the ^positive

(8)

Inour study weonsider the two valuesof

o

^used ⁱⁿ ^the ^pilot^task:

o = 5

(

ERDE 5

⁾ ^and

o = 50

⁽

ERDE 50

^). În êah ^hunk, ^lassiers ûsually ^produe

theirpreditionswithsomekindofondene,ingeneral,theestimatedprob-

abilityofthepreditedlass.Inthoseases,weanseletdierentthresholds

tr

onsideringthataninstane(doument)isassignedtothetargetlasswhenits

assoiatedprobability

p

îs^greater^(orêqual)^thanêrtain^threshold

tr

⁽

p ≥ tr

^).

Inthis studyweevaluated5dierentsettingsfor theprobabilitiesassignedfor

eah lassier:

p = 1

^,

p ≥ 0.9

^,

p ≥ 0.8

^,

p ≥ 0.7

^and

p ≥ 0.6

^. ^Due ^to ^spae

onstraints,onlythebestresultsobtainedwithapartiularsettingareshown.

6

Table1showstheresultsobtainedwith aBoW representationandaNaïve

Bayes lassier. Those values orrespond to the setting where an instane is

onsidered as depressive if the lassier assigns to the target/positive lass a

probabilitygreaterorequalthan0.8(

p ≥ 0.8

^).Surprisingly,thebestresultsfor all theonsidered measures are obtainedon the rsthunk.Inthis hunk,we

an observethat this model only reoversa45%of the depressed individuals.

However, this is not the worst aspet. Only a 12%of the individual lassied

asdepressed eetively hadthis ondition resultingin onsequenein avery

low

F 1

^measure^(0.19).^Table²^shows^similar^results^when^a

CSA ^⋆

^-RF^(random

forest)ombinationwith

p ≥ 0.6

îsûsed^to^lassify^the^writingsôf^theindividuals.

Here,

F 1

^measureîsâlso^low^but^weânôbserveâdeteriorationinthe(

ERDE 5

⁾

and(

ERDE 50

⁾^error^values^with^respet^to^the^previous^model.

Finally,inTable3,theresultsofTVTwithaNaïveBayesalgorithmand

p ≥ 0.6

âre^shown.^There,^weân^seeâ^remarkableimprovementintheperformane ofthelassierinthehunk3withexellentvaluesof

ERDE 50

^(7.02),^preision

π

^(0.63), ^reall

ρ

^(0.85)^and

F 1

^measure^(0.72). ^Analysing^the^results^along^the

10 onsidered hunks we observe how the measures keep improving from the

hunk 1up to reahthe best values in hunk 3and, from then on,they start

to deteriorate hunk byhunk and obtainingtheworstresults onthe last two

hunks.Asweakpointsofthoseresultsweansaythatthebestvalueof

ERDE 5

obtainedinhunk1isnotverygood.Besides,eventhought

ERDE 50

^values^are

aeptable formostoftheonsidered hunks,theyneedat leasttwohunks to

show a ompetitive performane. That aspet looks reasonable if we onsider

that TVT is based onthe variation of terms between onseutivehunks and

that informationisnotavailable onthersthunk.

Asgeneralonlusiontothehunkbyhunkanalysis,weouldsaythatim-

balanedlassesseemtoaetinadierentwaytothedierentmethods.BoW

and CSA diretly depend on the voabulary of positive and negative lasses.

In therst hunk where texts are supposed to bethe shortest, relevantwords

of the positive lass appearing in the posts will probably havemore hane of

beingbalanedwithrespettothewordsappearinginthenegativelass.That

makeslassiersbemoresensitivetothepositivelassand,inonsequene,the

reallandgeneralperformaneisimproved.Asmoreinformationisread,words

relatedtothenegativelassaremoreprobabletoourintroduingnoiseand

6

All the tables generated for the dierent probabilities an be downloaded from

(9)

Table 1.Model:

BoW

⁺^Naïve^Bayes⁽

p ≥ 0 . 8

).Chunkbyhunksetting.

ERDE ⁵

^,

ERDE ⁵⁰

^,

F ¹

^-measure⁽

F ¹

^),^preision⁽

π

⁾^and^reall⁽

ρ

⁾^of^the^depressed^lass.

ch ¹ ch ² ch ³ ch ⁴ ch ⁵ ch ⁶ ch ⁷ ch ⁸ ch ⁹ ch ¹⁰ ERDE ⁵

^18.09^20.98 ^21.5 ^21.73^21.95^21.95^21.95^21.95^22.17^22.17

ERDE ⁵⁰

^15.17^16.84^20.77^20.25^21.21^21.95^21.95^21.52^22.17^22.17

F ¹

^0.19 ^0.16 ^0.09 ^0.11 ^0.09 ^0.09 ^0.09 ^0.09 ^0.09 ^0.13

π

^0.12 ^0.11 ^0.06 ^0.17 ^0.06 ^0.06 ^0.06 ^0.06 ^0.06 ^0.08

ρ

^0.45 ^0.35 ^0.2 ^0.25 ^0.2 ^0.2 ^0.2 ^0.2 ^0.2 ^0.3

Table2.Model:

CSA ^⋆

⁺^RF⁽

p ≥ 0 . 6

ERDE ⁵

^,

ERDE ⁵⁰

^,

F ¹

^-measure⁽

F ¹

^),^preision⁽

π

⁾^and^reall⁽

ρ

ch ¹ ch ² ch ³ ch ⁴ ch ⁵ ch ⁶ ch ⁷ ch ⁸ ch ⁹ ch ¹⁰ ERDE ⁵

^21.93^25.64^25.46^25.57^26.12^25.68^25.68^25.46^25.35^25.68

ERDE ⁵⁰

^19.47^24.94^25.46^23.35^25.37 ^24.2 ^23.46 ^22.5 ^22.39^23.47

F ¹

^0.19 ^0.08 ^0.05 ^0.1 ^0.06 ^0.08 ^0.13 ^0.16 ^0.16 ^0.14

π

^0.11 ^0.05 ^0.03 ^0.06 ^0.04 ^0.05 ^0.07 ^0.09 ^0.09 ^0.08

ρ

^0.6 ^0.25 ^0.15 ^0.3 ^0.2 ^0.25 ^0.4 ^0.5 ^0.5 ^0.45

aetingin onsequenetheperformane.TVTdoesnotseemtobeso aeted

bythis problem showingamorestable performanealongallthe hunks,with

thebestresultsinthethirdhunkandthenwithalittledeteriorationfromthen

on. Those results ould be giving evidene that the variation of terms (with

f = 4

⁾âllows^to ^better^detet ^theôurreneôf ^relevant^words ôf^the ^positive

lassin thersthunks.However,italsoseemstobeaetedbytheunbalane

problem in subsequent hunks, although in a lowerlevelthan BoW and CSA

representations.Unfortunately,verifyingthosehypotheseswouldrequireonsid-

eringmorebalaned settingsanddierent

f

^values^whatîsôutôf^the^sopeôf

thispaper.However,thatimportantaspetwillbeaddressedinfuture works

Anotherapproah for the CTD issue ould be diretly use the probability

(orsomemeasureofondene)assignedbythelassiertodeidewhentostop

reading adoumentand givingitslassiation.That approah,that in [9℄is

referredasdynami,onlyonsidersthatthisprobabilityexeedssomepartiular

threshold to lassifytheinstane/individual aspositive.That means, that dif-

ferentstreams ofmessagesouldbelassiedasdepressed in dierentstages

(hunks). Table 4showthose statistisfor BoW,

CSA ^⋆

^and ^TVT representa- tionsforthoselearningalgorithmsandprobabilitythresholdsthatobtainedthe

best performane. There, we an see that TVT representation, with a Naïve

Bayesandlassifyinginstanesasdepressedwhentheassignedprobabilityis1,

obtains the best results for the measures we are more interested in:

ERDE 5

^,

ERDE 50

^and

F 1

^-measure.În^this ôntext, ^BoW ^gets â^better^reall ^value^but

attheexpenseofloweringthepreisionvaluesresultinginapoor

F 1

^-measure.

Testing stage The previous results were obtained by training the lassiers

with the

T R DS − train

^data ^set^and ^testing^them^with ^the

T R DS − test

^data

(10)

Table 3.Model:

T V T

⁺^Naïve^Bayes⁽

p ≥ 0 . 6

ERDE ⁵

^,

ERDE ⁵⁰

^,

F ¹

^-measure⁽

F ¹

^),^preision⁽

π

⁾^and^reall⁽

ρ

ch ¹ ch ² ch ³ ch ⁴ ch ⁵ ch ⁶ ch ⁷ ch ⁸ ch ⁹ ch ¹⁰ ERDE ⁵

^14.24^14.27^14.59^14.83^15.17^15.51^15.74^15.84^16.21^16.13

ERDE ⁵⁰

^10.80 ^7.22 ^7.02 ^9.24 ^9.25 ^9.97 ^10.73^10.73^11.06^10.96

F ¹

^0.42 ^0.65 ^0.72 ^0.67 ^0.67 ^0.67 ^0.64 ^0.64 ^0.57 ^0.58

π

^0.39 ^0.58 ^0.63 ^0.60 ^0.60 ^0.60 ^0.58 ^0.58 ^0.50 ^0.52

ρ

^0.45 ^0.75 ^0.85 ^0.75 ^0.75 ^0.75 ^0.70 ^0.70 ^0.65 ^0.65

Table4.DynamiModels forBoW-NB,

CSA ^⋆

^-NB^and^TVT-NB.

ERDE ⁵ ERDE ⁵⁰ F ¹ π ρ

BoW(

p ≥ 0 . 8

) 21.05 18.13 0.24 0.14 0.75

CSA ^⋆

^-NB(

p = 1

) 23.09 23.07 0.06 0.04 0.15 TVT-NB(

p = 1

) 14.13 11.25 0.400.47 0.35

set.Theobviousquestionnowisifsimilarresultsareobtainedbytrainingwith

the full training set of the pilot task (

T R DS

⁾ ^and ^using ^the ^lassiers ^with

the data set

T E DS

^that ^was inrementally released during the testing phase of the pilottask. Inthis new senario,the TVT representationwasused with

a simple rule for the CTD issue that onsists in lassifying all the individual

in the hunk 3 as positive (depressed) if a Naïve Bayes lassier produed a

probabilityequalorgreaterthan 0.6forthe positivelass.That strategy, that

wewill referas

T V T _p≥0.6 ³

^,^is ^motivated^by^the^good^results^showed^by^TVTⁱⁿ

Table3.WealsotestedtheBoW,

CSA ^⋆

^and^TVTrepresentationswithdynami strategiesandusingthoseprobabilitiesthatbestvaluesobtainedinthetraining

stage.As baselineswealso tested twoapproahes desribed in [9℄that will be

namedas

Ran

^and

M in

^.

Ran

^,^simply^emits^a^random^deision(depressed/non- depressed)foreahuserin thersthunk.

M in

^,^on^the^other^hand,^stands^for

minorityandonsistsinlassifyingeahuserasdepressiveinthersthunk.

Table5showstheperformaneofalltheabovementionedapproahesonthe

test set of thepilottask (

T E DS

^).^We ^also^inluded ^the^results^reported ⁱⁿ ^the

eRiskpageforthesystemsthatobtainedthebest

ERDE 5

⁽

F HDO−BCSGB

^),

ERDE 50

⁽

U N SLA

⁾ ^and

F 1

⁽

F HDO − BCSGB

⁾ ^measures ^on ^the ^pilot ^task.

Here we an observethat results obtained with

T V T _p≥0.6 ³

âre ^not âs ^good âs

those obtained in the training stage. However, the setting TVT-NB (

p = 1

⁾

wouldhaveobtainedthebest

ERDE 5

^sore^and^the^third

ERDE 50

^value,^with

asmalldierenerespettothebest reportedvalue(9.84versus9.68).

Thosegood resultsof TVT were ahievedtaking into aountthe best pa-

rameters obtainedin the trainingstage. However,it also would beinteresting

analysingwhatwouldhavebeentheTVT'sperformaneifotherparameterset-

tingshadbeenseleted.Table6showsthistypeofinformationbyreportingthe

results obtainedwith dierent learningalgorithms (Naïve Bayes and Random

Forest)and dierent probability valuesfor dynami approahes to the CTD

(11)

Table5.Resultsonthe

T E DS

^test^set.

ERDE ⁵ ERDE ⁵⁰ F ¹ π ρ Ran

^16.83 ^14.63 ^0.17 ^0.11 ^0.4

M in

^21.67 ^15.03 ^0.23 ^0.13 ¹

BoW(

p ≥ 0 . 8

) 16.45 10.87 0.38 0.25 0.77

CSA ^⋆

^-NB(

p = 1

) 20.58 19.58 0.05 0.03 0.15

T V T p≥ ³ 0.6

^13.64 ^10.17 ^0.53 ^0.46 ^0.62

TVT-NB(

p = 1

) 12.38 9.84 0.42 0.50 0.37

F HDO − BCSGA

^12.82 ^9.69 ^0.64 ^0.61 ^0.67

F HDO − BCSGB

^12.70 ^10.39 ^0.55 ^0.69^0.46

U N SLA

^13.66 ^9.68 ^0.59 ^0.48 ^0.79

Table6.ResultsofTVTwithdierentlearningalgorithmsandprobabilityvalues.

ERDE ⁵ ERDE ⁵⁰ F ¹ π ρ

TVT-NB(

p ≥ 0 . 6

) 13.59 8.40 0.50 0.37 0.75 TVT-NB(

p ≥ 0 . 7

) 13.43 8.24 0.51 0.39 0.75 TVT-NB(

p ≥ 0 . 8

) 13.13 8.17 0.54 0.42 0.73 TVT-NB(

p ≥ 0 . 9

) 13.07 8.35 0.52 0.42 0.69 TVT-NB(

p = 1

) 12.38 9.84 0.42 0.50 0.37 TVT-RF(

p ≥ 0 . 6

) 12.46 8.37 0.55 0.49 0.63 TVT-RF(

p ≥ 0 . 7

) 12.49 8.52 0.55 0.50 0.62 TVT-RF(

p ≥ 0 . 8

) 12.30 8.95 0.56 0.54 0.58 TVT-RF(

p ≥ 0 . 9

) 12.34 10.28 0.47 0.55 0.40 TVT-RF(

p = 1

) 12.82 11.82 0.20 0.67 0.12

the

ERDE

^measuresindependentlyofthealgorithmusedtolearnthemodeland theprobabilityusedinthedynamiapproahes.Mostofthe

ERDE 5

^values^are

lowandin 7outof10settingsthe

ERDE 50

^values^are^lowest^than^the^best^re-

portedinthepilottask(

U N SLA

^:^9.68).În^thisôntext,^TVTâhieves^the^best

reported

ERDE 5

^value^up^to ^now ^(12.30)^with ^the^setting ^TVT-RF⁽

p ≥ 0.8

⁾

andthelowest

ERDE 50

^value^(8.17)^with^the^model ^TVT-NB⁽

p ≥ 0.8

^).

4 Conlusions and future work

In this artile wepresenttemporal variation of terms (TVT) anapproah for

earlyriskdetetionbasedonusingthevariationofvoabularyalongthedierent

time stepsas oneptspaefordoumentrepresentation.TVTnaturallyopes

with the sequentialnature of ERD problems and also givesa toolfor dealing

with unbalaned data sets. Preliminary results with the eRisk 2017 data set

showabetter performane ofTVT in omparisonto othersuessful semanti

analysisapproahandthestandardBOWrepresentation.Italsoshowsarobust

performane along dierent parameter settings and reahes the best reported

resultsuptothemomentfor

ERDE 5

^and

ERDE 50

^error^evaluation^measures.

Asfuture work,weplantoapplytheTVTapproahtootherproblemsthat

(12)

PAN-2012ompetitiononsexualpredatoridentiation[5℄whihsharesseveral

harateristiswiththedatasetusedinthepresentworksuhasthesequentially

ofdata, unbalanedlassesandtherequirementofdetetingtheminoritylass

(predator)as soonaspossible,amongothers.

TVT is expliitly based on the enrihment of the minority lass with new

oneptsderivedfrom thepartialinformationobtainedfromtheinitialhunks.

However, some improvements an be ahieved by also lustering the negative

lass as proposed by [8℄ in author proling tasks. We arried out some initial

experiments by ombining TVT with the lustering of the negative lass but

more study is required to determine how both approahes an be eetively

integrated. Besides, in the present work, the eletion of

f = 4

^mainly ^aimed

atobtainingbalanedpositiveandnegativelasses.Infuture works,dierent

f

valueswillbeonsideredto seehowtheyimpatontheTVT'sperformane.

TVTprovides,asasideeet,aninterestingtoolfordealingwiththeunbal-

aneddatasetproblem.WeplantoapplyTVTonunbalaneddatasetsthatdo

notneessarilyorrespondtotheERDeldandomparingitagainstotherwell

knownmethodsinthisarea,suhasSMOTE[1℄.Finally,itwouldbeinteresting

omparingtheoneptspae usedinourapproahagainstother reentand ef-

fetiverepresentationsbasedonwordembeddings.Inthisontext,itouldalso

be analysed how our onept spae representation an be extended/improved

withinformationprovidedbythoseembeddings.

Referenes

1. N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.Ph.Kegelmeyer.Smote:Syntheti

minorityover-samplingtehnique. J.Artif.Intell.Res.,16:321357, 2002.

2. S. Deerwester, S.T.Dumais, G.W. Furnas, T.K. Landauer,and R. Harshman.

Indexingbylatentsemantianalysis. Journalofthe ASIS,41(6):391407,1990.

3. H. Jair Esalante, M. Montes-y-Gómez, L. Villaseñor Pineda, and M. Errealde.

Earlytextlassiation:anaïvesolution.InA.Balahur,E.VanderGoot,P.Vossen,

and A.Montoyo,editors, Pro.of WASSANAACL-HLT 2016, 2016, SanDiego,

California,USA,pages9199.TheAssoiationforComputerLinguistis,2016.

4. E. Gabrilovih and S. Markovith. Wikipedia-based semanti interpretation for

naturallanguageproessing. JAIR,34(1):443498,Marh2009.

5. G.InhesandF.Crestani. Overviewof theinternationalsexual predatoridenti-

ationompetitionatpan-2012. InP.Forner,J.Karlgren,andC.Womser-Haker,

editors, CLEF(OnlineWorkingNotes/Labs/Workshop),pages112,2012.

6. M. Lan, Ch. Tan, J. Su, and Y. Lu. Supervisedand traditional termweighting

methodsforautomatitextategorization. IEEETPAMI,31(4):721735,2009.

7. Z.Li,Z.Xiong,Y.Zhang,Ch.Liu,andK.Li. Fasttextategorizationusingonise

semantianalysis. PatternReogn. Lett.,32(3):441448,February2011.

8. A. Pastor López-Monroy, M. Montes y Gómez, H. Jair Esalante, L. Villaseñor-

Pineda, and E. Stamatatos. Disriminative subprole-spei representations for

authorprolinginsoialmedia. Knowledge-BasedSystems,89:134147,2015.

9. D.E.LosadaandF.Crestani. Atestolletionforresearhondepressionandlan-

guageuse.InExperimentalIRMeetsMultilinguality,Multimodality,andInteration

-7thInt.Conf.oftheCLEFAssoiation,Portugal,pages2839,2016.

Temporal Variation of Terms as Concept Space for Early Risk Prediction

1

1

1

1

1,2

1

2

ERDE 5

ERDE 50

ERDE 5

ERDE 50

q

q

D = {hd 1 , y 1 i, . . . , hd n , y n i}

n

d i

y i

y i ∈ C

C = {c 1 , . . . , c q }

V = {t 1 , . . . , t m }

t i ∈ V

t i ∈ R q

t i = ht i,1 , . . . , t i,q i

t i,j

t i

c j

i

j

w ij

D c u ⊆ D

D c u = {d r | hd r , y s i ∈ D ∧ y s = c u }

c u

w ij

w ij = X

∀d k ∈D cj

log 2

1 + tf ik

len(d k )

tf ik

t i

d k

len(d k )

d k

w ij

t i

t ′ ij = w ij m

P

i=1

w ij

t ij = t ′ ij

q

P

j=1

w ij

t i ∈ V

q

t i

t i = ht i,1 , . . . , t i,q i

q

q

tf idf

tf ig

tf χ 2

tf rf

d k

d k = X

t i ∈d k

tf ik

len(d k ) × t

i

q

d k ∈ R q

q ≪ m

C

p

{I 1 , . . . , I p }

I l

l ∈ {1, . . . , p}

n l

ERDE ⁵

ERDE ⁵⁰

t _i ∈ R ^q

t _i = ht i,1 , . . . , t i,q i

D c _u ⊆ D

D c _u = {d r | hd r , y s i ∈ D ∧ y s = c u }

∀d _k ∈D _cj

t _i

t ^′ _ij = w ij m

t ij = t ^′ _ij

t _i

t _i = ht i,1 , . . . , t i,q i

tf χ ²

t _i ∈d _k

d k ∈ R ^q

D I _l ,1 , D I _l ,2 , . . . , D I _l ,n _l

C ˆ _k ⁺

C ˆ _k ⁻

C ˆ k = ˆ C _k ⁺ S C ˆ _k ⁻

C ˆ _k ⁺

C ˆ _k ⁻

C ˆ _1−h ⁺

C ˆ _1−h ⁻

| C ˆ _1−h ⁻ |≫| C ˆ _1−h ⁺ |

C ˆ ₁ ⁺ , C ˆ ₁₋₂ ⁺ , . . . , C ˆ _1−f ⁺

C ˆ _1−h ⁺

C ˆ _1−h ⁻

CSA ^⋆