onept spae for early risk predition
MareloL.Errealde
1
, Ma.PaulaVillegas
1
,DarioG.Funez
1
,
Ma.JoséGariarenaUelay
1
,andLetiia C.Cagnina
1,2
1
LIDICResearhGroup,UniversidadNaionaldeSanLuis,Argentina
2
ConsejoNaionaldeInvestigaionesCientíasyTénias(CONICET)
{merrealde,villegasmariapaula 74,f unezd ario} gma il.o m
{mjgariarenauelay,lagnina} gmai l.o m
Abstrat. Early risk predition involvesthree dierent aspets to be
onsideredwhenanautomatilassierisimplementedfor thistask:a)
supportfor lassiation with partial information read up to dierent
timesteps, b) supportfor dealing with unbalaneddata sets and )a
poliytodeidewhenadoumentouldbelassiedasbelongingtothe
relevantlasswithareasonableondene.Inthispaperweproposean
approahthatnaturallyopeswiththersttwoaspetsandshowsgood
perspetives to deal with the last one. Our proposal, named temporal
variationofterms (TVT)isbasedonusingthevariationofvoabulary
along the dierent timesteps as onept spae to represent the dou-
ments.ResultswiththeeRisk2017datasetshowabetterperformane
ofTVTinomparison toothersuessfulsemanti analysisapproahes
andthestandardBOWrepresentation.Besides,italsoreahesthebest
reportedresultsuptothemomentfor
ERDE 5andERDE 50erroreval-
uationmeasures.
Keywords:Early Risk Detetion, Unbalaned Data Sets, Text Repre-
sentations,SemantiAnalysisTehniques.
1 Introdution
Early risk detetion (ERD) is a new researh area potentially appliable to a
widevarietyofsituationssuhasdetetionofpotentialpaedophiles,peoplewith
suiidal inlinations, or people suseptible to depression, among others. In a
ERDsenario,dataaresequentiallyreadasastreamandthehallengeonsists
in deteting risk ases assoon as possible. A usual situation in these ases is
thatthetargetlass(theriskyone)islearlyunder-sampledwithrespettothe
ontrollass(thenon-riskyone).Thatunequaldistributionbetweenthepositive
(minority)lassandthenegativeone,isawell-knownprobleminategorization
tasksandpopularlyreferredasunbalaneddata sets (UDS).
Besides dealing with the UDS problem, an ERD system needs to onsider
theproblemof assigningalasstodoumentswhenonlypartial informationis
available. A doument is proessedas a sequene of terms, and the goalis to
withpartialinformation (CPI)mightbeaddressedwithasimpleapproahthat
onsistsintrainingwithompletedoumentsasusualandonsideringthepartial
doumentsreaduptothelassiationpointasstandardompletedouments.
In[3℄ theCPI aspet wasonsidered byanalysing therobustnessof theNaïve
Bayesalgorithmtodealwithpartialinformation.
Last,but notleast, anERD system needsto onsider notonlywhih lass
shouldbeassignedtoadoument,but alsodeidingwhen to makethat assign-
ment. Thisaspet, that we willreferasthelassiation time deision (CTD)
issuehasbeenaddressedwithverysimpleheuristirules 3
althoughmoreelabo-
ratedapproahesmightbeused.
Inthisartileweproposeanoriginalideathatexpliitlyonsidersthesequen-
tialityofdata todealwiththeunbalaneddatasets problem.Inanutshell, we
usethetemporalvariationoftermsasoneptspaeofareentoniseseman-
tianalysis(CSA)approah[7℄.CSA isaninterestingdoumentrepresentation
tehniquewhihmodelswordsanddoumentsinasmalloneptspae whose
onepts are obtainedfrom ategorylabels.CSA hasobtained good resultsin
author proling tasks [8℄and the variantproposed in this artile, named tem-
poral variation of terms (TVT), seemstoshowsomeinterestingharateristis
todealwiththeERDproblem.Infat,itobtainedarobustperformaneonthe
eRisk 2017 data set and reahed the best (lowest) reported results up to the
momentfor
ERDE 5 andERDE 50errorevaluationmeasures.
The rest of this doument is organized as follows: Setion 2 desribes our
proposed method for the ERD problem. Setion 3 showsthe obtainedresults
withourmethod ontheeRisk2017dataset.Finally,Setion4depitspotential
future worksandtheobtainedonlusions.
2 The proposed method
Our method is based on the onise semanti analysis (CSA) tehnique pro-
posed in [7℄ and later extendedin [8℄for author proling tasks. Therefore, we
rstpresentinSubsetion2.1thekeyaspetsofCSA andthenexplaininSub-
setion2.2 howweinstantiateCSAwith oneptsderivedfrom thetermsused
in thetemporalhunksanalysedbyanERDsystematdierenttimesteps.
2.1 Conise Semanti Analysis
Standard text representation methods suh as Bag of Words (BoW) suer of
two well known drawbaks. First, their high dimensionality and sparsity; se-
ond,theydonotapturerelationshipsamongwords.CSAisasemantianalysis
tehniquethat aims at dealingwith those shortomingsby interpreting words
and douments in a spae of onepts. Dierently from other semanti analy-
sisapproahessuhas latentsemantianalysis (LSA)[2℄andexpliit semanti
3
Forinstane,exeedingaspeiondenethresholdinthepreditionofthelas-
wordsandtextfragmentsinaspaeofoneptsthat arelose(orequal)tothe
ategory labels. Forinstane, if douments in the data set are labeled with
q
dierent ategorylabels (usually nomorethan 100 elements),wordsand do-
umentswill berepresentedin a
q
-dimensionalspae. That spaesizeis usually muhsmallerthanstandardBoWrepresentationswhihdiretlydependonthevoabularysize(morethan10000or20000elementsingeneral).
ToexplainthemainoneptsoftheCSAtehniquewerstintroduesome
basinotationthatwillbeusedintherestofthiswork.Let
D = {hd 1 , y 1 i, . . . , hd n , y n i}
beatrainingsetformedby
n
pairsofdouments(d i)andvariables(y i)thatindi-
atetheoneptthedoumentisassoiatedwith,
y i ∈ C
whereC = {c 1 , . . . , c q }
is theonept spae.Forthemoment, onsiderthat theseonepts orrespond
tostandardategorylabelsalthough,aswewillsee later,theymightrepresent
moreelaborate aspets. Inthis ontext, wewill denote as
V = {t 1 , . . . , t m }
tothevoabularyoftermsoftheolletionbeinganalysed.
Representing terms in the onept spae In CSA, eah term
t i ∈ V
isrepresented as a vetor
t i ∈ R q, t i = ht i,1 , . . . , t i,q i
. Here, t i,j represents the
degreeofassoiationbetweentheterm
t iandtheoneptc janditsomputation
requiressomebasistepsthat areexplainedbelow.First,therawterm-onept
assoiation between the
i
th term and thej
th onept, denotedw ij, will be
obtained.If
D c u ⊆ D
,D c u = {d r | hd r , y s i ∈ D ∧ y s = c u }
is thesubset ofthetraininginstaneswhoselabelistheonept
c u,thenw ij mightbedened as:
w ij = X
∀d k ∈D cj
log 2
1 + tf ik
len(d k )
(1)
where
tf ik is the numberof ourrenesofthe termt i in thedoument d k
d k
and
len(d k )
isthelength(numberofterms) ofd k.
Asnotedin [7℄and[8℄,diret useof
w ij to representtermsinthevetort i
ouldbesensibleto highly unbalaneddata. Thus,somekindofnormalization
isusuallyrequiredand,inourase,weseletedtheoneproposedin [8℄:
t ′ ij = w ij m
P
i=1
w ij
(2)
t ij = t ′ ij
q
P
j=1
w ij
(3)
With this last onversion we nally obtain, for eah term
t i ∈ V
, aq
-dimensional vetor
t i, t i = ht i,1 , . . . , t i,q i
dened over a spae of q
onepts.
Up to now, those onepts orrespond to the original ategories used to label
thedouments.Later,wewilluseothermoreelaboratedonepts.
Representingdoumentsin the onept spae Onethetermsarerepre-
sentedinthe
q
-dimensionaloneptspae,thosevetorsanbeusedtorepresentportanefordierentdoumentssoitisnotagood ideaomputingthatvetor
for thedoument asthesimpleaverage ofall itsterm vetors.Previousworks
in BoW [6℄ haveonsidered dierentstatisti tehniques to weight theimpor-
taneof termsin adoument suh as
tf idf
,tf ig
,tf χ 2 ortf rf
,among others.
Here, we will use theapproah used in [8℄ for author proling that represents
eahdoument
d k asthe weightedaggregation oftherepresentations(vetors) oftermsthatitontains:
d k = X
t i ∈d k
tf ik
len(d k ) × t
i
(4)
Thus,doumentsarealsorepresentedina
q
-dimensionaloneptspae(i.e.,d k ∈ R q) whih is muh smaller in dimensionality than the one required by
standardBoWapproahes(q ≪ m
).
2.2 Temporal Variation ofTerms
InSubsetion2.1wesaidthattheoneptspae
C
usuallyorrespondsto stan-dard ategory names used to label the training instanes in supervised text
ategorizationtasks.Inthis senario,that in[7℄isreferredasdiretderivation,
eah ategory label simply orresponds to aonept. However, in [7℄ also are
proposed other alternatives likesplit derivation and ombined derivation. The
formerusesthelow-levellabelsinhierarhialorporaandthelatterisbasedon
ombiningsemantiallyrelatedlabelsinauniqueonept.In[8℄thoseideasare
extendedby rstlustering eahategoryof theorporaand then using those
subgroups(sub-lusters)asnewoneptspae.
4
Asweansee,theommonideatoalltheaboveapproahesisthatoneaset
of doumentsis identied asbelonging to agroup/ategory, that ategoryan
beonsideredasaoneptand CSA anbeapplied in theusual way.We take
asimilar viewto those works by onsideringthat thepositive(minority)lass
in ERDproblemsanbeaugmentedwiththeoneptsderivedfrom thesetsof
partial doumentsread along the dierent time steps. In order to understand
this ideait is neessarytorst introdueasequentialwork sheme astheone
proposedin[9℄forresearhinERD systemsfordepressionases.
Following [9℄,wewill assume a orpusof douments written by
p
dierentindividuals (
{I 1 , . . . , I p }
). For eah individualI l (l ∈ {1, . . . , p}
),the n l dou-
mentsthat he haswrittenare provided in hronologialorder (fromthe oldest
texttothemostreenttext):
D I l ,1 , D I l ,2 , . . . , D I l ,n l.Inthisontext,giventhese
p
streamsofmessages,theERDsystemhastoproesseverysequeneofmessages(inthehronologialordertheyareprodued)andtomakeabinarydeision(as
early aspossible) on whether ornotthe individualmight be apositiveaseof
depression.Evaluationmetrisonthistaskmustbetime-aware,soanearlyrisk
detetionerror(ERDE)isproposed.Thismetrinotonlytakesintoaountthe
4
makethedeision.
Inausualsupervisedtextategorizationtask,wewouldonlyhavetwoate-
gorylabels:positive (risk/depressivease)andnegative(non-risk/non-depressive
ase).ThatwouldonlygivetwooneptsforaCSArepresentation.However,in
ERD problems there is additional temporal information that ould be used to
obtainanimprovedoneptspae.Forinstane, thetrainingset ouldbesplit
in
h
hunks,C ˆ 1 , C ˆ 2 , . . . , C ˆ h,insuhawaythatC ˆ 1ontainstheoldestwritings
ofallusers(rst(100/
h
)%ofsubmittedpostsoromments),hunkC ˆ 2ontains
the seond oldest writings,and soforth. Eah hunk
C ˆ k anbepartitioned in
twosubsetsC ˆ k + andC ˆ k −, C ˆ k = ˆ C k + S C ˆ k − whereC ˆ k + ontainsthe positiveases
C ˆ k −, C ˆ k = ˆ C k + S C ˆ k − whereC ˆ k + ontainsthe positiveases
C ˆ k + ontainsthe positiveases
ofhunk
C ˆ k andC ˆ k − thenegativesonesofthis hunk.
Itisinterestingtonotethatweanalsoonsiderthedatasetsthatresultof
onatenatinghunks that are ontiguousin timeand using thenotation
C ˆ i−j
torefertothehunkobtainedfromonatenatingallthe(original)hunksfrom
the
i
thhunk to thej
th hunk (inlusive).Thus,C ˆ 1−h willrepresent thedata
set withtheompletestreamsofmessagesofallthe
p
individuals.Inthisase,C ˆ 1−h + andC ˆ 1−h − willhavetheobvioussemantispeiedabovefortheomplete
doumentsofthetrainingset.
The lassi way of onstruting alassier would be to take the omplete
doumentsofthe
p
individuals(C ˆ 1−h)and useanindutivelearningalgorithm
suhas SVMorNaïveBayestoobtainthatlassier. Aswementionedearlier,
anotherimportantaspetinEDSsystemsisthatthelassiationproblembeing
addressed is usually highly unbalaned (UDS problem). That is, the number
of douments of the majority/negative lass (non-depression) is signiantly
larger than that of the minority/positive lass (depression). More formally,
followingthepreviouslyspeiednotation
| C ˆ 1−h − |≫| C ˆ 1−h + |
.AnalternativetotrytoalleviatetheUDSproblemwouldbetoonsiderthat
theminoritylassisformednotonlybytheompletedoumentsoftheindividu-
alsbutalsobythepartialdoumentsobtainedinthedierenthunks.Following
the generalideasposed in CSA, weouldonsider that thepartial douments
readin thedierenthunksrepresenttemporal oneptsthatshouldbetaken
intoaount.Inthisontext,onemightthinkthatvariationsofthetermsusedin
thesedierentsequentialstagesofthedoumentsmayhaverelevantinformation
forthelassiationtask.With thisideainmind, themethodproposedin this
worknamedtemporalvariation ofterms(TVT)arises,whihonsistsinenrih-
ingthedoumentsoftheminoritylasswiththepartial doumentsread inthe
rsthunks.Thesersthunksoftheminoritylass,alongwiththeiromplete
douments,willbeonsideredasanewoneptspaeforaCSA method.
Therefore, in TVT we rst determine the number
f
of initial hunks thatwillbeusedtoenrih theminority(positive) lass.Then,weusethedoument
sets
C ˆ 1 + , C ˆ 1−2 + , . . . , C ˆ 1−f + and C ˆ 1−h + asonepts forthe positive lassand C ˆ 1−h −
C ˆ 1−h −
for the negative lass. Finally, we represent terms as douments in this new
(
f + 2
)-dimensionalspaeusingtheCSA approahexplainedinSetion 2.1.3.1 Data Set
OurapproahwastestedonthedatasetusedintheeRisk2017pilottask 5
and
desribed in [9℄. It is a olletion of writings (posts oromments) from a set
of Soial Mediausers. Therearetwoategories ofusers,depressed and non-
depressed and,foreahuser,theolletionontainsasequeneofwritings(in
hronologial order). For eahuser, theolletionof writingshasbeendivided
into 10 hunks. The rst hunk ontains the oldest 10% of the messages, the
seond hunkontainstheseondoldest 10%,and soforth.This olletionwas
splitintoatrainingandatestsetthatwewillreferas
T R DS andT E DS respe-
tively.The(training)
T R DS setontained486users(83positive,403negative)
and the (test)
T E DS set ontained 401 users (52 positive, 349 negative). The
userslabeledaspositivearethosethathaveexpliitlymentionedthattheyhave
beendiagnosedwithdepression.
Thistask was dividedinto atrainingstage and atestingstage.In therst
one, the partiipating teams had aess to the
T R DS set with all hunks of
all training users. They ould therefore tune their systems with the training
data. To reproduethe same onditions of the pilot task, weuse the training
set (
T R DS) togenerate aneworpusdividedinto atrainingset (that we will
refer as
T R DS − train
) and a test set (namedT R DS − test
) with the sameategories (depressed and non-depressed)for eah sequene of writings of the
usersintheolletion.Thosesets maintainedthesameproportionsof post per
userandwordsperuserasdesribedin[9℄.
T R DS −train
andT R DS −test
weregeneratedbyrandomlyseletingarounda70%ofwritingsfor therstone and
therest 30%fortheseond one.Thus,
T R DS − train
resultedin 351writings(63positive,288negative)meanwhile
T R DS − test
ontains135individuals(20 positive,115negative).Inthe pilot taskthe olletionof writingswasdividedinto10hunks,sowemadethesamedivisionon
T R DS −train
andT R DS −test
.3.2 ExperimentalResults
WereproduedthesameonditionsfaedbythepartiipantsoftheeRiskpilot
task, sowerstworkedonthedataset releasedonthetrainingstage (
T R DS)
andthen,theobtainedmodelsweretestedontheteststage(
T E DS).Theativ-
itiesarriedoutateahstagearedesribedbelow.
Training stage CSA is a doument representation that aims at addressing
somedrawbaksof lassialrepresentations suh as BoW.On theother hand,
TVTissupposedtoextendCSAbydeningoneptsthatapturethesequential
aspets oftheERD problems and thevariationsofvoabularyobservedin the
distintstagesoftheindividuals'writings.Thus,CSAandBoWariseasobvious
andidatesto ompareTVTinthedata setusedin thepilottask.Those three
5
NaïveBayesand RandomForest,among others.Ineahase, thebestparam-
eters were seletedfor eah algorithm-representationombination (model) and
thereportedresultsorrespondtothebest obtainedvalues.
We tested BoW with dierent weighting shemes and learning algorithms
but,in allases,thebestresultswereobtainedwithbinaryrepresentationsand
the Naïve Bayes algorithm. From now on, all referenes to BoW will stand
for that setting. We use CSA with representations of terms with normalized
weightsaordingtoEquations2and3anddoumentrepresentationsobtained
from Equation 4as proposed in [8℄ for author proling tasks. We named this
settingas
CSA ⋆.FortheTVTrepresentation,adeisionmustbemaderelated
to thenumberf
ofhunksthatwill enrihtheminority(positive)lass.Inour
studies,weuse
f = 4
and,inonsequene,thepositivelasswasrepresentedby 5onepts.Inthat way,the numberof doumentsin thedepressed lasswasinremented by 5 with respet to the original size,from 83 positive instanes
to 415. Aswean see, with this tehnique weare also obtainingsomekind of
balaningin thesizeofbothlassesandaddressinginthatwayanotherusual
problemthat wepreviouslyreferastheUDSproblem.
ApartiularitythatERD methodsmustonsideristheriterionusedtode-
idewhen (inwhatsituations)thelassiationgeneratedbythesystemison-
sideredthenal/denitivedeisionontheevaluatedinstanes(thelassiation
time deision (CTD) issue).We willstartourevaluation ofthedierentdou-
mentrepresentationsandalgorithmsassumingthatthelassiationismadeon
astatihunkbyhunkbasis.Thatis,foreahhunk
C ˆ i providedtotheERD
systemswewillevaluatetheir performane onsidering thatall themodelsare
(simultaneously)appliedtothewritingsreeiveduptothehunk
C ˆ i.With this
kind of information it will be possible to observeto what extent the dierent
approahesarerobusttothepartialinformationinthedierentstages,inwhih
momenttheystarttoobtainaeptableresults,andotherinterestingstatistis.
Tables1, 2and 3show theresults ofexperiments forthis stati hunk by
hunk lassiationsheme.Valuesofpreision (
π
), reall (ρ
)andF 1-measure
(
F 1) of the target (depressed) lass are reported for eah onsidered model.
Statistisalso inludetheearly riskdetetion error (ERDE) measureproposed
in [9℄.Thismeasure onsidersnotonlytheorretnessofthedeisionmadeby
thesystembutalsothedelayinmakingthatdeision.ERDEusesspeiosts
to penalize false positivesand false negatives. However,ERDE hasa dierent
treatmentwiththetwopossiblesuessfulpreditions(true negativesandtrue
positives).Truenegativeshavenoost(ost=0)butERDEassoiatesaostto
thedelayinthedetetionoftruepositivesthatmonotoniallyinreaseswiththe
number
k
oftextualitemsseenbeforegivingtheanswer.Inanutshell,thatostislowwhen
k
islowerthan athresholdvalueo
butrapidly approahes1whenk > o
.Inthat way,o
representssometypeofurgeny in detetingdepressionases:the lowest the
o
valuesthehighestthe urgenyin detetingthe positiveInour study weonsider the two valuesof
o
used in the pilottask:o = 5
(
ERDE 5) and o = 50
(ERDE 50). In eah hunk, lassiers usually produe
theirpreditionswithsomekindofondene,ingeneral,theestimatedprob-
abilityofthepreditedlass.Inthoseases,weanseletdierentthresholds
tr
onsideringthataninstane(doument)isassignedtothetargetlasswhenits
assoiatedprobability
p
isgreater(orequal)thanertainthresholdtr
(p ≥ tr
).Inthis studyweevaluated5dierentsettingsfor theprobabilitiesassignedfor
eah lassier:
p = 1
,p ≥ 0.9
,p ≥ 0.8
,p ≥ 0.7
andp ≥ 0.6
. Due to spaeonstraints,onlythebestresultsobtainedwithapartiularsettingareshown.
6
Table1showstheresultsobtainedwith aBoW representationandaNaïve
Bayes lassier. Those values orrespond to the setting where an instane is
onsidered as depressive if the lassier assigns to the target/positive lass a
probabilitygreaterorequalthan0.8(
p ≥ 0.8
).Surprisingly,thebestresultsfor all theonsidered measures are obtainedon the rsthunk.Inthis hunk,wean observethat this model only reoversa45%of the depressed individuals.
However, this is not the worst aspet. Only a 12%of the individual lassied
asdepressed eetively hadthis ondition resultingin onsequenein avery
low
F 1measure(0.19).Table2showssimilarresultswhenaCSA ⋆-RF(random
forest)ombinationwith
p ≥ 0.6
isusedtolassifythewritingsoftheindividuals.Here,
F 1measureisalsolowbutweanobserveadeteriorationinthe(ERDE 5)
and(
ERDE 50)errorvalueswithrespettothepreviousmodel.
Finally,inTable3,theresultsofTVTwithaNaïveBayesalgorithmand
p ≥ 0.6
areshown.There,weanseearemarkableimprovementintheperformane ofthelassierinthehunk3withexellentvaluesofERDE 50(7.02),preision
π
(0.63), reallρ
(0.85)andF 1 measure(0.72). Analysingtheresultsalongthe
10 onsidered hunks we observe how the measures keep improving from the
hunk 1up to reahthe best values in hunk 3and, from then on,they start
to deteriorate hunk byhunk and obtainingtheworstresults onthe last two
hunks.Asweakpointsofthoseresultsweansaythatthebestvalueof
ERDE 5
obtainedinhunk1isnotverygood.Besides,eventhought
ERDE 50valuesare
aeptable formostoftheonsidered hunks,theyneedat leasttwohunks to
show a ompetitive performane. That aspet looks reasonable if we onsider
that TVT is based onthe variation of terms between onseutivehunks and
that informationisnotavailable onthersthunk.
Asgeneralonlusiontothehunkbyhunkanalysis,weouldsaythatim-
balanedlassesseemtoaetinadierentwaytothedierentmethods.BoW
and CSA diretly depend on the voabulary of positive and negative lasses.
In therst hunk where texts are supposed to bethe shortest, relevantwords
of the positive lass appearing in the posts will probably havemore hane of
beingbalanedwithrespettothewordsappearinginthenegativelass.That
makeslassiersbemoresensitivetothepositivelassand,inonsequene,the
reallandgeneralperformaneisimproved.Asmoreinformationisread,words
relatedtothenegativelassaremoreprobabletoourintroduingnoiseand
6
All the tables generated for the dierent probabilities an be downloaded from
Table 1.Model:
BoW
+NaïveBayes(p ≥ 0 . 8
).Chunkbyhunksetting.ERDE 5,
ERDE 50,F 1-measure(F 1),preision(π
)andreall(ρ
)ofthedepressedlass.
F 1-measure(F 1),preision(π
)andreall(ρ
)ofthedepressedlass.
π
)andreall(ρ
)ofthedepressedlass.ch 1 ch 2 ch 3 ch 4 ch 5 ch 6 ch 7 ch 8 ch 9 ch 10 ERDE 5 18.0920.98 21.5 21.7321.9521.9521.9521.9522.1722.17
ERDE 5015.1716.8420.7720.2521.2121.9521.9521.5222.1722.17
F 1 0.19 0.16 0.09 0.11 0.09 0.09 0.09 0.09 0.09 0.13
F 1 0.19 0.16 0.09 0.11 0.09 0.09 0.09 0.09 0.09 0.13
π
0.12 0.11 0.06 0.17 0.06 0.06 0.06 0.06 0.06 0.08ρ
0.45 0.35 0.2 0.25 0.2 0.2 0.2 0.2 0.2 0.3Table2.Model:
CSA ⋆+RF(p ≥ 0 . 6
).Chunkbyhunksetting.ERDE 5,ERDE 50,
F 1-measure(F 1),preision(π
)andreall(ρ
)ofthedepressedlass.
ERDE 50,
F 1-measure(F 1),preision(π
)andreall(ρ
)ofthedepressedlass.
F 1),preision(π
)andreall(ρ
)ofthedepressedlass.
ch 1 ch 2 ch 3 ch 4 ch 5 ch 6 ch 7 ch 8 ch 9 ch 10 ERDE 5 21.9325.6425.4625.5726.1225.6825.6825.4625.3525.68
ERDE 5019.4724.9425.4623.3525.37 24.2 23.46 22.5 22.3923.47
F 1 0.19 0.08 0.05 0.1 0.06 0.08 0.13 0.16 0.16 0.14
F 1 0.19 0.08 0.05 0.1 0.06 0.08 0.13 0.16 0.16 0.14
π
0.11 0.05 0.03 0.06 0.04 0.05 0.07 0.09 0.09 0.08ρ
0.6 0.25 0.15 0.3 0.2 0.25 0.4 0.5 0.5 0.45aetingin onsequenetheperformane.TVTdoesnotseemtobeso aeted
bythis problem showingamorestable performanealongallthe hunks,with
thebestresultsinthethirdhunkandthenwithalittledeteriorationfromthen
on. Those results ould be giving evidene that the variation of terms (with
f = 4
)allowsto betterdetet theourreneof relevantwords ofthe positivelassin thersthunks.However,italsoseemstobeaetedbytheunbalane
problem in subsequent hunks, although in a lowerlevelthan BoW and CSA
representations.Unfortunately,verifyingthosehypotheseswouldrequireonsid-
eringmorebalaned settingsanddierent
f
valueswhatisoutofthesopeofthispaper.However,thatimportantaspetwillbeaddressedinfuture works
Anotherapproah for the CTD issue ould be diretly use the probability
(orsomemeasureofondene)assignedbythelassiertodeidewhentostop
reading adoumentand givingitslassiation.That approah,that in [9℄is
referredasdynami,onlyonsidersthatthisprobabilityexeedssomepartiular
threshold to lassifytheinstane/individual aspositive.That means, that dif-
ferentstreams ofmessagesouldbelassiedasdepressed in dierentstages
(hunks). Table 4showthose statistisfor BoW,
CSA ⋆ and TVT representa- tionsforthoselearningalgorithmsandprobabilitythresholdsthatobtainedthe
best performane. There, we an see that TVT representation, with a Naïve
Bayesandlassifyinginstanesasdepressedwhentheassignedprobabilityis1,
obtains the best results for the measures we are more interested in:
ERDE 5,
ERDE 50 and F 1-measure.Inthis ontext, BoW gets abetterreall valuebut
attheexpenseofloweringthepreisionvaluesresultinginapoor
F 1-measure.
Testing stage The previous results were obtained by training the lassiers
with the
T R DS − train
data setand testingthemwith theT R DS − test
dataTable 3.Model:
T V T
+NaïveBayes(p ≥ 0 . 6
).Chunkbyhunksetting.ERDE 5,
ERDE 50,F 1-measure(F 1),preision(π
)andreall(ρ
)ofthedepressedlass.
F 1-measure(F 1),preision(π
)andreall(ρ
)ofthedepressedlass.
π
)andreall(ρ
)ofthedepressedlass.ch 1 ch 2 ch 3 ch 4 ch 5 ch 6 ch 7 ch 8 ch 9 ch 10 ERDE 5 14.2414.2714.5914.8315.1715.5115.7415.8416.2116.13
ERDE 50 10.80 7.22 7.02 9.24 9.25 9.97 10.7310.7311.0610.96
F 1 0.42 0.65 0.72 0.67 0.67 0.67 0.64 0.64 0.57 0.58
F 1 0.42 0.65 0.72 0.67 0.67 0.67 0.64 0.64 0.57 0.58
π
0.39 0.58 0.63 0.60 0.60 0.60 0.58 0.58 0.50 0.52ρ
0.45 0.75 0.85 0.75 0.75 0.75 0.70 0.70 0.65 0.65Table4.DynamiModels forBoW-NB,
CSA ⋆-NBandTVT-NB.
ERDE 5 ERDE 50 F 1 π ρ
BoW(
p ≥ 0 . 8
) 21.05 18.13 0.24 0.14 0.75CSA ⋆-NB(p = 1
) 23.09 23.07 0.06 0.04 0.15
TVT-NB(p = 1
) 14.13 11.25 0.400.47 0.35
set.Theobviousquestionnowisifsimilarresultsareobtainedbytrainingwith
the full training set of the pilot task (
T R DS) and using the lassiers with
the data set
T E DS that was inrementally released during the testing phase of the pilottask. Inthis new senario,the TVT representationwasused with
a simple rule for the CTD issue that onsists in lassifying all the individual
in the hunk 3 as positive (depressed) if a Naïve Bayes lassier produed a
probabilityequalorgreaterthan 0.6forthe positivelass.That strategy, that
wewill referas
T V T p≥0.6 3 ,is motivatedbythegoodresultsshowedbyTVTin
Table3.WealsotestedtheBoW,
CSA ⋆andTVTrepresentationswithdynami strategiesandusingthoseprobabilitiesthatbestvaluesobtainedinthetraining
stage.As baselineswealso tested twoapproahes desribed in [9℄that will be
namedas
Ran
andM in
.Ran
,simplyemitsarandomdeision(depressed/non- depressed)foreahuserin thersthunk.M in
,ontheotherhand,standsforminorityandonsistsinlassifyingeahuserasdepressiveinthersthunk.
Table5showstheperformaneofalltheabovementionedapproahesonthe
test set of thepilottask (
T E DS).We alsoinluded theresultsreported in the
eRiskpageforthesystemsthatobtainedthebest
ERDE 5(F HDO−BCSGB
),
ERDE 50 (U N SLA
) and F 1 (F HDO − BCSGB
) measures on the pilot task.
U N SLA
) andF 1 (F HDO − BCSGB
) measures on the pilot task.
Here we an observethat results obtained with
T V T p≥0.6 3 are not as good as
those obtained in the training stage. However, the setting TVT-NB (
p = 1
)wouldhaveobtainedthebest
ERDE 5soreandthethirdERDE 50value,with
asmalldierenerespettothebest reportedvalue(9.84versus9.68).
Thosegood resultsof TVT were ahievedtaking into aountthe best pa-
rameters obtainedin the trainingstage. However,it also would beinteresting
analysingwhatwouldhavebeentheTVT'sperformaneifotherparameterset-
tingshadbeenseleted.Table6showsthistypeofinformationbyreportingthe
results obtainedwith dierent learningalgorithms (Naïve Bayes and Random
Forest)and dierent probability valuesfor dynami approahes to the CTD
Table5.Resultsonthe
T E DS testset.
ERDE 5 ERDE 50 F 1 π ρ Ran
16.83 14.63 0.17 0.11 0.4M in
21.67 15.03 0.23 0.13 1BoW(
p ≥ 0 . 8
) 16.45 10.87 0.38 0.25 0.77CSA ⋆-NB(p = 1
) 20.58 19.58 0.05 0.03 0.15
T V T p≥ 3 0.6 13.64 10.17 0.53 0.46 0.62
TVT-NB(
p = 1
) 12.38 9.84 0.42 0.50 0.37F HDO − BCSGA
12.82 9.69 0.64 0.61 0.67F HDO − BCSGB
12.70 10.39 0.55 0.690.46U N SLA
13.66 9.68 0.59 0.48 0.79Table6.ResultsofTVTwithdierentlearningalgorithmsandprobabilityvalues.
ERDE 5 ERDE 50 F 1 π ρ
TVT-NB(
p ≥ 0 . 6
) 13.59 8.40 0.50 0.37 0.75 TVT-NB(p ≥ 0 . 7
) 13.43 8.24 0.51 0.39 0.75 TVT-NB(p ≥ 0 . 8
) 13.13 8.17 0.54 0.42 0.73 TVT-NB(p ≥ 0 . 9
) 13.07 8.35 0.52 0.42 0.69 TVT-NB(p = 1
) 12.38 9.84 0.42 0.50 0.37 TVT-RF(p ≥ 0 . 6
) 12.46 8.37 0.55 0.49 0.63 TVT-RF(p ≥ 0 . 7
) 12.49 8.52 0.55 0.50 0.62 TVT-RF(p ≥ 0 . 8
) 12.30 8.95 0.56 0.54 0.58 TVT-RF(p ≥ 0 . 9
) 12.34 10.28 0.47 0.55 0.40 TVT-RF(p = 1
) 12.82 11.82 0.20 0.67 0.12the
ERDE
measuresindependentlyofthealgorithmusedtolearnthemodeland theprobabilityusedinthedynamiapproahes.MostoftheERDE 5valuesare
lowandin 7outof10settingsthe
ERDE 50 valuesarelowestthanthebestre-
portedinthepilottask(
U N SLA
:9.68).Inthisontext,TVTahievesthebestreported
ERDE 5 valueupto now (12.30)with thesetting TVT-RF(p ≥ 0.8
)
andthelowest
ERDE 50 value(8.17)withthemodel TVT-NB(p ≥ 0.8
).
4 Conlusions and future work
In this artile wepresenttemporal variation of terms (TVT) anapproah for
earlyriskdetetionbasedonusingthevariationofvoabularyalongthedierent
time stepsas oneptspaefordoumentrepresentation.TVTnaturallyopes
with the sequentialnature of ERD problems and also givesa toolfor dealing
with unbalaned data sets. Preliminary results with the eRisk 2017 data set
showabetter performane ofTVT in omparisonto othersuessful semanti
analysisapproahandthestandardBOWrepresentation.Italsoshowsarobust
performane along dierent parameter settings and reahes the best reported
resultsuptothemomentfor
ERDE 5 andERDE 50errorevaluationmeasures.
Asfuture work,weplantoapplytheTVTapproahtootherproblemsthat
PAN-2012ompetitiononsexualpredatoridentiation[5℄whihsharesseveral
harateristiswiththedatasetusedinthepresentworksuhasthesequentially
ofdata, unbalanedlassesandtherequirementofdetetingtheminoritylass
(predator)as soonaspossible,amongothers.
TVT is expliitly based on the enrihment of the minority lass with new
oneptsderivedfrom thepartialinformationobtainedfromtheinitialhunks.
However, some improvements an be ahieved by also lustering the negative
lass as proposed by [8℄ in author proling tasks. We arried out some initial
experiments by ombining TVT with the lustering of the negative lass but
more study is required to determine how both approahes an be eetively
integrated. Besides, in the present work, the eletion of
f = 4
mainly aimedatobtainingbalanedpositiveandnegativelasses.Infuture works,dierent
f
valueswillbeonsideredto seehowtheyimpatontheTVT'sperformane.
TVTprovides,asasideeet,aninterestingtoolfordealingwiththeunbal-
aneddatasetproblem.WeplantoapplyTVTonunbalaneddatasetsthatdo
notneessarilyorrespondtotheERDeldandomparingitagainstotherwell
knownmethodsinthisarea,suhasSMOTE[1℄.Finally,itwouldbeinteresting
omparingtheoneptspae usedinourapproahagainstother reentand ef-
fetiverepresentationsbasedonwordembeddings.Inthisontext,itouldalso
be analysed how our onept spae representation an be extended/improved
withinformationprovidedbythoseembeddings.
Referenes
1. N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.Ph.Kegelmeyer.Smote:Syntheti
minorityover-samplingtehnique. J.Artif.Intell.Res.,16:321357, 2002.
2. S. Deerwester, S.T.Dumais, G.W. Furnas, T.K. Landauer,and R. Harshman.
Indexingbylatentsemantianalysis. Journalofthe ASIS,41(6):391407,1990.
3. H. Jair Esalante, M. Montes-y-Gómez, L. Villaseñor Pineda, and M. Errealde.
Earlytextlassiation:anaïvesolution.InA.Balahur,E.VanderGoot,P.Vossen,
and A.Montoyo,editors, Pro.of WASSANAACL-HLT 2016, 2016, SanDiego,
California,USA,pages9199.TheAssoiationforComputerLinguistis,2016.
4. E. Gabrilovih and S. Markovith. Wikipedia-based semanti interpretation for
naturallanguageproessing. JAIR,34(1):443498,Marh2009.
5. G.InhesandF.Crestani. Overviewof theinternationalsexual predatoridenti-
ationompetitionatpan-2012. InP.Forner,J.Karlgren,andC.Womser-Haker,
editors, CLEF(OnlineWorkingNotes/Labs/Workshop),pages112,2012.
6. M. Lan, Ch. Tan, J. Su, and Y. Lu. Supervisedand traditional termweighting
methodsforautomatitextategorization. IEEETPAMI,31(4):721735,2009.
7. Z.Li,Z.Xiong,Y.Zhang,Ch.Liu,andK.Li. Fasttextategorizationusingonise
semantianalysis. PatternReogn. Lett.,32(3):441448,February2011.
8. A. Pastor López-Monroy, M. Montes y Gómez, H. Jair Esalante, L. Villaseñor-
Pineda, and E. Stamatatos. Disriminative subprole-spei representations for
authorprolinginsoialmedia. Knowledge-BasedSystems,89:134147,2015.
9. D.E.LosadaandF.Crestani. Atestolletionforresearhondepressionandlan-
guageuse.InExperimentalIRMeetsMultilinguality,Multimodality,andInteration
-7thInt.Conf.oftheCLEFAssoiation,Portugal,pages2839,2016.