HAL Id: hal-00675578
https://hal.archives-ouvertes.fr/hal-00675578
Submitted on 1 Mar 2012
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
What About Sequential Data Mining Techniques to
Identify Linguistic Patterns for Stylistics?
Solen Quiniou, Peggy Cellier, Thierry Charnois, Dominique Legallois
To cite this version:
Solen Quiniou, Peggy Cellier, Thierry Charnois, Dominique Legallois. What About Sequential Data
Mining Techniques to Identify Linguistic Patterns for Stylistics?. CICLing 2012: Computational
Linguistics and Intelligent Text Processing, Mar 2012, New Delhi, India. pp.166-177,
�10.1007/978-3-642-28604-9_14�. �hal-00675578�
to Identify Linguisti Patterns for Stylisti s? SolenQuiniou
1,2
,PeggyCellier3
,ThierryCharnois1
,andDominiqueLegallois
2
1
GREYCUniversitédeCaenBasse-Normandie,Campus2,14000Caen
2
CRISCOUniversitédeCaenBasse-Normandie,Campus1,14000Caen
3
IRISA-INSAdeRennes,CampusdeBeaulieu,35042 RennesCedex
Abstra t. Inthispaper,westudytheuseofdataminingte hniquesfor stylisti analysis,fromalinguisti pointofview,by onsidering emerg-ingsequentialpatterns.First,weshowthat miningsequentialpatterns ofwordswithgap onstraintsgivesnewrelevantlinguisti patternswith respe ttopatternsbuilton
n
-grams.Then,weinvestigatehow sequen-tial patterns of itemsets an provide more generi linguisti patterns. Wevalidate ourapproa hfromalinguisti pointofviewby ondu ting experimentsonthree orporaof various typesof Fren htexts(Poetry, Letters,andFi tion).By onsideringmoreparti ularlypoeti texts,we showthat hara teristi linguisti patterns anbeidentiedusingdata miningte hniques. We also dis uss how to improve our proposed ap-proa hsothatit anbeusedmoree ientlyforlinguisti analyses.1 Introdu tion
Thestudyofphraseology-in ludingstylisti s-isaresear heldthathasbeen investigatedoverthepast30 yearsbythelinguisti ommunity.More re ently, therehasbeenaparti ularinterestinstudiesfrom orpuslinguisti s.Twomain approa hes anbeidentied: orpus-basedand orpus-driven.Corpus-based ap-proa hesassumetheexisten eoflinguisti theoriesand use orporatoanalyze theirappli ationandhen etovalidatethem.Corpus-driven approa hes onsider that linguisti onstru tsemergefrom orpusanalysis.Thisanalysis allowsthe dis overyof o-o urringwordpatternsthatwillbethebasisoflinguisti anal-yses.Ourworkispartofthe orpus-drivenapproa hessin eourgoalistoassist linguistsin dis overingnewlinguisti onstru tswithoutanypriorknowledge.
Oneof the rst orpus-drivenapproa h was proposed by Renouf and Sin- lair [1℄. It onsists on astudy of ollo ationalframeworksthanks to orpora; ollo ational frameworks representdis ontinuoussequen esoftwogrammati al wordsen losingalexi alword(e.g.,many+?+ofthatmeansmanyfollowed by avariable lexeme - symbolized by ? - itself followed by of). However, this approa hisnotentirely orpus-drivensin ethestudied ollo ationalframeworks werepre-sele tedbyRenoufand Sin lair.Infa t,mostofthe so- alled orpus-drivenapproa hesarepartly orpus-based
1
.Morere ently,Biberpresentedan 1
orpora[2℄.Todoso,hereliesbothonapreliminarywork ontheidenti ation of lexi al bundles (i.e., frequent sequen es of ontiguous words, aka
n
-grams) and on ollo ationalframeworks to identify xed and variable elements in the patternsheextra ted.Furthermore,Biber onsiderstwolanguageregisters ( on-versationanda ademi writing)andshowstheinterestofusinga orpus-driven approa htostudythespe i itiesofpatternsappearinginea hregister.Inthispaper,wepresentarstandoriginalstudywhi haimsatshowingthe interestofdataminingmethodsforthestylisti analysisoflargetexts.Thegoal isto providetothelinguist expertssomeprominent,relevant,and understand-ablepatternswhi h an be hara teristi ofaspe i typeoftextsothat these experts an arryout astylisti analysis basedon these patterns. In fa t,our workisin the ontinuityofBiber'sbut we onsider varioustext types(instead of languageregisters) that we study from a stylisti point of view. To do so, weset upamethodologybasedonsequentialdatamining,from theextra tion of patterns to the sele tion of the most relevant. We apply this methodology to stylisti s.Tothebest of our knowledge, data miningmethods have notyet beenused intheeldofstylisti swhereasoneoftheiradvantagesistooeran interpretableresultto users, as opposed to numeri almethods su h as Hidden MarkovModelsorConditionalRandomFields.Indeed,thelattermethodshave been shown to a hieve good results for tasks like text ategorization or infor-mationextra tionbuttheyprodu eoutputshardlyunderstandablebyhumans. Thus,theapproa hthatweproposeisbasedonfrequent sequentialpatterns [3℄, awell-knowndataminingte hniquetoautomati allydis overfrequentpatterns basedonthesequentialorderofdata.We onsidertwotypesofsequential pat-terns:single-itempatterns(anitemrepresentsasinglepie eofinformation,e.g. a word form); and itemset patterns. In this se ond type of patterns, a word is represented by a set of features. Therefore, extra ted itemset patterns may ombinedierentlevelsofabstra tion(wordforms,lemmas,POStags,et .):for instan e,
h(P REP ) (DET ) (N C)i
orh(to) (the DET ) (N C)i
2
. Furthermore, as weset ourstudy intheeld of stylisti s,theend-goalis toextra t patterns thatare hara teristi ofa ertaintypeoftext.Thisisthereasonwhywefo us onaspe i typeof sequentialpatterns:emergingpatterns. Emergingpatterns an apture ontrast hara teristi sbetween lassesordatasets[4℄.Furthermore, thesepatterns anbeanalyzedbyexpertstodis overnewrelationshipsinagiven domainforabetterunderstandingofit.Here,extra tedemergingpatterns ould then beanalyzedby linguiststodis overlinguisti patterns, hara teristi ofa ertaintypeoftext.
Therest ofthispaperisorganizedasfollows.First,ourmethodologybased on sequentialdata miningis introdu ed in Se tion 2.Then, Se tion 3presents experimental resultson the use of our methodology for stylisti sboth from a quantitativeandalinguisti pointofview.Finally,Se tion4dis ussestheleads tofurther investigate,whileSe tion5drawssome on lusions.
2
computation
of the
emerging
patterns
emerg. pattern
set 1
emerg. pattern
set N
...
computation
of the
patterns
tagged
corpus 1
tagged
corpus N
...
pattern
set 1
pattern
set N
...
interpretation
linguistic
Fig.1.Overviewofourproposedapproa h
2 Methodology
Inthisse tion,wegiveanoverviewoftheproposedapproa htoidentify hara -teristi linguisti patternsforea htypeoftext (Se tion2.1).Then, wepresent thesequentialdataminingte hniquesonwhi h ourapproa hisbased:frequent sequentialpatterns(Se tion2.2)andemergingpatterns(Se tion2.3).
2.1 Overview ofthe Proposed Approa h
Figure 1 illustrates the various steps of our approa h.
N
orpora are used as the inputsof the pro ess, with one orpus orresponding to a onsidered type of text. Ea h orpus is rst pre-pro essed and then all its words are labeled with their lemma and their POS ategory (see Se tion 3.1). In the rst step of our approa h, sequential patterns are extra ted for ea h orpus:N
sets of sequential patterns are therefore obtained. Then, in the se ond step, sets of emerging patterns are sele ted for ea h orpus, from theN
sets of sequential patternspreviouslyextra ted.Lastly,theN
setsofemergingpatternsaregiven toalinguistsothat he anusethemtoperformalinguisti interpretation.The rstandse ondstepsarepresentedingreaterdetailsin thenextsub-se tions.2.2 Sequential Pattern Mining
Sequential pattern mining is a well-known data mining te hnique used to nd regularitiesinsequen edatabases,by onsideringthetemporalorderofthedata. Thiste hniquewasintrodu edbyAgrawaletal. in[3℄.
An itemset,
I
, is dened as a set of literals alled items, denoted byI
=
(i
1
. . . i
n
)
.Forexample,(a b)
isanitemsetwithtwoitems:a
andb
.Asequen e,S
, is dened as an ordered list of itemsets, denoted byS
= hI
1
. . . I
m
i
. For instan e,h(a b)(c)(d)(a)i
isasequen eoffour itemsets.Itshould benotedthat alot of appli ations needonly one itemin their itemsets (e.g. DNA stringsor protein sequen es). These parti ular kinds of sequen es are alled single-item sequen es; for the sake of larity, they are denoted byS
= hi
1
. . . i
n
i
, wherei
1
. . . i
n
are items. Several algorithms have been developed to e iently minethat kind of spe i sequen es, forexample [5℄. Inthe rest of the paper, both kinds of sequen es will be onsidered, i.e. single-item sequen es and itemset sequen es.Asequen e
S
1
= hI
1
. . . I
n
i
isin luded inasequen eS
2
= hI
′
1
. . . I
m
′
i
ifthere exist integers
1 ≤ j
1
< ... < j
n
≤ m
su hthatI
1
⊆ I
′
j
1
, ...,I
n
⊆ I
′
j
n
. Thesequen e
S
1
isthus alledasubsequen e ofS
2
,whi hisnotedS
1
S
2
.ForTable 1.
SDB
1
:asequen edatabase Sequen eidentier Sequen e1
h(a b)(c)(d)(a)i
2
h(d)(a)(e)i
3
h(d)(a b e)(c d e)i
4
h(c)(a)i
example,wehavethefollowingrelation:
h(c)(a)i h(a b)(c)(d)(a)i
. Asequen e databaseSDB
is a set of tuples(sid, S)
, wheresid
is a sequen e identier andS
asequen e. For instan e,Table1representsasequen edatabaseoffour sequen es. A tuple(sid, S)
ontains a sequen eS
1
, ifS
1
S
. The support of asequen eS
1
inasequen edatabaseSDB
, denotedsup(S
1
)
,isthenumberof tuples ontainingS
1
inthedatabase.Forexample,inTable1,sup(h(a)(e)i) = 2
sin esequen es 2and3 ontainanitemset witha
followed by anitemset withe
.Therelative supportofsequen esmayalsobeused,asdenedbyEquation1:sup(S
1
) =
|{(sid, S) | (sid, S) ∈ SDB ∧ (S
1
S)}|
|SDB|
(1)A frequent pattern is asequen e su h that its support is greater or equalto a giventhreshold:
minsup
.Sequentialpattern miningalgorithmsthus extra tall thefrequentsequentialpatternsthat appearinasequen edatabase.Be ausethe setof frequentsequentialpatterns an beverylarge, there ex-ists a ondensed representationwhi h eliminatesredundan ies without loss of information: losed sequential patterns [6℄. A frequent sequential pattern
S
is losed ifthere exists no otherfrequent sequentialpatternS
′
su h that
S
S
′
and
sup(S) = sup(S
′
)
.For instan e,with
minsup
= 2
,thesequentialpatternh(b)(c)i
fromTable1isnot losedwhereasthepatternh(a b)(c)i
is losed.More-over,inordertodrivetheminingpro esstowardstheuserobje tivesandto elim-inate irrelevantpatterns, one an dene onstraints[7,8℄. The most ommonly used onstraint is the frequen y onstraint (that assigns a value to
minsup
). Anotherwidespread onstraintisthegap onstraint.A sequentialpatternwith agap onstraint[M, N ]
,denotedbyP
[M,N ]
,isapatternsu has atleastM
− 1
itemsets and at mostN
− 1
itemsets are allowed between every two neigh-bor itemsets, in the original sequen es. For instan e, letP
[1,3]
= h(c)(a)i
andP
[2,3]
= h(c)(a)i
betwo patterns with two dierent gap onstraintsand let usonsiderthesequen esofTable1.Sequen es
1
and4
areo urren esofpatternP
[1,3]
(sequen e1
ontains one itemset between(c)
and(a)
whereas sequen e4
ontains no itemset between(c)
and(a)
), but only sequen e1
is an o ur-ren eofP
[2,3]
(onlysequen eswithone ortwoitemsetsbetween(c)
and(a)
are o urren esofthispattern).Inthispaper,the onsidereddatabases orrespondto orpora.Furthermore, twokindsofsequentialpatternsare onsidered:single-itempatternsanditemset patterns.Inthatlast ase,itemsets anbemadeupofthreetypesofitems:word
Emerging patterns are dened as sequential patterns whose support in reases signi antly from one dataset to another one [4℄. More spe i ally, emergent patterns are sequentialpatterns whose growth rate - the ratio of the supports in the two datasets - is larger than a given threshold:
ρ
. Thus, a sequential patternP
from a datasetD
1
is an emerging pattern to another datasetD
2
ifGrowthRate(P ) ≥ ρ
,withρ >
1
andwithGrowthRate(P )
beingdenedby:GrowthRate(P ) =
( ∞,
ifsup
D
2
(P ) = 0
sup
D1
(P )
sup
D2
(P )
,
otherwise (2) withsup
D
1
(P )
(sup
D
2
(P )
respe tively)beingtherelativesupportofthepatternP
inD
1
(D
2
respe tively).Sin eweareonlyinterestedinpatternsbelongingtoD
1
,wedonot onsiderpatternsP
withsup
D
1
(P ) = 0
.Inthe aseofstylisti analyses,ea hdataset ontainsthefrequentsequential patternsof a orpusandthusof the orrespondingtypeoftext.It orresponds tothepatternsextra tedduringtherststepofourapproa h(seeSe tion2.2). Be ause we onsider more than two types of text, we ompute the emerging patternsofa onsideredtypeoftextwithrespe ttoeveryothertype,a ording toEquation2.Finally,onlythepatternsthatareemergingtoeveryothertypeof textarekeptasemergingpatternsfora onsideredtypeoftext.The omputation ofalltheemergingpatternsisdonee ientlybasedon[9℄.
3 Experimental Evaluation
In this se tion, we report the results of our experimental evaluation on using sequentialpattern miningte hniques forstylisti s.First,in Se tion 3.1, we de-s ribe theused orporaas wellas the setup ofthe various parametersused to extra t emerging sequential patterns. Then, we present ananalysis of the ex-tra tedsequentialpatterns, attwolevels:fromaquantitativepointofview(in Se tion 3.2),andfromalinguisti pointofviewforstylisti s(inSe tion3.3).
3.1 ExperimentalSetup
Corpora We reatedthree orpora, orrespondingtovarioustypesoftext: Po-etry,Letters, andFi tion.Tobuildea h orpus,wesele tedallthetextsofthe 1800-1900era-providedbytheFren hresour esoftheCNRTL
3
- orrespond-ing to the onsidered type of text. For example, authors from Poetry in lude LamartineandMusset,whereasHugoandLamennaisarepartoftheauthorsof Letters,and ChateaubriandandZola areauthorsofFi tion. Then,these three orpora were pre-pro essed. The pre-pro essing steps onsisted in setting the wordsin lower- ase,and thensplittingthetexts intosequen es atpun tuation marksof theset: {'.', '?', '!', '...', ';', ':', ','}.Table2givessomedetails on ea h orpus:thenumberofauthors, ofworks,ofsequen es,andofwords.
Corpus #authors #works #sequen es #words Poetry 27 48 151116 1167422 Letters 5 9 234997 1562543 Fi tion 37 52 663860 5105240
After being pre-pro essed, the orporawere POS tagged using Cordial 4
, a tagger that is known to outperform TreeTagger on Fren h texts. Thus, ea h wordof the orpora was asso iatedwith its form,its lemma andits POS tag. After rst experimentations, it turns out that the POS tags given by Cordial weretoomu hspe i ;wethuspost-pro essedthemtoredu etheirnumber(as a onsequen e,itredu es thenumberof extra tedpatterns).First,toospe i ategoriesweremergedintomoregeneralones.Forexample,theadje tive ate-gorywas initiallyde omposed into16 ategories(dependingonthegender,the number,or whether theword startswithamuteh letter).Thus,the following ategorieswere reated,torepla etheir orrespondingsub- ategories:adje tives (ADJ), determiners(DET), ommonnouns (NC), propernouns (NP), demon-strativepronouns (PD), relative pronouns(PR), indenite pronouns (PI), and pastparti iples(VPARP).Then, ategories orrespondingtopersonalpronouns were de omposed into 2 tags:one for the personal pronoun (PPER), and one for the person (e.g. 1Sfor the singularrst person). Moreover, ategories or-respondingto verbswere de omposedinto3tags:onefortheverb(V),one for the mode of the verb(e.g. INDP for the present of the indi ative mode), and onefortheperson(thesameonesasforthepersonalpronouns).Attheend,we had aset of 35tags insteadof the133initial tags. Usingthis newset of tags, the phrase a rose that we smell is translated as < ( a a DET ) ( rose rose NC) (that thatPR )( wewePPER1P )( smell smellV PRES1P)>.
Mining Single-Item Sequen es First, we onsidered single-item sequen es of words.Toperform the mining task on thethree orpora, weused dmt4 [5℄ that allows the denition of various onstraints on the extra ted single-item sequentialpatterns:the length,thefrequen y(by setting
minsup
,thesupport threshold),orthegaps(by hoosingthevaluesof[M, N ]
).Wesetthelengthof thepatternstobebetween2and20.We hosethevalueofminsup
empiri allyas atrade-obetweenhavinginterestingpatternswithalowsupport(thussetting alowvaluetominsup
)andhavingnottoomanypatterns (thus settingahigh valuetominsup
).Be auseofthedieren esinthe orporasizes(Fi tion isve timesbiggerthanPoetry),we hosearelativethresholdwhosevalueis0.001%; it orrespondstothefollowingabsolutethresholds:16forPoetry,12forLetters, and 51 for Fi tion. That means that only patterns appearing in at least 16 sequen es are kept for Poetry, for example. For the gap onstraints,we hose to onsiderdierentvaluesinthefollowingexperiments(seeSe tion3.2):[1, 1]
,Corpus Single-itempatternswithgaps Itemset
[1, 1]
[1, 2]
[1, 3]
[1, 5]
patterns Poetry 18816 37933 55762 86901 2245326 (30.7%)(27.0%)(24.3%)(22.6%) (11.4%) Letters 16936 36849 56755 96549 10128288 (50.2%)(50.7%)(50.4%)(50.0%) (57.4%) Fi tion 78210 175645 282967 512647 11681913 (6.1%) (5.3%) (4.9%) (4.6%) (71.2%) Total 113962 250427 395484 696097 24055527 (16.7%)(15.3%)(14.2%)(13.2%) (59.8%)[1, 2]
,[1, 3]
,and[1, 5]
.Itisworthnotingthatthe[1, 1]
gap onstraint orrespondsto onsidering
n
-grampatterns.Indeed,patternsextra tedunderthis onstraint orrespondtosub-sequen esof onse utivewordsofthe orpus.Mining Itemset Sequen es Finally,we onsidered itemsetsequen es, where ea h itemset representsa word with its form,its lemma, and its POS tag. To minetheseitemsetsequen es,we hoseCloSpan[6℄thatextra ts losed sequen-tial itemset patterns. CloSpan allows to set only one onstraint: the support threshold
minsup
.Wealso hose empiri allythevalueofminsup
tobe0.15%. Notethat,be ausenogap onstraint anbesetinCloSpan,wehadto hoosea highervalueforminsup
tolimitthetotalnumberofpatternsthataregenerated and hen eto limit the omputation time. Thedrawba kof that hoi e is that interestingpatternsmaynotbeextra tedbe ausetheirsupportmaybetoolow (forexample,theabsolutesupport thresholdis1000forFi tion).Sele tingEmergingPatterns Tosele ttheemergingpatternsofthe orpora, weset thethreshold
ρ
just above1:ρ
= 1.001
.This thresholdisused on both single-itempatternsanditemsetpatterns.3.2 Quantitative Analysis of the Patterns
In thissub-se tion, we present quantitativeresultson thesingle-item patterns andontheitemsetpatterns.Thesetofextra tedpatternsbeinglarge,this quan-titativeanalysis allowsus to sele tthe patternsthat will bea tually analyzed fromalinguisti pointofview,forthestylisti task(seeSe tion3.3).
Table 3 gives the number of extra ted patterns for the three orpora, by onsidering the two types of patterns: single-item patterns (with various gap onstraints) anditemset patterns. Theratio ofemerging patterns isalso given forea htypeofpatterns.Thus,amongthe18816patternsextra tedfromPoetry (by setting the gap onstraint to
[1, 1]
), 30.7 % of the patterns are emerging0
10
20
30
40
50
60
70
0
2
4
6
8
10
12
14
Percentage of sequential patterns (%)
Length of the sequential patterns
Single-item patterns with gap [1,1]
Single-item patterns with gap [1,2]
Single-item patterns with gap [1,3]
Single-item patterns with gap [1,5]
Itemset patterns
Fig.2.Distributionoftheemergingpatternsw.r.t.thelength
65
70
75
80
85
90
95
100
2
4
6
8
10
12
14
Aggreg. percentage of seq. patterns (%)
Growth rate of the sequential patterns
Single-item patterns with gap [1,1]
Single-item patterns with gap [1,2]
Single-item patterns with gap [1,3]
Single-item patterns with gap [1,5]
Itemset patterns
Fig.3.Distributionoftheemergingpatternsw.r.t.thegrowthrate
patterns allowsalargeredu tionof thetotal numberofsequentialpatterns to analyze.Moreover,itallowstofo usourattentiononmoreinterestingpatterns inthe ontextofstylisti s.That iswhywewillonly onsideremergingpatterns in therestof theanalyses.Furthermore,we ansee thatbyin reasing thegap onstraint,therateofsingle-itememergingpatternstendstode rease:itmeans thatadditionalextra tedpatternstendtobenon-spe i patternsofthestudied typesoftext. For thestylisti analysis presentedin Se tion3.3, weset thegap onstraintto
[1, 3]
asatradeobetweenthetotalnumberofextra tedsingle-item patternsandtheirrelevan e.Finally,we anseethatmanymoreitemsetpatterns areextra ted, omparedto thenumberofextra tedsingle-itempatterns.Then,westudythedistribution oftheemergingpatternsw.r.t.theirlength. Figure 2plots therelativenumberof patterns for thevarious pattern lengths, for the single-item patterns (the length is given as the number of items) and for the itemset patterns (the length is given as the numberof itemsets). The patterndistributions are omputedonthethree orpora onsidered asawhole, forea hgap onstraintvalue onsidered.We anseethatmostofthesingle-item
linguisti patterns. Moreover,there are a lot of single-item patterns of length 2but theyarenotas instru tiveas longer patterns-from alinguisti pointof view.Thatiswhywewillonly onsiderpatternswhoselengthisgreaterthan2, forthestylisti analysis.
Finally, we study the distribution of the emerging patterns w.r.t. growth rates. Figure 3plots the aggregate relative number of emerging patterns as a fun tionofthegrowthrate,by onsideringthethree orporaasawhole.Itmeans that, for example,67.1% ofthe emergingitemset patternshaveagrowthrate greaterthan4.We an seethat mostoftheemergingpatterns haveaninnite growthrateastheaggregaterateofemergingpatternsisstableforgrowthrates greaterthan 10.It meansthatmostof theemergingpatterns appearonly ina ertain typeof text (and notat allin theother typesof text). Inthe stylisti analysis,we onsider onlyemergingpatternswith aninnitegrowthrate.
Finally,only itemsets ontainingbothPOS tagsand wordformsor lemmas are onsideredduring thestylisti analysis.Patterns ontainingonly POStags arethereforeremovedastheyaretoogeneralandpatterns ontainingonlyword forms orlemmas arealsoremovedas theyaretoospe i .Infa t,mostofthe itemsetpatterns ontainbothPOStagsandwordssin ethesepatternsrepresent 93.5%ofalltheitemset patterns.That on urswithBiber's on lusions[2℄on theextra ted patterns that ontainbothvariable and xedelements(patterns onlywith POS tagsthus ontain onlyvariable elementswhereaspatterns only withwords ontainonlyxedelements).
3.3 Stylisti Analysis of theEmergingPatterns
In this sub-se tion, we present a stylisti analysis of some extra ted emerging patterns.Wefo usourattentionmoreparti ularlyonthePoetry orpus.
Firstofall,we onsidersingle-itempatterns.Bystudyingthem,we annd someinterestingpatterns, hara teristi of Poetry. Table4 shows examplesof su hidentied hara teristi patterns.Inthepatterns, thesymbol* isused to representagapofoneormorewords
5
.Furthermore,wealsoillustrateea h pat-tern withexamples of underlying sequen esin Poetry. The extra ted patterns allow the observation of s hemati grammati al stru tures that are relatively lexi on-independent. Indeed, xed elements of these patterns are grammati al wordswhereasvariableelements(i.e.,llingthegaps)aregenerallylexi alwords (e.g.,nouns, verbs,or adje tives).Wealsoshowthe interestofgap onstraints thataregivenasintervals.Thepatternsome*more*than allowsthe identi a-tionoftwosequen es,amongothers,wheretherstgapislledwithadierent numberofwords(seeTable4):intherstone,thewordbitesllsthegapwhereas itislled withthewordsangularro ks inthese ondsequen e. Thisillustrates thegeneralization apabilityofsingle-itempatternswithgap onstraints(w.r.t.
n
-grampatterns, forinstan e). 5Symbol* orrespondstosymbol?usedin[1 ℄.Notethatsymbol* isalsousedin[2 ℄ but it represents a single variable lexeme whereas, in our approa h, this symbol
Single-itempattern Example(with Englishtranslation)
des*plus*que iladesmorsuresplusvenimeusesque elles detabou he (some*more*than) (hehassomebitesmore venomousthanthose fromyourmouth)
des aillouxanguleuxplusbrillantsquedesmarbres (some angularro ks brighterthansomemarbles) on*et*on unerosequ'onrespireetqu'onjette
(we*and*we) (a rosethatwesmellandthatwethrow)
surdestombeauxdivinsqu'onbriseetqu'oninsulte? (ondivine tombsthat webreak andthatweinsult?) le/la/l'*qui*et*qui lanuitquim'oppresseet quitroublemesyeux
(the*that*and*that) (the nightthatoppresses meand thattroubles myeyes) legrelotquirésonneetletroupeauquibêle
(the bellthatresoundsandtheo kthatbleats) le*du*qui*dans leventdusoirquimeurtdanslefeuillage (the*ofthe*that*in) (the windofthenightthatdiesinthefoliage)
lebruitduvieuxquibê hedanslanuit (the soundof theoldthatdigsinthenight) est*un*qui est- eungoélandquibatdel'aile?
(is*a*that) (isitagullthatapsits wing?)
tagrâ eest ommeunluthquivibreaufonddubois (yourgra eislikealutethatvibrates deep inthewood)
Table5givesthe orresponden ebetweenthesingle-itempatternspresented inTable4andtheirasso iateditemsetpatterns.First,it anbeseenthatseveral itemsetpatternsmay orrespondto thesamesingle-itempattern.Furthermore, extra ted itemset patterns allowto obtain the POS ategories of the variable elements.Therefore,inthe ontextofastylisti studyoftypesoftext,thework oflinguists onsistsinsele tingrelevantpatternsamongautomati allyextra ted itemsetpatterns:thisdire tlygivesthemgrammati alpatterns hara teristi of a onsideredtypeoftext.
In fa t, the grammati al patterns we onsider orrespond to ollo ational frameworks in the sense of Renouf and Sin lair [1℄, i.e. ollo ations on gram-mati al units and noton lexi al units.However,as opposed to their work,we do not hose apriori thepatterns that arethen studied but we automati ally dis over them from orpora. We an also ompare our work to Biber's [2℄ -who works also on ollo ational frameworks - but there are some dieren es. Indeed, ourapproa hallowsto dire tly extra t single-item patterns with gaps as well as itemset patterns ( orresponding to grammati al patterns) whereas Biberrstextra tsfrequentsequen esfrom orporaandthenanalyzethemone by one to identify variable and xed elementsto nally build various types of patternsthat hestudiesafterwards.Sin eRenouf andSin lairpaper,workson ollo ationalframeworks havebeendone in English orpuslinguisti s,but not
Single-itempattern (Englishtranslation) Grammati alpattern des*plus*que (some*more*than) someNmoreADJthan on*et*on (we*and*we) NthatweVandthatweV le/la/l'*qui*et*qui (the*that*and) theNthatVand(that)V
theNthatVandtheNthatV le*du*qui*dans (the*ofthe*that*in) theNoftheNthatVintheN est*un*qui (is*a*that) isitaNthatV
islikeaNthatV
insightswhenasso iatedto ana tualusagetheory onsidering that grammati- al forms ome from alinguisti usage (i.e. orpus-drivenapproa hes)and are not the result of integrated rules (i.e. orpus-based approa hes). Therefore, it isinterestingtohaveapproa hesthatautomati allyextra tpatternstoprovide these ollo ationframeworks,asitisthe asewithourproposedapproa h.
4 Dis ussion
In the previous se tion, we have shown that sequential patterns an be inter-pretedbylinguistsfor stylisti analyses.However,ahuge numberofsequential patternsareextra tedwith dataminingte hniques,from whi htheinteresting ones have to be identied. In this se tion, we dis uss the improvements that ouldbebroughttoour urrentapproa htomakeiteasierforlinguiststodeal withthepresentedsequentialpatterns.Tothisend,weidentied twoleads.
First,in order tofo usourattentiononthe interestingsequentialpatterns, itisne essaryto beableto setnew onstraintsduringthedataminingpro ess tonarrowthenumberofextra tedpatternsdown.Thus,itwouldbeinteresting to also set gap onstraints on itemset patterns (as it is already the ase for single-itempatterns).Inaddition,aswe ansetaminimumthreshold,
minsup
, for the pattern supports, it would beinteresting to set amaximum threshold,maxsup
, for the pattern supports as well. Indeed, most interesting sequentialpatternsgenerallyappearinfewsequen es.Thus,bynot onsideringtoofrequent sequentialpatterns,thetotalnumberofpatternswouldberedu ed(forinstan e, by setting
maxsup
= 50
, 21.2% ofthe Poetry single-itempatterns would not beextra ted).Moreover,itwouldallowusto setminsup
toalowervalue,and hen etodis overrarersequentialpatternswithoutin reasingthetotalnumberof patterns.Inaddition,membership onstraintsona ertainitemtype ouldalso bedenedtolteroutmoresequentialpatterns(e.g.only onsideringsequential patterns ontainingatleastoneverb).Lastly, itwould be of interestto provide toolsallowingthe orderingof the patterns, their ltering, or their exploration jointly with the sequen es of the orpustheyreferto. Therefore,itwouldbeeasierforlinguists,in parti ular,to analyzetheextra tedsequentialpatterns(moreparti ularlyforitemsetpatterns
Inthis paper,wehavepresentedarststudyon usingdatamining te hniques forstylisti sby proposing amethodologybasedontheextra tion ofsequential patternsandappliedtostylisti s.Thus,wehave onsideredtwotypesof sequen-tialpatterns: single-itempatterns and itemset patterns (based on word forms, lemmasandPOStags).Moreover,wefo usedourattentiononspe i sequential patterns: emergingpatterns. A quantitative analysisof thesequentialpatterns extra tedfromthree orpora(representingvarioustypesoftext,akaPoetry, Let-ters, and Fi tion) has shown that sequentialpatterns are morepowerfulthan
n
-gramsto expresslinguisti patterns. That hasbeen onrmedbyalinguisti analysis of theextra ted emergingsequentialpatternssin esome grammati al patterns hara teristi ofPoetry wereidentiedfromthese sequentialpatterns. Wealso omparedourmethodologytotheoneproposedbyBiber[2℄byshowing thatoursallowstodire tlyobtainpatterns hara teristi oftypesoftext.Lastly, wehavedis ussedtheimprovementsthat ouldbebroughttoourproposed ap-proa hbothby limitingthetotalnumberof extra ted sequentialpatterns and hen etoanalyze(bydeningnew onstraintsonthepatterns),andbymaking iteasierforlinguiststoexploreandanalyzethepatterns(bydevelopingsuitable toolsfor this task). Therefore,these dis ussions giveus leadsto investigatein future studies;someoftheseworksarealreadyinprogress.A knowledgments
ThisworkispartlysupportedbytheFren hRégionBasse-Normandieandbythe ANR(NationalResear hAgen y)funded proje tHybrideANR-11-BS02-002.
Referen es
1. Renouf,A.,Sin lair,J.: Collo ational FrameworksinEnglish. In:EnglishCorpus Linguisti s: StudiesinHonourofJanSvartvik.Longman(1991)128143
2. Biber,D.:A orpus-drivenapproa htoformulai languageinEnglish. International Journal ofCorpusLinguisti s14(2009)275311
3. Agrawal,R.,Srikant,R.: Mining sequentialpatterns. (In:Pro .ofICDE'95)314 4. Dong,G., Li,J.: E ient minigofemergingpatterns:Dis overingtrendsand
dif-feren es. (In:Pro .ofSIGKDD'99)4352
5. Nanni, M.,Rigotti,C.: Extra ting treesofquantitativeserial episodes. (In:Pro . ofKDID'07)170188
6. Yan,X., Han,J.,Afshar, R.: Clospan:Mining losed sequentialpatterns inlarge databases. (In:Pro .ofSDM'03)
7. Dong,G.,Pei,J.: Sequen eDataMining. Springer(2007)
8. Ng,R.,Lakshmanan,L.,Han,J.,Pang, A.: Exploratoryminingandpruning opti-mizationsof onstrainedasso iationsrules. (In:Pro .ofSIGMOD'98) 1324 9. Plantevit,M.,Crémilleux, B.: Condensedrepresentationofsequentialpatterns