• Aucun résultat trouvé

What About Sequential Data Mining Techniques to Identify Linguistic Patterns for Stylistics?

N/A
N/A
Protected

Academic year: 2021

Partager "What About Sequential Data Mining Techniques to Identify Linguistic Patterns for Stylistics?"

Copied!
13
0
0

Texte intégral

(1)

HAL Id: hal-00675578

https://hal.archives-ouvertes.fr/hal-00675578

Submitted on 1 Mar 2012

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

What About Sequential Data Mining Techniques to

Identify Linguistic Patterns for Stylistics?

Solen Quiniou, Peggy Cellier, Thierry Charnois, Dominique Legallois

To cite this version:

Solen Quiniou, Peggy Cellier, Thierry Charnois, Dominique Legallois. What About Sequential Data

Mining Techniques to Identify Linguistic Patterns for Stylistics?. CICLing 2012: Computational

Linguistics and Intelligent Text Processing, Mar 2012, New Delhi, India. pp.166-177,

�10.1007/978-3-642-28604-9_14�. �hal-00675578�

(2)

to Identify Linguisti Patterns for Stylisti s? SolenQuiniou

1,2

,PeggyCellier

3

,ThierryCharnois

1

,andDominiqueLegallois

2

1

GREYCUniversitédeCaenBasse-Normandie,Campus2,14000Caen

2

CRISCOUniversitédeCaenBasse-Normandie,Campus1,14000Caen

3

IRISA-INSAdeRennes,CampusdeBeaulieu,35042 RennesCedex

Abstra t. Inthispaper,westudytheuseofdataminingte hniquesfor stylisti analysis,fromalinguisti pointofview,by onsidering emerg-ingsequentialpatterns.First,weshowthat miningsequentialpatterns ofwordswithgap onstraintsgivesnewrelevantlinguisti patternswith respe ttopatternsbuilton

n

-grams.Then,weinvestigatehow sequen-tial patterns of itemsets an provide more generi linguisti patterns. Wevalidate ourapproa hfromalinguisti pointofviewby ondu ting experimentsonthree orporaof various typesof Fren htexts(Poetry, Letters,andFi tion).By onsideringmoreparti ularlypoeti texts,we showthat hara teristi linguisti patterns anbeidentiedusingdata miningte hniques. We also dis uss how to improve our proposed ap-proa hsothatit anbeusedmoree ientlyforlinguisti analyses.

1 Introdu tion

Thestudyofphraseology-in ludingstylisti s-isaresear heldthathasbeen investigatedoverthepast30 yearsbythelinguisti ommunity.More re ently, therehasbeenaparti ularinterestinstudiesfrom orpuslinguisti s.Twomain approa hes anbeidentied: orpus-basedand orpus-driven.Corpus-based ap-proa hesassumetheexisten eoflinguisti theoriesand use orporatoanalyze theirappli ationandhen etovalidatethem.Corpus-driven approa hes onsider that linguisti onstru tsemergefrom orpusanalysis.Thisanalysis allowsthe dis overyof o-o urringwordpatternsthatwillbethebasisoflinguisti anal-yses.Ourworkispartofthe orpus-drivenapproa hessin eourgoalistoassist linguistsin dis overingnewlinguisti onstru tswithoutanypriorknowledge.

Oneof the rst orpus-drivenapproa h was proposed by Renouf and Sin- lair [1℄. It onsists on astudy of ollo ationalframeworksthanks to orpora; ollo ational frameworks representdis ontinuoussequen esoftwogrammati al wordsen losingalexi alword(e.g.,many+?+ofthatmeansmanyfollowed by avariable lexeme - symbolized by ? - itself followed by of). However, this approa hisnotentirely orpus-drivensin ethestudied ollo ationalframeworks werepre-sele tedbyRenoufand Sin lair.Infa t,mostofthe so- alled orpus-drivenapproa hesarepartly orpus-based

1

.Morere ently,Biberpresentedan 1

(3)

orpora[2℄.Todoso,hereliesbothonapreliminarywork ontheidenti ation of lexi al bundles (i.e., frequent sequen es of ontiguous words, aka

n

-grams) and on ollo ationalframeworks to identify xed and variable elements in the patternsheextra ted.Furthermore,Biber onsiderstwolanguageregisters ( on-versationanda ademi writing)andshowstheinterestofusinga orpus-driven approa htostudythespe i itiesofpatternsappearinginea hregister.

Inthispaper,wepresentarstandoriginalstudywhi haimsatshowingthe interestofdataminingmethodsforthestylisti analysisoflargetexts.Thegoal isto providetothelinguist expertssomeprominent,relevant,and understand-ablepatternswhi h an be hara teristi ofaspe i typeoftextsothat these experts an arryout astylisti analysis basedon these patterns. In fa t,our workisin the ontinuityofBiber'sbut we onsider varioustext types(instead of languageregisters) that we study from a stylisti point of view. To do so, weset upamethodologybasedonsequentialdatamining,from theextra tion of patterns to the sele tion of the most relevant. We apply this methodology to stylisti s.Tothebest of our knowledge, data miningmethods have notyet beenused intheeldofstylisti swhereasoneoftheiradvantagesistooeran interpretableresultto users, as opposed to numeri almethods su h as Hidden MarkovModelsorConditionalRandomFields.Indeed,thelattermethodshave been shown to a hieve good results for tasks like text ategorization or infor-mationextra tionbuttheyprodu eoutputshardlyunderstandablebyhumans. Thus,theapproa hthatweproposeisbasedonfrequent sequentialpatterns [3℄, awell-knowndataminingte hniquetoautomati allydis overfrequentpatterns basedonthesequentialorderofdata.We onsidertwotypesofsequential pat-terns:single-itempatterns(anitemrepresentsasinglepie eofinformation,e.g. a word form); and itemset patterns. In this se ond type of patterns, a word is represented by a set of features. Therefore, extra ted itemset patterns may ombinedierentlevelsofabstra tion(wordforms,lemmas,POStags,et .):for instan e,

h(P REP ) (DET ) (N C)i

or

h(to) (the DET ) (N C)i

2

. Furthermore, as weset ourstudy intheeld of stylisti s,theend-goalis toextra t patterns thatare hara teristi ofa ertaintypeoftext.Thisisthereasonwhywefo us onaspe i typeof sequentialpatterns:emergingpatterns. Emergingpatterns an apture ontrast hara teristi sbetween lassesordatasets[4℄.Furthermore, thesepatterns anbeanalyzedbyexpertstodis overnewrelationshipsinagiven domainforabetterunderstandingofit.Here,extra tedemergingpatterns ould then beanalyzedby linguiststodis overlinguisti patterns, hara teristi ofa ertaintypeoftext.

Therest ofthispaperisorganizedasfollows.First,ourmethodologybased on sequentialdata miningis introdu ed in Se tion 2.Then, Se tion 3presents experimental resultson the use of our methodology for stylisti sboth from a quantitativeandalinguisti pointofview.Finally,Se tion4dis ussestheleads tofurther investigate,whileSe tion5drawssome on lusions.

2

(4)

computation

of the

emerging

patterns

emerg. pattern

set 1

emerg. pattern

set N

...

computation

of the

patterns

tagged

corpus 1

tagged

corpus N

...

pattern

set 1

pattern

set N

...

interpretation

linguistic

Fig.1.Overviewofourproposedapproa h

2 Methodology

Inthisse tion,wegiveanoverviewoftheproposedapproa htoidentify hara -teristi linguisti patternsforea htypeoftext (Se tion2.1).Then, wepresent thesequentialdataminingte hniquesonwhi h ourapproa hisbased:frequent sequentialpatterns(Se tion2.2)andemergingpatterns(Se tion2.3).

2.1 Overview ofthe Proposed Approa h

Figure 1 illustrates the various steps of our approa h.

N

orpora are used as the inputsof the pro ess, with one orpus orresponding to a onsidered type of text. Ea h orpus is rst pre-pro essed and then all its words are labeled with their lemma and their POS ategory (see Se tion 3.1). In the rst step of our approa h, sequential patterns are extra ted for ea h orpus:

N

sets of sequential patterns are therefore obtained. Then, in the se ond step, sets of emerging patterns are sele ted for ea h orpus, from the

N

sets of sequential patternspreviouslyextra ted.Lastly,the

N

setsofemergingpatternsaregiven toalinguistsothat he anusethemtoperformalinguisti interpretation.The rstandse ondstepsarepresentedingreaterdetailsin thenextsub-se tions.

2.2 Sequential Pattern Mining

Sequential pattern mining is a well-known data mining te hnique used to nd regularitiesinsequen edatabases,by onsideringthetemporalorderofthedata. Thiste hniquewasintrodu edbyAgrawaletal. in[3℄.

An itemset,

I

, is dened as a set of literals alled items, denoted by

I

=

(i

1

. . . i

n

)

.Forexample,

(a b)

isanitemsetwithtwoitems:

a

and

b

.Asequen e,

S

, is dened as an ordered list of itemsets, denoted by

S

= hI

1

. . . I

m

i

. For instan e,

h(a b)(c)(d)(a)i

isasequen eoffour itemsets.Itshould benotedthat alot of appli ations needonly one itemin their itemsets (e.g. DNA stringsor protein sequen es). These parti ular kinds of sequen es are alled single-item sequen es; for the sake of larity, they are denoted by

S

= hi

1

. . . i

n

i

, where

i

1

. . . i

n

are items. Several algorithms have been developed to e iently mine

that kind of spe i sequen es, forexample [5℄. Inthe rest of the paper, both kinds of sequen es will be onsidered, i.e. single-item sequen es and itemset sequen es.Asequen e

S

1

= hI

1

. . . I

n

i

isin luded inasequen e

S

2

= hI

1

. . . I

m

i

ifthere exist integers

1 ≤ j

1

< ... < j

n

≤ m

su hthat

I

1

⊆ I

j

1

, ...,

I

n

⊆ I

j

n

. Thesequen e

S

1

isthus alledasubsequen e of

S

2

,whi hisnoted

S

1

 S

2

.For

(5)

Table 1.

SDB

1

:asequen edatabase Sequen eidentier Sequen e

1

h(a b)(c)(d)(a)i

2

h(d)(a)(e)i

3

h(d)(a b e)(c d e)i

4

h(c)(a)i

example,wehavethefollowingrelation:

h(c)(a)i  h(a b)(c)(d)(a)i

. Asequen e database

SDB

is a set of tuples

(sid, S)

, where

sid

is a sequen e identier and

S

asequen e. For instan e,Table1representsasequen edatabaseoffour sequen es. A tuple

(sid, S)

ontains a sequen e

S

1

, if

S

1

 S

. The support of asequen e

S

1

inasequen edatabase

SDB

, denoted

sup(S

1

)

,isthenumberof tuples ontaining

S

1

inthedatabase.Forexample,inTable1,

sup(h(a)(e)i) = 2

sin esequen es 2and3 ontainanitemset with

a

followed by anitemset with

e

.Therelative supportofsequen esmayalsobeused,asdenedbyEquation1:

sup(S

1

) =

|{(sid, S) | (sid, S) ∈ SDB ∧ (S

1

 S)}|

|SDB|

(1)

A frequent pattern is asequen e su h that its support is greater or equalto a giventhreshold:

minsup

.Sequentialpattern miningalgorithmsthus extra tall thefrequentsequentialpatternsthat appearinasequen edatabase.

Be ausethe setof frequentsequentialpatterns an beverylarge, there ex-ists a ondensed representationwhi h eliminatesredundan ies without loss of information: losed sequential patterns [6℄. A frequent sequential pattern

S

is losed ifthere exists no otherfrequent sequentialpattern

S

su h that

S

 S

and

sup(S) = sup(S

)

.For instan e,with

minsup

= 2

,thesequentialpattern

h(b)(c)i

fromTable1isnot losedwhereasthepattern

h(a b)(c)i

is losed.

More-over,inordertodrivetheminingpro esstowardstheuserobje tivesandto elim-inate irrelevantpatterns, one an dene onstraints[7,8℄. The most ommonly used onstraint is the frequen y onstraint (that assigns a value to

minsup

). Anotherwidespread onstraintisthegap onstraint.A sequentialpatternwith agap onstraint

[M, N ]

,denotedby

P

[M,N ]

,isapatternsu has atleast

M

− 1

itemsets and at most

N

− 1

itemsets are allowed between every two neigh-bor itemsets, in the original sequen es. For instan e, let

P

[1,3]

= h(c)(a)i

and

P

[2,3]

= h(c)(a)i

betwo patterns with two dierent gap onstraintsand let us

onsiderthesequen esofTable1.Sequen es

1

and

4

areo urren esofpattern

P

[1,3]

(sequen e

1

ontains one itemset between

(c)

and

(a)

whereas sequen e

4

ontains no itemset between

(c)

and

(a)

), but only sequen e

1

is an o ur-ren eof

P

[2,3]

(onlysequen eswithone ortwoitemsetsbetween

(c)

and

(a)

are o urren esofthispattern).

Inthispaper,the onsidereddatabases orrespondto orpora.Furthermore, twokindsofsequentialpatternsare onsidered:single-itempatternsanditemset patterns.Inthatlast ase,itemsets anbemadeupofthreetypesofitems:word

(6)

Emerging patterns are dened as sequential patterns whose support in reases signi antly from one dataset to another one [4℄. More spe i ally, emergent patterns are sequentialpatterns whose growth rate - the ratio of the supports in the two datasets - is larger than a given threshold:

ρ

. Thus, a sequential pattern

P

from a dataset

D

1

is an emerging pattern to another dataset

D

2

if

GrowthRate(P ) ≥ ρ

,with

ρ >

1

andwith

GrowthRate(P )

beingdenedby:

GrowthRate(P ) =

( ∞,

if

sup

D

2

(P ) = 0

sup

D1

(P )

sup

D2

(P )

,

otherwise (2) with

sup

D

1

(P )

(

sup

D

2

(P )

respe tively)beingtherelativesupportofthepattern

P

in

D

1

(

D

2

respe tively).Sin eweareonlyinterestedinpatternsbelongingto

D

1

,wedonot onsiderpatterns

P

with

sup

D

1

(P ) = 0

.

Inthe aseofstylisti analyses,ea hdataset ontainsthefrequentsequential patternsof a orpusandthusof the orrespondingtypeoftext.It orresponds tothepatternsextra tedduringtherststepofourapproa h(seeSe tion2.2). Be ause we onsider more than two types of text, we ompute the emerging patternsofa onsideredtypeoftextwithrespe ttoeveryothertype,a ording toEquation2.Finally,onlythepatternsthatareemergingtoeveryothertypeof textarekeptasemergingpatternsfora onsideredtypeoftext.The omputation ofalltheemergingpatternsisdonee ientlybasedon[9℄.

3 Experimental Evaluation

In this se tion, we report the results of our experimental evaluation on using sequentialpattern miningte hniques forstylisti s.First,in Se tion 3.1, we de-s ribe theused orporaas wellas the setup ofthe various parametersused to extra t emerging sequential patterns. Then, we present ananalysis of the ex-tra tedsequentialpatterns, attwolevels:fromaquantitativepointofview(in Se tion 3.2),andfromalinguisti pointofviewforstylisti s(inSe tion3.3).

3.1 ExperimentalSetup

Corpora We reatedthree orpora, orrespondingtovarioustypesoftext: Po-etry,Letters, andFi tion.Tobuildea h orpus,wesele tedallthetextsofthe 1800-1900era-providedbytheFren hresour esoftheCNRTL

3

- orrespond-ing to the onsidered type of text. For example, authors from Poetry in lude LamartineandMusset,whereasHugoandLamennaisarepartoftheauthorsof Letters,and ChateaubriandandZola areauthorsofFi tion. Then,these three orpora were pre-pro essed. The pre-pro essing steps onsisted in setting the wordsin lower- ase,and thensplittingthetexts intosequen es atpun tuation marksof theset: {'.', '?', '!', '...', ';', ':', ','}.Table2givessomedetails on ea h orpus:thenumberofauthors, ofworks,ofsequen es,andofwords.

(7)

Corpus #authors #works #sequen es #words Poetry 27 48 151116 1167422 Letters 5 9 234997 1562543 Fi tion 37 52 663860 5105240

After being pre-pro essed, the orporawere POS tagged using Cordial 4

, a tagger that is known to outperform TreeTagger on Fren h texts. Thus, ea h wordof the orpora was asso iatedwith its form,its lemma andits POS tag. After rst experimentations, it turns out that the POS tags given by Cordial weretoomu hspe i ;wethuspost-pro essedthemtoredu etheirnumber(as a onsequen e,itredu es thenumberof extra tedpatterns).First,toospe i ategoriesweremergedintomoregeneralones.Forexample,theadje tive ate-gorywas initiallyde omposed into16 ategories(dependingonthegender,the number,or whether theword startswithamuteh letter).Thus,the following ategorieswere reated,torepla etheir orrespondingsub- ategories:adje tives (ADJ), determiners(DET), ommonnouns (NC), propernouns (NP), demon-strativepronouns (PD), relative pronouns(PR), indenite pronouns (PI), and pastparti iples(VPARP).Then, ategories orrespondingtopersonalpronouns were de omposed into 2 tags:one for the personal pronoun (PPER), and one for the person (e.g. 1Sfor the singularrst person). Moreover, ategories or-respondingto verbswere de omposedinto3tags:onefortheverb(V),one for the mode of the verb(e.g. INDP for the present of the indi ative mode), and onefortheperson(thesameonesasforthepersonalpronouns).Attheend,we had aset of 35tags insteadof the133initial tags. Usingthis newset of tags, the phrase a rose that we smell is translated as < ( a a DET ) ( rose rose NC) (that thatPR )( wewePPER1P )( smell smellV PRES1P)>.

Mining Single-Item Sequen es First, we onsidered single-item sequen es of words.Toperform the mining task on thethree orpora, weused dmt4 [5℄ that allows the denition of various onstraints on the extra ted single-item sequentialpatterns:the length,thefrequen y(by setting

minsup

,thesupport threshold),orthegaps(by hoosingthevaluesof

[M, N ]

).Wesetthelengthof thepatternstobebetween2and20.We hosethevalueof

minsup

empiri allyas atrade-obetweenhavinginterestingpatternswithalowsupport(thussetting alowvalueto

minsup

)andhavingnottoomanypatterns (thus settingahigh valueto

minsup

).Be auseofthedieren esinthe orporasizes(Fi tion isve timesbiggerthanPoetry),we hosearelativethresholdwhosevalueis0.001%; it orrespondstothefollowingabsolutethresholds:16forPoetry,12forLetters, and 51 for Fi tion. That means that only patterns appearing in at least 16 sequen es are kept for Poetry, for example. For the gap onstraints,we hose to onsiderdierentvaluesinthefollowingexperiments(seeSe tion3.2):

[1, 1]

,

(8)

Corpus Single-itempatternswithgaps Itemset

[1, 1]

[1, 2]

[1, 3]

[1, 5]

patterns Poetry 18816 37933 55762 86901 2245326 (30.7%)(27.0%)(24.3%)(22.6%) (11.4%) Letters 16936 36849 56755 96549 10128288 (50.2%)(50.7%)(50.4%)(50.0%) (57.4%) Fi tion 78210 175645 282967 512647 11681913 (6.1%) (5.3%) (4.9%) (4.6%) (71.2%) Total 113962 250427 395484 696097 24055527 (16.7%)(15.3%)(14.2%)(13.2%) (59.8%)

[1, 2]

,

[1, 3]

,and

[1, 5]

.Itisworthnotingthatthe

[1, 1]

gap onstraint orresponds

to onsidering

n

-grampatterns.Indeed,patternsextra tedunderthis onstraint orrespondtosub-sequen esof onse utivewordsofthe orpus.

Mining Itemset Sequen es Finally,we onsidered itemsetsequen es, where ea h itemset representsa word with its form,its lemma, and its POS tag. To minetheseitemsetsequen es,we hoseCloSpan[6℄thatextra ts losed sequen-tial itemset patterns. CloSpan allows to set only one onstraint: the support threshold

minsup

.Wealso hose empiri allythevalueof

minsup

tobe0.15%. Notethat,be ausenogap onstraint anbesetinCloSpan,wehadto hoosea highervaluefor

minsup

tolimitthetotalnumberofpatternsthataregenerated and hen eto limit the omputation time. Thedrawba kof that hoi e is that interestingpatternsmaynotbeextra tedbe ausetheirsupportmaybetoolow (forexample,theabsolutesupport thresholdis1000forFi tion).

Sele tingEmergingPatterns Tosele ttheemergingpatternsofthe orpora, weset thethreshold

ρ

just above1:

ρ

= 1.001

.This thresholdisused on both single-itempatternsanditemsetpatterns.

3.2 Quantitative Analysis of the Patterns

In thissub-se tion, we present quantitativeresultson thesingle-item patterns andontheitemsetpatterns.Thesetofextra tedpatternsbeinglarge,this quan-titativeanalysis allowsus to sele tthe patternsthat will bea tually analyzed fromalinguisti pointofview,forthestylisti task(seeSe tion3.3).

Table 3 gives the number of extra ted patterns for the three orpora, by onsidering the two types of patterns: single-item patterns (with various gap onstraints) anditemset patterns. Theratio ofemerging patterns isalso given forea htypeofpatterns.Thus,amongthe18816patternsextra tedfromPoetry (by setting the gap onstraint to

[1, 1]

), 30.7 % of the patterns are emerging

(9)

0

10

20

30

40

50

60

70

0

2

4

6

8

10

12

14

Percentage of sequential patterns (%)

Length of the sequential patterns

Single-item patterns with gap [1,1]

Single-item patterns with gap [1,2]

Single-item patterns with gap [1,3]

Single-item patterns with gap [1,5]

Itemset patterns

Fig.2.Distributionoftheemergingpatternsw.r.t.thelength

65

70

75

80

85

90

95

100

2

4

6

8

10

12

14

Aggreg. percentage of seq. patterns (%)

Growth rate of the sequential patterns

Single-item patterns with gap [1,1]

Single-item patterns with gap [1,2]

Single-item patterns with gap [1,3]

Single-item patterns with gap [1,5]

Itemset patterns

Fig.3.Distributionoftheemergingpatternsw.r.t.thegrowthrate

patterns allowsalargeredu tionof thetotal numberofsequentialpatterns to analyze.Moreover,itallowstofo usourattentiononmoreinterestingpatterns inthe ontextofstylisti s.That iswhywewillonly onsideremergingpatterns in therestof theanalyses.Furthermore,we ansee thatbyin reasing thegap onstraint,therateofsingle-itememergingpatternstendstode rease:itmeans thatadditionalextra tedpatternstendtobenon-spe i patternsofthestudied typesoftext. For thestylisti analysis presentedin Se tion3.3, weset thegap onstraintto

[1, 3]

asatradeobetweenthetotalnumberofextra tedsingle-item patternsandtheirrelevan e.Finally,we anseethatmanymoreitemsetpatterns areextra ted, omparedto thenumberofextra tedsingle-itempatterns.

Then,westudythedistribution oftheemergingpatternsw.r.t.theirlength. Figure 2plots therelativenumberof patterns for thevarious pattern lengths, for the single-item patterns (the length is given as the number of items) and for the itemset patterns (the length is given as the numberof itemsets). The patterndistributions are omputedonthethree orpora onsidered asawhole, forea hgap onstraintvalue onsidered.We anseethatmostofthesingle-item

(10)

linguisti patterns. Moreover,there are a lot of single-item patterns of length 2but theyarenotas instru tiveas longer patterns-from alinguisti pointof view.Thatiswhywewillonly onsiderpatternswhoselengthisgreaterthan2, forthestylisti analysis.

Finally, we study the distribution of the emerging patterns w.r.t. growth rates. Figure 3plots the aggregate relative number of emerging patterns as a fun tionofthegrowthrate,by onsideringthethree orporaasawhole.Itmeans that, for example,67.1% ofthe emergingitemset patternshaveagrowthrate greaterthan4.We an seethat mostoftheemergingpatterns haveaninnite growthrateastheaggregaterateofemergingpatternsisstableforgrowthrates greaterthan 10.It meansthatmostof theemergingpatterns appearonly ina ertain typeof text (and notat allin theother typesof text). Inthe stylisti analysis,we onsider onlyemergingpatternswith aninnitegrowthrate.

Finally,only itemsets ontainingbothPOS tagsand wordformsor lemmas are onsideredduring thestylisti analysis.Patterns ontainingonly POStags arethereforeremovedastheyaretoogeneralandpatterns ontainingonlyword forms orlemmas arealsoremovedas theyaretoospe i .Infa t,mostofthe itemsetpatterns ontainbothPOStagsandwordssin ethesepatternsrepresent 93.5%ofalltheitemset patterns.That on urswithBiber's on lusions[2℄on theextra ted patterns that ontainbothvariable and xedelements(patterns onlywith POS tagsthus ontain onlyvariable elementswhereaspatterns only withwords ontainonlyxedelements).

3.3 Stylisti Analysis of theEmergingPatterns

In this sub-se tion, we present a stylisti analysis of some extra ted emerging patterns.Wefo usourattentionmoreparti ularlyonthePoetry orpus.

Firstofall,we onsidersingle-itempatterns.Bystudyingthem,we annd someinterestingpatterns, hara teristi of Poetry. Table4 shows examplesof su hidentied hara teristi patterns.Inthepatterns, thesymbol* isused to representagapofoneormorewords

5

.Furthermore,wealsoillustrateea h pat-tern withexamples of underlying sequen esin Poetry. The extra ted patterns allow the observation of s hemati grammati al stru tures that are relatively lexi on-independent. Indeed, xed elements of these patterns are grammati al wordswhereasvariableelements(i.e.,llingthegaps)aregenerallylexi alwords (e.g.,nouns, verbs,or adje tives).Wealsoshowthe interestofgap onstraints thataregivenasintervals.Thepatternsome*more*than allowsthe identi a-tionoftwosequen es,amongothers,wheretherstgapislledwithadierent numberofwords(seeTable4):intherstone,thewordbitesllsthegapwhereas itislled withthewordsangularro ks inthese ondsequen e. Thisillustrates thegeneralization apabilityofsingle-itempatternswithgap onstraints(w.r.t.

n

-grampatterns, forinstan e). 5

Symbol* orrespondstosymbol?usedin[1 ℄.Notethatsymbol* isalsousedin[2 ℄ but it represents a single variable lexeme whereas, in our approa h, this symbol

(11)

Single-itempattern Example(with Englishtranslation)

des*plus*que iladesmorsuresplusvenimeusesque elles detabou he (some*more*than) (hehassomebitesmore venomousthanthose fromyourmouth)

des aillouxanguleuxplusbrillantsquedesmarbres (some angularro ks brighterthansomemarbles) on*et*on unerosequ'onrespireetqu'onjette

(we*and*we) (a rosethatwesmellandthatwethrow)

surdestombeauxdivinsqu'onbriseetqu'oninsulte? (ondivine tombsthat webreak andthatweinsult?) le/la/l'*qui*et*qui lanuitquim'oppresseet quitroublemesyeux

(the*that*and*that) (the nightthatoppresses meand thattroubles myeyes) legrelotquirésonneetletroupeauquibêle

(the bellthatresoundsandtheo kthatbleats) le*du*qui*dans leventdusoirquimeurtdanslefeuillage (the*ofthe*that*in) (the windofthenightthatdiesinthefoliage)

lebruitduvieuxquibê hedanslanuit (the soundof theoldthatdigsinthenight) est*un*qui est- eungoélandquibatdel'aile?

(is*a*that) (isitagullthatapsits wing?)

tagrâ eest ommeunluthquivibreaufonddubois (yourgra eislikealutethatvibrates deep inthewood)

Table5givesthe orresponden ebetweenthesingle-itempatternspresented inTable4andtheirasso iateditemsetpatterns.First,it anbeseenthatseveral itemsetpatternsmay orrespondto thesamesingle-itempattern.Furthermore, extra ted itemset patterns allowto obtain the POS ategories of the variable elements.Therefore,inthe ontextofastylisti studyoftypesoftext,thework oflinguists onsistsinsele tingrelevantpatternsamongautomati allyextra ted itemsetpatterns:thisdire tlygivesthemgrammati alpatterns hara teristi of a onsideredtypeoftext.

In fa t, the grammati al patterns we onsider orrespond to ollo ational frameworks in the sense of Renouf and Sin lair [1℄, i.e. ollo ations on gram-mati al units and noton lexi al units.However,as opposed to their work,we do not hose apriori thepatterns that arethen studied but we automati ally dis over them from orpora. We an also ompare our work to Biber's [2℄ -who works also on ollo ational frameworks - but there are some dieren es. Indeed, ourapproa hallowsto dire tly extra t single-item patterns with gaps as well as itemset patterns ( orresponding to grammati al patterns) whereas Biberrstextra tsfrequentsequen esfrom orporaandthenanalyzethemone by one to identify variable and xed elementsto nally build various types of patternsthat hestudiesafterwards.Sin eRenouf andSin lairpaper,workson ollo ationalframeworks havebeendone in English orpuslinguisti s,but not

(12)

Single-itempattern (Englishtranslation) Grammati alpattern des*plus*que (some*more*than) someNmoreADJthan on*et*on (we*and*we) NthatweVandthatweV le/la/l'*qui*et*qui (the*that*and) theNthatVand(that)V

theNthatVandtheNthatV le*du*qui*dans (the*ofthe*that*in) theNoftheNthatVintheN est*un*qui (is*a*that) isitaNthatV

islikeaNthatV

insightswhenasso iatedto ana tualusagetheory onsidering that grammati- al forms ome from alinguisti usage (i.e. orpus-drivenapproa hes)and are not the result of integrated rules (i.e. orpus-based approa hes). Therefore, it isinterestingtohaveapproa hesthatautomati allyextra tpatternstoprovide these ollo ationframeworks,asitisthe asewithourproposedapproa h.

4 Dis ussion

In the previous se tion, we have shown that sequential patterns an be inter-pretedbylinguistsfor stylisti analyses.However,ahuge numberofsequential patternsareextra tedwith dataminingte hniques,from whi htheinteresting ones have to be identied. In this se tion, we dis uss the improvements that ouldbebroughttoour urrentapproa htomakeiteasierforlinguiststodeal withthepresentedsequentialpatterns.Tothisend,weidentied twoleads.

First,in order tofo usourattentiononthe interestingsequentialpatterns, itisne essaryto beableto setnew onstraintsduringthedataminingpro ess tonarrowthenumberofextra tedpatternsdown.Thus,itwouldbeinteresting to also set gap onstraints on itemset patterns (as it is already the ase for single-itempatterns).Inaddition,aswe ansetaminimumthreshold,

minsup

, for the pattern supports, it would beinteresting to set amaximum threshold,

maxsup

, for the pattern supports as well. Indeed, most interesting sequential

patternsgenerallyappearinfewsequen es.Thus,bynot onsideringtoofrequent sequentialpatterns,thetotalnumberofpatternswouldberedu ed(forinstan e, by setting

maxsup

= 50

, 21.2% ofthe Poetry single-itempatterns would not beextra ted).Moreover,itwouldallowusto set

minsup

toalowervalue,and hen etodis overrarersequentialpatternswithoutin reasingthetotalnumberof patterns.Inaddition,membership onstraintsona ertainitemtype ouldalso bedenedtolteroutmoresequentialpatterns(e.g.only onsideringsequential patterns ontainingatleastoneverb).

Lastly, itwould be of interestto provide toolsallowingthe orderingof the patterns, their ltering, or their exploration jointly with the sequen es of the orpustheyreferto. Therefore,itwouldbeeasierforlinguists,in parti ular,to analyzetheextra tedsequentialpatterns(moreparti ularlyforitemsetpatterns

(13)

Inthis paper,wehavepresentedarststudyon usingdatamining te hniques forstylisti sby proposing amethodologybasedontheextra tion ofsequential patternsandappliedtostylisti s.Thus,wehave onsideredtwotypesof sequen-tialpatterns: single-itempatterns and itemset patterns (based on word forms, lemmasandPOStags).Moreover,wefo usedourattentiononspe i sequential patterns: emergingpatterns. A quantitative analysisof thesequentialpatterns extra tedfromthree orpora(representingvarioustypesoftext,akaPoetry, Let-ters, and Fi tion) has shown that sequentialpatterns are morepowerfulthan

n

-gramsto expresslinguisti patterns. That hasbeen onrmedbyalinguisti analysis of theextra ted emergingsequentialpatternssin esome grammati al patterns hara teristi ofPoetry wereidentiedfromthese sequentialpatterns. Wealso omparedourmethodologytotheoneproposedbyBiber[2℄byshowing thatoursallowstodire tlyobtainpatterns hara teristi oftypesoftext.Lastly, wehavedis ussedtheimprovementsthat ouldbebroughttoourproposed ap-proa hbothby limitingthetotalnumberof extra ted sequentialpatterns and hen etoanalyze(bydeningnew onstraintsonthepatterns),andbymaking iteasierforlinguiststoexploreandanalyzethepatterns(bydevelopingsuitable toolsfor this task). Therefore,these dis ussions giveus leadsto investigatein future studies;someoftheseworksarealreadyinprogress.

A knowledgments

ThisworkispartlysupportedbytheFren hRégionBasse-Normandieandbythe ANR(NationalResear hAgen y)funded proje tHybrideANR-11-BS02-002.

Referen es

1. Renouf,A.,Sin lair,J.: Collo ational FrameworksinEnglish. In:EnglishCorpus Linguisti s: StudiesinHonourofJanSvartvik.Longman(1991)128143

2. Biber,D.:A orpus-drivenapproa htoformulai languageinEnglish. International Journal ofCorpusLinguisti s14(2009)275311

3. Agrawal,R.,Srikant,R.: Mining sequentialpatterns. (In:Pro .ofICDE'95)314 4. Dong,G., Li,J.: E ient minigofemergingpatterns:Dis overingtrendsand

dif-feren es. (In:Pro .ofSIGKDD'99)4352

5. Nanni, M.,Rigotti,C.: Extra ting treesofquantitativeserial episodes. (In:Pro . ofKDID'07)170188

6. Yan,X., Han,J.,Afshar, R.: Clospan:Mining losed sequentialpatterns inlarge databases. (In:Pro .ofSDM'03)

7. Dong,G.,Pei,J.: Sequen eDataMining. Springer(2007)

8. Ng,R.,Lakshmanan,L.,Han,J.,Pang, A.: Exploratoryminingandpruning opti-mizationsof onstrainedasso iationsrules. (In:Pro .ofSIGMOD'98) 1324 9. Plantevit,M.,Crémilleux, B.: Condensedrepresentationofsequentialpatterns

Figure

Fig. 1. Overview of our proposed approah
Table 1. SDB 1 : a sequene database
Fig. 3. Distribution of the emerging patterns w.r.t. the growth rate

Références

Documents relatifs

The basic idea of MMISP (Mining Multidimensional Itemsets Sequential Patterns) consists in transforming the mds-data into a “classical form” (i.e., sequence of itemsets) and

In this paper we first analyze the constraints of using transver- sal hypergraph enumeration in itemset mining, then pro- pose the ordered pattern model for representing and min-

Each value of IT EM S is a tuple (labelitem, {time, occ}, {(regions, RootReg)}) where labelitem stands for the considered item, {time, occ} is used in order to store the number

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

The test done on the hepatitis dataset provided by the Chiba University Hospital allowed us after a learning step, to extract very short patterns, which were sufficient, in half

Multivariate Student, also called t-distributions, are useful when dealing with real-data because of their heavy tails. They are a robust alternative to the Gaussian distribution,

For sake of better interpretability of the results and of computa- tion time, the negative sequential patterns contain only one kind of negated itemset (¬| · |, ¬(·) or ¬{·}),

overnight incubation with the ALK1 278–301 and ALK2 233–256 peptides in an ELISPOT assay, the CD4 + cells released IFN-g in response to both ALK