• Aucun résultat trouvé

Towards a mixed approach to extract biomedical terms from text corpus

N/A
N/A
Protected

Academic year: 2021

Partager "Towards a mixed approach to extract biomedical terms from text corpus"

Copied!
16
0
0

Texte intégral

(1)

HAL Id: lirmm-00859846

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00859846v2

Submitted on 7 Jul 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Towards a mixed approach to extract biomedical terms

from text corpus

Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, Maguelonne

Teisseire

To cite this version:

Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, Maguelonne Teisseire. Towards a mixed approach to extract biomedical terms from text corpus. International journal of Knowledge Discovery in Bioinformatics, IGI Global, 2014, 4 (1), pp.1-15. �10.4018/ijkdb.2014010101�. �lirmm-00859846v2�

(2)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

$%675$&7

7KHREMHFWLYHRIWKLVSDSHULVWRSUHVHQWDPHWKRGRORJ\WRH[WUDFWDQGUDQNDXWRPDWLFDOO\ELRPHGLFDOWHUPV IURPIUHHWH[W7KHDXWKRUVSUHVHQWQHZH[WUDFWLRQPHWKRGVWDNLQJLQWRDFFRXQWOLQJXLVWLFSDWWHUQVVSHFLDOL]HG IRUWKHELRPHGLFDOGRPDLQVWDWLVWLFWHUPH[WUDFWLRQPHDVXUHVVXFKDV&YDOXHDQGVWDWLVWLFNH\ZRUGH[WUDFWLRQ PHDVXUHVVXFKDV2NDSL%0DQG7),')7KHVHPHDVXUHVDUHFRPELQHGLQRUGHUWRLPSURYHWKHH[WUDFWLRQ SURFHVVDQGWKHDXWKRUVLQYHVWLJDWHZKLFKFRPELQDWLRQVDUHWKHPRUHUHOHYDQWDVVRFLDWHGWRGLIIHUHQWFRQWH[WV ([SHULPHQWDOUHVXOWVVKRZWKDWDQDSSURSULDWHKDUPRQLFPHDQRI&YDOXHDVVRFLDWHGWRNH\ZRUGH[WUDFWLRQ PHDVXUHVRIIHUVEHWWHUSUHFLVLRQERWKIRUVLQJOHZRUGDQGPXOWLZRUGVWHUPH[WUDFWLRQ([SHULPHQWVGHVFULEH WKHH[WUDFWLRQRI(QJOLVKDQG)UHQFKELRPHGLFDOWHUPVIURPDFRUSXVRIODERUDWRU\WHVWVDYDLODEOHRQOLQH 7KHUHVXOWVDUHYDOLGDWHGE\XVLQJ80/6 LQ(QJOLVK DQGRQO\0H6+ LQ)UHQFK DVUHIHUHQFHGLFWLRQDU\

7RZDUGVD0L[HG$SSURDFK

WR([WUDFW%LRPHGLFDO

7HUPVIURP7H[W&RUSXV

-XDQ$QWRQLR/RVVLR9HQWXUD/,5008QLYHUVLW\0RQWSHOOLHU0RQWSHOOLHU)UDQFH  &1563DULV)UDQFH &OHPHQW-RQTXHW/,5008QLYHUVLW\0RQWSHOOLHU0RQWSHOOLHU)UDQFH &1563DULV )UDQFH 0DWKLHX5RFKH8057(7,6&LUDG,UVWHD$JUR3DULV7HFK0RQWSHOOLHU)UDQFH 0DJXHORQQH7HLVVHLUH8057(7,6&LUDG,UVWHD$JUR3DULV7HFK0RQWSHOOLHU)UDQFH .H\ZRUGV %LRPHGLFDO1DWXUDO/DQJXDJH3URFHVVLQJ %LR1/3 %LRPHGLFDO7HUP([WUDFWLRQ%LRPHGLFDO 7HUPLQRORJLHVDQG2QWRORJLHV%LRPHGLFDO7KHVDXUXV6WDWLVWLF0HDVXUH7H[W0LQLQJ

,1752'8&7,21

7KHKXJHDPRXQWRIGDWDDYDLODEOHRQOLQHWRGD\ LVRIWHQFRPSRVHGRISODLQWH[WILHOGIRULQ VWDQFHVFOLQLFDOWULDOGHVFULSWLRQVDGYHUVHHYHQW UHSRUWVRUHOHFWURQLFKHDOWKUHFRUGV7KHVHWH[WV RIWHQ FRQWDLQ WKH UHDO ODQJXDJH H[SUHVVLRQV DQGWHUPV XVHGE\WKHFRPPXQLW\$OWKRXJK LQWKHELRPHGLFDOGRPDLQWKHUHH[LVWKXQGUHGRI

WHUPLQRORJLHVDQGRQWRORJLHVWRGHVFULEHVXFK ODQJXDJHV 1R\HWDO WKRVHWHUPLQROR JLHVRIWHQPLVVFRQFHSWVRUSRVVLEOHDOWHUQDWLYH WHUPV IRU WKRVH FRQFHSWV 2XU PRWLYDWLRQ LV WR LPSURYH WKH SUHFLVLRQ RI DXWRPDWLF WHUPV H[WUDFWLRQ SURFHVV WKH PDLQ UHDVRQ IRU WKLV LVWKDWODQJXDJHHYROYHVIDVWHUWKDQRXUDELOLW\ WRIRUPDOL]HDQGFDWDORJLW7KLVLVHYHQPRUH WUXHIRU)UHQFKLQZKLFKWKHQXPEHURIWHUPV '2,LMNGE

(3)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK IRUPDOL]HGLQWHUPLQRORJLHVLVVLJQLILFDQWO\OHVV LPSRUWDQWWKDQLQ(QJOLVK 1/3 QDWXUDOODQJXDJHSURFHVVLQJ WRROV DQGPHWKRGVHQDEOHWRHQULFKELRPHGLFDOGLF WLRQDULHVIURPWH[WV$XWRPDWLF7HUP5HFRJQL WLRQ $75 LVDQDSSURDFKLQODQJXDJHWHFKQRO RJ\WKDWLQYROYHVWKHH[WUDFWLRQRIWHFKQLFDO WHUPVIURPGRPDLQVSHFLILFODQJXDJHFRUSRUD =KDQJ HW DO   ,Q DGGLWLRQ$XWRPDWLF .H\ZRUG([WUDFWLRQ $.( LVWKHSURFHVVRI H[WUDFWLQJWKHPRVWUHOHYDQWZRUGVRUSKUDVHV LQDGRFXPHQW.H\ZRUGVZKLFKZHGHILQHDV DVHTXHQFHRIRQHRUPRUHZRUGVSURYLGHD FRPSDFWUHSUHVHQWDWLRQRIDGRFXPHQW¶VFRQWHQW 7ZRSRSXODU$.(PHDVXUHVDUH2NDSL%0 DQG7),')DOVRFDOOHGZHLJKWLQJPHDVXUHV 7KHVHWZRILHOGVDUHVXPPDUL]HGLQ7DEOH ,QRXUZRUNZHDGRSWDVEDVHOLQHPHDVXUHV DQ$75PHWKRG&YDOXH )UDQW]LHWDO  DQGWKHEHVWWZR$.(PHWKRGV +XVVH\HWDO  ,QGHHGWKH&YDOXHFRPSDUHGWRRWKHU $75PHWKRGVRIWHQJHWVEHVWSUHFLVLRQUHVXOWV DQG HVSHFLDOO\ LQ ELRPHGLFDO VWXGLHV .QRWK HWDO=KDQJHWDO=KDQJHWDO  0RUHRYHUWKLVPHDVXUHLVGHILQHGIRU PXOWLZRUGWHUPH[WUDFWLRQEXWFDQEHHDVLO\ DGDSWHGIRUVLQJOHZRUGWHUP SUHVHQWHGODWHU RQ DQGLWKDVQHYHUEHHQDSSOLHGWR)UHQFK WH[WZKLFKLVDSSHDOLQJLQRXUFDVH2NDSLDQG 7),')DUHWKHEHVW$.(PHWKRGV +XVVH\HW DO :HSURSRVHWRGHILQHQHZH[WUDFWLRQ PHWKRGV E\ FRPELQLQJ LQ GLIIHUHQW PDQQHUV $75DQG$.(PHDVXUHVLQRUGHUWRUDQNWKH EHVWFDQGLGDWHWHUPV2XUH[SHULPHQWUHVXOWV XQGHUOLQHWKHSUHFLVLRQHIILFLHQF\JDLQZLWKWKH SURSRVHGPHWKRGV:HJLYHSULRULW\WRSUHFLVLRQ LQRUGHUWRIRFXVRQH[WUDFWLRQRIQHZYDOLG WHUPV SUHFLVLRQ UDWKHUWKDQRQPLVVHGWHUPV UHFDOO LHIRUDFDQGLGDWHWHUPWREHDYDOLG ELRPHGLFDOWHUPRUQRW 7KHUHVWRIWKHSDSHULVRUJDQL]HGDVIRO ORZVVHFWLRQ³5HODWHG:RUN´GHVFULEHVWKHVWDWH RIWKHDUWLQWKHILHOGRI$75DQGVSHFLDOO\WKH PHWKRGVEDVHGRQ&YDOXHVHFWLRQ³3URSRVHG $SSURDFK´ SUHVHQWV RXU SURSRVDO RI UDQNLQJ PHDVXUHVVHFWLRQ³([SHULPHQWVDQG5HVXOWV´ GHWDLOVDQGGLVFXVVHVWKHFRQGXFWHGH[SHULPHQWV DQGWKHDVVRFLDWHGUHVXOWVDQGVHFWLRQ³&RQFOX VLRQ´FRQFOXGHVWKHSDSHU

5(/$7(':25.

$75 VWXGLHV FDQ EH GLYLGHG LQWR IRXU PDLQ FDWHJRULHV L  UXOHEDVHG DSSURDFKHV LL  GLFWLRQDU\ EDVHG DSSURDFKHV LLL  VWDWLVWLFDO DSSURDFKHVDQG LY K\EULGDSSURDFKHV5XOH EDVHGDSSURDFKHVIRULQVWDQFH *DL]DXVNDVHW DO DWWHPSWWRUHFRYHUWHUPVWKDQNVWR WKHIRUPDWLRQSDWWHUQVWKHPDLQLGHDLVWREXLOG UXOHVLQRUGHUWRGHVFULEHQDPLQJVWUXFWXUHVIRU GLIIHUHQWFODVVHVXVLQJRUWKRJUDSKLFOH[LFDORU PRUSKRV\QWDFWLF FKDUDFWHULVWLFV 'LFWLRQDU\ EDVHG DSSURDFKHV XVH H[LVWLQJ WHUPLQRORJ\ UHVRXUFHVLQRUGHUWRORFDWHWHUPRFFXUUHQFHV LQWH[WV .UDXWKDPPHUHWDO 6WDWLVWLFDO DSSURDFKHVDUHRIWHQEXLOWIRUH[WUDFWLQJJHQHUDO WHUPV (FNHWDO 7KHPRVWEDVLFPHDVXUH LVIUHTXHQF\&1&YDOXH )UDQW]LHWDO  LVDQRWKHUVWDWLVWLFDOPHWKRGZHOONQRZQLQWKH OLWHUDWXUHWKDWFRPELQHVVWDWLVWLFDODQGOLQJXLVWLF LQIRUPDWLRQIRUWKHH[WUDFWLRQRIPXOWLZRUGDQG QHVWHGWHUPV:KLOHPRVWVWXGLHVDGGUHVVVSH FLILFW\SHVRIHQWLWLHV&1&YDOXHLVDGRPDLQ LQGHSHQGHQWPHWKRGXVHGIRUH[WUDFWLQJWHUPV 7DEOH'LIIHUHQFHVEHWZHHQ$75DQG$.( $XWRPDWLF7HUP5HFRJQLWLRQ $75 $XWRPDWLF.H\ZRUG([WUDFWLRQ $.( ,QSXW RQHODUJHFRUSXV LHQRWH[SOLFLWO\VHSDUDWHG LQGRFXPHQWV VLQJOHGRFXPHQWZLWKLQDGDWDVHWRIGRFXPHQWV 2XWSXW WHFKQLFDOWHUPVRIDGRPDLQ NH\ZRUGVWKDWGHVFULEHWKHGRFXPHQW 'RPDLQ YHU\VSHFLILF QRQH ([HPSOHV &YDOXH 7),')2NDSL

(4)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK IURPELRPHGLFDOOLWHUDWXUH +OLDRXWDNLVHWDO  7KH&1&YDOXHPHWKRGZDVDOVRDSSOLHG WRPDQ\GLIIHUHQWODQJXDJHVEHVLGHV(QJOLVK )UDQW]LHWDO VXFKDV6HUELDQ 1HQDGLF ғHWDO 6ORYHQLDQ 9LQWDU 3ROLVK .XSVF &KLQHVH -LHWDO 6SDQ LVK %DUUyQHWDO DQG$UDELF .KDWLEHW DO 7RWKHEHVWRIRXUNQRZOHGJHLWKDV QHYHUEHHQXVHGWR)UHQFKWH[WV 7KHPDLQREMHFWLYHRIRXUZRUNLVWKXVWR FRPELQHWKLVPHWKRGZLWK$.(PHWKRGVDQG WRHYDOXDWHWKHPERWKIRU(QJOLVKDQG)UHQFK ,QGHHGZHDUJXHWKDWWKHFRPELQDWLRQRIELR PHGLFDOWHUPH[WUDFWLRQDQGNH\ZRUGVH[WUDF WLRQPHWKRGVFRXOGKLJKOLJKWUHOHYDQWWHUPVRI ELRPHGLFDOGRPDLQ

352326('$3352$&+

7KLVVHFWLRQGHVFULEHVWKHEDVHOLQHPHDVXUHVDQG WKHLUFXVWRPL]DWLRQVDVZHOODVQHZFRPELQD WLRQVRIWKHVHPHDVXUHVIRUDXWRPDWLFELRPHGLFDO WHUPVH[WUDFWLRQ,QVXEVHFWLRQ$ZHGHWDLOWKH H[WHQVLRQVRIWKHEDVHOLQHVPHDVXUHV3DUWLFX ODUO\ZHLPSURYHWKH&YDOXHPHWKRGE\WDNLQJ LQWRFRQVLGHUDWLRQOLQJXLVWLFSDWWHUQVSHFLDOL]HG IRUELRPHGLFDOGRPDLQ,QDGGLWLRQZHDGDSW WKHVWDWLVWLFPHDVXUHLQRUGHUWRH[WUDFWVLQJOH DQGPXOWLWHUPV7KHVHDSSURDFKHVDUHDSSOLHG ERWKWR)UHQFKDQG(QJOLVKODQJXDJHV:HDOVR XVH2NDSL%0 KHUHDIWHU2NDSL DQG7),') 6XEVHFWLRQ%SUHVHQWVVRPHSURSRVHGFRPEL QDWLRQVRIWKHEDVLFPHDVXUHV L &RPSXWLQJ KDUPRQLFPHDQFRPELQDWLRQV LL 7DNLQJLQWR DFFRXQWWKH2NDSLYDOXHVDQG7),')YDOXHV ZLWKLQWKHFDOFXOXVRI&YDOXH

2XU PHWKRG IRU DXWRPDWLF WHUP H[WUDF WLRQKDVIRXUPDLQVWHSV /RVVLRHWDO  GHVFULEHGLQ)LJXUH

 3DUWRI6SHHFKWDJJLQJRIWKHFRUSXV  &DQGLGDWH WHUPV H[WUDFWLRQ IROORZLQJ

SDWWHUQV  5DQNLQJRIFDQGLGDWHWHUPV  &RPSXWLQJQHZFRPELQHGPHDVXUHV :H H[HFXWH WKRVH IRXU VWHSV E\ WDNLQJ HLWKHU&YDOXH ULJKWEUDQFK RU2NDSL7),') OHIWEUDQFK DVEDVHOLQHPHWKRGV1RWLFHWKDW DV WKH LQSXW RI &YDOXH LV D XQLTXH HOHPHQW DQGWKHZHLJKWLQJPHDVXUHGHDOVZLWKPDQ\ GRFXPHQWV FI7DEOH ZHQHHGWRPHUJHDOO GRFXPHQWVWREXLOGDVLQJOHWH[WXDOHOHPHQW$ SUHOLPLQDU\VWHSQRWUHSUHVHQWHGLQ)LJXUHLV WKHFUHDWLRQRISDWWHUQVIRU)UHQFKDQG(QJOLVK DVGHVFULEHGKHUHDIWHU %XLOGLQJ%LRPHGLFDO3DWWHUQV

:H FRQVLGHU WKH IROORZLQJ DVVXPSWLRQ ELR PHGLFDOWHUPVKDYHVLPLODUV\QWDFWLFVWUXFWXUH 7KHUHIRUHZHEXLOGDOLVWRIWKHPRVWFRPPRQ OH[LFDOSDWWHUQVDFFRUGLQJWKHV\QWDFWLFVWUXF WXUHRIWHUPVWKDWDUHLQELRPHGLFDOGDWDEDVHV 80/6 8QLILHG0HGLFDO/DQJXDJH6\VWHP  IRU(QJOLVKDQG0H6+ 0HGLFDO6XEMHFW+HDG LQJV IRU)UHQFK )LUVWDSDUWRIVSHHFKWDJJLQJRIWKHELR PHGLFDOWHUPVLVGRQHE\XVLQJ7UHH7DJJHU 7KHIUHTXHQF\RIV\QWDFWLFVWUXFWXUHVLVWKHQ FRPSXWHG7KHWRSDUHVHOHFWHGDVSDWWHUQV IRUHDFKODQJXDJH7KHQXPEHURIWHUPVXVHG WREXLOGWKHOLVWZDVIRU(QJOLVKDQG IRU)UHQFK([DPSOHVRISDWWHUQVVRUWHG E\IUHTXHQF\DUHJLYHQLQ7DEOH 3DUWRI6SHHFK7DJJLQJ 6HH3DUW  LQ)LJXUH 3DUWRIVSHHFK 326  WDJJLQJ DVVLJQV HDFK ZRUG LQ D WH[W WR LWV JUDPPDWLFDO FDWHJRU\ HJQRXQDGMHFWLYH 7KLVSURFHVVLVEDVHG RQWKHGHILQLWLRQRIWKHZRUGRURQWKHFRQWH[W LQZKLFKLWDSSHDUV$WWKLVVWHSDVVXJJHVWHG LQWKH&YDOXHPHWKRGWKHSDUWRIVSHHFKLV DSSOLHG RQ WKH ZKROH FRUSXV :H HYDOXDWHG WKUHHWRROV 7UHH7DJJHU6WDQIRUG7DJJHUDQG %ULOO¶V UXOHV  DQG ILQDOO\ FKRRVH 7UHH7DJJHU ZKLFKJDYHEHWWHUUHVXOWVDQGLVXVDEOHERWK IRU)UHQFKDQG(QJOLVK

&DQGLGDWH7HUPV([WUDFWLRQ %DVHGRQ%LRPHGLFDO3DWWHUQV VHH3DUW  LQ)LJXUH

%HIRUH DSSO\LQJ DQ\ PHDVXUHV ZH ILOWHU RXW WKHFRQWHQWRIRXULQSXWFRUSXVXVLQJSDWWHUQV SUHYLRXVO\FRPSXWHG:HVHOHFWRQO\WKHWHUPV

(5)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK ZKLFKV\QWDFWLFVWUXFWXUHLVLQWKHSDWWHUQVOLVW 2IFRXUVHWKHSDWWHUQILOWHULQJRFFXUVVSHFLIL FDOO\E\ODQJXDJH LHZKHQWH[WLVLQ)UHQFK RQO\)UHQFKOLVWRISDWWHUQVLVXVHG  ‡ 8QLRQ'RFXPHQWV7KH&YDOXHPHWKRG QHHGVDVLQJOHWH[WGRFXPHQWDVLQSXW7KLV VWHSPHUJHVDOOWH[WVRIWKHFRUSXVLQWRRQH GRFXPHQW )LJXUH:RUNIORZRIELRPHGLFDOWHUPH[WUDFWLRQ

(6)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

5DQNLQJRI&DQGLGDWH7HUPV 6HH3DUW  LQ)LJXUH

 5DQNLQJ 7HUPV ZLWK &9DOXH 7KH &

YDOXH PHWKRG FRPELQHV OLQJXLVWLF DQG

VWDWLVWLFDOLQIRUPDWLRQ )UDQW]LHWDO  7KHOLQJXLVWLFLQIRUPDWLRQLVEDVHGRQWKH XVHRIDJHQHUDOUHJXODUH[SUHVVLRQ LH OLQJXLVWLFSDWWHUQV 7KHVWDWLVWLFDOLQIRUPD WLRQLVWKHYDOXHDVVLJQHGZLWKWKH&YDOXH PHDVXUHEDVHGRQWKHWHUPIUHTXHQF\WR FRPSXWHWKHWHUPKRRG LHWKHDVVRFLDWLRQ VWUHQJWKRIDWHUPWRGRPDLQFRQFHSWV 7KH DLPRIWKH&YDOXHPHWKRGLVWRLPSURYH WKHH[WUDFWLRQRIQHVWHGWHUPV,WKDVEHHQ VSHFLDOO\GHILQHGIRUH[WUDFWLQJPXOWLZRUG WHUPV C _ , value a w a f a ( )= ( )× ( ) !!!!!!!! ! ! ! ! ! ! ! ! ! ! ! ! if a nested w a f a Sa b Saf b ∉ ( )× ( )− × ( )     ∈ ∑ 1                       ,!!otherwise   :KHUHDLVWKHFDQGLGDWHWHUPZ D  ORJ _D_  _D_WKHQXPEHURIZRUGVLQDI D WKHIUHTXHQF\ RIDLQWKHXQLTXHGRFXPHQW6DWKHVHWRIWHUPV WKDWFRQWDLQDDQG_6D_WKHQXPEHURIWHUPVLQ 6D,QDQXWVKHOO&YDOXHHLWKHUXVHVIUHTXHQF\ RIWKHWHUPLIWKHWHUPLVQRWLQFOXGHLQRWKHU WHUPV ILUVWOLQH RUGHFUHDVHWKLVIUHTXHQF\LI WKHWHUPDSSHDUVLQRWKHUWHUPVE\XVLQJWKH IUHTXHQF\RIWKRVHRWKHUWHUPV VHFRQGOLQH  :HPRGLILHGWKHPHDVXUHLQRUGHUWRH[WUDFW DOOWHUPV VLQJOHZRUGPXOWLZRUGVWHUPV DV VXJJHVWHGLQ%DUUyQHWDO  LQGLIIHUHQW PDQQHUVLQWKHIRUPXODZ D  ORJ _D_ ZH XVHZ D  ORJ _D_ LQRUGHUWRDYRLGQXOO YDOXHV IRUVLQJOHZRUGWHUPV DVLOOXVWUDWHGLQ 7DEOH1RWHWKDWZHGRQRWXVHDVWRSZRUG OLVW QRU D WKUHVKROG IRU IUHTXHQF\ DV LW ZDV RULJLQDOO\SURSRVHG 7DEOHVKRZVWKHSURSRVHGFKDQJHVIRU WKHFRPSXWDWLRQRIZ D ZLWKWKHRULJLQDODQG PRGLILHG&YDOXHGHILQLWLRQV  5DQNLQJ 7HUPV ZLWK 2NDSL  7),') ,QDQXWVKHOOWKHVHPHDVXUHVDUHXVHGWR DVVRFLDWHHDFKRFFXUUHQFHRIDWHUPZLWK DZHLJKWUHSUHVHQWLQJLWVUHOHYDQFHWRWKH PHDQLQJRIWKHGRFXPHQWLWDSSHDUVLQDQG UHODWLYHO\WRWKHFRUSXVLWLVLQFOXGHGLQ DQGDOVRUHODWLYHO\WRWKHVL]HRIWKHGRFX PHQWLQWKHFDVHRI2NDSL 7KHRXWSXWLV DUDQNHGOLVWRIWHUPVIRUHDFKGRFXPHQW 7KH\VHUYHDVUDQNLQJPHDVXUHVWRRUGHU GRFXPHQWV E\ WKHLU LPSRUWDQFH JLYHQ D TXHU\ 5REHUWVRQHWDO 2NDSLFDQ EHVHHQDVDQLPSURYHPHQWRIWKH7),') PHDVXUHWDNLQJLQWRDFFRXQWWKHGRFXPHQW 7DEOH([DPSOHVRIWKHPRVWIUHTXHQWSDWWHUQVIRU(QJOLVKDQG)UHQFK (QJOLVK )UHQFK  3URSHU1RXQ 1RXQ  1RXQ 1RXQ$GM  3URSHU1RXQ3URSHU1RXQ 1RXQ3UHS1RXQ  1RXQ1RXQ 1RXQ$GM$GM  $GM1RXQ 1RXQ3UHSGHW1RXQ  1RXQ1RXQ3URSHU1RXQ 1RXQ3UHS3URSHU1RXQ  $GM3URSHU1RXQ3URSHU1RXQ 1RXQ3URSHU1RXQ  1RXQ3URSHU1RXQ3URSHU1RXQ 1RXQ1RXQ  1RXQ1RXQ3UHS1RXQ 1RXQ3UHS1RXQ$GM

(7)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK OHQJWK%RWKPHDVXUHVDUHPRVWO\XVHGIRU LQIRUPDWLRQUHWULHYDODQGWH[WPLQLQJ D 1RUPDOL]DWLRQ 7KH 2NDSL DQG 7),')PHDVXUHVDUHFDOFXODWHGZLWK DYDULDEOHQXPEHURIHOHPHQWVVRWKDW WKHREWDLQHGYDOXHVDUHKHWHURJHQHRXV ,Q RUGHU WR PDQLSXODWH WKHVH UHVXOW OLVWVWKHZHLJKWVREWDLQHGIURPHDFK GRFXPHQWPXVWEHQRUPDOL]HGIRUWKH ZKROHFRUSXV7KHUHIRUHWKHUHVXOWVRI HDFKPHDVXUHKDYHWREHQRUPDOL]HG IRULQVWDQFHEHWZHHQDQG E 0HUJLQJ/LVWV2QFHYDOXHVQRUPDO L]HGZHKDYHWRPHUJHWKHWHUPVLQWR DVLQJOHOLVWLQRUGHUWRHYDOXDWHWKH UHVXOWV &OHDUO\ WKH SUHFLVLRQ ZLOO GHSHQGRQWKHPHWKRGXVHGWRSHUIRUP VXFKPHUJLQJ:HPHUJHGIROORZLQJ W K U H H  I X Q F W L R Q V   6 X P 6  0D[LPXP 0 DQG$YHUDJH $ ZKLFK FDOFXODWHUHVSHFWLYHO\WKHVXPPD[ DQGDYHUDJHRIDWHUPLQWKHZKROH FROOHFWLRQ$WWKHHQGRIWKLVWDVNZH REWDLQWKUHHOLVWVIURP2NDSLDQGWKUHH OLVWV IURP 7),') 7KH QRWDWLRQ IRU WKHVH OLVWV DUH OkapiX

( )

a  DQG TFIDF a

X

( )

ZKHUHDLVWKHWHUP;

WKHIDFWRU∈^06$`)RULQVWDQFH

OkapiM

( )

a  LV WKH OLVW REWDLQHG E\

WDNLQJWKHPD[LPXP2NDSLYDOXHIRU DWHUPDLQWKHZKROHFRUSXV

&RPSXWLQJWKH1HZ&RPELQHG 0HDVXUHV 6HH3DUW  LQ)LJXUH

:LWKDLPRILPSURYLQJWKHSUHFLVLRQRIWHUPV H[WUDFWLRQ ZH KDYH FRQFHLYHG WZR QHZ

FRPELQHG PHDVXUH VFKHPHV WDNLQJ LQWR DF FRXQWWKHUHVXOWVREWDLQHGLQWKHDERYHVWHSV 7KHILUVWRQHLVEDVHGRQWKHKDUPRQLFPHDQ RIWZRYDOXHV7KHVHFRQGRQHLVREWDLQHGE\ UHSODFLQJWKHIUHTXHQF\ZLWKLQWKH(TXDWLRQ  RI&YDOXHE\WKHYDOXHRIWKHZHLJKWLQJ PHDVXUHV )2&DSLDQG)7),')&&RQVLGHUHGDVWKH KDUPRQLFPHDQRIWKHWZRXVHGYDOXHVWKLV PHWKRGKDVDVDGYDQWDJHWRXVHDOOYDOXHV RIWKHGLVWULEXWLRQ

F OCapi a Okapi a C value a

Okapi a C value X x x

( )

= ×

( )

× −

( )

( )

+ − 2! ! ! ! ! ! !

( )

aa   F TFIDF C a TFIDF a C value a TFIDF a C valu X x x − −

( )

= ×

( )

× −

( )

( )

+ − 2! ! ! ! ! ! ee a

( )

 

&2NDSL DQG &7),') )RU WKLV PHDVXUH

RXU DVVXPSWLRQ LV WKDW &YDOXH FDQ EH PRUH UHSUHVHQWDWLYH LI WKH IUHTXHQF\ LQ WKH(TXDWLRQ  RIWKHWHUPVLVUHSODFHG ZLWKDPRUHVLJQLILFDQWYDOXHLQWKLVFDVH ZLWKWKH2NDSL¶VDQG7),')¶VYDOXHVRI WKHWHUPV RYHUWKHZKROHFRUSXV  C−

( )

=

( )

×

( )

m a m x x w a ! ! a ,!!!!!!!!!!!!!!!!!!!!!!!!!! ! !if a nestedd w a Sa b Sa ot ! ! ! ! ! ! ! ,!! ! ! !

( )

×

( )

− ×

( )

          ∈

m a m b x x 1 h herwise                        7DEOH&DOFXODWLRQRIZ D  2ULJLQDO&YDOXH 0RGLILHG&YDOXH Z D  ORJ _D_ Z D  ORJ _D_ DQWLSKRVSKROLSLGDQWLERGLHV ORJ    ORJ   

ZKLWHEORRG ORJ    ORJ   

(8)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK :KHUHm a x

( )

 

{

OkapiX TFIDFX

}

DQG;

{

!M S A, ,

}

. 7DEOHVKRZVGLIIHUHQWUDQNLQJRIWHUPV ZLWKGLIIHUHQWPHDVXUHV7KLVH[DPSOHKLJK OLJKWVVSHFLILFDQGYHU\UHOHYDQWWHUPVVXFKDV ´DQWLSKRVSKROLSLGDQWLERGLHV´DQG´SODWHOHW´ ,QGHHGWKHVHWHUPVREWDLQDEHWWHUUDQNLQJE\ XVLQJRXUPHDVXUHVVXFKDV)7),')&0 ,PSOHPHQWDWLRQDQG$YDLODELOLW\ :HGHYHORSHG%LR7H[DZHEDSSOLFDWLRQ LO OXVWUDWHGLQ)LJXUH WKDWLPSOHPHQWVWKHHQWLUH ZRUNIORZSUHVHQWHGLQWKLVSDSHUIRUDJLYHQ WH[WFRUSXVDVLQSXW%LR7H[ZLOOH[WUDFWDQG UDQNELRPHGLFDOWHUPVDFFRUGLQJWRWKHVHOHFWHG H[WUDFWLRQPHDVXUHLQFOXGHGLQ&YDOXH2NDSL 7),')RURQHRIWKHQHZSURSRVHGFRPELQD WLRQV,QDGGLWLRQ%LR7H[DOORZVWRYDOLGDWH DXWRPDWLFDOO\ WHUPV DOUHDG\ H[LVWLQJ LQ WKH DYDLODEOH80/60H6+IUWHUPLQRORJLHV:H KDYHLPSOHPHQWHGDQGHYDOXDWHG%LR7H[IRU ERWK (QJOLVK DQG )UHQFK7KH DSSOLFDWLRQ LV DYDLODEOHRQOLQHEXWFDQDOVREHXVHGLQDQ\ SURJUDPWKURXJKDMDYD$3,KWWSWXEROLUPP IUELRWH[ ,Q WKH IROORZLQJ VHFWLRQ ZH HYDOXDWH D ODUJHOLVWRIH[WUDFWHGDQGUDQNHGWHUPVZLWKRXU QHZPHDVXUHVDQGWKHLUGLIIHUHQWFRPELQDWLRQV XVLQJRXUZHEDSSOLFDWLRQ 

'$7$$1'(;3(5,0(17$/

35272&2/

7HVW&ROOHFWLRQ :HXVHGELRORJLFDOODERUDWRU\WHVWVDVFRUSXV REWDLQHG IURP /DE7HVWV2QOLQHRUJ 7KLV VLWH SURYLGHVLQIRUPDWLRQLQVHYHUDOODQJXDJHVWR SDWLHQWRUIDPLO\FDUHJLYHURQFOLQLFDOODEWHVWV (DFKWHVWZKLFKLVFRQVLGHUHGDVDGRFXPHQWLQ RXUFRUSXVLQFOXGHVWKHIRUPDOODEWHVWQDPH LWV V\QRQ\PV DQG PDQ\ DOWHUQDWH QDPHV DV ZHOODVDGHVFULSWLRQRIWKHWHVW2XUH[WUDFWHG FRUSXVFRQWDLQVFOLQLFDOWHVWV DERXW ZRUGV IRU(QJOLVKDQG DERXW ZRUGV IRU)UHQFK 9DOLGDWLRQ'DWD ,QRUGHUWRDXWRPDWLFDOO\YDOLGDWHRXUFDQGLGDWH WHUPVZHFRPSXWHDYDOLGDWLRQGLFWLRQDU\WKDW LQFOXGHWKHRIILFLDOQDPHWKHV\QRQ\PVDQG DOWHUQDWHQDPHVRIWKHODEWHVWRQOLQHWHVWVSOXVDOO 80/6WHUPVIRU(QJOLVKDQGWKH0H6+WHUPV IRU)UHQFK:HFDQQRZHYDOXDWHSUHFLVLRQZLWK DSURSHUUHIHUHQFHIRUYDOLGWHUPV1RWHWKDWDV DFRQVHTXHQFHWKHUHFDOOLVHTXDOWRZLWK WKHZKROHOLVWRIH[WUDFWHGWHUPV

(;3(5,0(176$1'5(68/76

$ ILUVW HYDOXDWLRQ ZDV GRQH DXWRPDWLFDOO\ ZLWKRXWWKHYHULILFDWLRQRIDQH[SHUWWRYDOLGDWH RULQYDOLGDWHWKHWHUPVWKDWDUHQRWIRXQGLQRXU YDOLGDWLRQGLFWLRQDU\5HVXOWVDUHHYDOXDWHGLQ WHUPVRI3UHFLVLRQREWDLQHGRYHUWKHWRSNWHUPV DWGLIIHUHQWVWHSVRIRXUZRUNIORZSUHVHQWHGLQ 7DEOH5DQNRIWHUPVEDVHGRQGLIIHUHQWPHDVXUHV 5DQNLQJRIWKHWHUPV

&YDOXH 7),')0 2NDSL0 )7),')&0 )2&DSL0 &7),')0 &2NDSL6

DQWLSKRVSKROLSLG

DQWLERGLHV       

ZKLWHEORRG       

(9)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

SUHYLRXVVHFWLRQ2NDSLDQG7),')SURYLGHG WKUHH OLVWV RI UDQNHG FDQGLGDWH WHUPV 0 6 $ )RUHDFKFRPELQHGPHDVXUHXVLQJ2NDSL RU7),')WKHH[SHULPHQWVDUHFRQGXFWHGZLWK WKHWKUHHOLVWV7KHUHIRUHWKHQXPEHURIUDQNHG OLVWWRFRPSDUHLV&YDOXH  2NDSL   7),')  )2&DSL  )7),')&   &2NDSL  &7),')  ,QDGGLWLRQZH H[SHULPHQWHGWKHZRUNIORZHLWKHUIRUDOO VLQJOH DQGPXOWL RUPXOWLWHUPVZKLFKILQDOO\JLYH UDQNHGOLVWV7KHQZHVHOHFWDOOWHUPV VLQJOH DQG PXOWL  RU RQO\ PXOLWHUPV  î    H[SHULPHQWVIRUHDFKODQJXDJH 

7KHIROORZLQJVHFWLRQVVKRZSDUWRIWKH H[SHULPHQW UHVXOWV GRQH DOO RU PXOWL WHUPV RQO\DQGFRQVLGHULQJWKHWRSDQG H[WUDFWHGWHUPVEHFDXVHLWLVDSSURSULDWHDQG HDVLHUIRUDQH[SHUWWRWRHYDOXDWHRQO\WKHWRSN H[WUDFWHGWHUPV:HHYDOXDWHGILUVWWKHEDVHOLQHV PHDVXUHVDQGVHFRQGZLWKWKHQHZFRPELQHG PHDVXUHVIRU(QJOLVKDQG)UHQFK ([SHULPHQWVZLWK$.( 0HWKRGV2NDSLDQG7),') 7KH H[SHULPHQWV ZLWK WKHVH PHWKRGV ZHUH SHUIRUPHGDIWHUDSSO\LQJWKHOLQJXLVWLFILOWHU 7KHH[SHULPHQWVZHUHFDUULHGIRU$OODQG0XOWL WHUPVH[WUDFWLRQ7DEOHDQG7DEOHVKRZWKH UHVXOWVRIWHUPH[WUDFWLRQZLWKOkapiX%HVW

UHVXOWVZHUHRIWHQREWDLQHGZLWKOkapiMIRU

ERWKODQJXDJHV

7DEOHDQG7DEOHVKRZWKHUHVXOWVRI WHUPLQRORJ\ H[WUDFWLRQ ZLWKTFIDFX %HVW

(10)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

UHVXOWVZHUHREWDLQHGZLWKTFIDFMIRU$OO

WHUPVIRUERWKODQJXDJHV)RU0XOWLWHUPVWKH EHVWUHVXOWVZHUHREWDLQHGZLWKTFIDF SIRU ERWKODQJXDJHV ([SHULPHQWVZLWK&YDOXH ,Q WKLV VXEVHFWLRQ ZH HYDOXDWHG WKH $75 PHWKRG&YDOXH VHH7DEOHVDQG  ([SHULPHQWVZLWK1HZ &RPELQHG0HDVXUHV 7KHQHZPHDVXUHVZHUHDOVRHYDOXDWHG7DEOH DQG7DEOHSUHVHQWWKHUHVXOWVRIWHUPLQRO RJ\ H[WUDFWLRQ ZLWK WKHVH QHZ PHDVXUHV ,Q JHQHUDOWKHEHVWSUHFLVLRQUDWHLVREWDLQHGZLWK F TFIDF C M − −  I R U  ( Q J O L V K  D Q G FOCapiMIRU)UHQFK 0DQXDO9DOLGDWLRQ ,QRUGHUWRNQRZWKHWUXHSUHFLVLRQEHFDXVHLQ WKHPDQXDOYDOLGDWLRQWKHUHDUHWHUPVWKDWDUH QRWLQRXUGLFWLRQDULHV6RZHH[SRUWDOLVWRI H[WUDFWHGWHUPVWREHPDQXDOO\YDOLGDWHG)RU WKLVZHFKRRVHWKHOLVWZLWKWKHEHVWSUHFLVLRQ UDWHLQWKHDXWRPDWLFYDOLGDWLRQSURFHVV7DEOH DQG7DEOHFRPSDUHWKHEHVWUHVXOWVRIWKH DERYH HYDOXDWHG PHDVXUHV ,Q JHQHUDO

F TFIDF C

M

− −  REWDLQHG WKH EHVW UHVXOWV

IRU(QJOLVKH[WUDFWLRQWHUPVDQGFOCapiM REWDLQVKLJKHVWSUHFLVLRQIRUELRPHGLFDO)UHQFK ([SHUWVYDOLGDWHGWKHVHWZROLVWVFRPSRVHGRI  WHUPV7DEOH  DQG7DEOH  VKRZ WKH SUHFLVLRQFRPSXWHGZLWKWKHPDQXDOYDOLGDWLRQ FRPSDUHGWRWKHRQHZLWKWKHDXWRPDWLFYDOLGD WLRQ1RWHWKDWWKHPDQXDOYDOLGDWLRQFRQILUPV WKDWRXUUDQNLQJIXQFWLRQKDVDJRRGEHKDYLRU EHFDXVHWKHSUHFLVLRQYDOXHLVEHWWHUIRUILUVW WHUPV

7DEOH3UHFLVLRQRIOkapiXRQ(QJOLVKFRUSXV

$OO7HUPV 0XOWL7HUPV

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV

OkapiM      

OkapiS      

OkapiA      

7DEOH3UHFLVLRQRIOkapiXRQ)UHQFKFRUSXV

$OO7HUPV 0XOWL7HUPV

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV

OkapiM      

OkapiS      

(11)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

',6&866,21

,QWKHUHVXOWVRI$.(PHWKRGV7),')REWDLQV EHWWHUUHVXOWVWKDQ2NDSL7KHPDLQUHDVRQIRU WKLVLVEHFDXVHWKHVL]HRIWKH(QJOLVKFRUSXV LV ODUJHU WKDQ WKH )UHQFK RQH DQG 2NDSL LV NQRZQWRSHUIRUPEHWWHUZKHQWKHFRUSXVVL]H LVVPDOOHU /YHWDO  7DEOHVKRZVWKDW&YDOXHFDQEHXVHG WR H[WUDFW )UHQFK ELRPHGLFDO WHUPV ZLWK D

7DEOH3UHFLVLRQRITFIDFXRQ(QJOLVKFRUSXV

$OO7HUPV 0XOWL7HUPV

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV

TFIDFM      

TFIDF

S      

TFIDFA      

7DEOH3UHFLVLRQRITFIDFXRQ)UHQFKFRUSXV

$OO7HUPV 0XOWL7HUPV

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV

TFIDFM       TFIDF S       TFIDFA       7DEOH3UHFLVLRQRI&YDOXHRQ(QJOLVKFRUSXV $OO7HUPV 0XOWL7HUPV

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV

&YDOXH      

7DEOH3UHFLVLRQRI&YDOXHRQ)UHQFKFRUSXV

$OO7HUPV 0XOWL7HUPV

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV

(12)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

EHWWHUSUHFLVLRQWKDQZKDWKDVEHHQREWDLQHGLQ SUHYLRXVFLWHGZRUNVLQRWKHUODQJXDJHV7KH SUHFLVLRQ RI &YDOXH IRU WKH SUHYLRXV ZRUNV ZDVEHWZHHQDQG

)RUWKHQHZFRPELQHGPHDVXUHVWKHEHVW UHVXOWV DUH REWDLQHG E\ FRPELQLQJ &YDOXH ZLWKWKHEHVWUHVXOWVIURP$.(PHWKRGVLH

)7),')&0 DQG )2&DSL0 7DEOH  DQG 7DEOHFRPSDUHWKHSUHFLVLRQEHWZHHQWKH EHVWEDVHOLQHVPHDVXUHVDQGWKHEHVWFRPELQHG PHDVXUHV%HVWUHVXOWVZHUHREWDLQHGLQJHQHUDO ZLWK)7),')&0IRU(QJOLVKDQG)2&DSL0IRU )UHQFK7KHVHILJXUHVSURYHWKDWWKHFRPELQHG PHDVXUHVEDVHGRQWKHKDUPRQLFPHDQDUHEHWWHU WKDQWKHEDVHOLQHVPHDVXUHVDQGVSHFLDOO\IRU PXOWLZRUGWHUPVIRUZKLFKWKHJDLQLQSUHFL VLRQ UHDFKHV 7KLV UHVXOW LV SDUWLFXODUO\ SRVLWLYHEHFDXVHLQWKHELRPHGLFDOGRPDLQLW LVRIWHQPRUHLQWHUHVWLQJWRH[WUDFWPXOWLZRUG WHUPVWKDQVLQJOHZRUGWHUPV+RZHYHURQHFDQ QRWLFHWKDWUHVXOWVREWDLQHGWRH[WUDFWDOOWHUPV ZLWKWFRNDSLDQGFWILGIDUHQRWEHWWHUWKDQRNDSL RUWILGIXVHGLUHFWO\7KHPDLQUHDVRQIRUWKLVLV EHFDXVHWKHSHUIRUPDQFHRIWKRVHQHZFRPELQHG PHDVXUHVDUHDEVRUEHGE\WKHHIIHFWRIH[WUDFW LQJDOVRVLQJOHZRUOGWHUPV'HILQLWLYHO\DOOWKH QHZFRPELQHGPHDVXUHVDUHUHDOO\SHUIRUPLQJ EHWWHUIRUPXOWLZRUGWHUPV

6HYHUDO WHUPV SURSRVHG E\ RXU V\VWHP DUH FRQVLGHUHG DV LUUHOHYDQW LH IDOVH SRVL WLYHH[DPSOHV ZLWKRXUDXWRPDWLFYDOLGDWLRQ SURWRFROEHFDXVHWKH\DUHQRWSUHVHQWLQNQRZQ ELRPHGLFDOGLFWLRQDULHVZKLFKGRHVQRWPHDQ WKDWWKH\DUHLUUHOHYDQW$FWXDOO\HOHPHQWVWKDW DUH QRW IRXQG LQ ELRPHGLFDO UHVRXUFHV FDQ EH UHOHYDQW WKDQNV D PDQXDO YDOLGDWLRQ )RU

7DEOH3UHFLVLRQFRPSDULVRQRIQHZPHDVXUHVIRU(QJOLVK

$OO7HUPV 0XOWL7HUPV

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV

FOCapiM       F TFIDF C M − −       COkapiS       C TFIDF S      7DEOH3UHFLVLRQFRPSDULVRQRIQHZPHDVXUHVIRU)UHQFK $OO7HUPV 0XOWL7HUPV

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV

FOCapiM       F TFIDF C M − −       COkapiS       C TFIDF S     

(13)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

7DEOH3UHFLVLRQRIWKHEHVWPHDVXUHVIRUWKHH[WUDFWLRQRIDOOWHUPVIRU(QJOLVK

$OO7HUPV

WHUPV WHUPV WHUPV WHUPV

F TFIDF C M − −     C TFIDF S    &YDOXH     OkapiM     TFIDFM     7DEOH3UHFLVLRQRIWKHEHVWPHDVXUHVIRUWKHH[WUDFWLRQRIDOOWHUPVIRU)UHQFK $OO7HUPV

WHUPV WHUPV WHUPV WHUPV

FOCapiM     C TFIDF S    &YDOXH     OkapiM     TFIDFM     7DEOH3UHFLVLRQRIF TFIDF C M − − IRU(QJOLVKZLWKDXWRPDWLFDQGPDQXDOYDOLGDWLRQV 0XOWL7HUPVE\F TFIDF C M − −

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV $XWRPDWLF

9DOLGDWLRQ      

0DQXDO

(14)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK LQVWDQFHWKH\FDQUHSUHVHQWQHZWHUPVWRDGG LQELRPHGLFDOGLFWLRQDULHV6RLQ7DEOHVDQG WKHSUHFLVLRQUDWHLVQDWXUDOO\KLJKHUZLWKD PDQXDOYDOLGDWLRQ ,QDGGLWLRQWR/DEWHVWRQOLQHRUJZHDOVR KDYHGRQHH[SHULPHQWVZLWKWZRPRUHFRUSXV L WKH'UXJV GDWDIURP0HGOLQH3OXV ZKLFK FRQWDLQVDERXWPLOOLRQRIZRUGVLQ(QJOLVK ZHKDYHYHULILHGWKDWWKHQHZFRPELQHGPHD VXUHVDUHSHUIRUPLQJEHWWHUSDUWLFXODUO\WKHVH EDVHG RQ WKH KDUPRQLF PHDQ )7),')&0 DQG)2&DSL0 LL 3XE0HGFLWDWLRQV¶WLWOHV

LQ (QJOLVK DQG )UHQFK ZKLFK FRQWDLQ DERXW WLWOHVRIDUWLFOHVWKHUHVXOWVVKRZDVPDOO GLIIHUHQFHEHWZHHQWKHEDVHOLQHPHDVXUHVDQG WKHQHZFRPELQHGPHDVXUHVPDLQO\EHFDXVH WLWOHVDUHVPDOOSLHFHRIWH[WDQGWKHUHIRUHWKH QHZFRPELQHGPHDVXUHVFDQQRWWDNHDGYDQWDJH RIWKHIUHTXHQF\

&21&/86,21$1'

)8785(:25.

7KLV SDSHU SUHVHQWV D QHZ PHWKRGRORJ\ WR DXWRPDWLFDOO\H[WUDFWELRPHGLFDOWHUPLQRORJ\ WRSURSRVHUHOHYDQWWHUPVWRH[SHUWV)RUWHUP UDQNLQJPHDVXUHVKDYHEHHQSURSRVHGIRU WZRODQJXDJHV)UHQFKDQG(QJOLVK :HKDYHDGDSWHG&YDOXHWRH[WUDFW)UHQFK ELRPHGLFDOWHUPVZKLFKZDVQRWSURSRVHGLQ WKHOLWHUDWXUHEHIRUH7KHSUHFLVLRQRIWKH& YDOXHLQSUHYLRXVZRUNVZDVEHWZHHQDQG :LWKWKLVSURSRVDOZHJUHDWO\LPSURYHG WKHVHUHVXOWV7KLVPHDVXUHKDVEHHQLPSURYHG E\ILUVWDGGLQJOLQJXLVWLFSDWWHUQVRIELRPHGL FDOILHOG6HFRQGWKHVWDWLVWLFDODVSHFWVRIWKH PHDVXUHKDYHEHHQFKDQJHGLQRUGHUWRWDNH LQWRDFFRXQWDOOW\SHVRIWHUPV LHVLQJOHDQG PXOWLZRUGWHUPV  :HDSSOLHGWZR$.(PHWKRGVIRUH[WUDFW LQJNH\ZRUGVIURPDGRFXPHQWPHUJLQJWKH WHUPVIROORZLQJWKUHHPHUJLQJIDFWRUVLQWRD VLQJOHOLVW :H SUHVHQWHG DQG HYDOXDWHG WZR QHZ PHDVXUHVWKDQNVWRWKHFRPELQDWLRQRIWKUHH H[LVWLQJPHWKRGV7KHHYDOXDWLRQVKRZHGWKDW WKHVHFRPELQDWLRQVREWDLQWKHEHVWSUHFLVLRQ UDWHVIRUERWKFDVHVDOODQGPXOWLWHUPH[WUDF WLRQIRU)UHQFK )XWXUHZRUNZLOOEHGHGLFDWHGWR L DZHE UDQNLQJLQRUGHUWRLPSURYHWKHSUHFLVLRQRIWKH WHUPLQRORJLHVOLVWVDQG LL WKHLPSURYHPHQW RIWKH%LR7H[ZHEDSSOLFDWLRQ ZHEVHUYLFH LQRUGHUWRHQDEOHDQ\RQHWRTXHU\WRXVHDQ\ RIRXUSURSRVHGELRPHGLFDOWHUPH[WUDFWLRQV PHWKRGVRQRWKHUGDWDVHWV:HDUHDOVRFRQ VLGHULQJWRHQULFKRXUYDOLGDWLRQGLFWLRQDULHV ZLWK%LR3RUWDOWHUPVIRU(QJOLVKDQG&,60H) WHUPVIRU)UHQFK

$&.12:/('*0(17

7KLVZRUNZDVVXSSRUWHGLQSDUWE\WKH)UHQFK 1DWLRQDO 5HVHDUFK$JHQF\ XQGHU -&-& SUR JUDPJUDQW$15-6DVZHOODVE\ 8QLYHUVLW\0RQWSHOOLHUDQG&156DQG,%& 3URMHFW ZZZLEFPRQWSHOOLHUIU 

7DEOH3UHFLVLRQRIFOCapiMIRU)UHQFKZLWKDXWRPDWLFDQGPDQXDOYDOLGDWLRQV

0XOWL7HUPVE\FOCapiM

WHUPV WHUPV WHUPV WHUPV WHUPV WHUPV $XWRPDWLF

9DOLGDWLRQ      

0DQXDO

(15)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

5()(5(1&(6

$O.KDWLE. $PHU%DGDUQHK  $XWRPDWLF H[WUDFWLRQRI$UDELFPXOWLZRUGWHUPV,Q3URFRI

&RPSXWHU 6FLHQFH DQG ,QIRUPDWLRQ 7HFKQRORJ\ ,0&6,7  SS  %DUUyQ&HGHxR$6LHUUD*'URXLQ3 $QD QLDGRX 6   $Q LPSURYHG DXWRPDWLF WHUP UHFRJQLWLRQPHWKRGIRU6SDQLVK SS± 3URF RI&RPSXWDWLRQDO/LQJXLVWLFVDQG,QWHOOLJHQW7H[W 3URFHVVLQJGRLB &LOLEUDVL 5 / 9LWDQ\L 3 0 %  7KH *RRJOH VLPLODULW\ GLVWDQFH ,((( 7UDQVDFWLRQV

RQ .QRZOHGJH DQG 'DWD (QJLQHHULQJ ± GRL7.'( (FN1-:DOWPDQ/1R\RQV(&0 %XWHU 5.  $XWRPDWLFWHUPLGHQWLILFDWLRQIRUELEOLR PHWULFPDSSLQJ6SULQJHU/LQN6FLHQWRPHWULFV   )UDQW]L . $QDQLDGRX 6  0LPD +   $XWRPDWLF UHFRJQLWLRQ RI PXOWLZRUG WHUPV 7KH &YDOXH1&YDOXH PHWKRG ,QWHUQDWLRQDO -RXUQDO

RQ'LJLWDO/LEUDULHV  ±GRL

V

*UDFLD-7ULOOR5(VSLQR]D0 0HQD(   4XHU\LQJWKHZHE$PXOWLRQWRORJ\GLVDPELJXDWLRQ PHWKRG ,Q 3URFHHGLQJV RI WKH WK ,QWHUQDWLRQDO

&RQIHUHQFHRQ:HE(QJLQHHULQJ SS± 

*UDFLD - 7ULOOR 5 (VSLQR]D 0  0HQD (  :HEEDVHGPHDVXUHRIVHPDQWLFUHODWHGQHVV ,Q3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO&RQIHUHQFHRQ

:HE,QIRUPDWLRQ6\VWHPV(QJLQHHULQJ SS± 

+OLDRXWDNLV$ =HUYDQRX .  3HWUDNLV ( * 0  7KH$07([DSSURDFKLQWKHPHGLFDO GRFXPHQWLQGH[LQJDQGUHWULHYDODSSOLFDWLRQ'DWD .QRZOHGJH(QJLQHHULQJ±GRLM GDWDN +XVVH\ 5 :LOOLDPV 6  0LWFKHOO 5   $XWRPDWLFNH\SKUDVHH[WUDFWLRQ$FRPSDULVRQRI PHWKRGV,Q3URFRIWKH,QWHUQDWLRQDO&RQIHUHQFH RQ,QIRUPDWLRQ3URFHVVDQG.QRZOHGJH0DQDJH PHQW SS  -L/6XP0/X4/L: &KHQ<   &KLQHVHWHUPLQRORJ\H[WUDFWLRQXVLQJZLQGRZEDVHG FRQWH[WXDOLQIRUPDWLRQ,Q3URFHHGLQJRI&,&/LQJ SS± /1&6GRL B .QRWK 3 6FKPLGW 0 6PU] 3  =GUDKDO =  7RZDUGVDIUDPHZRUNIRUFRPSDULQJDXWR PDWLFWHUPUHFRJQLWLRQPHWKRGV,Q3URFHHGLQJVRI WKH&RQIHUHQFH=QDORVWL .UDXWKDPPHU 0  1HQDGLF *   7HUP LGHQWLILFDWLRQLQWKHELRPHGLFDOOLWHUDWXUH-RXUQDO RI%LRPHGLFDO,QIRUPDWLFV±GRLM MEL30,' .XSVF$  ([WUDFWLRQDXWRPDWLTXHGHWHUPHV DCSDUWLUGHWH[WHVSRORQDLV-RXUQDO/LQJXLVWLTXHGH &RUSXV/DE7HVW2Q/LQH QG 5HWULHYHGIURPKWWS ODEWHVWVRQOLQHRUJ /RVVLR9HQWXUD-$-RQTXHW&5RFKH0 7HLV VHLUH0  &RPELQLQJ&YDOXHDQGNH\ZRUG H[WUDFWLRQPHWKRGVIRUELRPHGLFDOWHUPVH[WUDFWLRQ ,Q3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO6\PSRVLXP RQ/DQJXDJHVLQ%LRORJ\DQG0HGLFLQH /Y< =KDL&;  :KHQGRFXPHQWVDUH YHU\ORQJ%0IDLOV,Q3URFHHGLQJVRIWKHWK ,QWHUQDWLRQDO$&06,*,5&RQIHUHQFHRQ5HVHDUFK DQG 'HYHORSPHQW LQ ,QIRUPDWLRQ 5HWULHYDO SS

±  0HGHO\DQ2(LEH) :LWWHQ,+  +X PDQFRPSHWLWLYHWDJJLQJXVLQJDXWRPDWLFNH\SKUDVH H[WUDFWLRQ,Q3URFHHGLQJVRIWKH,QWHUQDWLRQDO&RQ IHUHQFHRI(PSLULFDO0HWKRGVLQ1DWXUDO/DQJXDJH 3URFHVVLQJ (01/3 6LQJDSRUH 0H6+ 0HGLFDO 6XEMHFW +HDGLQJV  LV WKH 1/0 FRQWUROOHGYRFDEXODU\WKHVDXUXVXVHGIRULQGH[LQJ DUWLFOHVIRU3XE0HG  5HWULHYHGIURPKWWS ZZZQFELQOPQLKJRYPHVK 1HQDGLF * 6SDVLF , $QDQLDGRX 6   0RUSKRV\QWDFWLFFOXHVIRUWHUPLQRORJLFDOSURFHVVLQJ LQ6HUELDQ,Q3URFHHGLQJVRIWKH($&/:RUNVKRS RQ0RUSKRORJLFDO3URFHVVLQJRI6ODYLF/DQJXDJHV SS±  1R\1)6KDK1+:KHW]HO3/'DL% 'RUI0 *ULIILWK1%HWDO  %LR3RUWDO 2QWRORJLHVDQGLQWHJUDWHGGDWDUHVRXUFHVDWWKHFOLFN RIDPRXVH1XFOHLF$FLGV5HVHDUFK± GRLQDUJNS30,' 5REHUWVRQ6(:DONHU6 %HDXOLHX0   2NDSLDW75(&$XWRPDWLFDGKRFILOWHULQJ9/& DQGLQWHUDFWLYHWUDFN,1± 6FODQR) 9HODUGL3  7HUP([WUDFWRU$ ZHEDSSOLFDWLRQWROHDUQWKHFRPPRQWHUPLQRORJ\ RI LQWHUHVW JURXSV DQG UHVHDUFK FRPPXQLWLHV ,Q (QWHUSULVH,QWHURSHUDELOLW\,, SS  6SHOD9LQWDU  &RPSDUDWLYHHYDOXDWLRQRIF YDOXHLQWKHWUHDWPHQWRIQHVWHGWHUPV,Q3URFHHGLQJV RIWKH:RUNVKRS 0HWKRGRORJLHVDQG(YDOXDWLRQRI 0XOWLZRUG8QLWVLQ5HDOZRUOG$SSOLFDWLRQV /5(& SS± 

(16)

,QWHUQDWLRQDO-RXUQDORI.QRZOHGJH'LVFRYHU\LQ%LRLQIRUPDWLFV  -DQXDU\0DUFK

7UHH7DJJHU QG  5HWULHYHG IURP ZZZFLVXQL

PXHQFKHQGHaVFKPLGWRROV7UHH7DJJHU

8QLILHG0HGLFDO/DQJXDJH6\VWHP 80/6   

5HWULHYHG IURP KWWSZZZQOPQLKJRYUHVHDUFK XPOV =KDQJ<0LOLRV( =LQFLUKH\ZRRG1  $ FRPSDULVRQRINH\ZRUGDQGNH\WHUPEDVHGPHWKRGV IRUDXWRPDWLFZHEVLWHVXPPDUL]DWLRQ,Q3URFHHGLQJV RIWKH$$$,:RUNVKRSRQ$GDSWLYH7H[W([WUDFWLRQ DQG0LQLQJ SS±  =KDQJ=,ULD-%UHZVWHU& &LUDYHJQD)  $FRPSDUDWLYHHYDOXDWLRQRIWHUPUHFRJQLWLRQ DOJRULWKPV,Q3URFHHGLQJVRIWKH6L[WK,QWHUQDWLRQDO &RQIHUHQFHRQ/DQJXDJH5HVRXUFHVDQG(YDOXDWLRQ /5(& 

(1'127(6

 KWWSZZZQOPQLKJRYUHVHDUFKXPOV  KWWSPHVKLQVHUPIU  KWWSZZZFLVXQLPXHQFKHQGH×VFKPLG WRROV7UHH7DJJHU  KWWSZZZQOPQLKJRYPHGOLQHSOXV  KWWSZZZQFELQOPQLKJRYSXEPHG  KWWSELRSRUWDOELRRQWRORJ\RUJ  KWWSZZZFKXURXHQIUFLVPHI

Références

Documents relatifs

To provide this level of abstraction, each ROS application includes a special node called the ROS Master. It provides registration and lookup services to the other

The Logistic regression classifier with character N-gram TF-IDF features performed best with the weighted precision, recall,

In this context, our first research question is: (RQ1) Are there more bits of relevant infor- mation in the text collection used for term extraction at step (2) that may help

Since at this stage the system does not perform relation extraction, the interface allows the user to manually link the entities that are factual components of the events to

Abstract. In this paper, we propose an original method that allows to summarize Web pages automatically. Our method is numerical and differs from other methods

We utilized two corpora in our experiments: the Quaero medical corpus, a French annotated resource for medical entity recognition and normalization [3], which was the basis for

We used the Systran ® translator to convert French and Spanish queries to English for our bilingual experiments and to convert English topics to French, German and Italian in

Procedure 2 First pass of identifying key phrases: indexesIntersection() Require: Indexes of titles of Web pages (Index web ) &amp; Wikipedia articles (Index wiki ),.. Wikipedia