HAL Id: hal-01059329
https://hal.archives-ouvertes.fr/hal-01059329
Submitted on 20 Oct 2020
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
The Semantic Measures Library and Toolkit: fast
computation of semantic similarity and relatedness using
biomedical ontologies
Sébastien Harispe, Sylvie Ranwez, Stefan Janaqi, Jacky Montmain
To cite this version:
Sébastien Harispe, Sylvie Ranwez, Stefan Janaqi, Jacky Montmain. The Semantic Measures Library
and Toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies.
Bioinformatics, Oxford University Press (OUP), 2014, 30 (5), pp.740-742.
�10.1093/bioinformat-ics/btt581�. �hal-01059329�
T h e s e m a nti c m e a s ur e s li br ar y a n d t o ol kit: f a st c o m p ut ati o n of
s e m a nti c si mil arit y a n d r el at e d n e s s u si n g bi o m e di c al o nt ol o gi e s
S e´b a sti e n H ari s p e *, S yl vi e R a n w e z, St ef a n J a n a qi a n d J a c k y M o nt m ai n
L GI 2 P/ E M A R e s e ar c h C e ntr e, Sit e E E RI E, P ar c S ci e ntifi q u e G. B e s s e, 3 0 0 3 5 Nıˆm e s c e d e x 1, Fr a n c e A s s o ci at e E dit or: M arti n Bi s h o p
A B S T R A C T
S u m m ar y: T h e s e m a nti c m e a s ur e s li br ar y a n d t o ol kit ar e r o b u st o p e s o ur c e a n d e a s y t o u s e s oft w ar e s ol uti o n s d e di c at e d t o s e m a n-ti c m e a s ur e s. T h e y c a n b e u s e d f or l ar g e- s c al e c o m p ut an-ti o n s a n d a n al y s e s of s e m a nti c si mil ariti e s b et w e e n t er m s/ c o n c e pt s d efi n e d i n t er mi n ol o gi e s a n d o nt ol o gi e s. T h e c o m p ari s o n of e ntiti e s ( e. g. g e n e s) a n n ot at e d b y c o n c e pt s i s al s o s u p p ort e d. A l ar g e c oll e cti o n of m e a s-ur e s i s a v ail a bl e. N ot li mit e d t o a s p e cifi c a p pli c ati o n c o nt e xt, t h e li br ar y a n d t h e t o ol kit c a n b e u s e d wit h v ari o u s c o ntr oll e d v o c a b ul ari e s a n d o nt ol o g y s p e cifi c ati o n s ( e. g. O p e n Bi o m e di c al O nt ol o g y, R e s o ur c e D e s cri pti o n Fr a m e w or k). T h e pr oj e ct t ar g et s b ot h d e si g n er s a n d pr a ctiti o n er s of s e m a nti c m e a s ur e s pr o vi di n g a J A V A li br ar y, a s w ell a s a c o m m a n d-li n e t o ol t h at c a n b e u s e d o n p er s o n al c o m p ut er s or c o m p ut er cl u st er s.
A v ail a bilit y a n d i m pl e m e nt ati o n: D o w nl o a d s, d o c u m e nt ati o n, t ut ori al s, e v al u ati o n a n d s u p p ort ar e a v ail a bl e at htt p:// w w w. s e m a n-ti c- m e a s ur e s-li br ar y. or g.
C o nt a ct: h ari s p e. s e b a sti e n @ g m ail. c o m
1 I N T R O D U C TIO N
Bi o m e di c al o nt ol o gi es pr o vi d e w ell-str u ct ur e d a n d c o ntr oll e d v o c a b ul ari es of s p e cifi c d o m ai ns, e. g. bi ol o gi c al pr o c ess es a n d cli ni c al h e alt h c ar e t er mi n ol o g y. T h e y ar e i n cr e asi n gl y us e d t o dri v e d at a i nt e gr ati o n, i nf or m ati o n r etri e v al, d at a a n n ot ati o ns a n d d e cisi o n s u p p ort, t o cit e a f e w ( St e v e ns et al. , 2 0 0 0). O pe n r e p osit ori es s u c h as t h e O p e n Bi o m e di c al O nt ol o g y ( O B O) F o u n dr y or Bi o P ort al ( S mit h et al. , 2 0 0 7 ; W h et z el et al. , 2 0 1 1 ) pr o vi d e a c c ess t o h u n dr e ds of bi o m e di c al o nt ol o gi es e x pr ess e d i n v ari o us f or m ats, e. g. R es o ur c e D es cri pti o n Fr a m e w or k ( R D F), O B O, W e b O nt ol o g y L a n g u a g e ( O W L). T h es e str u ct ur e d v o c a b ul ari es ar e us e d t o c h ar a ct eri z e e ntiti es t hr o u g h c o n c e pt u al a n n ot ati o ns. F or i nst a n c e, g e n es ( pr o d u cts) c a n b e a n n ot at e d b y G e n e O nt ol o g y ( G O) t er ms t o d efi n e t h eir m ol e c ul ar f u n cti o ns, t h eir c ell ul ar l o c ati o ns or t h e bi ol o gi c al pr o c ess es i n w hi c h t h e y ar e i n v ol v e d ( As h b ur n er et al. , 2 0 0 0). T h os e u n a m bi g u o us a n n o-t ao-ti o ns c a n, o-t h er ef or e, b e us e d o-t o q u er y l ar g e c oll e co-ti o ns of d ao-t a t a ki n g i nt o a c c o u nt t h e k n o wl e d g e d efi n e d i n t h e o nt ol o g y, i. e. pr a ctiti o n ers s e ar c hi n g f or g e n es a n n ot at e d t o ‘ n u cl e osi d e bi n d-i n g’ wd-ill als o r etrd-i e v e g e n es a n n ot at e d t o ‘ A T P bd-i n dd-i n g’, as t h e o nt ol o g y s p e cifi es t h at ‘ A T P bi n di n g’ is a s p e cifi c t y p e of ‘ n u cl e osi d e bi n di n g’. H o w e v er, i n s o m e c as es, e x a ct s e ar c h es
ar e t o o c o nstr ai ni n g, a n d w e s e ar c h f or e ntities t h at are si mil ar or r el at e d t o t he q uer y. S u c h a n i m pr ecise s e ar c h is b as e d o n i nf or m ati o n r etri e v al t ec h ni q ues t h at r e q uire a f u n cti o n t o esti-m at e w het her or n ot t w o e ntities are si esti-mil ar or r el at e d wit h r e g ar ds t o t heir c o nc e pt u al a n n ot ati o ns. T h er ef or e, t o e x pl oit o nt ol o gi es a n d c orres p o n di n g a n n ot ati o ns, s e m a nti c m e as ur es ar e r e q uir e d. T he y ai m t o c o m p are c o nc e pts b y t a ki n g i nt o acc o u nt t h e s e m a ntic s p ac e i n w hi c h t he y ar e d efi n e d. T he y c a n, t heref or e, b e use d t o ass ess t he de gr ee of li k e n ess of c o nce pts defi ne d i n o nt ol o gi es or b et wee n e ntiti es a n n ot at e d b y t h os e c o n-ce pts ( P es q uit a et al., 2 0 0 9).
A n i ncre asi n g n u m b er of al g orit h ms r el y o n s e m a ntic me as-ur es, f or i nst a nc e t o a n al yze ge nes b as e d o n t heir m ol ec ul ar f u n c-ti o ns ( S y et al., 2 0 1 2) or r el at e d dis e ases ( Li et al., 2 0 1 1). Se m a nti c m e as ur es c a n als o assist i n c o m p aris o ns of p atie nt r ec or ds, c h e mic al c o m p o u n ds, dis e ases or a n y e ntit y t h at c a n be c h ar acteri ze d b y u n a m bi g u o us t er ms or c o n ce pts d efi n e d i n o nt ol o gi es or t hes a uri.
N u mer o us c o m m u niti es are i n v ol ve d i n t he st u d y of s e m a nti c me as ur es ( e. g. bi oi nf or m atics, N at ur al L a n g u a ge Pr o cessi n g, artifi ci al i nt elli ge nc e a n d Se m a ntic W e b). O wi n g t o t heir p o p ul arit y, m a n y me as ur es h a v e be e n desi g n e d f or diff er e nt o nt ol o gi es a n d tre at me nts ( e. g. ge ne a n al ysis, i nf or m ati o n r e-tri e v al): a r ece nt s ur ve y disti n g uis he d t e ns of me as ures de di c at e d t o t he G O al o n e ( G u zzi et al., 2 0 1 2). H o w e ver, c o m m u niti es f oc usi n g o n ot her t y pes of a n n ot at e d e ntities ( e. g. p atie nt r ec or ds) als o be nefit t he or etic al fi n di n gs m a d e b y st u d yi n g me as-ur es i n ot her s pecific d o m ai ns s uc h as m olec ul ar bi ol o g y a n d vi ce vers a. Ne vert hel ess, m ost s oft w ar e s ol uti o ns r el at e d t o s e m a nti c me as ur es are d e v el o pe d f or a s pecific t er mi n ol o g y/ o nt ol o g y a n d o nl y f oc us o n a li mite d s et of me as ur es ( Fr o¨hlic h et al., 2 0 0 7; Li et al., 2 0 1 1; M cI n n es et al., 2 0 0 9; Y u et al., 2 0 1 0). T o f e d er at e eff orts r el at e d t o t he desi g n a n d a n al ysis of s e m a ntic m e as ur es a n d t o r es p o n d t o t he ne e d f or a ge neri c s oft w ar e t o ol de di c at e d t o t he m, w e d e vel o pe d t he s e m a nti c me as ur es li br ar y ( S M L). T his arti cl e pr es e nts its b e n efits f or t he c o m p ut ati o n of s e m a nti c me as ur es usi n g bi o- o nt ol o gi es.
2 T H E S M L A N D T O O L KI T
T h e S M L is a n e xt e nsi ve, effi ci e nt a n d g e n eric o p e n-s o urc e li br ar y de dic at e d t o t he c o m p ut ati o n, d e vel o p me nt a n d a n al ysis of s e m a nti c me as ur es. N u mer o us f u n cti o n aliti es pr o vi de d b y t he S M L ar e als o a v ail a bl e wit hi n t he S M L- T o ol kit, a c o m m a n d-li n e pr o gr a m t h at c a n b e use d b y n o n- de vel o pers t o e asil y c o m p ute s e m a ntic m e as ur es o n p ers o n al c o m p uters or c o m p uter cl usters. T h e S M L a n d t he t o ol kit ar e distri b ut e d u n der t he o p e n-s o ur ce * T o w h o m c orres p o n de nce s h o ul d be a d dress e d.
Ce CI L L lic e ns e ( c o m p ati ble wit h t he wi d el y use d G N U Ge n er al P u blic Li ce nse).
T h e S M L us es cr oss- pl atf or m J A V A pr o gr a mi n g l a n g u a ge versi o n 1. 7, w hi c h is a v ail a ble f or m ost o per ati n g s yste ms. It c a n b e us e d t o c o m p ute s e m a nti c si mil arities of c o nc e pts/ ter ms defi ne d i n str uct ur e d t er mi n ol o gi es a n d o nt ol o gi es. It c a n als o b e use d t o assess t he s e m a ntic si mil arit y of p airs of e ntities a n n ot at e d b y c o n ce pts, e. g. p atie nt r ec or ds a n n ot at e d b y gr o u ps of c o nc e pts, g e n es a n n ot at e d b y G O t er ms, P u b Me d arti cl es a n n ot at e d b y M e S H descri pt ors. C o nsi deri n g a p air of t er ms/ e ntiti es, t he li br ar y c o m p ut es a si mil arit y s c ore. D e vel o pers c a n, t heref or e, e asil y e m be d s o urce c o de r ef erri n g t o t he li br ar y t o c o m p ut e m e as ur es i n t heir o w n al g orit h ms a n d a p plic ati o ns. T h e li br ar y s u p p orts v ari o us o nt ol o g y f or m ats a n d s pe cifi c a-ti o ns ( e. g. O B O, R D F, O W L). S p ecific o nt ol o g y l o a ders ar e als o pr o vi de d t o h a n dl e wi del y use d bi o me di c al t er mi n ol o gi es s uc h as Me S H a n d S N O M E D Cli ni c al T er ms ( S N O M E D C T). C ust o m k n o wle d ge r e pr es e nt ati o n l o a d ers c a n als o be a d de d t o t he S M L. I n a d diti o n, l o w-l e v el a ccess t o t he li br ar y e n a bles de vel o p ers t o fi nel y c o ntr ol t he u n d erl yi n g gr a p h m o del ( o nt ol o g y) t o a p pl y s pe cifi c tre at me nts s o meti mes r e q uir e d f or t he c o m p ut ati o n of s e m a ntic me as ur es (e. g. tr a nsiti ve r e d ucti o n t o r e m o ve t a x o n o m-i c al r e d u n d a n cm-i es).
A l ar ge c oll ecti o n of s e m a nti c me as ur es is pr o vi de d o ut- of-t he-b o x — v ersi o n 0. 7 s u p p orts a he-b o ut 5 0 me as ur es r el yi n g o n differe nt str at e gi es. T h a n ks t o t he fi ne- gr ai ne d c o ntr ol pr o vi de d b y t he li br ar y, t his l e a ds t o a b o ut 1 5 0 0 s pecific me as ur e c o nfi g ur ati o ns t h at c a n be s pe cifi e d f or c o nt e xt-s pecifi c a p plic ati o ns. I n a d d-iti o n, t he al g orit h ms d e v el o pe d i n t he S M L pr o vi de t he d esi g ners of s e m a ntic me as ur es a n e xt e nsi ve A p pli c ati o n Pr o gr a m mi n g I nt erf ac e a n d fr a me w or k t o e asil y de vel o p, t est a n d e v al u at e ne w m e as ur es. M or e o v er, b ec a use of its ge neri c u n d erl yi n g gr a p h d at a m o del, s e m a nti c me as ures d e vel o p e d usi n g t he S M L will b e n efit a l ar ge a u di e nc e. T h ose m e as ur es ar e n ot r estrict e d t o a s pe cifi c o nt ol o g y, w hi c h is t he c as e wit h e xisti n g s oft w ar e s ol uti o ns, a n d c a n, t heref or e, b e us e d wit h t he v ari o us k n o wl e d ge r e pr es e nt ati o ns s u p p orte d b y t he li br ar y. F urt her m ore, t he S M L r elies o n a gr a p h m o d el c o m p ati bl e wit h t he Li n k e d D at a p ar a di g m. T his e n a bles S M L users t o t a ke a d v a nt a ge of t he gr o wi n g n u m b er of d at as ets p u blis h e d acc or di n g t o Li n ke d D at a a n d Se m a nti c W e b visi o ns, e. g. s ee Bi o 2 R D F i niti ati ve ( B elle a u et al., 2 0 0 8).
T h e S M L e n a bl es l ar ge-s c ale c o m p ut ati o ns a n d a n al ys es of s e m a ntic me as ur es. It s u p p orts m ulti-t hr e a d e d pr oc esses f or f ast p ar allel c o m p ut ati o n o n m ultic ore pr oc ess ors. T a ble 1 pr es e nts a r u n ni n g ti me c o m p aris o n bet w ee n t hr ee li br aries de di-c at e d t o t he G O a n d t he S M L ( det aile d pr ot odi-c ol, ass o di-ci at e d s o urce c o de a n d a d diti o n al e v al u ati o ns ar e pr o vi de d at htt p:// w w w.s e m a ntic- me as ur es-li br ar y.c o m/s ml/ p erf or m a nc e).
B ase d o n t he S M L, a n o p e s o urce t o ol kit e n a bl es n o n-de v el o pers t o b e n efit fr o m f u n cti o n aliti es pr o vi n-de d b y t he li br ar y t hr o u g h e as y t o use c o m m a n d-li ne s oft w are. T h e S M L- T o ol kit is hi g hl y t u n e a ble a n d e n a bl es c o nte xt-s pe cifi c c o nfi g ur ati o ns t o b e s pe cifi e d d e p e n di n g o n t he e x p eri m e nt perf or me d: k n o wle d g e b as e t o use ( o nt ol o gi es, a n n ot ati o ns), r e q uire d d at a pre pr ocess-i n g (e. g. t he r e m o v al of t a x o n o mocess-i c r e d u n d a ncocess-ies), m e as ur e c o nstr ai nts ( e. g. al g orit h mi c c o m ple xit y, i nf or m ati o n t o t a ke i nt o ac c o u nt), s et of q ueri es t o perf or m (i. e. c o nc e pt or e ntit y i de ntifiers) a n d ot her ( o pti o n al) p ar a met ers ( e. g. o ut p ut fil e,
c o m p uter r es o urces all o c at e d). Det aile d c o nfi g ur ati o ns c a n be s pecifie d usi n g a n e xt e nsi bl e m ar k- u p l a n g u a g e fil e. S pecifi c c o m-m a n d-li ne i nt erf a ces, c all e d pr ofil es , ar e als o d e v el o pe d t o e as e t he us e of t he S M L- T o ol kit i n s pecific use c as es, e. g. t o esti m at e t he si mil arit y of g e nes r e g ar di n g t heir G O t er m a n n ot ati o ns. S uc h pr ofil es c a n b e us e d t o hi de t he a d v a n ce d c a p a biliti es of t he li br ar y, a n d t heref or e i m pr o ve t h e e x p erie n ce f or us ers i nt er-est e d o nl y i n c o m p uti n g s e m a ntic m e as ur es i n a s pe cifi c c o nte xt of use (e. g. ge n e or dise as e a n al ysis). R el at e d s o ur ce c o de a n d iss ue tr a c k ers are a v ail a ble fr o m t he p u blic de di c at e d r e p osit or y. C o m m u nit y s u p p ort is als o pr o vi de d t o f acilit at e us a g e a n d e ns ur e i m pr o ve me nts of b ot h t he li br ar y a n d t he t o ol kit.
O pe n s o urce, ge neri c, effi ci e nt a n d hi g hl y t u ne a bl e, t he S M L a n d t he t o ol kit ar e n ot li mit e d t o a s pecific o nt ol o g y a n d c a n, t her ef or e, be use d i n a br o a d fiel d of a p plic ati o n, (scie ntifi c) pr oj ects a n d s oft w ar e s ol uti o ns [ e. g. H aris pe et al. ( 2 0 1 3), S y et al. ( 2 0 1 2)].
A C K N O W L E D G E M E N T
T he a ut h ors w o ul d li k e t o t h a n k S M L us ers a n d d e v el o pers f or t heir c o ntri b uti o ns t o t he pr oj ect.
F u n di n g : Fre n c h Life S ci e nc es a n d He alt hc ar e Alli a n ce ( A VI E S A N).
C o nflict of I nt er est : n o n e de cl ar e d. R E F E R E N C E S
As h b ur n er, M. et al. ( 2 0 0 0) G e n e o nt ol o g y: t o ol f or t h e u nifi c ati o n of bi ol o g y. T h e G e n e O nt ol o g y C o ns orti u m. N at. Ge net. , 2 5 , 2 5 – 2 9.
Bell e a u, F. et al. ( 2 0 0 8) Bi o 2 R D F: t o w ar ds a m as h u p t o b uil d bi oi nf or m atics k n o w-l e d ge s yste ms. J. Bi o m e d. I nf. , 4 1 , 7 0 6 – 7 1 6.
Fr o¨hlic h, H. et al. ( 2 0 0 7) G O Si m – a n R- p ac k a ge f or c o m p ut ati o n of i nf or m ati o n t he or etic G O si mil arities b et w ee n t er ms a n d g e n e pr o d u cts. B M C Bi oi nf or m ati cs , 8 , 1 6 6.
G uz zi, P. H. et al. ( 2 0 1 2) S e m a nti c si mil arit y a n al ysis of pr otei n d at a: assess m e nt wit h bi ol o gi c al f e at ures a n d iss u es. Briefi n gs Bi oi nf or m atics , 1 3 , 5 6 9 – 5 8 5. H aris p e, S. et al. ( 2 0 1 3) S e m a nti c me as ur es b as e d o n R D F pr oje cti o ns: a p plic ati o n
t o c o nte nt- b as e d re c o m me n d ati o n s yste ms. I n: O n t h e M ov e t o M e a ni n gf ul I nter n et S yste ms: O T M 2 0 1 3 C o nfere nc es. S pri n ger Berli n Hei d el ber g, Gr a z ( A ustri a), p p. 6 0 6 – 6 1 5.
Li,J. et al. ( 2 0 1 1) D O Si m: A n R p ac k a ge f or si mil arit y b et we e n dis e ases b as e d o n Dise as e O nt ol o g y. B M C Bi oi nf or m ati cs , 1 2 , 2 6 6.
T a bl e 1. R u n ni n g ti m es of t he g e n eric S M L a n d t hre e t o ols d e dic at e d t o t he G O T o ols 1 K 1 0 K 1 M 1 0 0 M F ast Se m Si ma 0 m 1 3. 3 6 0 m 1 6. 7 9 7 m 8. 1 4 X G O Si m X X X X G O S e m Si m 2 7 m 0 2. 6 6 X X X S M L 0 m 1 0. 0 1 0 m 1 1. 1 8 1 m 3 8. 8 7 1 3 3 m 2 7. 4 4 S M L p ar allel 0 m 9. 8 0 0 m 1 0. 2 4 0 m 4 7. 6 2 5 8 m
ahtt p://s o urc ef or ge. n et/ pr oj ects/f astse msi m/.
N ote: F o ur t ests h a v e b ee n p erf or m e d c o nsi deri n g r a n d o m s a m ples of ge ne p airs wit h fi xe d siz es (s ee c ol u m ns, K ¼ 1 0 3, M ¼ 1 0 6). S M L p ar allel c orres p o n ds t o t h e S M L c o nfi g ure d wit h f o ur t hre a ds. ‘ X’ s p ecifi es t h at t he pr o cess re q uire d 4 6 G o of R A M or t o o k 4 4 h.
M cI n n es, B. T. et al. ( 2 0 0 9) U M L S-I nterf ac e a n d U M L S- Si mil arit y: o pe n s o urc e s oft w are f or m e as uri n g p at hs a n d s e m a nti c si mil arit y. A MI A A n n u. S y m p. Pr o c. , 2 0 0 9 , 4 3 1 – 4 3 5.
P es q uit a, C. et al. ( 2 0 0 9) S e m a nti c si mil arit y i n bi o me dic al o nt ol o gies. P L o S C o m p ut. Bi ol. , 5 , 1 2.
S mit h, B. et al. ( 2 0 0 7) T h e O B O F o u n dr y: c o or di n at e d e v ol uti o n of o nt ol o gies t o s u p-p ort bi o me di c al d at a i nte gr ati o n. N at. Bi ote c h n ol. , 2 5 , 1 2 5 1 – 1 2 5 5.
Ste ve ns, R. et al. ( 2 0 0 0) O nt ol o g y- b as e d k n o wl e d ge r e prese nt ati o n f or bi oi nf or-m ati cs. Bri efi n gs Bi oi nf or or-m ati cs , 1 , 3 9 8 – 4 1 4.
S y, M.- F. et al. ( 2 0 1 2) User c e ntere d a n d o nt ol o g y b as e d i nf or m ati o n r etrie v al s yste m f or life s ci e n ces. B M C Bi oi nf or m atics , 1 3 (S u p pl. 1 ), S 4.
W hetz el, P. L. et al. ( 2 0 1 1) Bi o P ort al: e n h a n ce d f u ncti o n alit y vi a n e w W e b s er vic es fr o m t h e N ati o n al C e nter f or Bi o me dic al O nt ol o g y t o a cc ess a n d use o nt ol o gi es i n s oft w ar e a p pli c ati o ns. N ucleic A ci ds Res. , 3 9 , W 5 4 1 – W 5 4 5.
Y u, G. et al. ( 2 0 1 0) G O S e m Si m: a n R p ac k a ge f or me as uri n g s e m a ntic si mil arit y a m o n g G O ter ms a n d g e n e pr o d u cts. Bi oi nf or m atics , 2 6 , 9 7 6 – 9 7 8.