• Aucun résultat trouvé

Information Content Measures

Dans le document Data Mining in Biomedicine Using Ontologies (Page 49-52)

Ontological Similarity Measures

2.2 Traditional Approaches to Ontological Similarity

2.2.2 Information Content Measures

w X Y = w XrY +w Yr Xd (2.14)

The relative scaling by this depth is based on the intuition that siblings deep in a network are more closely related than only-siblings higher up. The semantic distance between two arbitrary nodes, c1 and c2, is then computed as the sum of the distances between the pairs of adjacent nodes along the shortest path connect-ing c1 and c2.

2.2.2 Information Content Measures

The foundation for this approach is the insight that conceptual similarity between two concepts, c1 and c2, may be judged by the degree to which they share informa-tion [41]. The more informainforma-tion they share, then the more similar they are. In an is-a network, this common information is contained in the most specifi c concept

2.2 Traditional Approaches to Ontological Similarity 33

that subsumes both of c1 and c2, the common subsumer, which is also referred to as the lowest common ancestor (LCA), c3. For example, Figure 2.2 shows a frag-ment of the WordNet ontology.

The most specifi c superclass or the lowest common ancestor for nickel and dime is coin, and for nickel and credit card is medium of exchange. The semantic similarity between nickel and dime should be determined by the information con-tent of coin, and that between nickel and credit card should be determined by the information content of medium of exchange. According to standard information theory [61], the information content of a concept c is based on the probability of encountering an instance of concept c in a certain corpus, p(c). As one moves up the taxonomy, p(c) is monotonically nondecreasing. The probability is based on using the corpus to perform a frequency count of all occurrences of concept c, including occurrences of any of its descendents. The information content (IC) of concept c using the corpus approach then is quantifi ed as

( ) ( ) ( )

corpus

IC c = −log p c (2.15)

Another approach to determine IC was proposed in [47] using WordNet as an example. The structure of the ontology itself is used as a statistical resource, with no need for external ones, such as corpuses. The assumption is that the ontology is organized in a meaningful and structured way, so that concepts with many de-scendants communicate less information than leaf concepts. The more dede-scendants a concept has, the less information it expresses. The IC for a concept c is defi ned as

( ) ( ( ( ) ) ) ( )

( ) ( )

( )

log 1 max / log 1 max

1 log 1 log max

ont ont ont

ont

IC c desc c

desc c

= + =

= − + (2.16)

where desc(c) is the number of descendants of concept c, and maxont is the maxi-mum number of concepts in the ontology.

Figure 2.2 A fragment of the WordNet ontology [41].

Resnik proposed the use of information content [41] to determine the semantic similarity between two concepts, c1 and c2, with the lowest common ancestor of c3 as

(

1, 2

)

ICcorpus

( )

3

simRes c c = c (2.17)

Note that this is intuitively satisfying since the higher the position of the c3 in the taxonomy, the more abstract c3 is, therefore, the lower the similarity between c1 and c2. The lowest common ancestor c3, also referred to as the most specifi c subsumer, most informative common ancestor [11], or nearest common ancestor [39], is the concept from the set of concepts subsuming both c1 and c2 that has the greatest information content.

While most similarity measures increase with commonality and decrease with difference, simRes takes only commonality into account. The semantic similarity between two concepts proposed by Lin [31] takes both commonality and differ-ence into account. It uses the shared information content in the lowest-common-ancestor concept and normalizes with sum of the unshared information content of both concepts c1 and c2, given by

(

1, 2

)

2ICcorpus

( )

3

(

ICcorpus

( )

1 ICcorpus

( )

2

)

sim c cL = c c + c (2.18)

Jiang and Conrath [24] began by trying to combine the network distance ap-proach with the information theoretic apap-proach. They envisioned using corpus statistics as a corrective factor to fi x the problems with the weighted-edge-counting approaches and developed a general formula for the weight of a link between a child concept cc and its parent concept cp in a hierarchy. This formula incorporates the ideas of node depth, local density similar to the TSF, and the link type. These ideas parallel the approach used in [51].

Jiang and Conrath studied the roles of the density and depth components, con-cluded that they are not major factors in the overall edge weight, and shifted their focus to the link-strength factor. The link-strength factor uses information content, but in the form of conditional probability, or in other words, the probability of en-countering an instance of a child concept c1, given an instance of a parent concept c3. If the probabilities are assigned as in Jiang and Conrath [41], then the distance between concepts c1 and c2, with concept c3 as the most specifi c concept that sub-sumes both, is

(

1, 2

)

2ICcorpus

( )

3

(

ICcorpus

( )

1 ICcorpus

( )

2

distJC c c = − c + c + c (2.19)

Note that this measure is but a different arithmetic combination of the same term used in the Lin measure given in (2.18). The Lin similarity measure can be converted into a dissimilarity measure by subtracting it from 1 since it is a similar-ity measure in [0,1] and produces

(

1, 2

)

IC 1

( )

IC 2

( )

2IC 3

( )

IC 1

( )

IC 2

( )

dissim c cL =⎡⎣ c + cc ⎤ ⎡⎦ ⎣ c + c ⎤⎦ (2.20)

2.2 Traditional Approaches to Ontological Similarity 35

The numerator in (2.20) is Jiang-Conrath’s distance measure given in (2.19).

Jiang-Conrath’s distance measure is not normalized. If normalized by [IC(c1) + IC(c2)], then it is equivalent to the corresponding dissimilarity measure for Lin’s similarity measure.

2.2.3 A Relationship Between Path-Based and Information-Content Measures

Dans le document Data Mining in Biomedicine Using Ontologies (Page 49-52)