• Aucun résultat trouvé

Experiment with approximate matching

APPROXIMATE SEMANTIC MATCHING OF MUSIC CLASSES ON THE INTERNET

9.5 Experiment with approximate matching

In this section we summarize the results of experiments that we conducted using the approximate matching method. We used the metadata schemas ex-tracted from ArtistDirectNetwork and MusicMoz.

The linguistic interpretation (i.e., the formulas build from the labels of the nodes) were obtained using simple techniques. For example,Alternative Rock was transformed into the following formula:

(Alternative∩Rock)∪Alternative Rock.

Special characters “&” and “/” were treated as logical union. For example,Pop

& Rockwas transformed into the formulaPop∪Rock. No background knowl-edge was used. When using background knowlknowl-edge, each atomic concept (e.g., Alternative, Rock, Alternative Rock) should be replaced with the union of the different senses for that concept.

We made the assumption that concepts with the same label have the same meaning. When comparing the disjunct-conjunct relations we made a simpli-fication: a disjunct Ai is considered to be a subclass of a conjunct Bj when Zharko Aleksovski, Warner ten Kate and Frank van Harmelen

141 some literal in the disjunct (which is an intersection of literals) is present in the conjunct (which is an union of literals). So, given a disjunct-conjunct pair:

Ai= (A1i ∩A2i ∩ ··· ∩Anii),Bj= (B1j∪B2j∪ ··· ∪Bmjj),

we say thatAi⊆Bj ifAni =Bmj for somenandm. If no such pair is found, the disjunctAi is not considered to be a subclass of the conjunctBj. This simpli-fication, however, may lead to some incorrect rejections of subclass relations.

Also, more sophisticated techniques can be used to match the names [Bilenko et al., 2003].

9.5.1 Example of an approximate matching

Now we explain the process of approximate inferring an equivalence rela-tion in detail. For the sake of the explanarela-tion we have chosen an example that produces simple formulas, however, in practice these formulas can grow bigger and be more complex.

In our example, consider the relation between two styles from ADN and MM that are namedGlam Rockon both portals (Figure 9.4).

Figure 9.4. Glam Rock style from the schemas of ADN and MM.

The first step is to transform the concepts into formulas. We first transform theGlam Rock style from ADN. Note thatGlam Rock is a substyle ofRock as shown in Figure 9.4. Also note thatGlam Rockconsists of two words. For the formula, we therefore have to take into account the separate meanings of those words (i.e., the intersection of their meanings), as well as those words constituting a single term (as is the case in “New Zealand”). Therefore the formula representing the meaning ofGlam Rockfrom ADN is the following:

Glam Rock A=Rock∩((Glam∩Rock)∪Glam Rock).

This leads to the following normal forms:

Glam Rock DNF A = (Glam∩Rock)(Glam Rock∩Rock), (9.2) Glam Rock CNF A = (Rock)(Glam∪Glam Rock). (9.3) Approximate Semantic Matching of Music Classes on the Internet

142

Analogously, the “Glam Rock” style from MM is transformed into the for-mula:

Glam Rock B =Rock∩Glam∩((Glam∩Rock)∪Glam Rock)

=Rock∩Glam.

The literal Glam Rock in the formula is discarded because of the absorption rule [Mendelson, 1997]. This leads to the following normal forms:

Glam Rock DNF B = (Glam∩Rock), (9.4) Glam Rock CNF B = (Rock)(Glam). (9.5) The normal forms can be used to test the equivalence relation between the concepts Glam Rock A and Glam Rock B. We therefore have to check the subclass relation for those two concepts in both directions.

In order to check the subsumption Glam Rock B⊆Glam Rock Athe nor-mal forms (9.3) and (9.4) are needed. Glam Rock Bconsists of only one dis-junct, andGlam Rock Aconsists of two conjuncts. We therefore have to check two disjunct-conjunct pairs:

(Glam∩Rock)⊆(Rock) true(Rock is on both sides), (Glam∩Rock)(Glam∪Glam Rock) true(Glam is on both sides).

Both disjunct-conjunct pairs satisfy the relation, so Glam Rock B Glam Rock Aholds with a sloppiness of 0%.

In order to check the subsumption Glam Rock A⊆Glam Rock Bthe nor-mal forms (9.2) and (9.5) are needed. Glam Rock Aconsists of two disjuncts, andGlam Rock Bconsists of two conjuncts. We therefore have to check four disjunct-conjunct pairs:

(Glam∩Rock)(Rock) true(Rock is on both sides), (Glam∩Rock)⊆(Glam) true(Glam is on both sides), (Glam Rock∩Rock)(Rock) true(Rock is on both sides), (Glam Rock∩Rock)⊆(Glam) false.

Three out of four disjunct-conjunct pairs satisfy the relation, however, one disjunct-conjunct pair does not. Hence, 25% of the disjunct-conjunct pairs do not satisfy the subsumption relation, and the relation Glam Rock A⊆ Glam Rock Btherefore holds with a sloppiness of 25%.

When assessing the sloppiness in the equivalence relation between Glam Rock AandGlam Rock B, we take the maximum of the sloppiness val-ues calculated in the two subsumptions. The equivalence relation between Glam Rock AandGlam Rock Btherefore holds with a sloppiness of 25%.

Zharko Aleksovski, Warner ten Kate and Frank van Harmelen

143

9.5.2 Comparison with instance data

For our experiments we extracted real data from the Internet (Section 9.3).

In the following, the results are presented that were obtained using the data sets MM and ADN (Table 9.1).

Table 9.1. Size of the data in ArtistDirectNetwork and MusicMoz.

name number of number of number of number of

classes artists classified shared classified artists artists Artist Direct Network 465 16072 16072

MusicMoz 1073 6451 2356 1183

Most of the shared classified artists are classified underRock-related classes (e.g.,Alternative Rock,Glam Rock,Heavy Metal). A significant limitation of our dataset is that the number of instances is of the same order as the number of classes.

The tests were performed to discover the equivalence matchings between the classes in both hierarchies, i.e., whether each is a subclass of the other.

Different values for the sloppiness measure were used in the tests. In order to assess the success of the matching we introduce a value called significance, which we define as the cardinality ratio between the intersection and the union of the two classes. Formally:

significance(A⇔B) =|A∩B|

|A∪B|.

The significance is close to 0 when the two classes have no overlap, i.e., a relatively small amount of instances belong to their intersection. When the value is close to 1 (or 100%) then the two classes denote almost the same set of instances.

Figure 9.5 presents the average significance for different values of the slop-piness in case of equivalence testing between ADN and MM. Only classes that have at least 3 instances were observed, leaving to compare roughly 150 against 350 classes. The figure shows that the significance stays constant with increasing sloppiness before dropping down. On the other hand the number of matched equivalences was found to increase with sloppiness: from 18 matches at 0%, to 51 at 30%, to 140 at 45%, to 900 at 55%, where the onset is passed of exploding to all possible 43000 matches at 100%. This increase at constant significance suggests that the matches additionally found at first do represent correct matches. Above 40% incorrect matches prevail.

Approximate Semantic Matching of Music Classes on the Internet

144

Figure 9.5. Significance of matched equivalences between ADN and MM.

The relatively low value of the initial average significance reflects the pres-ence of fuzziness, as discussed in Section 9.3.2. It is a notification that peo-ple have large deviation in the way they think about the music style names.

It is stated by Aucouturier & Pachet [2003] that the music domain constantly evolves, and there is no centralized authority that can assign styles to the artists.

They are classified in different ways, although the same name is given by the music providers.

Figure 9.6 shows the number of equivalence relations inferred given some value for the sloppiness parameter. The number of inferences increases when the sloppiness is increased. At the beginning, the number of inferences in-creases slowly. This is reasonable since a relatively small amount of pairs of classes from different sources should be considered to be equivalent or approx-imately equivalent. In general, most of the pairs of classes are not related at all, and adding sloppiness should not change this. Still, as said, more classes were found, and most of them were relevant, not altering the significance. From 50% toward the end, the number of inferences increases more rapidly. At 100% there is a “cliff”, because all classes are considered to be equivalent with a sloppiness of 100%.