Connectivity Clustering - Data Summarization Through Background Knowledge

Text Summarization Using Ontologies

8.4 Data Summarization Through Background Knowledge

8.4.1 Connectivity Clustering

Connectivity Clustering is clustering based solely on connectivity in an instan tiated ontology. More specifically the idea is to cluster a given set of concepts based on their connections to common ancestors, for instance grouping two sib lings due to their common parent, and in addition to replace the group by the common ancestor.

Thus rather than, when taking a bottom-up hierarchical clus tering view, moving

3. Notice that due to the use of SEMCOR, there is no attribution in the initial set of concepts.

towards a smaller number of larger clusters, connectivity clustering is about moving towards a smaller number of more general concepts.

For a set of concepts C = {c₁, ..., c_n} we can consider as generalizing de scription a new set of concepts δ( )C ={ ,cˆ₁ ,cˆ_k}, where ˆc is either a concept generalizing _i concepts in C or an element from C. Each generalizer in δ(C) is a least upper bound (lub) of a subset of C, ˆc = lub(C_i _i), where {C₁, ..., C_k} is a division (clustering) of C.

Notice that the lub of a singleton set is the single element in this.

We define the most specific generalizing description δ(C) for a given ˆ1 ˆ

( )c ={ ,c ,c_k} as a description restricted by the following properties:

( )

ˆ : ˆ , ˆ ˆ

c δ C c C c c C c c c c c c

∀ ∈ ∈ ∨ ∃ ′ ′′∈ ∧ ≠′ ′′∧ < ∧′ ′′< (8.6)

( )

ˆ ˆ, : ˆ ˆ

c c δ C c c

∀ ′ ′′∈ ′⬍ ′ (8.7)

Figure 8.3 An instantiated ontology based on a paragraph from SemCor.

8.4 Data Summarization Through Background Knowledge 175 where (8.6) restricts δ(C) to elements that either originate from C or generalize two or more concepts from C. Equation (8.7) restricts δ(C) to be without redundance (no element of δ(C) may be subsumed by another element), and (8.8) reduces to the most specific, in the sense that no subsumer for two elements of C may be subsumed by an element of δ(C).

Observe that δ(C), like C, is a subset of L_C, and that we therefore can refer to an m’th order summarizer δ^m(C). Obviously, to obtain an appropriate description of C, we will in most cases need to consider higher orders of δ. At some point m, we will in most cases, have that δ^m(C) = Top, where Top is the topmost element in the ontology. An exception is when a more specific single summarizer is reached in the ontology.

The most specific generalizing description δ(C) for a given C is obviously not unique, and there are several diﬀerent sequences of most specific generalizing de-scriptions of C from C toward Top. However, a reasonable approach would be to go for the largest possible steps obeying the restrictions for δ above, as done in the algorithm below.

For a poset S, we define min(S) as the subset of minimal elements of S: min(S)

= {s|s ∈ S, ∀s

′

∈ S : s

′

⬍ s}

ALGORITHM—Connectivity summary.

INPUT: Set of concepts C = {c₁, ..., c_n}.

OUTPUT: A most specific generalizing description δ(C) for C.

Let the instantiated ontology for

1. C be O_C= (L_C, ≤C , R)

In 2, all of the most specific concepts U that generalize two or more concepts from C are derived. Notice that these may include concepts from C when C con-tains concepts subsuming other concepts. In 3, L defines the set of concepts in C that specializes the generalizers in U. In 4 additional parents for (multiple inherit-ing) concepts covered by generalizations in U are added. 5, derives δ(C) from C by adding the most specific generalizers and subtracting concepts specializing these.

Notice especially that 4, is needed in case of multiple inheritance. If a concept C has multiple parents and is replaced by a more general concept due to one of its senses (parents) we need to add parents corresponding to the other senses of C—

otherwise, we lose information corresponding to these other senses. For instance, in Figure 8.4 we have that δ({a, b}) = {c, d} because a and b will be replaced by c, and d will be added to specify the second sense of b.

As a more elaborate example consider again Figure 8.3. Summarization of C by connectivity will proceed as follows.

C = {case, system, dirt, phase, capillary action, interfacial tension, grease, oil, water, liquid, surface-active agent, surface}

δ¹(C) = {abstraction, binary compound, liquid, oil, phase, surface, surface-active agent, surface tension}

δ²(C) = {abstraction, compound, liquid, molecule, natural phenomenon, surface, surface-active agent}

δ³(C) = {abstraction, molecule, natural phenomenon, substance, surface, surface-active agent}

δ⁴(C) = {abstraction, physical entity}

δ⁵(C) = {entity}

The chosen approach, taking the largest possible steps in which everything that can, will be grouped, is, of course, not the only possible approach. If we, alterna-tively, want to form only some of the possible clusters complying with the restric-tions, some kind of priority mechanism for selection is needed.

Among important properties that might contribute to priority are deepness, redundancy, and support. The deepest concepts, those with the largest depth in the ontology, are structurally, and thereby often also conceptually, the most specific concepts. Thus, collecting these first would probably lead to a better balance with regard to how specific the participating concepts are in candi date summaries. Re-dundancy, where participating concepts include (subsume) others, is avoided as regards more general concepts introduced (step 3 in the algorithm). However re-dundancy in the input set may still survive so priority could also be given to remove this first. In addition we could consider support for candidate summarizers. One option is simply to measure support in terms of the number of subsumed concepts in the input set while more refinement could be obtained by also taking the fre-quencies of concepts, as well as their distribution in documents⁴ in the original text into consideration. Support may guide the clus tering in several ways. It indicates for a concept how much it covers in the input and can thus be considered as an importance weight for the concept as summa rizer for the input. High importance should probably infer more reluctance, as regards further generalization.

4. Corresponding to term and document frequencies in information retrieval.

Figure 8.4 Ontology fragment with multiple inheritance.

8.4 Data Summarization Through Background Knowledge 177

Dans le document Data Mining in Biomedicine Using Ontologies (Page 190-194)