Unveiling Biological Associations by Extracting Rules Involving GO Terms

Extracting Biological Knowledge by Association Rule Mining

7.2 Using GO in Association Rule Mining

7.2.1 Unveiling Biological Associations by Extracting Rules Involving GO Terms

Some authors incorporate GO terms into their datasets to obtain associations that relate the terms with their studied variables. Rules involving GO terms are able to describe, in an intuitive and concise way, relations between biological concepts and the rest of the studied variables. This makes the integration of GO terms with other data sources an attractive approach, and thus, several authors have recently developed different proposals [2, 3, 47].

In these types of studies, the dataset typically consists of a data table in which rows represent genes and columns represent the set of variables of interest (e.g., microarrays, annotations from other databases, gene features, and so on). An ex-ample of this type is the work reported by Carmona-Saez et al. [3]. Thus, the naive approach to integrating the GO terms in the analysis consists of directly including the GO annotations of each gene in an additional column. For example, in Table 7.5, an additional column has been added containing the list of GO annotations for each gene. Each GO term constitutes an item of the form (GO annotation = GO:xxxxxxx). Thus, in running an association rule mining algorithm over this data table, associations between the GO terms and the rest of variables might be obtained. Since the number of terms is quite high (~6,800 terms related with the human genome), and in these types of studies, associations among GO terms are not usually of interest, the search space may be substantially pruned by avoiding generating itemsets containing more than one GO annotation.

Nevertheless, when using the terms in which the genes are directly annotated, some problems might arise:

Some of these terms may represent very specifi c concepts. This means that 1.

only few genes would be annotated in these terms, and thereby, these terms would not form frequent itemsets.

Suppose a set of genes annotated to a term

2. T and a different set of genes

annotated to a term T′, where T′ is an ancestor of T. When counting the

7.2 Using GO in Association Rule Mining 145

occurrences of the itemsets containing T′ in the data table, those genes annotated to T would not be taken into account, since only the term T appears in their transactions. Since terms are considered to share the

attributes of all the parent nodes, all the genes annotated to term T must also be taken into account when counting the frequency of term T′,

otherwise an important loss of information might occur.

Martinez et al. [47] avoided these problems by including in the data table not only the terms in which the genes are directly annotated, but also all of their an-cestors. However, an important drawback arises when using this last strategy: if every ancestor is included in the analysis, very general terms (e.g., molecular_func-tion, biological_process, cellular_component, and so on) may be considered. These terms are so general that do not provide any interesting information. Moreover, they slow down the mining process and disturb the interpretation of the fi nal rule set, since they generate many trivial or uninteresting rules.

Hence, a possible approach consists of including only terms of a selected GO level. Those terms below the selected depth are mapped to the corresponding one in that level, and those above are discarded. Some applications (not necessarily as-sociation rule-based applications), such as FatiGO [48], adopted this methodology, and, in principle, it seems that GO level 3 represents a good compromise between information quality and the number of annotated genes [49]. Nevertheless, GO levels are not homogeneous, or, in other words, the terms representing general concepts and others that represent more specifi c concepts might be found in the same GO level [50]. Therefore, some information might be lost when using this strategy.

Table 7.5 An Example of a Data Table in Which GO Terms Have Been Included

Lopez et al. [2] noticed the previous problems and proposed an alternative methodology: consider all the ancestors, calculate the information provided by each term, and remove those that are uninformative. By assuming that the more specifi c a term is, the more information it gives, the information content (IC) of a term T can be computed as IC(T) = −log(P(T))/ −log(P(min)), where P(T) represents the probability of fi nding T or a child of T in the ontology. The denominator is used to normalize, or, P(min) = 1/Total_number_of_annotations. Note that the deeper the GO term is in the ontology, the greater its IC. This is due to the ontological struc-ture of GO. If the number of annotations decreases, the probability of the terms occurring also decreases, and therefore their IC tends to increase (Figure 7.7).

Additionally, if many rules involving GO terms are obtained, these authors propose to reduce the resultant rule set by merging subsets of rules containing GO terms that may provide similar information. This strategy takes advantage of the GO structure to fi lter the rule set. First of all, a scan of the rule set is carried out to look for groups of rules involving a GO term and sharing all their items except the GO node. For each group, if there is a GO term in it that is a common ancestor for the rest of the GO nodes in this rule set, only the rule involving the common ances-tor is maintained, while the rest of rules in the group are discarded. This strategy relies on the idea that each Gene Ontology term shares the attributes of all its par-ent nodes. Since it is ensured that the terms included in the analysis are informative

Figure 7.7 A fragment of the ontology molecular function in GO. Each node is labeled with its name, the number of annotations in it, and under it (N), the probability derived from the number of annotations (P) and its information content (IC). The Total_number_of_annotations used to calculate the probabilities corresponds to the number of annotations of the highest node in the ontology, or 169,524 in this case.

7.2 Using GO in Association Rule Mining 147

enough by setting an appropriate IC threshold, the common ancestor represents the most intuitive term. See Figure 7.8 for an example.

Regarding the application of fuzzy techniques, to the extent of our knowledge, so far only the work by Lopez et al. [2] makes use of fuzzy association rules in its study. In this case, the domains of the continuous variables are partitioned into three fuzzy sets that represent the linguistic labels HIGH, MEDIUM, and LOW.

Fuzzy sets are defi ned by using the expert-guided percentiles p₂₀, p₄₀, p₆₀, p₈₀, as shown in Figure 7.9, and a fuzzy version of the Top Down FP-Growth algorithm [51] is used to mine the data table.

It is worth mentioning the absence of works that, trying to extract useful knowledge from the Gene Ontology by association rule mining, consider GO as a taxonomy. As previously stated, an ontology can be considered as a taxonomy, since it represents a set of concepts hierarchically organized, according to their specifi city. Many works have proposed effi cient algorithms for mining association rules from taxonomies, and their application in future works may provide higher quality rule sets. In addition, the use of algorithms able to mine fuzzy taxonomies could also be interesting. However, their application does not make sense as long as there is no fuzzy version of the Gene Ontology.

Dans le document Data Mining in Biomedicine Using Ontologies (Page 161-164)