HAL Id: hal-02194229
https://hal.archives-ouvertes.fr/hal-02194229
Submitted on 25 Jul 2019
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A Feature Selection Method based on Tree Decomposition of Correlation Graph
Abdelkader Ouali, Nyoman Juniarta, Bernard Maigret, Amedeo Napoli
To cite this version:
Abdelkader Ouali, Nyoman Juniarta, Bernard Maigret, Amedeo Napoli. A Feature Selection Method
based on Tree Decomposition of Correlation Graph. LEG@ECML-PKDD 2019 - The third Interna-
tional Workshop on Advances in Managing and Mining Large Evolving Graphs, Sep 2019, Würzburg,
Germany. �hal-02194229�
A Feature Selection Method based on Tree Decomposition of Correlation Graph
Abdelkader Ouali 1 , Nyoman Juniarta 1 , Bernard Maigret 1 , and Am´ed´eo Napoli 1
Universit´e de Lorraine, CNRS, Inria, LORIA F-54000 Nancy, France.
firstname.lastname@loria.fr
Molecular descriptors are tightly connected to the concept of molecular structure and play an important role in the scientific sphere due to their connection to a complex network of knowledge [7]. The availability of large numbers of theoretical descriptors, currently by thousands, provides diverse sources to better understand relationships be- tween molecular structure and experimental evidence, but adds more complexity by in- troducing new features. To confront the high dimensionality of data, a correct selection of the features allows to reduce the size of the data and the computational time while preserving the predictive power of the classifier. Feature selection methods, wherein subsets of features available from the original data are selected, are now commonly used in data mining and represent an active field of research [3]. Some of these meth- ods rely on correlation graph to reveal structural properties hidden in the data. These graph-based approaches belong to the class of filter-based methods. We introduce a method which uses a tree decomposition [4] of feature-feature correlation graph to se- lect representatives in order to reduce redundancy. This method is exploited to target molecular descriptors involved in identifying molecules that can be characterized as antibacterials.
Prior Work. Different graph-based approaches have been proposed for feature cluster- ing where the main objective is to avoid the selection of redundant features. The authors in [6] use graph-theoretic clustering methods to separate features into clusters, represen- tatives are selected next using features which are strongly related to target classes. The authors in [8] use a hyper-graph clustering to extract maximally coherent feature groups from the dataset which includes third or higher order dependencies. The authors in [1]
proposed a method which relies on community detection techniques to cluster graphs which describe the strongest correlations among features. Compared to existing graph- based feature selection methods, our method exploits tree decomposition techniques to reveal groups of highly correlated features.
FSTDCG 1 method. A dataset D is described as set of examples M and a set of features F. Each example m ∈ M is defined w.r.t. all the features in F. Let Dom(f ) be the set of values taken by a feature f ∈ F over all the examples in M. We assume that Dom is a numerical set for all features in F. One specific feature can act as the class of the examples. We assume that the set of values of a feature class Dom(class) is defined in N . A feature selection to F, resulting in F S , is the set of more discriminating features for the classes in the dataset. Our method proceeds in three steps. First, a correlation graph is computed between each two features (x, y) ∈ F using the Pearson Product
1
Feature Selection Tree Decomposition Correlation Graph.
2 A. Ouali et al.
A
B C
D E
F
(a) Initial graphG.
A
B C
D E
F
(b) Example of a triangulation ofG.
A,B,C B,C,D
A,B,E D,F
B,C
A,B D
B
(c) Maximal cliques corresponding to the triangulated graph (graph of
clusters).
A,B,C
B,C,D A,B,E
D,F
B,C A,B
D
(d) Tree decomposition ofG with width2.
Fig. 1: Steps for computing a tree decomposi- tion of a graph
GAlgorithm 1: Pseudo code of FSTDCG.
Input:
A set of features
F, and Correlation GraphG.Output:
Selected subset of features
FS.
1FS←notCorrelatedFeatures(F, G);
2(CT, T)←treeDecomposition(G);
3for eachclusterC∈CTdo
4 fmax←0,max←0;
5 for eachfeaturef∈Cdo
6 ifOccur(f,CT)> maxthen
7 max←Occur(f,CT);
8 fmax←f;
9 FS← FS∪ {fmax};
10returnFS