• Aucun résultat trouvé

A Feature Selection Method based on Tree Decomposition of Correlation Graph

N/A
N/A
Protected

Academic year: 2021

Partager "A Feature Selection Method based on Tree Decomposition of Correlation Graph"

Copied!
3
0
0

Texte intégral

(1)

HAL Id: hal-02194229

https://hal.archives-ouvertes.fr/hal-02194229

Submitted on 25 Jul 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A Feature Selection Method based on Tree Decomposition of Correlation Graph

Abdelkader Ouali, Nyoman Juniarta, Bernard Maigret, Amedeo Napoli

To cite this version:

Abdelkader Ouali, Nyoman Juniarta, Bernard Maigret, Amedeo Napoli. A Feature Selection Method

based on Tree Decomposition of Correlation Graph. LEG@ECML-PKDD 2019 - The third Interna-

tional Workshop on Advances in Managing and Mining Large Evolving Graphs, Sep 2019, Würzburg,

Germany. �hal-02194229�

(2)

A Feature Selection Method based on Tree Decomposition of Correlation Graph

Abdelkader Ouali 1 , Nyoman Juniarta 1 , Bernard Maigret 1 , and Am´ed´eo Napoli 1

Universit´e de Lorraine, CNRS, Inria, LORIA F-54000 Nancy, France.

firstname.lastname@loria.fr

Molecular descriptors are tightly connected to the concept of molecular structure and play an important role in the scientific sphere due to their connection to a complex network of knowledge [7]. The availability of large numbers of theoretical descriptors, currently by thousands, provides diverse sources to better understand relationships be- tween molecular structure and experimental evidence, but adds more complexity by in- troducing new features. To confront the high dimensionality of data, a correct selection of the features allows to reduce the size of the data and the computational time while preserving the predictive power of the classifier. Feature selection methods, wherein subsets of features available from the original data are selected, are now commonly used in data mining and represent an active field of research [3]. Some of these meth- ods rely on correlation graph to reveal structural properties hidden in the data. These graph-based approaches belong to the class of filter-based methods. We introduce a method which uses a tree decomposition [4] of feature-feature correlation graph to se- lect representatives in order to reduce redundancy. This method is exploited to target molecular descriptors involved in identifying molecules that can be characterized as antibacterials.

Prior Work. Different graph-based approaches have been proposed for feature cluster- ing where the main objective is to avoid the selection of redundant features. The authors in [6] use graph-theoretic clustering methods to separate features into clusters, represen- tatives are selected next using features which are strongly related to target classes. The authors in [8] use a hyper-graph clustering to extract maximally coherent feature groups from the dataset which includes third or higher order dependencies. The authors in [1]

proposed a method which relies on community detection techniques to cluster graphs which describe the strongest correlations among features. Compared to existing graph- based feature selection methods, our method exploits tree decomposition techniques to reveal groups of highly correlated features.

FSTDCG 1 method. A dataset D is described as set of examples M and a set of features F. Each example m ∈ M is defined w.r.t. all the features in F. Let Dom(f ) be the set of values taken by a feature f ∈ F over all the examples in M. We assume that Dom is a numerical set for all features in F. One specific feature can act as the class of the examples. We assume that the set of values of a feature class Dom(class) is defined in N . A feature selection to F, resulting in F S , is the set of more discriminating features for the classes in the dataset. Our method proceeds in three steps. First, a correlation graph is computed between each two features (x, y) ∈ F using the Pearson Product

1

Feature Selection Tree Decomposition Correlation Graph.

(3)

2 A. Ouali et al.

A

B C

D E

F

(a) Initial graphG.

A

B C

D E

F

(b) Example of a triangulation ofG.

A,B,C B,C,D

A,B,E D,F

B,C

A,B D

B

(c) Maximal cliques corresponding to the triangulated graph (graph of

clusters).

A,B,C

B,C,D A,B,E

D,F

B,C A,B

D

(d) Tree decomposition ofG with width2.

Fig. 1: Steps for computing a tree decomposi- tion of a graph

G

Algorithm 1: Pseudo code of FSTDCG.

Input:

A set of features

F, and Correlation GraphG.

Output:

Selected subset of features

FS

.

1FS←notCorrelatedFeatures(F, G);

2(CT, T)←treeDecomposition(G);

3for eachclusterC∈CTdo

4 fmax←0,max←0;

5 for eachfeaturef∈Cdo

6 ifOccur(f,CT)> maxthen

7 max←Occur(f,CT);

8 fmax←f;

9 FS← FS∪ {fmax};

10returnFS

;

Moment Correlation [5]. An edge is created for each (x, y) having a correlation value higher than empirical threshold. Second, a tree decomposition of the correlation graph is computed based on the min-fill heuristic [2]. Fig. 1 depicts the different steps (a- d) to compute the tree decomposition. The clusters C T produced in step (c) gather highly correlated features since clusters correspond to maximal cliques which cannot be extended by including one more adjacent node. Third, representatives are selected based on their maximum occurrence in other clusters.

Algorithm 1 depicts the pseudo code of FSTDCG method. The uncorrelated features not found in the correlation graph are kept in the set F S , see line 1. FSTDCG selects the feature with a maximum occurrence over all the clusters of the tree decomposition.

This strategy allows to benefit from the topology of the tree, and to reduce redundancy among features appearing in different clusters, see lines 3-9.

Experiments. Experiments were carried out on 18 different datasets available from the UCI repository. An edges is created in the correlation graph if the ρ-value of the cor- relation is greater than 0.05. We used Random Forest classifier with 10-fold cross vali- dation to evaluate the accuracy. The method provides a reasonable number of features, on average 20.32% of the original datasets, while decreasing on average 4.36% of the accuracy of the classifier. We are currently investigating the use of FSTDCG to address more specific molecular datasets. Further work will consist of using evolve graphs to handle changes that occur on the dataset when new data are added or removed.

References

1. Horvath, S.: Correlation and Gene Co-Expression Networks (2011)

2. Kjaerulff, U., Datasystemer A/s, J.: Triangulation of graphs - algorithms giving small total state space (2002)

3. Li, J., Liu, H.: Challenges of feature selection for big data analytics (2017)

4. Robertson, N., Seymour, P.D.: Graph Minors. II. Algorithmic Aspects of Tree-Width (1986) 5. Rodgers, J.L., Nicewander, W.A.: Thirteen ways to look at the correlation coefficient (1988) 6. Song, Q., Ni, J., Wang, G.: A fast clustering-based feature subset selection algorithm for high-

dimensional data (2013)

7. Todeschini, R., Consonni, V.: Molecular Descriptors for Chemoinformatics (2009)

8. Zhang, Z., Hancock, E.R.: A graph-based approach to feature selection (2011)

Références

Documents relatifs

knowledge, the algorithm providing the best approximation is the one proposed by Grabisch in [20]. It assumes that in the absence of any information, the most reasonable way of

A further analysis of the selected subsets by the three methods shows that the Pareto subsets obtained with 3O and 2OMF-3O are statistically dominated by at least one 2OMF subset:

Experimental results on the problems of emotion classification in speech and music have shown that selecting relevant features improves the classification accuracy, and for this

Figure 1 represents the ob- tained classification rate in function of the size of the index T ′ that depends on the number n of words selected by cat- egory with the different

view stereo images along with the internal camera calibration information as input, and yield a sparse 3D point cloud, camera orientations and poses in a common 3D coordinate

Also the obtained accuracies of different classifiers on the selected fea- tures obtained by proposed method are better that the obtained accuracies of the same

In this paper we present a relation validation method for KBP slot filling task by exploring some graph features to classify the candidate slot fillers as correct or incorrect..

Also, as there have been developed methods for automating this process, the feature set size increases dramatically (leading to what is commonly refer to as