Feature selection and evaluation

Top PDF Feature selection and evaluation:

Performance evaluation of feature selection and tree-based algorithms for traffic classification

Performance evaluation of feature selection and tree-based algorithms for traffic classification

competitive with each other. It is hence left to the user to adopt an appropriate algorithm to their requirement and environment. VI. C ONCLUSION AND FUTURE WORK In this paper, we have presented data analysis and explo- ration techniques to select the most relevant features that can be used for network traffic classification. Then, an empirical analysis of different DT-based traditional classifiers (DT, RF, AdaBoost) as well as the recently developed CatBoost, Light- GBM, and XGBoost classifiers, has been conducted. This com- parison has been carried out with the data subset selected via Recursive Features Elimination (RFE). Using RFE, we have derived a mechanism to identify the best 15 features out of 87 features in our dataset. This has not only significantly reduced the execution time but also has identified useful features for network traffic classification in a real-world dataset to increase the ML models’ accuracy. Furthermore, experimental results and analysis have shown that more features do not always improve the classification performance. Moreover, from the DT-based models’ comparison, we conclude that the hyper- parameter search is necessary to construct accurate boosting- based models where this is not the case for DT and especially RF that generalize well with the default hyper-parameters.
En savoir plus

7 En savoir plus

Bimodal spectroscopic evaluation of ultra violet-irradiated mouse skin inflammatory and precancerous stages: instrumentation, spectral feature extraction/selection and classification (k-NN, LDA and SVM)

Bimodal spectroscopic evaluation of ultra violet-irradiated mouse skin inflammatory and precancerous stages: instrumentation, spectral feature extraction/selection and classification (k-NN, LDA and SVM)

In a study characterizing early neoplastic changes in DMBA/TPA-induced mouse skin tumor model, using AF multi-excitation in the wavelength band 280-460 nm and LDA classification, Diagardjane et al. [19] obtained good results in discriminating different precancerous states. They defined 5 categories: category I (Healthy), category II (Inflammation and Hyperplasia), category III (Hyper- plasia and Dysplasia), category IV (Moderately Differen- tiated Squamous Cell Carcinoma) and category V (Poorly Differentiated Squamous Cell Carcinoma). Considering classifications for categories I, II and III using excita- tions in the wavelength band 360-420, their results were 62.1% < Se < 69.0% and 82.2% < Sp < 95.6%. In our work we concentrated on a refined classification within corre- sponding categories II and III and obtained higher Se and Sp, while only using NUV-UV excitations wavelengths higher than 360 nm (i.e. wavelengths with lowest or no mutagenic potential, in perspective of clinical implemen- tation). Chang et al. [6] discriminated healthy (Squamous Normal, SN) vs. Low and High Grades of Squamous In- traepithelial cervical Lesions (LGSIL, HGSIL), using AF multi-excitation in the range 330-480 nm and DR spec- troscopies, in combination with PCA and Mahalanobis distance algorithm. Several pairs of discrimination were defined (SN vs. CN, SN vs. LGSIL, SN vs. HGSIL, CN vs. LGSIL and CN vs. HGSIL). Considering classification of discrimination pairs with SN and LHSIL using AF multi- excitation and DR, their results were 53% < Se < 95% and 69% < Sp < 91%. Although our work concerns another type of epithelium, we proposed here a refined classifica-
En savoir plus

13 En savoir plus

Network Feature Selection based on Machine Learning for Resource Management

Network Feature Selection based on Machine Learning for Resource Management

This work presents a comparative analysis between two feature selection methods, which are Recursive Feature Elimination (RFE) and Information Gain Attribute Evaluation (InfoGain), using several classifiers on different reduced versions of the network’s dataset. To prepare our data for resource management in SDN, methods of attribute selection (RFE and InfoGain) followed by the classification have been used to find a subset of appropriate features from our dataset.

2 En savoir plus

Texture feature benchmarking and evaluation for historical document image analysis

Texture feature benchmarking and evaluation for historical document image analysis

4.2 Corpus and preparation of ground truth Many important issues arise to provide an informative benchmarking of the most classical and widely used texture- based feature sets for HDI layout analysis and HDI segmen- tation such as the lack of a common dataset of HDIs and the lack of the appropriate quantitative evaluation measures for the segmentation quality [67]. Moreover, many researchers have addressed the need of a good dataset. Antonacopoulos et al. [1] considered a dataset as a good one if it is realistic (i.e. it must be composed of real digitized DIs), comprehen- sive (i.e. it must be well characterized and detailed for en- suring in-depth evaluation) and flexibly structured (i.e. to fa- cilitate a selection of sub-sets with specific conditions). Al- though the issue of the realistic dataset availability and the broadband access to researchers for the performance evalua- tion of contemporary DIs have been discussed and solved by Antonacopoulos et al. [1], representative datasets of HDIs with their associated ground truths are currently hardly pub- licly accessible for HDI layout analysis. Finding a large corpus of HDIs having many annotated HDIs with various content and layout characteristics and which were collected from several European libraries is still a challenging issue for HDI layout analysis. This is mainly due to the intellec- tual and industrial property rights. Another challenge facing founding a representative dataset of HDIs concerns the defi- nition of its objective and complete associated ground truth. Defining an objective ground truth is still not a straightfor- ward task due to their characteristics (e.g. noise and degra- dation, presence of handwriting, overlapping layouts, great variability of page layout). These characteristics complicate the definition of the appropriate and objective ground truth, the characterization or segmentation of HDIs and make the processing of this kind of DIs a difficult task.
En savoir plus

35 En savoir plus

Dynamic Reconfiguration of Feature Models: an Algorithm and its Evaluation

Dynamic Reconfiguration of Feature Models: an Algorithm and its Evaluation

2 Run Time Model 2.1 Feature Model Representation Feature Models (FM) [3,11] allow a designer to represent all possible run-time configurations of a system and provide a compact formalism to model software commonalities and variabilities. In this approach, “features” correspond to selectable concepts of a software system and of its contextual environment. They are organized along a tree, with logical selection relations (op- tional/mandatory features, exclusive choices...). Moreover, since features are not independent, one usually adds cross tree constraints to express relationships between features. In our case, we added constraints in the form of simple first order logic formulas of two types: imply noted ⇒ and exclude noted ⊗: a feature may either imply or exclude another one.
En savoir plus

20 En savoir plus

Forward and Backward Feature Selection for Query Performance Prediction

Forward and Backward Feature Selection for Query Performance Prediction

Evaluation metrics. We use the Pearson and Spearman corre- lations between the predicted effectiveness value and the ground- truth effectiveness, as in previous works [5, 9, 17, 29, 30, 42]. To mea- sure the ground-truth effectiveness of the system, we use AP (aver- age precision) and NDCG (normalized discounted cumulative gain) since they are commonly adopted in related works [9, 17, 26, 29, 42]. Experimental settings. As a common practice in QPP evalua- tion [26, 30, 42], we randomly split the queries into two equally- sized sets and conduct two-fold cross-validation. We repeat these steps for 30 times and report the average results. Statistically sig- nificant differences of prediction performance are estimated using two-tailed paired t-test with Bonferroni correction (p < 0.05) com- puted over the 30 splits. Similar to previous works [12, 26, 32, 45], we chose the Language Modeling with Dirichlet smoothing and µ = 1000 without query expansion (as implemented in Lemur Indri platform, using default parameters) to retrieve n documents for each query and to calculate the performance of the IR system (and thus, determine the results to be predicted in terms of AP or NDCG).
En savoir plus

9 En savoir plus

Robust supervised classification and feature selection using a primal-dual method

Robust supervised classification and feature selection using a primal-dual method

6.1. Experimental settings In this section, we compare the constrained primal-dual ` 1 approach with the one based on the Frobenius norm. Our primal-dual method can be applied to any classification prob- lem with feature selection on high dimensional dataset stem- ming from computational biology, image recognition, social networks analysis, customer relationship management, etc., We provide an experimental evaluation in computational bi- ology on simulated and real single-cell sequencing dataset. There are two advantages of such biological dataset. First, many public data are now available for testing reproductibil- ity; besides, these dataset suffer from outliers ("dropouts") with different levels of noise depending on sequencing ex- periments. Single-cell is a new technology which has been elected "method of the year" in 2013 by Nature Methods ( Evanko , 2014 ). We provide also evaluation on proteomics mass-spectrometric dataset. A test query x (a dimension d row vector) is classified according to the following rule: it belongs to the unique class j ∗ such that
En savoir plus

13 En savoir plus

Protein Structural Annotation: Multi-Task Learning and Feature Selection

Protein Structural Annotation: Multi-Task Learning and Feature Selection

However, by proceeding in this way, the user of wrapper methods must be aware that the selected features may lead to a model that overfits the samples, if no special care is taken. Indeed, if we consider the simple case in which training samples are used to select the subset of features, the evaluation of the model on an independent set of samples may be strongly lack of generalization since the selected features can perfectly fit the training samples. To tackle this phenomenon, a widely used approach is to cross- validate the selection, i.e., to run the feature selection once for different train/test splits. Although this method is usually well adapted to evaluate a model with a fixed set of features, this approach is not suitable for feature selection. Indeed, by proceeding in this way, the same data would be used for both selecting the set of features and assessing the quality of this selected set. Ambroise et al. [ 6 ] have been shown that this approach is biased due to using the same data for selecting and for evaluating and that it could lead to highly over-estimated the model performance. To avoid this risk of overfitting, we recommend a more evolved approach based on a second loop of cross-validation, in which the training stages perform a cross-validated feature selection on a part of the data and the testing stages perform the evaluation of the selected features on the remaining data. This approach is further detailed in Chapter 5 and 6 .
En savoir plus

135 En savoir plus

Evaluation of Road Marking Feature Extraction

Evaluation of Road Marking Feature Extraction

VI. Conclusion We presented an experimental comparison study on six representative road feature extractors and two variants. These algorithms were selected to sample the state of the art in road marking detection. The comparison was performed on a database of 116 images featuring variable conditions and situations. A hand-made ground truth was used for assessing the performance of the algorithms. Experiments show that photometric selection must be combined with geometric selection to extract road mark- ings correctly. As usual in pattern recognition, this task is not trivial, even for objects that seem quite simple, such as road markings. In particular, pitfalls such as lousy models and too selective models must be avoided. The methodology we proposed in this paper is a helpful tool for this purpose. For example, several times during this study, we were faced with intuitively good variants, which appeared to be inefficient in practice when systematically evaluated on the test base.
En savoir plus

8 En savoir plus

Feature space selection and combination for native language identification

Feature space selection and combination for native language identification

4 Discussion and Conclusion Our results suggest that on the shared task, a combi- nation of features relying only on word and character ngrams provided a strong baseline. Our best system ended up being a combination of models trained on various sets of lexical and syntactic features, using a simple majority vote. Our submission #4 combined only our three other submissions, but we later exper- imented with a larger pool of models. Table 3 shows that the best performance is obtained using the top 10 models, and many of the combinations are com- petitive with the best performance achieved during the evaluation. Our cross-validation estimate was also maximized for 10 models, with as estimated ac- curacy of 83.23%. It is interesting that adding some of the weaker models does not seem to hurt the vot- ing combination very much.
En savoir plus

6 En savoir plus

Intelligent feature based resource selection and process planning

Intelligent feature based resource selection and process planning

Effective process selection for a given set of product characteristics & requirements is a multi-criterion problem which is strongly influenced by interdependent manufacturing knowledge like product knowledge (product complexity, design requirements, product quality) and resource knowledge (resource availability, characteristic and cost). Based on the information provided regarding the role and experiences of experts in product-process design this knowledge is formalized. The process-resource selection involves two steps: step 1 involves the evaluation of available processes for their technical capability in order to appropriately respond to design requirements and step 2 ranks the process performance with regard to its resource consumption economically [4]. Many methods have been proposed to support these steps. Ashby [5] proposes the selection of manufacturing processes considering their compatibility with the geometrical parameter and the resources needed followed by ranking of the relative costs associated. The final choice is performed with respect to the rating, industrial knowledge, and available manufacturing resources; however, the approach adopted in this research does not target a specific manufacturing domain.
En savoir plus

13 En savoir plus

Forward and Backward Feature Selection for Query Performance Prediction

Forward and Backward Feature Selection for Query Performance Prediction

Evaluation metrics. We use the Pearson and Spearman corre- lations between the predicted effectiveness value and the ground- truth effectiveness, as in previous works [5, 9, 17, 29, 30, 42]. To mea- sure the ground-truth effectiveness of the system, we use AP (aver- age precision) and NDCG (normalized discounted cumulative gain) since they are commonly adopted in related works [9, 17, 26, 29, 42]. Experimental settings. As a common practice in QPP evalua- tion [26, 30, 42], we randomly split the queries into two equally- sized sets and conduct two-fold cross-validation. We repeat these steps for 30 times and report the average results. Statistically sig- nificant differences of prediction performance are estimated using two-tailed paired t-test with Bonferroni correction (p < 0.05) com- puted over the 30 splits. Similar to previous works [12, 26, 32, 45], we chose the Language Modeling with Dirichlet smoothing and µ = 1000 without query expansion (as implemented in Lemur Indri platform, using default parameters) to retrieve n documents for each query and to calculate the performance of the IR system (and thus, determine the results to be predicted in terms of AP or NDCG).
En savoir plus

10 En savoir plus

Contributions to the estimation of probabilistic discriminative models: semi-supervised learning and feature selection

Contributions to the estimation of probabilistic discriminative models: semi-supervised learning and feature selection

imiser devient l(θ) + ρ 2 kθk 2 2 , o` u ρ 2 est un param` etre de r´ egularisation. Outre ses bonnes performances empiriques, l’int´ erˆ et pratique de cette approche est que l’´ evaluation de la fonction objectif et de son gradient n´ ecessitent les mˆ emes calculs que dans le cas de l(θ) et n’importe quelle approche num´ erique de minimisation d’une fonction diff´ erentiable et con- vexe, de surcroˆıt sans contrainte de domaine, peut ˆ etre utilis´ ee. Les limitations principales de cette approche standard sont, d’une part, li´ ees au temps d’ex´ ecution avec la n´ ecessit´ e de r´ ealiser la r´ ecursion forward-backward pour toutes les s´ equences d’apprentissage lors de chaque ´ evaluation de la fonction ou du gradient et, d’autre part, li´ ees ` a l’empreinte m´ emoire du code du fait de la taille habituellement tr` es grande du vecteur de param` etres. En pra- tique, ce deuxi` eme aspect interdit l’utilisation d’algorithmes cherchant ` a estimer directe- ment le hessien ou son inverse et se traduit par l’utilisation pr´ epond´ erante d’algorithmes de type gradient conjugu´ e ou quasi-Newton ` a m´ emoire limit´ ee (de type L-BFGS, Limited Memory BFGS, en particulier).
En savoir plus

171 En savoir plus

Feature selection and term weighting beyond word frequency for calls for tenders documents

Feature selection and term weighting beyond word frequency for calls for tenders documents

• For this particular collection, we found that both feature selection by sentence filtering and term weighting method improves the performance of Naive Bayes classifier by a bigger marg[r]

116 En savoir plus

Regression Trees and Random forest based feature selection for malaria risk exposure prediction

Regression Trees and Random forest based feature selection for malaria risk exposure prediction

2.3 Environmental and behavioral data Rainfall was recorded twice a day with a pluviometer in each village. In and around each catch site, the following information was systematically collected: (1) type of soil (dry lateritic or humid hydromorphic)assessed using a soil map of the area (map IGN Benin at 1/200 000 e , sheets NB-31-XIV and NB-31- XV, 1968) that was georeferenced and input into a GIS; (2) presence of areas where building constructions are ongoing with tools or holes representing po- tential breeding habitats for anopheles; (3) presence of abandoned objects (or ustensils) susceptible to be used as oviposition sites for female mosquitoes; (4) a watercourse nearby; (5) number of windows and doors; (6) type of roof (straw or metal); (7) number of inhabitants; (8) ownership of a bed-net or (9) insect repellent; And (10) normalized difference vegetation index (NDVI) which was estimated for 100 meters around the catch site with a SPOT 5 High Resolution (10 m colors) satellite image (Image Spot5, CNES, 2003, distribution SpotImage S.A) with assessment of the chlorophyll density of each pixel of the image. Due to logistical problems, rainfall measurements are only available after the second entomological survey. Consequently, we excluded the first and second survey (performed in July and August 2007 respectively) from the statistical analyses.
En savoir plus

16 En savoir plus

Speeding-up model-selection in GraphNet via early-stopping and univariate feature-screening

Speeding-up model-selection in GraphNet via early-stopping and univariate feature-screening

Sparsity- and structure-inducing priors are used to perform jointly the prediction of a target variable and region segmenta- tion in multivariate analysis settings. Specifically, it has been shown that one can employ priors like Total Variation (TV) [1], TV-L1 [2], [3], TV-ElasticNet [4], and GraphNet [5] (aka S- Lasso [6] outside the neuroimaging community) to regularize regression and classification problems in brain imaging. The results are brain maps which are both sparse (i.e regression coefficients are zero everywhere, except at predictive voxels) and structured (blobby). The superiority of such methods over methods without structured priors like the Lasso, ANOVA, Ridge, SVM, etc. for yielding more interpretable maps and improved prediction scores is now well established (see for example [2], [3]). These priors are fast becoming popular for brain decoding and segmentation. Indeed, they leverage a feature-selection function (since they limit the number of active voxels), and also a structuring function (since they penalize local differences in the values of the brain map). Also, such priors produce state-of-the-art methods for automatic extraction of functional brain atlases [7].
En savoir plus

5 En savoir plus

Random forests variable importances Towards a better understanding and large-scale feature selection

Random forests variable importances Towards a better understanding and large-scale feature selection

Gilles Louppe, Understanding random forests: From theorey to practice, Ph.D. thesis, University of Liège, 2014. G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts, Understanding variable importances in forests of randomized trees, Advances in neural information processing, 2013.

27 En savoir plus

Lasso based feature selection for malaria risk exposure prediction

Lasso based feature selection for malaria risk exposure prediction

8. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning Data mining, Inference, Prediction. Springer, second edn. (2009) 9. J. Friedman, T. Hastie, N.S., Tibshirani, R.: Lasso and elastic-net regularized generalized linear models (2015), http://www.jstatsoft.org/v33/i01/ R CRAN 10. Ng, A.Y.: Preventing ”overfitting” of cross-validation data. In: Proceedings of the Fourteenth International Conference on Machine Learning. pp. 245–253. ICML ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997), http: //dl.acm.org/citation.cfm?id=645526.657119
En savoir plus

16 En savoir plus

Feature selection and classification in genetic programming: application to haptic based biometric data

Feature selection and classification in genetic programming: application to haptic based biometric data

I. I NTRODUCTION T HE integration of haptics into immersive virtual en- vironments, has been an active research area the past decade. Immersive digital environments consist of computer- created scenes within which users can immerse themselves and interact with other users or various objects through a virtual reality experience. Conversely, haptic systems enable physical interactions with virtual three-dimensional objects through the sense of touch, and are therefore expected to become the next dimension of human-computer interaction. Haptic-based applications are wide, and span many areas, including medicine, rehabilitation, education and entertain- ment. In recent years however, the possible use of haptic devices in biometric systems has been suggested to enable improved user identification/verification performance over more traditional techniques, such as those based on hand- written signatures. Biometric systems provide a solution to ensure that protected services are solely accessible by a legitimate user. This is achieved while relying on users’ behavioral and/or physiological characteristics. Conversely, haptic data depict trajectory, cutaneous as well as kinesthetic information which essentially consist of position, velocity,
En savoir plus

8 En savoir plus

Mutual information-based feature selection enhances fMRI brain activity classification

Mutual information-based feature selection enhances fMRI brain activity classification

3.4. Results on Real Data Effect of α: The parameter α has a strong influence on the outcome of the selection. We have studied the size of the selection for different values of α (see Fig. 3), on the set of images of subject 6, where we have pre-selected the 300 voxels with the highest F-score in Anova, in order to reduce the computation time. We can see that the final number of features depends on α. When α increases, the selection is less strict, and the number of selected features is higher. The performance in generalization of the selection of features is constant, which seems to imply that the first set of 4 vox- els contains all the information needed to classify the images. However, the voxels added by increasing α do not seem to decrease the performance in generalization for the SVM. It is interesting to notice that for very low threshold (i.e. for a low
En savoir plus

5 En savoir plus

Show all 10000 documents...