Master 2 Thesis Project — 2021 A scalable bi-clustering rough set-based approach for big data pre-processing

(1)

1

Master 2 Thesis Project — 2021

A scalable bi-clustering rough set-based approach for big data pre-processing

Project overview

A big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures (GT et al., 2020).

To overcome these limitations, Rough Set Theory (RST) (MS & U, 2017) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set; while using the data alone and requiring no supplementary information.

However, despite of its scalable version (Dagdia et al., 2020), the distributed rough set-based technique faces a key challenge tied to the partitioning of the feature search space in the distributed environment while guaranteeing data dependency. A first attempt aiming at guaranteeing data dependency was proposed in (Dagdia et al., 2018) where the algorithm applies a hashing technique to partition the universe. Nevertheless, the proposed feature selection technique does not fully guarantee data dependency as data may belong to more than a single block. Adding to this, such crisp separation between data instances does not fall with the fundamentals of rough set theory which categorizes each data instance into different classes with different degrees.

Therefore, in this Master Thesis, we will investigate a new distributed RST version based on bi-clustering techniques to guarantee data dependency and to better perform the big data feature selection task. The notion of a bi-cluster (VA & RJ, 2017) gives rise to a more flexible computational framework; as it allows simultaneous clustering of the rows and columns of an input data; where a data instance can belong to more than a single cluster. Within a distributed environment, and when partitioning the feature search space, bi-clustering analysis can be seen as highly appropriate for such context. In this Master Thesis, by using bi-clustering, the rough set-based hybrid technique will be more suitable for partitioning the high dimensional feature search space in a more reliable way; hence better preserving data dependency in the distributed environment.

Keywords: Big data, rough set theory, bi-clustering, feature selection, data dependency, distributed processing.

References

- Reddy GT, Reddy MP, Lakshmanna K, Kaluri R, Rajput DS, Srivastava G, Baker T.

Analysis of dimensionality reduction techniques on big data. IEEE Access. 2020 Mar 16;8:54776-88.

(2)

2 - Padilha VA, Campello RJ. A systematic comparative evaluation of biclustering

techniques. BMC bioinformatics. 2017 Dec 1;18(1):55.

- Raza MS, Qamar U. Redefining core preliminary concepts of classic Rough Set Theory for feature selection. Engineering Applications of Artificial Intelligence. 2017 Oct 1;65:375-87.

- Zaineb Chelly Dagdia, Christine Zarges, Gaël Beck, Mustapha Lebbah:

A scalable and effective rough set theory-based approach for big data pre- processing. Knowl. Inf. Syst. 62(8): 3321-3386 (2020)

- Zaineb Chelly Dagdia, Christine Zarges, Gaël Beck, Hanene Azzag, Mustapha Lebbah.

A Distributed Rough Set Theory Algorithm based on Locality Sensitive Hashing for an Efficient Big Data Pre-processing. BigData 2018: 2597-2606

Work plan

1) Studying the state-of-the-art of bi-clustering techniques and their distributed versions;

2) Studying the state-of-the-art of scalable feature selection techniques (including the rough set-based approaches);

3) Synthesizing the pros and cons of existing works;

4) Proposing a new efficient scalable bi-clustering rough set-based approach for big data pre-processing; specifically, for feature selection while guaranteeing data dependency;

5) Evaluating the performance of the proposed technique via a set of comparative experiments; by using commonly-adopted benchmark datasets, metrics, and experimental methodologies;

6) Summarizing the realized work in a scientific research paper that will be submitted to a referred international conference/journal.

Required qualifications

- Excellent skills in machine learning (e.g., (bi-)clustering, classification, etc.) and in knowledge discovery in databases (e.g., feature engineering);

- Excellent skills in parallel and distributed systems/processing (e.g., Spark);

- Excellent programming skills (Scala, Python, etc.);

- Excellent English writing skills;

- Good communication skills;

- Commitment, perseverance, efficiency, integrity, and ability to get things done;

- Strong motivation to publish in refereed international conferences/journals.

Supervisor’s details

▪ Dr Zaineb CHELLY DAGDIA, Associate Professor at Versailles Saint-Quentin- en-Yvelines University

▪ Contact: zaineb.chelly-dagdia@uvsq.fr

▪ Website: https://sites.google.com/site/zeinebchelly/home

Funding Details

- Starts February 2021 or soon after;

- Duration: 6 months;

(3)

3 - Funding: 600 EURO per month plus partial coverage of public transportation expenses;

- Working place: Versailles Saint-Quentin-en-Yvelines University, UFR des Sciences, 45 Avenue des États Unis, 78000 Versailles.

- Note: With regard to the sanitary measures related to COVID-19, telecommuting and remote supervision might be envisaged.

Host environment

Host institution: The Versailles Saint-Quentin-en-Yvelines University (UVSQ), a leading multidisciplinary institution, engaged in Paris-Saclay University, fosters research addressing major scientific, technological, economic and societal challenges by building on strong partnerships (CNRS, INSERM, CEA, INRA, INRIA, etc.) and breaking down barriers between disciplines.

Host laboratory and team: The Data and Algorithms for an Intelligent and Sustainable City (DAVID: https://www.david.uvsq.fr/home/) laboratory is a Computer Science laboratory of the Faculty of Sciences of UVSQ. The objective of the laboratory is to conduct research activities combining big data and the extraction of quality knowledge, data security and confidentiality, modeling and algorithmic in order to propose innovative applications in the context of the smart and sustainable city. Within DAVID, the Ambient Data Access and Mining (ADAM) team will be the host for the current proposed research project. The ADAM team’s research focuses on the management of massive data, characterized by large-scale distribution, high heterogeneity and dynamism. Whether produced by sensors or mobile devices or available on the Web or in specialized databases, the valuation of these data requires a set of services with high added value, such as the modeling of imperfect or incomplete data, integration and data fusion, complex query definition and execution, and data mining.

How to apply

- Detailed Curriculum Vitae;

- Master 1 transcripts;

- Two recommendation letters (not mandatory but will be a plus);

- A personal statement (max 1 page in English) to present your interest, motivation and suitability for this project.

Candidates holding a Master 1 in Computer Science can send their applications to zaineb.chelly-dagdia@uvsq.fr, quoting the reference “GS21-01” in the email subject.

Application deadline: 09/01/2021