HAL Id: hal-02913092
https://hal.archives-ouvertes.fr/hal-02913092
Submitted on 7 Aug 2020
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
OUTLIERS DETECTION METHODS: A SURVEY
Maurras Togbe, Yousra Chabchoub, Aliou Boly
To cite this version:
OUTLIERS DETECTION METHODS: A SURVEY
Maurras U. TOGBE
1, Yousra CHABCHOUB
2, Aliou BOLY
31
maurras.togbe@isep.fr,
2yousra.chabchoub@isep.fr,
3aliou.boly@ucad.edu.sn
Context
• The emergence of data streams issued from a variety of sources with an always in-creasing rate requires efficient anoma-lies detection algorithms to identify suspicious behavior.
• According to the knowldege about the available dataset many supervised, semi-supervised or unsupervised anomalies detection techniques can be applied.
Application Domains
Anomalies detection is an actual issue in many fields:
Constraints and challenges
Constraints of the data stream context:
• A continuous generation and reception of the data stream for analysis
• A high and variable data stream rate
• A poor quality, so that the data stream needs to be first cleaned
Data stream constraints engender several challenges for the online outliers detection methods:
• A single pass over the data stream
• A fast online analysis (low complexity)
• No a priori knowledge on data distribution • Adaptation for distributed processing
• Ability to provide Specific information about the anomaly (start-time, intensity) • Correlation of the observed data stream
with external knowledge (weather condi-tions, social networks...) to achieve a bet-ter detection accuracy
Definitions
• Outlier: an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism. [3]
• Outlier detection: effective anomaly detection is based on the fundamental concept of mod-eling what is normal in order to discover what is not. [2]
Classification of the outliers detection methods
Only certain existing anomalies detection methods are adapted to data streams and time series [1].
Performance metrics: - Detection accuracy (False positives, False negatives) - Response time (anomalies detection delay) - Assumptions on data distribution - Scalability (adaptation to a high data dimension or high rate data stream) Adaptation for distributed processing Complexity -Memory consumption
Comparison of outliers detection methods
Approaches Advantages Disadvantages Statistics-based
- Non-parametric methods are adapted to data stream context
- Parametric method can’t be used in data stream
- Non-parametric methods can only be used for low dimensional data stream Distance-based - Adapted for global outliers detection
- Not adapted for non-homogeneous densities
- High computational cost for high dimensional data stream
Density-based - Adapted for local outliers detection - High complexity - More efficient than distance-based
methods
- Not effective for high dimensional data stream
Clustering-based - Adapted for clusters identification - Not optimized for outliers identification
Future Works
• Benchmark of several existing outliers de-tection methods based on the different per-formance metrics.
• Focus on the distributed outliers detection methods and design a distributed version for some existing outliers detection tech-niques.
References
[1] Charu C. Aggarwal. Outlier Analysis. Springer, sec-ond edition, 2017.
[2] Ted Dunning and Ellen Friedman. Practical machine learning: a new look at anomaly detection. O’Reilly Media, Inc., 2014.