• Aucun résultat trouvé

OUTLIERS DETECTION METHODS: A SURVEY

N/A
N/A
Protected

Academic year: 2021

Partager "OUTLIERS DETECTION METHODS: A SURVEY"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-02913092

https://hal.archives-ouvertes.fr/hal-02913092

Submitted on 7 Aug 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

OUTLIERS DETECTION METHODS: A SURVEY

Maurras Togbe, Yousra Chabchoub, Aliou Boly

To cite this version:

(2)

OUTLIERS DETECTION METHODS: A SURVEY

Maurras U. TOGBE

1

, Yousra CHABCHOUB

2

, Aliou BOLY

3

1

maurras.togbe@isep.fr,

2

yousra.chabchoub@isep.fr,

3

aliou.boly@ucad.edu.sn

Context

• The emergence of data streams issued from a variety of sources with an always in-creasing rate requires efficient anoma-lies detection algorithms to identify suspicious behavior.

• According to the knowldege about the available dataset many supervised, semi-supervised or unsupervised anomalies detection techniques can be applied.

Application Domains

Anomalies detection is an actual issue in many fields:

Constraints and challenges

Constraints of the data stream context:

• A continuous generation and reception of the data stream for analysis

• A high and variable data stream rate

• A poor quality, so that the data stream needs to be first cleaned

Data stream constraints engender several challenges for the online outliers detection methods:

• A single pass over the data stream

• A fast online analysis (low complexity)

• No a priori knowledge on data distribution • Adaptation for distributed processing

• Ability to provide Specific information about the anomaly (start-time, intensity) • Correlation of the observed data stream

with external knowledge (weather condi-tions, social networks...) to achieve a bet-ter detection accuracy

Definitions

• Outlier: an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism. [3]

• Outlier detection: effective anomaly detection is based on the fundamental concept of mod-eling what is normal in order to discover what is not. [2]

Classification of the outliers detection methods

Only certain existing anomalies detection methods are adapted to data streams and time series [1].

Performance metrics: - Detection accuracy (False positives, False negatives) - Response time (anomalies detection delay) - Assumptions on data distribution - Scalability (adaptation to a high data dimension or high rate data stream) Adaptation for distributed processing Complexity -Memory consumption

Comparison of outliers detection methods

Approaches Advantages Disadvantages Statistics-based

- Non-parametric methods are adapted to data stream context

- Parametric method can’t be used in data stream

- Non-parametric methods can only be used for low dimensional data stream Distance-based - Adapted for global outliers detection

- Not adapted for non-homogeneous densities

- High computational cost for high dimensional data stream

Density-based - Adapted for local outliers detection - High complexity - More efficient than distance-based

methods

- Not effective for high dimensional data stream

Clustering-based - Adapted for clusters identification - Not optimized for outliers identification

Future Works

• Benchmark of several existing outliers de-tection methods based on the different per-formance metrics.

• Focus on the distributed outliers detection methods and design a distributed version for some existing outliers detection tech-niques.

References

[1] Charu C. Aggarwal. Outlier Analysis. Springer, sec-ond edition, 2017.

[2] Ted Dunning and Ellen Friedman. Practical machine learning: a new look at anomaly detection. O’Reilly Media, Inc., 2014.

Références

Documents relatifs

We address stuttering type detection as a multi-class classification problem by training a single StutterNet including data from all types of stuttering.. Due to the parameter

Using a local piecewise polynomial representation of the signal, we are able, in section 2, to cast the change- point problem into a delay estimation.. The proposed detec-

I Least correlated scale and 2nd iteration of streaming PCA decorrelates the reconstruction error across scales. I The least correlated scale performs better when there is a

If we choose a loss function, such as the biweight loss, that is bounded, then this will impose a minimum segment length on the segmentations that we infer using the penalized

Starved flies are then transferred onto a Drosophila narrow vials containing fly medium covered with filter disks soaked with 50 µl of the intoxication solution (Figure 1)...

Indeed, it is very likely that the main environmental factor that determines the chi- ronomid faunal composition illustrated by PCA axis 1 is seasonal alternation of

Several systems exist alongside with AIS, such as radar or LRIT (Long-Range Identification and Tracking), and all are meant to be complementary. Moreover, the messages

The way to detect epistasis complies with Fisher’s epistasis definition (see Section Biological Epistasis and Statistical Epistasis) since authors look for a difference between