• Aucun résultat trouvé

Then, we formulate various remarks for random forests in the Big Data context

N/A
N/A
Protected

Academic year: 2022

Partager "Then, we formulate various remarks for random forests in the Big Data context"

Copied!
1
0
0

Texte intégral

(1)

Random Forests for Big Data

Nathalie Villa-Vialaneix, Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot

Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi- class classification problems. Focusing on classification problems, this paper reviews available proposals about random forests in parallel environments as well as about online random forests. Then, we formulate various remarks for random forests in the Big Data context.

Finally, we experiment three variants involving subsampling, Big Data- bootstrap and MapReduce respectively, on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data.

Références

Documents relatifs

A lot of work has been done in the concerned direction by different authors, and the following are the contribu- tions of this work: (1) An exhaustive study of the existing systems

Keywords: analytical programming, spectra analysis, CUDA, evolu- tionary algorithm, differential evolution, parallel implementation, sym- bolic regression, random decision forest,

Consider optimization problem (P2). Consider the assumptions of Proposition 4. In other words, as we experimentally show in the next section, most of the existing big datasets

Bien qu’ils soient partisans de laisser parler les données, les auteurs nous préviennent: “nous devons nous garder de trop nous reposer sur les données pour ne pas

difference between the “small data” standard framework compared to the Big Data in terms of statistical accuracy when approximate versions of the original approach are used to deal

More precisely, one variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to

As indicated in [4], the standard MR version of RF, denoted by MR-RF in the re- maining of this paper, is based on the parallel construction of trees obtained on bootstrap subsamples

I Sequential text classification offers good performance and naturally uses very little information. I Sequential image classification also performs