• Aucun résultat trouvé

AIDEme: An active learning based system for interactive exploration of large datasets

N/A
N/A
Protected

Academic year: 2021

Partager "AIDEme: An active learning based system for interactive exploration of large datasets"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-02430750

https://hal.inria.fr/hal-02430750

Submitted on 7 Jan 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

AIDEme: An active learning based system for interactive exploration of large datasets

Enhui Huang, Luciano Palma, Laurent Cetinsoy, Yanlei Diao, Anna Liu

To cite this version:

Enhui Huang, Luciano Palma, Laurent Cetinsoy, Yanlei Diao, Anna Liu. AIDEme: An active learning based system for interactive exploration of large datasets. 2019. �hal-02430750�

(2)

AIDEme: An active learning based system for interactive exploration of large datasets

Enhui Huang, Luciano Di Palma, Laurent Cetinsoy, Yanlei Diao, Anna Liu

AIDEme is a scalable interactive data exploration system for efficiently learning a user interest pattern over a large dataset

Motivation

• An increasing gap between fast growth of data and limited human ability to comprehend data.

• A growing demand of data analytics tools that can bridge this gap and help the user retrieve high-value content from data more effectively.

System Overview

• Consider the data content as a set of records, and the user is interested in some of them but not all.

• In each iteration,

– the user labels a record as ”interesting” or ”not interesting”, – a classification model is built,

– active learning techniques are employed to select a new record from the unlabeled data source.

• Construct an increasingly-more-accurate model of the user interest.

• Upon convergence, the model is run through the entire data source to retrieve all relevant records.

Key Techniques

• Challenge:

Slow Convergence

• Novel techniques in AIDEme:

1. Factorization

– Factorized Version Space

Color (B/R) Size (S/L) Label

B(lack) L(arge) {-, +}

B(lack) S(mall) {-, +}

R(ed) L(arge) {-, +}

R(ed) S(mall) {-, +}

Version Space V (size = 16)

A labeled example (BL, ‘–’)

Version Space V (size = 8)

Color

(B/R) Label B(lack) {-, +}

R(ed) {-, +}

Size

(S/L) Label L(arge) {-, +}

S(mall) {-, +}

Version Space V (size = 16)

A labeled example

((B, ‘–’), (L, ‘+’))

Color

(B/R) Label B(lack) { - }

R(ed) {-, +}

Size

(S/L) Label L(arge) { + } S(mall) {-, +}

Version Space V (size = 4)

Color (B/R) Size (S/L) Label

B(lack) L(arge) { - }

B(lack) S(mall) {-, +}

R(ed) L(arge) {-, +}

R(ed) S(mall) {-, +}

(a) Without factorization

(b) With factorization

2. Formal results on convergence

– Theoretical results on the convergence of our proposed techniques.

– Detect convergence and terminate the exploration process.

3. Scaling to large datasets – Subsampling procedures

– Provide provable results that guarantee the performance of the model learned from the sample over the entire data source

4. Optimization using Class Distribution

– Subspatial convex property: the user in- terest pattern projected onto a subspace of- ten entails a convex object.

– When the subspatial convex property holds, we introduce a Dual-Space Model (DSM).

DSM ←−−−−−predict

sample next Classifier + Polytope model.

Demonstration

• Demonstration

Setup =⇒ Iterative Exploration =⇒ Final retrieval =⇒ Comparison

• Comparison

SDSS Q2

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F-score

Iteration

AIDEme SM QBD

SDSS Q3

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F-score

Iteration

AIDEme SM QBD

SDSS Q6

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F-score

Iteration AIDEme

SM QBD

Car Q4 and Q17

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F1-score

Iteration

Q4-AIDEme Q17-AIDEme Q4-SM Q17-SM

References

1. Huang, E., Peng, L., Di Palma, L., Abdelkafi, A., Liu, A. & Diao, Y. Optimization for active learning- based interactive database exploration. Proceedings of the VLDB Endowment (PVLDB), 12(1), 71-84, September 2018.

2. Di Palma, L., Diao, Y. & Liu, A. A Factorized Version Space Algorithm for “Human-In-the-Loop” Data Exploration. IEEE International Conference on Data Mining (ICDM), November 2019.

Références

Documents relatifs

A level-of-detail rendering and processing approach combined with distributed and parallel feature extraction enables the real-time exploration of data.. However, by now the

When the true query is con- junctive, taking advantage of the factorization property of the data space proves to be effective for enlarging the positive and nega- tive regions known

Positive and negative regions in factorized space. Then in the I th subspace defined by A I , the positive and negative regions of Q I can be defined as in Definition 3.1 and 3.2,

This data structure is two orders of magnitude smaller than the input, making it possible to explore interactively an entire family of features along with aggre- gate attributes,

We experimentally noticed that selecting images close to the boundary between relevant and irrelevant classes is definitively efficient, but when considering a subset of images close

Target search strategy is working as follows: the user presents an example of the images he is looking for, and the system extracts from the database the images most similar to

In this section, we describe Exquisitor, a novel interactive learning approach that tightly integrates high-dimensional indexing with the interactive learning process,

c) Focusing on an image: The user can focus his attention on a single image by clicking on it. A zoom is done by placing the selected image on the center of the visualisation