• Aucun résultat trouvé

PRIM related correlations

N/A
N/A
Protected

Academic year: 2021

Partager "PRIM related correlations"

Copied!
15
0
0

Texte intégral

(1)

PRIM related correlations

Axel Math´ei

Montefiore Institute - University of Li`ege

19th Belgian Workshop on Mathematical Optimization

(2)

Table of contents

Problem statement PRIM algorithm

(3)

Problem statement

An industrial decision maker owns a production line having dozens of parameters. Currently, he’s tuning them by trial and error, but wants a program finding the right ”zone” to be in.

Input: Data (parameters + quality) Output: Parameter intervals

Resulting in the best quality box possible. Limited to a given number of parameters.

(4)

Interpretability

Why boxes?

→ Interpretability

The decision maker prefers to know directly which parameters to controls and in which ranges to put them.

That would be impossible with ellipsoid for example.

Box

(5)

PRIM

Patient Rule Induction Method, J. Friedman & N. Fisher (1998) Idea:

1 Starts with all points.

2 For every dimension, removes a fraction α from the points at

an extreme then at the other.

3 Chooses the best among the 2d sets generated and restart at

step 2 until it remains a fraction β from the starting points. One can detect many hyperrectangles (or ”boxes”) by removing the points from the box found and restarting the algorithm on the remaining points.

(6)
(7)
(8)

PRIM - Variables selection

It is difficult to control a lot of parameters at the same time, so we want to limit the number of output intervals.

To select a reduced number of dimensions, we iterate:

1 Release the interval for each dimension, one by one.

2 Remove the dimension which, when released, gives the best

resulting box.

(9)

PRIM - First results

Database : AGC

Total: 748 points, 440 good ones. Dimensions: 53

Parameters: α = 5%, β = 20%

⇒ 5s, 144 points, 140 good ones → 97%

⇒ 5D limitation : 306 points (266 good ones, 40 bad ones)

(10)
(11)

Ongoing research - Correlations

When a box is defined by a few dimensions, we could want to exchange one of them keeping a good box, without restarting the algorithm.

It can be useful if a parameter is more difficult or dangerous to control, and the others are already tuned.

2 kind of scores we can maximize: Mean: best possible box

→ maximum-density segment problem (O(n))

Similarity: X1 is the constrained box; X2 is X1 without the

removed constraint; searching a X2 subarray maximizing:

|X1∩ X2|

(12)
(13)

Ongoing research - First results

Database : marketing (Friedman & Fisher) Total: 9409 points on 14 dimensions

Goal: isolate largest income and swap the most important variable

Parameters: α = 5%, β = 20%, 3 important variables Most important variable: dual income

⇒ Mean: Swaping it with ”householder status” even increased

the score by 3%

⇒ Similarity: Swaping it with ”marital status” gave a similarity

(14)

Complexity

PRIM:

1D projections: O(d .N. log(N))

Top-down peeling: − log(N)/ log(1 − α) steps

Swap:

Mean: O(d .N) Similarity: O(d .N2)

For each removed dimension:

Intersection: O(N) (with hash set) Max intersection score subarray: O(N2)

(15)

References

1 Jerome H. Friedman and Nicholas I. Fisher. Bump hunting in

high-dimensional data. Statistics and Computing, 9(2):123–143, April 1999.

2 Michael H. Goldwasser, Ming-Yang Kao, and Hsueh-I Lu.

Linear-time algorithms for computing maximum-density sequence segments with bioinformatics applications. CoRR, cs.DS/0207026, 2002.

Références

Documents relatifs

We have shared large cross-domain knowledge bases [4], enriched them with linguistic and distributional information [7, 5, 2] and created contin- uous evolution feedback loops

(C unique (product i ) + C reuse (product i )) (1) The importance of knowing the number of products of a feature diagram is recognized by many commercial tools for SPL development,

In particular, in Lemma 4.1 we prove the limit (1.1) when A is the number of marked nodes, and we deduce the convergence of a critical GW tree conditioned on the number of marked

From a mathematical point of view, this problem can be formulated as a time optimal control problem with state and control constraints, where the control variable (T 1 , T 2 , ρ 1 , ρ

Let X be an algebraic curve defined over a finite field F q and let G be a smooth affine group scheme over X with connected fibers whose generic fiber is semisimple and

They showed that the problem is NP -complete in the strong sense even if each product appears in at most three shops and each shop sells exactly three products, as well as in the

Throughout this paper, we shall use the following notations: N denotes the set of the positive integers, π( x ) denotes the number of the prime numbers not exceeding x, and p i

3 – We present 3 other script files to binarize this image (threshold=128).. b) The second script uses the Matlab function find (to avoid using “for”. loops and to reduce run