• Aucun résultat trouvé

A Practical Guide to Experimentation (and Benchmarking)

N/A
N/A
Protected

Academic year: 2021

Partager "A Practical Guide to Experimentation (and Benchmarking)"

Copied!
78
0
0

Texte intégral

(1)

HAL Id: hal-01959453

https://hal.inria.fr/hal-01959453

Submitted on 18 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A Practical Guide to Experimentation (and Benchmarking)

Nikolaus Hansen

To cite this version:

Nikolaus Hansen. A Practical Guide to Experimentation (and Benchmarking). GECCO ’18 Com- panion: Proceedings of the Genetic and Evolutionary Computation Conference Companion, Jul 2018, Kyoto, Japan. �hal-01959453�

(2)

A Practical Guide to Experimentation (and Benchmarking)

Nikolaus Hansen Inria

Research Centre Saclay, CMAP, Ecole polytechnique, Université Paris-Saclay

Installing IPython is not a prerequisite to follow the tutorial

for downloading the material, see


slides: http://www.cmap.polytechnique.fr/~nikolaus.hansen/gecco2018-experimentation-guide- slides.pdf at http://www.cmap.polytechnique.fr/~nikolaus.hansen/invitedtalks.html


code: https://github.com/nikohansen/GECCO-2018-experimentation-guide-notebooks


(3)

Overview

Scientific experimentation

Invariance

Statistical Analysis

A practical experimentation session

Approaching an unknown problem

Performance Assessment

What to measure

How to display

Aggregation

Empirical distributions

Do not hesitate to ask questions!

(4)

Why Experimentation?

The behaviour of many if not most interesting algorithms is

not amenable to a (full) theoretical analysis even when applied to simple problems

calling for an alternative to theory for investigation

not fully comprehensible or even predictable without (extensive) empirical examinations

even on simple problems comprehension is the main driving force for scientific progress
 If it disagrees with experiment, it's wrong. And that simple statement is the key to science. — R. Feynman

Virtually all algorithms have parameters

like most (physical/biological/…) models in science we rarely have explicit knowledge about the “right” choice this is a big obstacle in designing and benchmarking algorithms

We are interested in solving black-box optimisation problems

which may be “arbitrarily” complex and (by definition) not well-understood

(5)

Scientific Experimentation (dos and don’ts)

What is the aim? Answer a question, ideally quickly and comprehensively

consider in advance what the question is and in which way the experiment can answer the question

do not (blindly) trust in what one needs to rely upon (code, claims, …) without good reasons

check/test “everything” yourself, practice stress testing (e.g.


weird parameter setting) which also boosts understanding one key element for success
 interpreted/scripted languages have an advantage
 Why Most Published Research Findings Are False [Ioannidis 2005]

practice to make predictions of the (possible) outcome(s)

to develop a mental model of the object of interest to practice being proven wrong

run rather many than few experiments iteratively, practice online experimentation (see demonstration)

to run many experiments they must be quick to implement and run,
 ideally seconds rather than minutes (start with small dimension/budget) develops a feeling for the effect of setup changes

What are the dos and don’ts?

• what is most helpful to do?

• what is better to avoid?

(6)

Scientific Experimentation (dos and don’ts)

What is the aim? Answer a question, ideally quickly (minutes, seconds) and comprehensively

consider in advance what the question is and in which way the experiment can answer the question

do not (blindly) trust in what one needs to rely upon (code, claims, …) without good reasons

check/test “everything” yourself, practice stress testing (e.g.


weird parameter setting) which also boosts understanding one key element for success
 interpreted/scripted languages have an advantage
 Why Most Published Research Findings Are False [Ioannidis 2005]

practice to make predictions of the (possible) outcome(s)

to develop a mental model of the object of interest to practice being proven wrong, to overcome confirmation bias

run rather many than few experiments iteratively, practice online experimentation (see demonstration)

to run many experiments they must be quick to implement and run,
 ideally seconds rather than minutes (start with small dimension/budget) develops a feeling for the effect of setup changes

(7)

Scientific Experimentation (dos and don’ts)

run rather many than few experiments iteratively, practice online experimentation (see demonstration)

to run many experiments they must be quick to implement and run,
 ideally seconds rather than minutes (start with small dimension/budget) develops a feeling for the effect of setup changes

run any experiment at least twice

assuming that the outcome is stochastic get an estimator of variation/dispersion/variance

display: the more the better, the better the better

figures are intuition pumps (not only for presentation or publication) it is hardly possible to overestimate the value of a good figure data is the only way experimentation can help to answer questions,

therefore look at them, study them carefully!

don’t make minimising CPU-time a primary objective

avoid spending time in implementation details to tweak performance prioritize code clarity (minimize time to change code, to debug code, to maintain code)
 yet code optimization may be necessary to run experiments efficiently

(8)

Scientific Experimentation (dos and don’ts)

don’t make minimising CPU-time a primary objective

avoid spending time in implementation details to tweak performance
 yet code optimization may be necessary to run experiments efficiently

Testing Heuristics: We Have it All Wrong [Hooker 1995]

“The emphasis on competition is fundamentally anti-intellectual and does not build the sort of insight that in the long run is conducive to more effective algorithms”

It is usually (much) more important to understand why algorithm A performs badly on function f, than to make algorithm A faster for unknown, unclear or trivial reasons

mainly because an algorithm is applied to unknown functions, not to f, and the “why” allows to predict the effect of design changes

there are many devils in the details, results or their interpretation may crucially depend on simple or intricate bugs or subtleties

yet another reason to run many (slightly) different experiments check limit settings to give consistent results

(9)

Scientific Experimentation (dos and don’ts)

there are many devils in the details, results or their interpretation may crucially depend on simple or intricate bugs or subtleties

yet another reason to run many (slightly) different experiments check limit settings to give consistent results

Invariance is a very powerful, almost indispensable tool

(10)

Invariance: binary variables

Assigning 0/1 (for example minimize )

is an “arbitrary” and “trivial” encoding choice and

amounts to the affine linear transformation

this transformation or the identity are the coding choice in each variable
 in continuous domain: norm-preserving (isotropic, “rigid”) transformation

does not change the function “structure”

all level sets have the same size (number of elements, same volume)

the same neighbourhood

no variable dependencies are introduced (or removed)

Instead of 1 function, we now consider 2**n different but equivalent functions

2**n is non-trivial, it is the size of the search space itself

x i 7! x i + 1

{ x | f (x) = const }

ÿ

i

x

i

vs ÿ

i

1 ≠ x

i

(11)

Invariance: binary variables

Permutation of variables

is another “arbitrary” and “trivial” encoding choice and

is another norm-preserving transformation

does not change the function “structure” (as above)

consider one-point vs two-point crossover: which is better choice?

only two-point crossover is invariant to variable permutation

Instead of 1 function, we now consider n ! different but equivalent functions

is much larger than the size of the search space

n! ≫ 2n

(12)

f = h f = g

1

h f = g

2

h

Three functions belonging to the same equivalence class

Invariance Under Order Preserving Transformations

A function-value free search algorithm is invariant under the

transformation with any order preserving (strictly increasing) g . Invariances make

• observations meaningful as a rigorous notion of generalization

• algorithms predictable and/or ”robust”

(13)

Invariance Under Rigid Search Space Transformations

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

for example, invariance under search space rotation (separable … non-separable)

f-level sets in dimension 2

f = h

Rast

f = h

Invariance Under Rigid Search Space Transformations

for example, invariance under search space rotation


(separable vs non-separable)

(14)

Nikolaus Hansen, Inria A practical guide to experimentation 13 !13

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

for example, invariance under search space rotation (separable … non-separable)

f-level sets in dimension 2

f = h

Rast

R f = hR

Invariance Under Rigid Search Space Transformations

for example, invariance under search space rotation


(separable vs non-separable)

(15)

Invariance

Consequently, invariance is of greatest importance for the

assessment of search algorithms.

(16)

Statistical Analysis

“The first principle is that you must not fool yourself, and you are the easiest person to fool. So you have to be very careful about that. After you've not fooled yourself, it's easy not to fool other[ scientist]s. You just have to be honest in a

conventional way after that. ”

— Richard P. Feynman

(17)

Statistical Analysis

“experimental results lacking proper statistical analysis must be considered anecdotal at best, or even wholly inaccurate”

— M. Wineberg

Do you agree (sounds about right) or disagree (is taken a little over the top) with the quote?

an experimental result (shown are all data obtained):

Do we (even) need a statistical

analysis?

(18)

first, check the relevance of the result, for example of the difference which is to be tested for statistical significance

this also means: do not explorative testing (e.g. test all pairwise combinations) any ever so small difference can be made statistically 


significant with a simple trick, 
 but not made significant in the sense of important or meaningful

prefer “nonparametric” methods

not assuming that the data come from a parametrised 
 family of probability distributions

p-value = significance level = probability of a false positive outcome, given H0 is true

smaller p-values are better

<0.1% or <1% or <5% is usually considered as statistically significant

given a found/observed p-value, fewer data are better

more data (almost inevitably) lead to smaller p-values, hence 
 to achieve the same p-value with fewer data, the between-difference 
 must be larger compared to the within-variation

Statistical Significance: General Prodecure

(19)

first, check the relevance of the result, for example of the difference which is to be tested for statistical significance

this also means: do not explorative testing (e.g. test all pairwise combinations) any ever so small difference can be made statistically 


significant with a simple trick, 
 but not made significant in the sense of important or meaningful

prefer “nonparametric” methods

not assuming that the data come from a parametrised 
 family of probability distributions

p-value = significance level = probability of a false positive outcome, given H0 is true

smaller p-values are better

<0.1% or <1% or <5% is usually considered as statistically significant

given a found/observed p-value, fewer data are better

more data (almost inevitably) lead to smaller p-values, hence 
 to achieve the same p-value with fewer data, the between-difference 
 must be larger compared to the within-variation

Statistical Significance: General Prodecure

example of test statistics distribution density given H0

false positive error area

(20)

first, check the relevance of the result, for example of the difference which is to be tested for statistical significance

this also means: do not explorative testing (e.g. test all pairwise combinations) any ever so small difference can be made statistically 


significant with a simple trick, 
 but not made significant in the sense of important or meaningful

prefer “nonparametric” methods

not assuming that the data come from a parametrised 
 family of probability distributions

p-value = significance level = probability of a false positive outcome, given H0 is true

smaller p-values are better

<0.1% or <1% or <5% is usually considered as statistically significant

given a found/observed p-value, fewer data are better

more data (almost inevitably) lead to smaller p-values, hence 
 to achieve the same p-value with fewer data, the between-difference 
 must be larger compared to the within-variation

Statistical Significance: General Prodecure

(21)

use the rank-sum test (aka Wilcoxon or Mann-Whitney U test)

Assumption: all observations (data values) are obtained independently and no equal values are observed

The “lack” of necessary preconditions is the main reason to use the rank-sum test.


even a few equal values are not detrimental
 the rank-sum test is nearly as efficient as the t-test which requires normal distributions

Null hypothesis (nothing relevant is observed if): Pr( x < y ) = Pr( y

< x )

H0: the probability to be greater or smaller (better or worse) is the same
 the aim is to be able to reject the null hypothesis

Procedure: compute the sum of ranks in the ranking of all (combined) data values

Outcome: a p-value

the probability that the observed or a more extreme data set was generated under the 
 null hypothesis; the probability to mistakenly reject the null hypothesis

Statistical Significance: Methods

(22)

Statistical Significance: How many data do we need?

AKA as test efficiency

assumption: data are fully “separated”, that is,

observation: adding 2 data points in each group gives about one additional order of magnitude

use the Bonferroni correction for multiple tests

simple and conservative: multiply the computed p-value by the number of tests

p

min

= 2

n1

Y

i=1

i i + n

2

i, j : xi < yj or i, j : xi > yj (two-sided)

(23)

Statistical Significance: How many data do we need?

• In the best case: at least ten (two times five) and two times nine is plenty

minimum number of data to possibly get two-sided p < 1%: 5+5 or 4+6 or 3+9 or 2+19 or 1+200

and p < 5%: 4+4 or 3+5 or 2+8 or 1+40

• I often take two times 11 or 31 or 51

median, 5%-tile and 95%-tile are easily accessible 
 with 11 or 31 or 51… data

• Too many data make statistical significance

meaningless

(24)

Statistical Significance: How many data do we need?

• In best case at least five per group and nine is plenty

minimum number of data to possibly get two-sided p < 1%: 5+5 or 4+6 or 3+9 or 2+19 or 1+200

and p < 5%: 4+4 or 3+5 or 2+8 or 1+40

• I often take between 11 and 51

median and 5%-tile are easily accessible with 11 or 31 or 51… data

• Too many data make statistical significance meaningless

= 0.997, 1.008 mean = 0.034

median = 0.044 11x<median(y) = 51.6%

11y>median(x) = 51.9%

two empirical distributions

(25)

Statistical Analysis

“experimental results lacking proper statistical analysis must be considered anecdotal at best, or even wholly inaccurate”

— M. Wineberg

Do you agree (sounds about right) or disagree (is taken a little over the top) with the quote?

an experimental result (shown are all data obtained):

Do we (even) need a statistical

analysis?

(26)

Jupyter IPython notebook

(27)
(28)

Questions?

(29)

see https://github.com/nikohansen/GECCO-2018-experimentation-guide-notebooks

• Demonstrations

• A somewhat typical working mode

• A parameter investigation

Jupyter IPython notebook

(30)

Approaching an unknown problem

• Problem/variable encoding

for example log scale vs linear scale vs quadratic transformation

• Fitness formulation

for example have the same optimal
 (minimal) solution but may be very differently “optimizable”.

• Try to locally improve a given (good) solution

• Start local search from different initial solutions.

Ending up always in different solutions? Or always in the same?

• Apply “global search” setting

• see also

http://cma.gforge.inria.fr/cmaes_sourcecode_page.html#practical q

i |xi| and q

i x2i

(31)

Questions?

(32)

Performance Assessment

• methodology: run an algorithm on a set of test

functions and extract performance measures from the generated data

choice of measure and aggregation

• display

subtle display changes can make a huge difference

• there are surprisingly many devils in the details

(33)

Why do we want to measure performance?

• compare algorithms and algorithm selection (the obvious)

ideally we want standardized comparisons

• regression testing after (small) changes

as we may expect (small) changes in behaviour, 
 conventional regression testing may not work

• understanding of algorithms

to improve algorithms

non-standard experimentation is often preferable or necessary

(34)

Measuring Performance

Empirically

convergence graphs is all we have to start with

the right presentation is important!

(35)

Displaying Three Runs

why not, what’s wrong?

not like this (it’s unfortunately not an uncommon picture)

why not, what’s wrong with it?

(36)

Displaying Three Runs

(37)

Displaying Three Runs

(38)

Displaying 51 Runs

(39)

There is more to display than convergence graphs

(40)

Aggregation: Which Statistics?

(41)

Aggregation: Which Statistics?

(42)

Aggregation: Which Statistics?

(43)

Aggregation: Which Statistics?

(44)

Aggregation: Which Statistics?

(45)

Implications

• use the median as summary datum

unless there are good reasons for a different statistics
 out of practicality: use an odd number of repetitions

• more general: use quantiles as summary data

for example out of 15 data: 2nd, 8th, and 14th 


value represent the 10%, 50%, and 90%-tile

(46)

Examples

caveat: the range display with error bars fails, if, for example, only 30% of all runs “converge"


How can we deal with large variations?

(47)

Aggregation: Fixed Budget vs Fixed Target

• for aggregation we need comparable data

• missing data: problematic when many runs lead to missing data

(48)

Performance Measures for Evaluation

Generally, a performance measure should be

• quantitative on the ratio scale (highest possible)

“algorithm A is two times better than algorithm B” 


as “performance(B) / performance(A) = 1/2 = 0.5”


should be meaningful statements

• assuming a wide range of values

• meaningful (interpretable) with regard to the real world

transfer the measure from benchmarking to real world

runtime or first hitting time is the prime candidate

(49)

• for aggregation we need comparable data

• missing data: problematic when many runs lead to missing data

Aggregation: Fixed Budget vs Fixed Target

(50)

Fixed Budget vs Fixed Target

Fixed budget => measuring/display final/best f-values

Fixed target => measuring/display needed budgets (#evaluations) Number of function evaluations:

are quantitatively comparable (on a ratio scale)

ratio scale: “A is 3.5 times faster than B”, A/B = 1/3.5 is a meaningful notion

the measurement itself is interpretable independently of the function

time remains the same time regardless of the underlying problem
 3 times faster is 3 times faster is 3 times faster on every problem

there is a clever way to account for missing data

via restarts

=> fixed target is (much) preferable

(51)

• for aggregation we need comparable data

• missing data: problematic when many runs lead to missing data

The Problem of Missing Values

(52)

The Problem of Missing Values

how can we compare the following two algorithms?

number of evaluations

function (or indicator) value

(53)

The Problem of Missing Values

Consider simulated (artificial) restarts using the given independent runs

Caveat: the performance of algorithm A critically depends

on termination methods (before to hit the target)

(54)

ERT = #evaluations(until to hit the target)

#successes

= avg(evals

succ

) +

odds ratio

z }| { N

unsucc

N

succ

⇥ avg(evals

unsucc

)

⇡ avg(evals

succ

) + N

unsucc

N

succ

⇥ avg(evals

succ

)

= N

succ

+ N

unsucc

N

succ

⇥ avg ( evals

succ

)

= 1

success rate ⇥ avg(evals

succ

)

The Problem of Missing Values

The expected runtime (ERT, aka SP2, aRT) to hit a target value in #evaluations is computed (estimated) as:

defined (only) for #successes > 0. The last three lines are aka Q-measure or SP1 (success performance).

unsuccessful runs

count (only) in the

nominator

(55)

Empirical Distribution Functions

• Empirical cumulative distribution functions (ECDF,

or in short, empirical distributions) are arguably the

single most powerful tool to “aggregate” data in a

display.

(56)

a convergence

graph

(57)

a convergence graph

first hitting time

(black): lower

envelope, a

monotonous

graph

(58)

another

convergence

graph

(59)

another

convergence

graph with hitting

time

(60)

a target value

delivers two data points (or

possibly missing

values)

(61)

another target

value delivers

two more data

points

(62)

the ECDF with four steps

(between 0 and 1)

1 0.8 0.6 0.4 0.2 0

(63)

reconstructing a

single run

(64)

50 equally spaced targets

(65)
(66)

the ECDF recovers the monotonous

graph

1 0.8 0.6 0.4 0.2 0

(67)

the ECDF recovers the monotonous

graph, discretised and flipped

1 0.8 0.6 0.4 0.2 0

(68)

the ECDF recovers the monotonous

graph, discretised and flipped

1 0.8 0.6 0.4 0.2 0

(69)

the ECDF recovers the monotonous

graph, discretised and flipped

the area over the ECDF curve is the average runtime (the geometric

average if the x-axis is in log scale)

1 0.8 0.6 0.4 0.2 0

(70)

Data and Performance Profiles

• so-called Data Profiles (Moré and Wild 2009) are

empirical distributions of runtimes [# evaluations] to achieve a given single target

usually divided by dimension + 1

• so-called Performance profiles (Dolan and Moré

2002) are empirical distributions of relative runtimes [# evaluations] to achieve a given single target

normalized by the runtime of the fastest algorithm 


on the respective problem

(71)

Benchmarking with COCO

COCO — Comparing Continuous Optimisers

is a (software) platform for comparing continuous optimisers in a black-box scenario

https://github.com/numbbo/coco

automatises the tedious and repetitive task of benchmarking numerical optimisation algorithms in a black-box setting

advantage: saves time and prevents common (and not so common) pitfalls

COCO provides

experimental and measurement methodology

main decision: what is the end point of measurement

suites of benchmark functions

single objective, bi-objective, noisy, constrained (in beta stage)

data of already benchmarked algorithms to compare with

(72)

COCO: Installation and Benchmarking in Python

(73)

Benchmark Functions

should be

• comprehensible

• difficult to defeat by “cheating”

examples: optimum in zero, separable

• scalable with the input dimension

• reasonably quick to evaluate

e.g. 12-36h for one full experiment

• reflect reality

specifically, we model well-identified difficulties


encountered also in real-world problems

(74)

The COCO Benchmarking Methodology

• budget-free

larger budget means more data to investigate any budget is comparable termination and restarts are or become relevant

• using runtime as (almost) single performance measure

measured in number of function evaluations

• runtimes are aggregated

• in empirical (cumulative) distribution functions

• by taking averages

geometric average when aggregating over different problems

(75)
(76)

Using Theory

“In the course of your work, you will from time to time encounter the situation where the facts and the theory do not coincide. In such

circumstances, young gentlemen, it is my earnest advice to respect the facts.”

— Igor Sikorsky, airplane and helicopter designer

(77)

Using Theory in Experimentation

• shape our expectations and objectives

• debugging / consistency checks

theory may tell us what we expect to see

• knowing the limits (optimal bounds)

for example, we cannot converge faster than optimal trying to improve is a waste of time

• utilize invariance

it may be possible to design a much simpler experiment and


get to the same or stronger conclusion by invariance considerations


change of coordinate system is a powerful tool

(78)

FIN

Références

Documents relatifs

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

A two-dimensional representation is either irreducible or has an invariant one-dimensional subspace, which corresponds to a common eigenvector of all transformations representing

In this paper I will investigate the relation between the affective dimension of these normative pressures and their moral dimension by arguing that an important moral

*Politecnico di Milano, Department of Management, Economics and Industrial Engineering (DIG), Via Raffaele Lambruschini 4, Milan (Italy), michele.benedetti@polimi.it.. **Politecnico

In our implementation we offer these Mementos (i.e., prior ver- sions) with explicit URIs in different ways: (i) we provide access to the original dataset descriptions retrieved

We demonstrate our approach by describing how our domain model, which is a domain ontology of CCO, is mapped to logical models created in Ecore and NIEM (National Information

natural by the use of the kernel, whih maps the original data into a (large.. dimensional) Eulidean spae (see [10, 11, 12℄ for on-line versions

However, by my lights, Sankey is mistaken that his most recently proposed view avoids Grzankowski’s second objection, since believing that p is true isn’t logically distinct