• Aucun résultat trouvé

Histograms, Regressograms and Charts

Dans le document Contributions Statistics (Page 40-47)

Exploratory Statistical Techniques

2.5 Histograms, Regressograms and Charts

2.5.1 Histogram

Histograms are a graphical representation of the whole distribution of a dataset. In the time of text based terminals histograms were easily built up. The stem and leaf plot is a text based variation of a histogram, which allows to pick up some location parameters like quantiles, median etc.

BOxplot Histogram

+ +

I I

I I

+ +

FIGURE 2.5. Boxplot and relative frequency histogram of 1000 data sampled from N(O,I) (binwidth = 0.1, origin = 0).

34 Exploratory Statistical Techniques

,

'

nl.' "'_1 -,.11 1 •. ' '",., ,.... .a.'

FIGURE 2.6. Histogram of the sixth variable of the Swiss banknote dataset. We see four histograms with a binwidth of 1.5, 1.0, 0.3 and 0.065. The bimodal structure is hidden if the binwidth is too big (left upper picture).

FIGURE 2.7. Histogram of the sixth variable of the Swiss banknote dataset. We see four histograms with a the same binwidth of 1.0, but the origin has different values 0.1, 0.35, 0.6 and 0.85. The bimodal stru,c-ture is hidden in the right upper picstru,c-ture (origin is 0.6).

..

a

Exploratory Statistical Techniques 35

.. a

a

..

a

a~~--~----'---~r----r----r----r----~

137.0 138.0 139.0 140.0 lU.O 142.0 143.0 1".0 PH[.lI

FIGURE 2.8. Averaged shifted histogram of the sixth variable of the Swiss banknote dataset. The histogram has the binwidth of 0.75 and is composed from 5 histograms with the origins 0.00, 0.15, 0.30, 0.45 and 0.60. The bimodal structure is now clearly visible and we could give some speculation if the structure is even trimodal.

A histogram provides more detailed informations than a boxplot as can be seen in Figure 2.5. It shows us how many data can be found at which loca-tion. Histograms appear in different forms, e.g. absolute frequency histogram, relative frequency histograms and other forms.

The construction of a histogram can be expressed as in HardIe (1991):

• Divide the real line into bins

Bj = [XO

+

(j - l)h, Xo

+

jh) j an integer with binwidth h and origin Xo.

• Count how many data fall into each bin

• Depending on the wanted type of histogram we have to multiply with a constant. If the histogram represents a density function we write the histogram as

Computationally it will be done in the following way:

36 Exploratory Statistical Techniques

1. Bin the data Xi and get the bincenters iJi and the bincounts Ci:

2. Ensure that for empty bins a 0 will be inserted 3. Double your data:

iJ"

C"

iJ"

C"

4. Add and subtract now the half of the binwidth 6:

5. Plot this dataset

iJ

1 - 6/2 C1

iJ

1

+

6/2 C1

iJ" -

6/2 C"

iJ" +

6/2 C"

Two problems arise in connection with the drawing of histograms:

• Bow big should a bar (binwidth) of a histogram be?

• Where should we put the origin of the histogram?

To illustrate the problem see Figure 2.6 which shows the histogram for dif-ferent bandwidths for a univariate dataset. In Figure 2.7 we can see what happens if we move the origin of the histogram.

The problem of the origin in the case of density estimation can be solved with average shifted histograms suggested by Scott (1985). The idea is to use a smaller binwidth and to distribute a observation over more than one bin.

One practical solution would be to calculate the histogram for the different origins and to take z the mean of the values of the histogram at every point

Exploratory Statistical Techniques 37 which is a specialized form given in Scott (1985). We can compare the Figure 2.8 with the Figure 2.7 and see that the problem of the origin is solved, but the problem of the binwidth remains. In the beginning of computational statistics it was suggested that the number of bins is proportional to

..;n.

In S-Plus the number of bins is proportional to log2(n). A calculation of the approximated mean integrated squared error (AMISE) of the histogram (HardIe 1991) leads to

with lithe derivative of the unknown density function and

II .

II~ the squared L2-norm of the density function. A minimization of the AM ISE leads to an

"optimal" bin width of

In practice this choice is not very helpful since we would have to know the unknown density function, but we could plug in a reference density for the unknown density function. Silverman (1986) suggested the gaussian distribu-tion.

Another approach is to show a lot of different histograms and to change the bin width interactively with a slider.

2.5.2 Regressogram

In the regressogram we try to estimate the unknown regression function by a stepwise constant function. The algorithm is similar to the algorithm for density estimation.

As an example see Figure 2.9 for a regressogram. Since the problems with them are mainly the same as for the histograms we will not go into details here. The regression estimator is very similar to the Nadaraya-Watson esti-mator.

2.5.3 Charts

Some charts for the plotting of noncontinuous variables have to be mentioned too. Barcharts and piech arts are used to represent such variables (see Figure 2.10 and Figure 2.11). Cleveland (1985) introduced dotcharts (see Figure 2.12) as an alternative to the pie and the barcharts.

38 Exploratory Statistical Techniques

o "

o

o

"

0.0 0.' 1.0 I.S 2 . 0 2.5

LSTAT (-10)

3.0 3.' 6 . 0

FIGURE 2.9. Regressogram of the variable LSTAT (percent lower sta-tus) and MEDV (median value of owner-occupied homes in thousand of dollars) of the Boston housing dataset. The binwidth is 3 and the origin

o.

FIGURE 2.10. A barchart of the frequency of values of the variable exp(RAD) of the Boston housing dataset (accessibility to radial high-ways).

Exploratory Statistical Techniques 39

FIGURE 2.11. A piech art of the frequency of values of the variable exp(RAD) of the Boston housing dataset (accessibility to radial high-ways).

1 ... -0 ... . 2 ... ·0 ... · ... ..

4 ... · ... ·0 ... · ... · ... · ..

5 ... -0 ... ..

6 ... -0 ... ..

7 ... -0 ... . 8 ... -& ... _ ... _ ... _ ... . 24 ... · ... 0 .. ..

20 60 60 100 120

FIGURE 2.12. A dotchart of the frequency of values of the variable exp(RAD) of the Boston housing dataset (accessibility to radial high-ways).

40 Exploratory Statistical Techniques

Dans le document Contributions Statistics (Page 40-47)