Probability distributions MATHEMATICS

(1)

SALES AND MARKETING Department

MATHEMATICS

3rd Semester

Probability distributions ^

LESSONS

Online document : on http://jff-dut-tc.weebly.com section DUT Maths S3.

(2)

INTRODUCTION AND HISTORY 3

LESSONS 5

1 DISCRETE PROBABILITY DISTRIBUTIONS ... 5

1.1 GENERAL CASE: REMINDERS 5

1.2 HYPERGEOMETRIC DISTRIBUTION 6

1.3 BINOMIAL DISTRIBUTION 7

1.4 POISSON'S DISTRIBUTION 8

2 A CONTINUOUS PROBABILITY DISTRIBUTION: THE NORMAL LAW ... 9

2.1 CONVERGENCE OF DISCRETE LAWS 9

2.2 CONTINUOUS REAL RANDOM VARIABLE 10

2.3 THE NORMAL LAW (OR LAPLACE'S LAW) 11

3 SAMPLING DISTRIBUTIONS ... 14

3.1 INTRODUCTION 14

3.2 RANDOM SAMPLING 14

3.3 SAMPLING DISTRIBUTION OF MEANS 14

3.4 SAMPLING DISTRIBUTION OF PROPORTIONS 15

4 ESTIMATES (STATISTICAL INFERENCE) ... 16

4.1 POINT ESTIMATE 16

4.2 ESTIMATE BY A CONFIDENCE INTERVAL 16

5 STATISTICAL HYPOTHESIS TESTING ... 17

5.1 ADEQUACY χ² TEST (PEARSON'S TEST) 17

5.2 CONFORMANCE TEST OF A MEAN, OF A PROPORTION 18

5.3 THE RISKS (NON REQUIRED) 19

(3)

INTRODUCTION AND HISTORY

A quick story of the normal law

On the late XVIIth century, Jakob Bernoulli found the way to the binomial law, calculating the chances of success while performing a given experiment several times. His manual calculations became horribly complicated in case of big numbers, due to the calculation of factorials. In the first half of the

XVIIIth century, Abraham de Moivre worked on chance calculus and discovered a formula that gives (approximately) the factorial of a natural number. Stirling-Moivre formula: ! 2

e n n

n n 

≈ π  

  (with n > 8, deviation < 1 %)(n increases : % of deviation decreases)

Afterwards Leonhard Euler improved this formula, proving the following equality :

! .

0

e ^x ⁿd

n x x

+∞

=

∫

− . The function within the integral shows a

typical "bell" curve, whose vertex is the point , e n n

n

   

   

   

 

. Pierre Simon de Laplace gave a new demonstration of this formula, using Euler's works.

With Euler, and then with Laplace and Legendre, a new theory is developed : the theory of errors (born to simplify astronomers' works) : among several fluctuating measures of the same object or phenomenon (fluctuations due to a lack of

sharpness, dilatation of materials, variable pressure in the atmosphere, …), what unique value could be

considered as the true one? Thus, laws of distribution were to be created: distribution of values and of sample means. These distributions of values are in infinite number, given each possible concrete example. The general case of the theory of errors is still today an unsolved problem.

Between 1790 and 1800, Carl Friederich Gauss, the "prince of mathematicians", applied the least square method (invented by Laplace) to the theory of errors, arguing that the best representative value for a data series xi is the one, x, that minimises Σ(xi - x)². This way, and from simple distributions, x appears to be the arithmetic mean of the xi ; this result is also true from a bell distribution (that is generally typical of a sampling means distribution - with same sized samples taken from the same former population). These works are the only ones in which Gauss mentioned the now famous "bell curve", but he never drew one and its function already existed - that's why calling it a Gauss curve is irrelevant.

Laplace soon objected, in relation to Gauss works, that if a bell distribution leads to a bell sample distribution, there is no mention about the numerous other concrete situations whose populations don't behave this way (bell curve). According to Laplace, Gauss works are only theoretical thoughts and, worst, are reflexive (bell leads to bell… because it's bell !). In the 1810s, he demonstrated that if the values are uniformly distributed on an interval (a constant probability density, distributed into an interval whose mean is µ), then the sample mean distribution of n-sized samples (n big enough) is a bell one, whose mean is µ and whose standard deviation is about µ^/√(3n).

(4)

Then, he enunciated a theorem that is the cornerstone of statistical inference:

Laplace's theorem (nowadays central limit theorem):

Whatever the distribution of the values, for n big enough, the sample distribution of the means (of the n- sized samples) is normal (bell curve), whose mean is the arithmetic mean of the values, and whose standard deviation can be easily calculated by a formula (which always looks like the one given above).

Thus, he's been building his Laplace's law (so: the normal law) and discovering its fundamental properties.

The profession of statistician only appeared in the XIXth century (for many purposes, people needed to know how a population behaves). The most famous and prolific at this time was the French Adolphe Quételet, who published an analysis of Laplace's philosophy, numerous concrete data series showing bell-shaped distributions (for instance, the "chest sizes of 4000 Scottish soldiers", whose distribution perfectly fits in the kind of theoretical normal curve. Indeed, the chest size of a man is the sum of several, random and independent factors: genetics, education, feeding, activity, … and Laplace's theorem assesses that the distribution of a sum, like the one of a mean, is normal!). It has also to be reported that Quételet was the first who drew one of these famous normal bell curves! (neither Gauss, nor Laplace, felt the need to draw one while thinking about theory).

Everything isn't necessarily normal

During the second half of XIXth century, statisticians shown that a lot of data series are in fact not normally distributed (the symmetry of the normal law isn't always representative of what happens in our complex world).

Consequently, other continuous or discrete laws are created in order to model several concrete situations.

For instance:

* Poisson's law, quite asymmetrical, in case of rare events,

* Pareto's law for incomes distributions, asymmetrical as well,

* Exponential law and others based on the same model, for life lengths, asymmetrical again, … Other laws had been found before the normal law has been created:

* Uniform law, the probability of each value is the same (throw of a die; choose a number between 0 and 1),

* Binomial law (from Bernoulli),

* Geometric law, dealing with the number of attempts until your first success (in binomial situations),

* Hypergeometric law, similar to the binomial law, but in which repetition isn't allowed, ...

In the early XXst century, laws of superior orders are built, dealing with more than one variable, generally involving degrees of freedom:

* Student's law (sample distribution of means, built with two variables: mean and standard deviation)

* χ² law - "Chi-squared" - (evaluates the differences between a theoretical law and a real distribution) At this time, English statisticians like Pearson, Student (nickname of William Sealy Gosset) or Fisher began to develop a true actual methodology in statistics, that's to say a well-formalized theory of inference (drawing conclusions about a population, only knowing one or more of its samples), by the mean of creating new probability laws to describe phenomena.

They have dictated, between 1900 and 1950, an "objectivist" or "frequencist" interpretation of the concept of probability. Since the 1950s, an argument known as the "neo-Bayesian" school appeared, telling that statistical inference shouldn't be only based on the collected data alone, but also needs the knowledge and the use of underlying probabilistic models - it's the "subjectivist" school.

Calculation tools are increasingly powerful

Data processing (computering) helped a new performance taking off: the "multidimensional data analysis". It consists in describing, sorting and simplifying large recordings of collected data (e.g.: a survey on 3000 people in which 80 answers each have to be collected). Observed and crossed results may suggest laws (already existing or not), models or explanations that may avoid statisticians to consider data relatively to arbitrary laws, formerly created, with which they would be forced to do a comparison.

(5)

PROBABILITY DISTRIBUTIONS - LESSONS

1 Discrete probability distributions

1.1 General case: reminders

Let's consider an object or a set of objects and conceive a random experiment on it, whose outcomes form a sample space partitioned into a certain number of events.

e.g.:

objects:

experiment:

sample space:

partition of Ω :

two dice

roll them and add both numbers

Ω = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} (non equally likely outcomes) E1 : "less than 7" ; E2 : "from 7 to 10" ; E3 : "11 or 12"

Each event Ei can be associated with a value xi , a gain, random, as the upcoming outcome is - unpredictable; the set of the xi values is named random variable, denoted X.

events: E1 E2 E3

gain X (€): -3 1 5

For each value of the gain, we have to be able to calculate the probability of the associated event.

This is called "getting the probability distribution of X".

gain X (€ : -3 1 5

pi = p(X = xi): 15/36 18/36 3/36 Interpretation and purpose of these probabilities:

If you play this game many times, your numbers of losses and wins may be estimated thanks to the proportions announced by these probabilities.

With our example: every 36 games, you will have on average 15 losses of €3, 18 wins of €1 and 3 wins of

€5; thus, on combining them: a global loss of €12, on average, every 36 games.

This overall result can be expressed on average per game : 12/36 ≈ €0.33.

Playing it long-term, you will approximately have an average loss of 33 cents per attempt.

This value is called expected value of X: E(X).

This expected value is in any case:

( )

ⁿ ⁱ ⁱ

i

X p x

=

∑

1

E

where n is the number of possible values of X.

These long-term forecasts allow us to regard the former table as a statistical series in which probabilities could be real frequencies of occurrence of the gains (though they only are "ideal" frequencies). Into this context, the table can be interpreted on a statistical angle, leading for instance to the calculation of the standard deviation of X, σ(X).

( )

ⁿ ⁱ ⁱ

( ) ( ) ^{( )}

^;

^{( )} ^{( )}

i

X p x X X X X X

=

∑

²− ²= ² − ² σ =

1

V E E E V

(6)

1.2 Hypergeometric distribution

Its study will be restricted to a simple partition of the initial set into TWO subsets.

1.2.1 Definition and implementation

The probability distribution of a random variable X is hypergeometric iff:

* an experiment is conducted n times without repetition of any outcome, leading to combinations, falling inside a partition of Ω into an event (success) and its contrary (failure).

* X is the total number of successes got after n attempts. X ⊂ {0; 1; 2; 3; …; n}

Let's consider a sample space Ω, set of N outcomes, parted into two events:

A, containing a outcomes called successes

A , containing the N - a other outcomes, named failures.

An experiment is conducted n times, without possibility of repetition of any outcome (which forces n to be less than or equal to N). In the end, k successes would be met, random but observing k ≤ n and k ≤ a (number of available "success" outcomes), and of course n - k failures, ≤ N - a (available failures).

X refers to the random variable number k of successes after n attempts.

Then, the probability distribution of X is hypergeometric, with parameters n, a and N.

Notation:

H

(n , a , N).

1.2.2 Calculation of probabilities

The total number of different sets of outcomes after n attempts is: C_Nⁿ Among them, the number of sets that contain exactly k successes is: C^k_a×C_N^{n k}⁻₋_a

Hence, the probability of reaching k successes after n attempts is:

( )

^N

N

C C

p C

k n k

a a

X k n

−−

= = ×

1.2.3 Mean and variance

In that context, both parameters are accessible thanks to the following formulas:

( )

E N

X = ×n a

( ) ( )

2

N N

V N N 1

a a n

X = ×n − × −

−

Comment: on naming "p" the probability of success at the first attempt, and "q" the complementary probability of failure, we can notice that: = =N−

N and N

a a

p q

Thus, the formulas above become: ^E

( )

^X ⁼^np^and^V

( )

^X ⁼^npq^×^N_{N 1}⁻₋ⁿ

(7)

1.3 Binomial distribution

1.3.1 Definition and implementation

The probability distribution of a random variable X is binomial iff:

* an experiment is conducted n times with allowed repetition of an outcome, leading to p-lists, falling inside a partition of Ω into an event (success) and its contrary (failure).

* X is the total number of successes got after n attempts. X = {0; 1; 2; 3; …; n}.

a. Bernoulli's scheme

Let's consider a random experiment leading to a sample space Ω. The event A, named success, has a probability p(A) to occur, denoted p.

The probability of its contrary, named failure, is q = 1 - p.

b. Binomial law

This experiment is conducted n times in the same conditions, so: p is invariable.

X refers to the random variable number k of successes after n attempts.

Then, the probability distribution of X is binomial with parameters n and p.

Notation:

B

(n ; p).

1.3.2 Calculation of probabilities

A tree (n levels Bernoulli's scheme) will lead to the formula to be used – in this example, the experiment is conducted three times: n = 3 ; A is the success.

On the right, the numbers of successes, values of X, match the probabilities of the corresponding

intersections. For instance, the probability that X = 1 is the sum of pq², qpq and q²p. Thus: p(X = 1) = 3pq².

Why are there 3 paths in the tree leading to X = 1? Because there are 3 ways to combine one success among 3 attempts.

We can generalise: the probability of reaching k successes after n attempts is: ^p

(

^X ⁼^k

)

⁼^C^kⁿ^{p q}^k ^{n k}⁻

1.3.3 Mean and variance

In that context, both parameters are accessible thanks to the following formulas:

( )

E X =np ^V

( )

^X ⁼^npq

1.3.4 Approximation of a hypergeometric distribution by a binomial one

In case N ≥ 20n, the law

H

(n, a, N) comes close to the law

B

(n, p) where p = a/N.

(8)

1.4 Poisson's distribution

1.4.1 Why it has been created

In many cases, the number of different values that a variable X can reach is very big. So, calculating a probability may involve very large numbers of combinations (and also include large powers if the law is binomial), that even a computer might not be able to calculate. Moreover, in case a success is a rare event, every result isn't useful, like calculating the extremely low probabilities of many non-realistic situations implying a large number of successes (non-realistic because very far from the low average number of expected successes).

In the context of a binomial law, with a low value of p, another formula can be used (instead of the binomial formula) based on a Poisson's law, whose results will appear to be close enough to reality.

Concrete examples of use :

* examining a sample taken from a large quantity of products, or a large harvested production, in case the probability p that an element is shoddy (wrong) is low:

Here, the n elements of the sample are taken among N elements without possible repetition - which gives a hypergeometric law, but n is very little compared to N, so we can simplify the

situation considering it as if repetition were allowed. So, this case can be treated by a binomial law, whose results will be reliable. Moreover, the low value of p allows us to use a Poisson's law instead of a binomial law.

* problems of length of a queue

* predicting a "maximum" number of accidents or failures, or other rare events concerning a large population (for insurance companies, or study of rare diseases, for instance).

In the context of a binomial or hypergeometric law, under certain conditions, we can therefore use an approximate model, a Poisson's law, whose results will be fairly close to reality.

1.4.2 Definition, calculation of a probability

This law has been designed for a theoretical random variable X that might value every natural number k (0, 1, 2, 3, 4,… "until infinity"). This still represents a number of successes.

Probabilities are defined by the following formula: ^p

(

^X ^k

)

^e ^k_!

k

λ λ

= = − e is the exponential number, λ is the expectation of X. λ^{= E(X)} The probability distribution of X is the Poisson's law with parameter λ^{, denoted}

P

₍_λ_).

Using a Poisson's law into an exercise must be justified: either an exercise tells that the law is a Poisson's one, or a binomial law will logically lead to the corresponding Poisson's law (section 1.4.4)

1.4.3 Mean and variance

In that context, both parameters are very simple: ^E

( )

^X ⁼^λ ^V

( )

^X ⁼^λ

1.4.4 Approximation of both previous laws by a Poisson's one

Given a random variable X whose distribution is

B

(n, p),

for n "big enough" (n > 30) and p "little enough" (p ≤ 0.1), such that npq ≤ 10,

B

(n, p) comes close to the law

P

(λ) where λ = E(X) = np.

Given a random variable X whose distribution is

H

(n, a, N),

For a sample "little enough inside the population" (N ≥ 20n) but "big enough" (n > 30), and for a proportion of success elements "little enough" (a/N ≤ 0.1),

H

(n, a, N) comes close to the law

P

₍_λ_{) where}_λ = E(X) = na/N.

(9)

2 A continuous probability distribution: the Normal law

2.1 Convergence of discrete laws

Let's display a few probability distributions, with given values for n, a, N :

n = 10, a = 500, N = 5000 n = 50, a = 500, N = 5000 n = 200, a = 500, N = 5000

Some comments can be done from that:

* for the whole set of graphs, p = 0.1. This probability of success isn't very low, hence the reliability criterion "np < 10" for the use of a Poisson's law isn't met everywhere,

* the population's size (N = 5000) is rather big compared to n, which implies that both hypergeometric and binomial distributions are quite similar,

* the higher n is, the more the distributions appear to be symmetrical, around a central value that is actually the expectation of the variable.

* the higher n is, the more the distributions seem to follow a curve, that might be the same whatever the genuine discrete law, or at least that might belong to a unique class of functions.

Then, could we, under conditions on n and p, define a unique law that would correctly and quickly describe the reality?

* As n becomes high, looking for every punctual probability among a lot of other ones may not be relevant or useful. We had better look for the probability that X would be located inside some interval.

Could this unique law be described in terms of intervals (instead of punctual values), by a continuous random variable?

To conclude, the opportunity of a new and general probability distribution is obvious. Nevertheless, this law could be available only in case of big populations and big samples taken inside them (but little enough compared to the population)… but that's actually the purpose of many current surveys!

(10)

2.2 Continuous real random variable

2.2.1 Statistical introduction to a "continuous" distribution

2.2.2 Continuous random variable

Let's consider the ideal situation where X can take every possible real value, working in an infinite population. Here, the "frequencies concentration" is renamed "probability density".

A probability density of X is a function f , positive and continuous in ℝ and such that

∫

^{f x}

( )

^.d^x⁼¹

ℝ

where a probability is the measure of a surface bounded between its curve and the abscissas axis (Ox).

For example (tutorial), the probability that a mass would be less than 3.7 kg is ^.

( )

^.

−∞

∫

3 7

f x dx.

The distribution function of X is the function F that, to an input x, lead to the output F(x) = p(X < x).

F is an increasing function of x.

Comments:

* the graph of a probability density doesn't necessarily have a symmetry axis, unlike the graphs above could lead us to conclude; the latter is the one of a normal distribution, that is actually symmetrical.

* the expectation of a continuous random variable is: ^E

( )

^X ⁼

∫

^x^× ^{f x}

( )

^.^d^x

ℝ

.

* the variance of a continuous random variable is: ^V

( )

^X ⁼

∫ (

^x⁻^E

( )

^X

)

²^× ^{f x}

^{( )}

^.^d^x

ℝ

, definition from which we can rediscover the well-known property: ^V

( )

^X ⁼^E

( )

^X² ⁻^E

^{( )}

^X ²^.

y = f (x)

F (3.7) F (3.85)

y = f (x)

y = F (x)

F (3.7) F (3.85)

(11)

2.3 The Normal law (or Laplace's law)

As we glimpsed, with a large number of observations from a big population, a lot of concrete phenomena, as well as discrete probability distributions, can be modelled by probability densities sharing a typical shape.

The general expression of such functions f is: ^{f x}

( )

⁼^k^.^e⁻^{a x b}⁽ ⁻ ⁾²

Their graphs are named "bell curves".

2.3.1 General definition of the normal law N ( µ ^, σ ⁾

Let be a random variable X, whose mean and standard deviation are µ^andσ. (E(X) = µ^{; V(X) =}σ^²) Its probability distribution is

N

(µ , σ) when its probability density expression is:

( )

1 2

1

2

e 2

x

f x

µ σ

σ

−

 

−  

 

= π

e.g.: probability density of

N

(25 , 10):

Comment 1: such a curve owns two inflexion points, whose abscissas are µ^-σ^andµ⁺σ^. Hence, we can depict the standard deviation graphically.

Comment 2: some typical results have to be known:

p(µ^-σ < X < µ⁺σ⁾≈ 68.3 % p(µ^{- 1.96}σ < X < µ^{+ 1.96}σ⁾≈ 95 %

p(µ^{- 2}σ < X < µ^{+ 2}σ⁾≈ 95,4 % p(µ^{- 2.58}σ < X < µ^{+ 2.58}σ⁾≈ 99 %

Comment 3: the term "normal" can't be defined for one individual. Only a population may show a normal distribution, using this adjective because it's known that these functions fit with many concrete situations.

15 25 35

µ

σ σ

(12)

2.3.2 The standard normal law N (0 , 1)

We shall sometimes be forced to use it (demanded or necessary…).

The variable of this special distribution (mean = 0 , standard deviation = 1) is denoted U (its values: u).

The comment 2 above gives here:

p(-1 < U < 1) ≈ 68.3 % p(-1.96 < U < 1.96) ≈ 95 % p(-2 < U < 2) ≈ 95,4 % p(-2.58 < U < 2.58) ≈ 99 %

A lot of values F(u) = p(U < u) are given in a table (form), but only with u ≥ 0.

The latter restriction still make us able to find out other probabilities, thanks to the following formulas:

p(a < U < b) = p(U < b) – p(U < a) p(U > a) = 1 – p(U < a) p(U < –a) = p(U > a)

2.3.3 Variable change: transition from N ( µ ^, σ ^{) to} N (0 , 1)

Sometimes, we are confronted to an unsolvable problem while expressed in a given normal law, especially when one parameter is unknown. It will have to make a transition to

N

(0 , 1).

X is distributed by

N

(µ^,σ⁾⇔ X

U µ

σ⁻

= is distributed by

N

(0 , 1).

U is distributed by

N

(0 , 1) ⇔ X = +µ Uσ is distributed by

N

(µ^,σ^).

Hence: ^p

(

^X ^<^x

)

⁼^p^^_^U^< ^X_σ⁻^µ^^_

Whatever the parameters of the normal distribution, a probability is the area of a given surface under the curve. On applying the variable change given above, you only modify the labels on the horizontal axis, without modifying the curve! For instance, the abscissa µ +0,5σ for X matches the abscissa 0.5 for U , and then

p(X < µ + 0.5σ) = p(U < 0.5).

a b a -a a

(13)

2.3.4 Approximation of discrete laws by a normal one

We already told that as n gets high, the hypergeometric, binomial and Poisson's distributions become close to a normal one. This normal distribution will be efficient to replace the former ones if:

Approximation of a binomial distribution by a normal one:

From

B

(n , p), if n > 30 and npq > 5, then we can use

N

₍_µ_,_σ_{) with}_µ_{= np and}_σ₌ _npq

Approximation of a Poisson's distribution by a normal one:

From

P

₍_λ_{), if}_λ > 20, then we can use

N

₍_µ_,_σ_{) with}_µ₌_λ_and_σ₌ _λ

(starting with a hypergeometric distribution will require a first transformation into a binomial one)

2.3.5 Calculation of a discrete probability

In a discrete situation, where the variable X can only take integers as values (e.g.: number of successes, but not only), we're interested in the calculation of p(X = k). However, the normal law only permits us to calculate probabilities of intervals.

In that case, the best way is to apply the following rule: p(X = k) = p(k – 0.5 < X < k + 0.5)

2.3.6 Important consequence

The binomial and Poisson's distributions are discrete, so that X can only take natural numbers as values:

something like X = 3.8 has no reality for them. Though, the point 2.3.5 shows us that the effect of the use the normal law instead is to transform any integer into an interval around it, whose size is 1.

In discrete situations, the numbers 3, 0, or –8 for instance, have to be translated into the intervals [2.5 ; 3.5], [–0.5 ; 0.5], [–8.5 ; –7.5].

As for the probability that X be more than or equal to 10, it will be translated into p(X > 9.5); but, be careful: the probability that X be more than 10 will be translated into p(X > 10.5)!

(the probability that X be equal to 10 being p(9.5 < X < 10.5).

(14)

3 Sampling distributions

3.1 Introduction

Do you know an operation where the entire population is surveyed ? …to get several information ?

The deployed means are huge. It takes more than a year to collect and analyse the whole data set, and also an impressive number of surveyors to walk through the whole country. Of course, this work can't be carried out for any survey…

By selecting a part of the population, you can get a pretty good representation of reality. This selection, more or less "representative" to reality, is called sample. Survey methods do exist, to build a sample as representative to the population as possible.

Our aim is, in this section, given an completely known population, to be able to tell how its set of samples will surely behave.

Naming conventions:

The population's parameters will be written with Greek letters:

mean: µ ; standard deviation: σ ; proportion : π The sample's parameters will be written using our alphabet:

mean: x ; standard deviation: s ; proportion : p

3.2 Random sampling

There are two main types of random sampling:

* the simple random sampling (SRS) allows the repetition of an individual and takes the order into account (which leads to p-lists in counts and to the binomial law in probabilities),

* the exhaustive sampling doesn't allow the repetition of an individual and doesn't take the order into account (which leads to combinations in counts and to the hypergeometric law in probabilities).

3.3 Sampling distribution of means

A variable X has to be studied in a population.

Once chosen a size n , we can virtually extract all the samples that share this size.

The sample n° k can give way to the calculation of its own mean: x_k.

We denote X the random variable of the means of the n-sized samples, and we name sampling

distribution of the means the probability distribution of the whole set of x_k, that is to say the probability distribution of the random variable X .

Let be a "big enough" population (N > 30), on which a variable X is known in details (at least, its mean and standard deviation µ and σ are known). The mean x_k of each n – sized sample is more or less close to µ. In case n is big enough too (n ≥ 5),

X is distributed by , n µ σ

 

 

 

N

on SRS, and by , N

N 1 n n µ σ

 − 

 

 − 

 

N

on exhaustive sampling.

Comment 1: in case N > 20n ("little enough" sample), we can claim that N N 1

−n

− is close to 1 and then forget it. An exhaustive sampling (which is the most used) will in this case be handled as a SRS.

Comment 2: if, in an exercise, no comparison between N et n is possible, we will use the SRS results.

Comment 3: (from the "central limit" theorem) The higher N and n are, the closer the law of X is to a normal law, and that, whatever the probability distribution of X.

Comment 4: in case n is little (< 5), the distribution of X is not close to a normal one. However, its mean and standard deviation are still those announced in the frame above.

(15)

Activity: Let be the population: Ω = {0, 1, 2, 3, 4, 5} (N = 6), uniformly distributed.

Its mean is: µ = 2.5 and its standard deviation is: σ = 1.7078.

Below are listed all the samples of size 2 (SRS): (bold: the sample; besides: sample's mean)

Now, let's analyse the statistical distribution of these samples means:

their mean is: 2.5 ! their standard deviation is: 1.2076… but σ ₌ ! 1.2076 n

Below are listed all the samples of size 2 (exhaustive): (bold: the sample; besides: sample's mean)

Now, let's analyse the statistical distribution of these samples means:

their mean is: 2.5 ! their standard deviation is: 1.0801… but σ _{− =} !

−

N 1.0801

N 1 n n

3.4 Sampling distribution of proportions

Let be a population of N individuals into which we know that a number a of individuals share the character A. The proportion of such individuals in the population is then :

N π = a.

Once chosen a size n , we can virtually extract all the samples that share this size.

The sample n° k can give way to the calculation of its own proportion: pk.

We denote P the random variable of the proportions , set of all values pk , and sampling distribution of proportions the probability distribution of P.

Let be a "big enough" population (N > 30), into which a proportion π is known. The proportion pk in each n- sized sample is more or less close to π. In case n is big enough too (n ≥ 5),

P is distributed by _,

(

¹

)

n

π π

π − 

 

 

N

on SRS, and by _,

(

¹

)

^N

N 1 n n

π π

π − − 

 

 − 

 

N

on exhaust. samp.

Comment: let's explain these results for the SRS case. Let's name Y the variable giving, in each n-sized sample, the number of individuals owning the character A. The law of Y binomial, with the parameters n and π. Reminders: E(Y) = nπ and V(Y) = nπ ^{(1 –}π^).

Moreover: P = Y/n , which leads to: E(P) = π and V(P) =

(

¹

)

n π −π

. Moreover, the four comments made in part 3.3 are still relevant here.

(16)

4 Estimates (statistical inference)

A large population is partially or totally unknown. A unique n-sized sample being extracted, to what extent does it represent the whole population? Is the information got from this sample reliable in order to estimate the reality of the unknown population?

As it’s a large population, we will systematically consider SRS samples.

4.1 Point estimate

The sign ^ will have to be placed above a parameter in order to express its estimate.

The mean of a sample serves as an estimate of the population's mean; same for a proportion.

ˆ x ; ˆ p

µ= π =

(indeed, it has been recorded in sections 3.3 and 3.4 that X is centred on µ and that P is centred on π^{. We} say that the variables X and P are non-biased estimators)

The variance s² of a sample is not a best estimate of the one, σ², of the population. It has to be corrected:

ˆ² ² ; ˆ

1 1

n n

s s

n n

σ = × σ = ×

− − (biased estimator)

4.2 Estimate by a confidence interval

A point estimate doesn't guarantee some accuracy. Indeed, a sample might represent the population very badly, and the both means or proportions might be far from each other.

A confidence interval will make us able to know about the probability that a population's parameter be at a given distance from the one got from a sample. For instance, we will build around the mean x of a sample an interval "that has 95 % chances" to contain the population's mean µ^.

We name significance level, α, the probability that a confidence interval might not contain the population's parameter. α^{= 5 % or}α = 1 % are the most commonly used.

We name confidence level the probability that a confidence interval contains the population's parameter.

Commonly: 1 – α = 95 % or 1 – α^{= 99 %.}

4.2.1 Estimate of a mean

The way to build the interval depends on the knowledge of σ^. if σ is known: I x u ;x u

n n

α ₌^ ₋ σ ₊ σ ^

 

 

using the variable U distributed by

N

(0 , 1), looking for u such that p(-u < U < u) = 1 – α^. e.g.: with α = 5 %, u = 1.96 and with α = 1 %, u = 2.58.

if σ is unknown: ;

1 1

s s

I x t x t

n n

α

 

= − + 

− −

 

Both population's mean and standard deviation are unknown, which prevents us using the variable U and obliges us to replace it by the variable T distributed by the Student's law with n – 1 "degrees of freedom" (dof)

St

(0 , 1), looking for t such that p(-t < T < t) = 1 – α^.

Several values of T are given in the form, in the corresponding table.

4.2.2 Estimate of a proportion

(

¹

)

_;

(

¹

)

p p p p

I p u p u

n n

α

 − − 

= − + 

 

 

(automatic use of the normal law)

(17)

5 Statistical hypothesis testing

On knowing (at least) a sample, a hypothesis can be expressed on the unknown population. It is called null hypothesis and denoted H0. An appropriate statistical test will allow us to reject it (or will not).

A rejection of H0 will be associated to a risk to be wrong: the significance level α^. Sometimes, it is useful to express its contrary, the alternative hypothesis H1.

5.1 Adequacy χ ² test (Pearson's test)

Aim: to compare a distribution of observed values to a given law.

By convention, the null hypothesis H0 is: "the observed distribution fits the chosen law".

This hypothesis will be rejected if the observed distribution differs "very much" from the chosen law.

e.g.: frequency bar diagram of observed values (vertical bars) compared to the normal law

N

(6 , 2)

H0 : in the population, the variable is distributed by the law

N

(6 , 2).

(we venture the hypothesis that the observed values – in the sample – are consistent with the idea that the population would be distributed by this normal law. For this purpose, we have to perform a χ² adequacy test in order to decide whether H⁰ can be rejected with a high enough confidence level) implementation of the test

1. expression of the null hypothesis

2. Calculation of the observed chi-square: χ²^calc

n observations are done: n individuals are evaluated. k different values are spotted.

The tested law makes us calculate theoretical frequencies.

3. Rejection area

look for the value χ²limit in relation with the significance level α and with the number of dof, which is in any case k – 1 here.

4. Comparison and decision

If χ²calc > χ²lim, then we can reject H0 with a risk α to be wrong.

If χ²calc < χ²lim, then we cannot reject H0 at the level α (the risk to be wrong would be more than α^).

values observed

frequencies

theoretical frequencies

val 1 obs1 th1 χ²¹

val 2 obs2 th2 χ²2

… … … …

val k obsk thk χ²k

total n n

χ ²

calc

(18)

5.2 Conformance test of a mean, of a proportion

5.2.1 Principle

The aim of these tests is to decide whether the mean µ (or the proportion π ) of a population, unknown, differs from a given value µ⁰^(orπ⁰ ) or not.

Null hypothesis: H0 : µ⁼µ⁰ ^{or: (H}⁰^:π⁼π⁰⁾

Alternative hypothesis: H1 : µ ≠ µ⁰ : two-sided test,

or H1 : µ^<µ⁰ : right one-sided test same for a proportion or H1 : µ^>µ⁰ : left one-sided test

This alternative hypothesis is essential because:

* in case of a two-sided test, α has to be cut in two, both halves creating two rejection areas, on the left and on the right of the tested value,

* in case of a one-sided test, a unique rejection area corresponds to α^whole.

There is a strong relationship between confidence intervals around an observed value (as we seen them in the previous section) and performing a test on a given value; hence what follows:

5.2.2 Conformance test of a mean

If the standard deviation of the population, σ , is known The associated decision variable is: X ⁰

U

n σ⁻µ

=

which is distributed, under the null hypothesis, by the standard normal law, provided that X is normally distributed or in case n is big enough (n≥ 5).

If the standard deviation of the population, σ , is unknown The associated decision variable is: X

T S

n µ

= −

−

0

1

which is distributed, under the null hypothesis, by the Student's law with n – 1 degrees of freedom, provided that X is normally distributed or in case n is big enough (n≥ 5).

(S is the random variable "standard deviations in the samples")

5.2.3 Conformance test of a proportion

The associated decision variable is:

( )

U P

n π

π π

= −

−

0

0 1 0

which is distributed, under the null hypothesis, by the standard normal law, provided that X is normally distributed or in case n is big enough (n≥ 5).

(P is the random variable "proportions in the samples")

5.2.4 Methodology

1. Clearly spell out the null hypothesis and the alternative hypothesis

2. Calculate the value (u or t) of the decision variable in association with the value x or p of the sample 3. Calculate the limit value(s) u or t demarcating the rejection area

4. Compare the results of points 2 and 3, then conclude about the rejection (or not) of H0

(19)

5.3 The risks (non required)

5.3.1 Accept a hypothesis?

If we are to make the decision of the rejection or not of a hypothesis, we have to perform a statistical test:

make observations and confront them to an alternative hypothesis (defining a rejection area). The main idea is trying to reject H0 in case the observation appear to be too far from what was expected according to this hypothesis. However, making a wrong decision is possible: each decision is associated with a

probability to be wrong, that we try to minimize to the best.

The conclusion of a test can only be the rejection or the non-rejection of H0, and never its acceptance. In statistical inference (as well as in any observational activity: physics, chemistry, astronomy, economics, …), proving that a theory is true is impossible (it is!) ; on the other hand, it's possible that an observation contradicts this theory, forcing the analyser to modify it.

Let's consider a statistical test at a 5% significance level. If our observation lies in the rejection area, then we can reject H0 with less than 5% risk to be wrong doing that (and then a 95% reliability). On the other hand, if our observation isn't located in the rejection area, we only know that our chances to be right on rejecting H0 would be less than 95% ("we can't reject H0 at a 5% significance level"), which is surely not a situation that would lead us to accept H0!

5.3.2 Decisions and risks

There are two different kinds of mistakes while making a decision, each one associated with a risk:

We reject H0 whereas H0 is true: associated with the risk α : type 1 error,

( )

α=

H true0 0

p reject H

We don't reject H0 whereas H0 is false: associated with the risk β : type 2 error.

( )

β=

H false0 0

p not reject H

The four probabilities in the table are conditional; be careful of their interpretation!

α is the probability to reject H0, given that H0 is true, 1-α is the probability not to reject H0, given that H0 is true, 1-β is the probability to reject H0, given that H0 is false, β is the probability not to reject H0, given that H0 is false,

5.3.3 Risks and statistical tests

The probability α to be wrong rejecting H0 is named significance level of the test.

The probability 1-α to be right not rejecting H0 is named confidence level of the test.

The probability 1-β to be right rejecting H0 is named power of the test.

While the risk α is well-known, since we have to decide its value, it's unfortunately impossible to know the risk β, since the population remains unknown.

(20)

e.g.: let's test the hypothesis that the mean of a population is 4.

So, we assume that the sample distribution of means is the one in the opposite graph.

We decide a significance level α = 5%, that leads us to the conclusion: if our sample's mean is more than 5.3, then we can reject the hypothesis that claims µ^{= 4.}

However, if x >5 3. , then the risk to be wrong (on rejection) is 5%, since in case µ⁼ 4 is true, 5% of the samples would show a mean that is more than 5.3!

Now, let's suppose that the real mean of the population is 6 (but the person who performs the test doesn't know!). The real distribution has been added in the second graph below, dotted. If our sample's mean is less than à 5.3, the one who performs the test will not be allowed to reject the hypothesis µ^{= 4} (too low confidence level), of course making a mistake.

The real proportion of the samples whose mean is less than 5.3 is β : risk to be wrong while not rejecting µ = 4, unfortunately unknown (because the real value of µ is unknown) and possibly very high!

Nevertheless, the error risk β decreases as the number of observations (the sample's size) increases.

The third graph below displays what happens as the sample's size is doubled (the standard deviation is divided by 2 ), with the same value for α^:

To conclude: there is a safe method to reduce the risks of errors in surveys: enlarge the sample!

Probability distributions MATHEMATICS

SALES AND MARKETING Department

MATHEMATICS

3rd Semester

________ Probability distributions ________

LESSONS

TABLE OF CONTENTS

INTRODUCTION AND HISTORY 3

LESSONS 5

INTRODUCTION AND HISTORY

A quick story of the normal law

∫

Everything isn't necessarily normal

Calculation tools are increasingly powerful

PROBABILITY DISTRIBUTIONS - LESSONS

1 Discrete probability distributions

1.1 General case: reminders

( )

∑

( )

( ) ( ) ( )

( ) ( )

∑

1.2 Hypergeometric distribution

1.2.1 Definition and implementation

H

1.2.2 Calculation of probabilities

( )

1.2.3 Mean and variance

( )

( ) ( )

( )

( )

1.3 Binomial distribution

1.3.1 Definition and implementation

B

1.3.2 Calculation of probabilities

(

)

1.3.3 Mean and variance

( )

( )

1.3.4 Approximation of a hypergeometric distribution by a binomial one

H

B

1.4 Poisson's distribution

1.4.1 Why it has been created

1.4.2 Definition, calculation of a probability

(

)

P

1.4.3 Mean and variance

( )

( )

1.4.4 Approximation of both previous laws by a Poisson's one

B

B

P

H

H

P

2 A continuous probability distribution: the Normal law

2.1 Convergence of discrete laws

2.2 Continuous real random variable

2.2.1 Statistical introduction to a "continuous" distribution

2.2.2 Continuous random variable

∫

( )

( )

∫

( )

∫

( )

( )

∫ (

( )

)

( )

( )

( )

Probability distributions ^

( ) ( ) ^{( )}

^{( )} ^{( )}

^{( )}

^{( )}

2.3.1 General definition of the normal law N ( µ ^, σ ⁾

2.3.3 Variable change: transition from N ( µ ^, σ ^{) to} N (0 , 1)