DATA: PROBABILISTIC VIEW - DATA MINING AND ANALYSIS

The probabilistic view of the data assumes that each numeric attributeXis arandom variable, defined as a function that assigns a real number to each outcome of an experiment (i.e., some process of observation or measurement). Formally, X is a functionX:O→R, where O, the domain ofX, is the set of all possible outcomes of the experiment, also called the sample space, and R, the range of X, is the set of real numbers. If the outcomes are numeric, and represent the observed values of the random variable, thenX:O→Ois simply the identity function:X(v)=v for all v∈O. The distinction between the outcomes and the value of the random variable is important, as we may want to treat the observed values differently depending on the context, as seen in Example 1.6.

A random variableXis called adiscrete random variableif it takes on only a finite or countably infinite number of values in its range, whereasXis called acontinuous random variableif it can take on any value in its range.

Example 1.6. Consider the sepal length attribute (X1) for the Iris dataset in Table 1.1. All n=150 values of this attribute are shown in Table 1.2, which lie in the range [4.3,7.9], with centimeters as the unit of measurement. Let us assume that these constitute the set of all possible outcomesO.

By default, we can consider the attributeX1to be a continuous random variable, given as the identity functionX₁(v)=v, because the outcomes (sepal length values) are all numeric.

On the other hand, if we want to distinguish between Iris flowers with short and long sepal lengths, with long being, say, a length of 7 cm or more, we can define a discrete random variableAas follows:

A(v)=

(0 ifv <7 1 ifv≥7

In this case the domain of A is [4.3,7.9], and its range is {0,1}. Thus, A assumes nonzero probability only at the discrete values 0 and 1.

1.4 Data: Probabilistic View 15

Probability Mass Function

IfXis discrete, theprobability mass functionofXis defined as f (x)=P (X=x) for allx∈R

In other words, the functionf gives the probabilityP (X=x)that the random variable Xhas the exact valuex. The name “probability mass function” intuitively conveys the fact that the probability is concentrated or massed at only discrete values in the range ofX, and is zero for all other values.f must also obey the basic rules of probability.

That is,f must be non-negative:

f (x)≥0 and the sum of all probabilities should add to 1:

f (x)=1

Example 1.7 (Bernoulli and Binomial Distribution). In Example 1.6,Awas defined as a discrete random variable representing long sepal length. From the sepal length data in Table 1.2 we find that only 13 Irises have sepal length of at least 7 cm. We can thus estimate the probability mass function ofAas follows:

f (1)=P (A=1)= 13

150=0.087=p and

f (0)=P (A=0)=137

150=0.913=1−p

In this case we say thatAhas aBernoulli distributionwith parameterp∈[0,1], which denotes the probability of asuccess, that is, the probability of picking an Iris with a long sepal length at random from the set of all points. On the other hand, 1−pis the probability of afailure, that is, of not picking an Iris with long sepal length.

Let us consider another discrete random variable B, denoting the number of Irises with long sepal length inm independent Bernoulli trials with probability of successp. In this case,Btakes on the discrete values [0, m], and its probability mass function is given by theBinomial distribution

f (k)=P (B=k)= m

p^k(1−p)^m⁻^k The formula can be understood as follows. There are ^m_k

ways of pickingklong sepal length Irises out of themtrials. For each selection ofklong sepal length Irises, the total probability of theksuccesses isp^k, and the total probability ofm−kfailures is (1−p)^m⁻^k. For example, becausep=0.087 from above, the probability of observing exactlyk=2 Irises with long sepal length inm=10 trials is given as

f (2)=P (B=2)= 10

(0.087)²(0.913)⁸=0.164

Figure 1.6 shows the full probability mass function for different values ofkform=10.

Because p is quite small, the probability of k successes in so few a trials falls off rapidly askincreases, becoming practically zero for values ofk≥6.

0.1 0.2 0.3 0.4

0 1 2 3 4 5 6 7 8 9 10

k P (B=k)

Figure 1.6. Binomial distribution: probability mass function (m=10,p=0.087).

Probability Density Function

IfXis continuous, its range is the entire set of real numbersR. The probability of any specific valuex is only one out of the infinitely many possible values in the range of X, which means that P (X=x)=0 for allx ∈R. However, this does not mean that the valuex is impossible, because in that case we would conclude that all values are impossible! What it means is that the probability mass is spread so thinly over the range of values that it can be measured only over intervals [a, b]⊂R, rather than at specific points. Thus, instead of the probability mass function, we define theprobability density function, which specifies the probability that the variableX takes on values in any interval [a, b]⊂R:

P X∈[a, b]

= Zb a

f (x) dx

As before, the density functionf must satisfy the basic laws of probability:

f (x)≥0, for allx∈R and

Z∞

−∞

f (x) dx=1

We can get an intuitive understanding of the density functionf by considering the probability density over a small interval of width 2ǫ >0, centered at x, namely

1.4 Data: Probabilistic View 17

[x−ǫ, x+ǫ]:

P X∈[x−ǫ, x+ǫ]

= Zx+ǫ x−ǫ

f (x) dx ≃2ǫ·f (x)

f (x)≃P X∈[x−ǫ, x+ǫ]

2ǫ (1.8)

f (x)thus gives the probability density atx, given as the ratio of the probability mass to the width of the interval, that is, the probability mass per unit distance. Thus, it is important to note thatP (X=x)6=f (x).

Even though the probability density functionf (x)does not specify the probability P (X=x), it can be used to obtain the relative probability of one valuex₁over another x₂because for a givenǫ >0, by Eq. (1.8), we have

P (X∈[x1−ǫ, x₁+ǫ])

P (X∈[x2−ǫ, x₂+ǫ])≃2ǫ·f (x₁)

2ǫ·f (x₂)=f (x₁)

f (x₂) (1.9)

Thus, iff (x₁)is larger thanf (x₂), then values ofXclose tox₁are more probable than values close tox₂, and vice versa.

Example 1.8 (Normal Distribution). Consider again thesepal lengthvalues from the Iris dataset, as shown in Table 1.2. Let us assume that these values follow a Gaussianornormaldensity function, given as

f (x)= 1

√2π σ²exp

−(x−µ)² 2σ²

There are two parameters of the normal density distribution, namely, µ, which represents the mean value, andσ², which represents the variance of the values (these parameters are discussed in Chapter 2). Figure 1.7 shows the characteristic “bell”

shape plot of the normal distribution. The parameters,µ=5.84 andσ²=0.681, were estimated directly from the data forsepal lengthin Table 1.2.

Whereas f (x =µ)=f (5.84)= 1

√2π·0.681exp{0} =0.483, we emphasize that the probability of observing X=µ is zero, that is,P (X=µ)=0. Thus,P (X=x) is not given by f (x), rather, P (X=x) is given as the area under the curve for an infinitesimally small interval [x−ǫ, x+ǫ] centered atx, with ǫ >0. Figure 1.7 illustrates this with the shaded region centered atµ=5.84. From Eq. (1.8), we have

P (X=µ)≃2ǫ·f (µ)=2ǫ·0.483=0.967ǫ

Asǫ→0, we getP (X=µ)→0. However, based on Eq. (1.9) we can claim that the probability of observing values close to the mean valueµ=5.84 is 2.69 times the probability of observing values close tox=7, as

f (5.84)

f (7) =0.483 0.18 =2.69

0 0.1 0.2 0.3 0.4 0.5

2 3 4 5 6 7 8 9

x f (x)

µ±ǫ

Figure 1.7. Normal distribution: probability density function (µ=5.84,σ²=0.681).

Cumulative Distribution Function

For any random variable X, whether discrete or continuous, we can define the cumulative distribution function (CDF)F :R→[0,1], which gives the probability of observing a value at most some given valuex:

F (x)=P (X≤x) for all − ∞< x <∞ WhenXis discrete,F is given as

F (x)=P (X≤x)=X

u≤x

f (u) and whenXis continuous,F is given as

F (x)=P (X≤x)= Zx

−∞

f (u) du

Example 1.9 (Cumulative Distribution Function). Figure 1.8 shows the cumulative distribution function for the binomial distribution in Figure 1.6. It has the characteristic step shape (right continuous, non-decreasing), as expected for a discrete random variable.F (x) has the same valueF (k) for allx ∈[k, k+1) with 0≤k < m, wheremis the number of trials andk is the number of successes. The closed (filled) and open circles demarcate the corresponding closed and open interval [k, k+1). For instance,F (x)=0.404=F (0)for allx∈[0,1).

Figure 1.9 shows the cumulative distribution function for the normal density function shown in Figure 1.7. As expected, for a continuous random variable, the CDF is also continuous, and non-decreasing. Because the normal distribution is symmetric about the mean, we haveF (µ)=P (X≤µ)=0.5.

1.4 Data: Probabilistic View 19

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

−1 0 1 2 3 4 5 6 7 8 9 10 11 x F (x)

Figure 1.8. Cumulative distribution function for the binomial distribution.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0 1 2 3 4 5 6 7 8 9 10

x F (x)

(µ, F (µ))=(5.84,0.5)

Figure 1.9. Cumulative distribution function for the normal distribution.

1.4.1 Bivariate Random Variables

Instead of considering each attribute as a random variable, we can also perform pair-wise analysis by considering a pair of attributes,X1andX2, as abivariate random variable:

X= X₁

X:O→R² is a function that assigns to each outcome in the sample space, a pair of real numbers, that is, a 2-dimensional vector

x₁ x₂

∈R². As in the univariate case,

if the outcomes are numeric, then the default is to assume X to be the identity function.

Joint Probability Mass Function

IfX1 andX2 are both discrete random variables then Xhas ajoint probability mass functiongiven as follows:

f (x)=f (x₁, x₂)=P (X1=x₁,X2=x₂)=P (X=x) f must satisfy the following two conditions:

f (x)=f (x₁, x₂)≥0 for all− ∞< x₁, x₂<∞ X

f (x)=X

x₁

x₂

f (x₁, x₂)=1 Joint Probability Density Function

If X1 and X2 are both continuous random variables then Xhas a joint probability density functionf given as follows:

P (X∈W)= Z Z

x∈W

f (x) dx= Z Z

(x₁, x₂)^T∈W

f (x₁, x₂) dx₁dx₂

whereW⊂R²is some subset of the 2-dimensional space of reals.f must also satisfy the following two conditions:

f (x)=f (x₁, x₂)≥0 for all− ∞< x₁, x₂<∞ Z

R²

f (x) dx= Z∞

−∞

Z∞

−∞

f (x₁, x₂) dx₁dx₂=1

As in the univariate case, the probability mass P (x)=P (x₁, x₂)^T

=0 for any particular pointx. However, we can usef to compute the probability density at x.

Consider the square regionW= [x1−ǫ, x1+ǫ],[x2−ǫ, x2+ǫ]

, that is, a 2-dimensional window of width 2ǫ centered at x =(x₁, x₂)^T. The probability density at x can be approximated as

P (X∈W)=P

X∈ [x1−ǫ, x₁+ǫ],[x2−ǫ, x₂+ǫ]

xZ₁+ǫ x₁−ǫ

xZ₂+ǫ x₂−ǫ

f (x₁, x₂) dx₁dx₂

≃2ǫ·2ǫ·f (x₁, x₂) which implies that

f (x₁, x₂)=P (X∈W) (2ǫ)²

The relative probability of one value(a₁, a₂)versus another(b₁, b₂)can therefore be computed via the probability density function:

P (X∈ [a1−ǫ, a₁+ǫ],[a2−ǫ, a₂+ǫ]

) P (X∈ [b1−ǫ, b₁+ǫ],[b2−ǫ, b₂+ǫ]

)≃(2ǫ)²·f (a₁, a₂)

(2ǫ)²·f (b₁, b₂)=f (a₁, a₂) f (b₁, b₂)

1.4 Data: Probabilistic View 21

Example 1.10 (Bivariate Distributions). Consider the sepal length and sepal widthattributes in the Iris dataset, plotted in Figure 1.2. LetAdenote the Bernoulli random variable corresponding to long sepal length (at least 7 cm), as defined in Example 1.7.

Define another Bernoulli random variableBcorresponding to long sepal width, say, at least 3.5 cm. LetX=

A B

be a discrete bivariate random variable; then the joint probability mass function ofXcan be estimated from the data as follows:

f (0,0)=P (A=0,B=0)=116

150=0.773 f (0,1)=P (A=0,B=1)= 21

150=0.140 f (1,0)=P (A=1,B=0)= 10

150=0.067 f (1,1)=P (A=1,B=1)= 3

150=0.020 Figure 1.10 shows a plot of this probability mass function.

Treating attributesX1 andX2in the Iris dataset (see Table 1.1) as continuous random variables, we can define a continuous bivariate random variableX=

X₁ X2

. Assuming thatXfollows abivariate normal distribution, its joint probability density function is given as

f (x|µ,6)= 1 2π√

|6| exp

−(x−µ)^T6⁻¹(x−µ) 2

Hereµand6are the parameters of the bivariate normal distribution, representing the 2-dimensional mean vector and covariance matrix, which are discussed in detail

X1 X2

f (x)

b b

0.773

0.14 0.067

0.02 0 1 1

Figure 1.10. Joint probability mass function:X1(long sepal length),X2(long sepal width).

X₂ f (x)

1 0 3 2 5 4

7 6 8 9

0 1

2 3

4 5

0 0.2 0.4

Figure 1.11. Bivariate normal density:µ=(5.843,3.054)^T(solid circle).

in Chapter 2. Further,|6|denotes the determinant of6. The plot of the bivariate normal density is given in Figure 1.11, with mean

µ=(5.843,3.054)^T and covariance matrix

0.681 −0.039

−0.039 0.187

It is important to emphasize that the function f (x)specifies only the probability density atx, andf (x)6=P (X=x). As before, we haveP (X=x)=0.

Joint Cumulative Distribution Function

The joint cumulative distribution function for two random variables X₁ and X₂ is defined as the functionF, such that for all valuesx₁, x₂∈(−∞,∞),

F (x)=F (x₁, x₂)=P (X1≤x₁andX2≤x₂)=P (X≤x) Statistical Independence

Two random variablesX1andX2are said to be (statistically)independentif, for every W1⊂RandW2⊂R, we have

P (X1∈W1andX2∈W2)=P (X1∈W1)·P (X2∈W2)

Furthermore, ifX1andX2are independent, then the following two conditions are also satisfied:

F (x)=F (x₁, x₂)=F₁(x₁)·F₂(x₂) f (x)=f (x₁, x₂)=f₁(x₁)·f₂(x₂)

1.4 Data: Probabilistic View 23 where Fi is the cumulative distribution function, and fi is the probability mass or density function for random variableXi.

1.4.2 Multivariate Random Variable

A d-dimensional multivariate random variable X=(X1,X2, . . . ,Xd)^T, also called a vector random variable, is defined as a function that assigns a vector of real numbers to each outcome in the sample space, that is,X:O→R^d. The range of Xcan be denoted as a vectorx=(x1, x2, . . . , xd)^T. In case allXj are numeric, thenXis by default assumed to be the identity function. In other words, if all attributes are numeric, we can treat each outcome in the sample space (i.e., each point in the data matrix) as a vector random variable. On the other hand, if the attributes are not all numeric, then Xmaps the outcomes to numeric vectors in its range.

If all Xj are discrete, then X is jointly discrete and its joint probability mass functionf is given as

f (x)=P (X=x)

f (x₁, x₂, . . . , xd)=P (X₁=x₁,X₂=x₂, . . . ,Xd=xd)

If allXj are continuous, thenXis jointly continuous and its joint probability density function is given as

P (X∈W)= Z

···

x∈W

f (x) dx

P (X1,X2, . . . ,Xd)^T∈W

= Z

···

(x₁, x₂, ..., x_d)^T∈W

f (x1, x2, . . . , xd) dx1dx2. . . dxd

for anyd-dimensional regionW⊆R^d.

The laws of probability must be obeyed as usual, that is,f (x)≥0 and sum of f over allx in the range ofXmust be 1. The joint cumulative distribution function of X=(X₁, . . . ,Xd)^Tis given as

F (x)=P (X≤x)

F (x₁, x₂, . . . , xd)=P (X₁≤x₁,X₂≤x₂, . . . ,Xd≤xd) for every pointx∈R^d.

We say that X₁,X₂, . . . ,Xd are independent random variables if and only if, for every regionWi⊂R, we have

P (X1∈W1andX2∈W2··· andXd∈Wd)

=P (X1∈W1)·P (X2∈W2)· ··· ·P (Xd∈Wd) (1.10) IfX1,X2, . . . ,Xd are independent then the following conditions are also satisfied

F (x)=F (x₁, . . . , x_d)=F₁(x₁)·F₂(x₂)·. . .·F_d(x_d)

f (x)=f (x₁, . . . , x_d)=f₁(x₁)·f₂(x₂)·. . .·f_d(x_d) (1.11)

where Fi is the cumulative distribution function, and fi is the probability mass or density function for random variableXi.

1.4.3 Random Sample and Statistics

The probability mass or density function of a random variableX may follow some known form, or as is often the case in data analysis, it may be unknown. When the probability function is not known, it may still be convenient to assume that the values follow some known distribution, based on the characteristics of the data. However, even in this case, the parameters of the distribution may still be unknown. Thus, in general, either the parameters, or the entire distribution, may have to be estimated from the data.

In statistics, the wordpopulationis used to refer to the set or universe of all entities under study. Usually we are interested in certain characteristics or parameters of the entire population (e.g., the mean age of all computer science students in the United States). However, looking at the entire population may not be feasible or may be too expensive. Instead, we try to make inferences about the population parameters by drawing a random sample from the population, and by computing appropriatestatistics from the sample that give estimates of the corresponding population parameters of interest.

Univariate Sample

Given a random variableX, arandom sampleof sizenfromXis defined as a set ofn independent and identically distributed (IID)random variablesS1,S2, . . . ,Sn, that is, all of theSi’s are statistically independent of each other, and follow the same probability mass or density function asX.

If we treat attributeXas a random variable, then each of the observed values of X, namely,xi (1≤i≤n), are themselves treated as identity random variables, and the observed data is assumed to be a random sample drawn fromX. That is, allx_i are considered to be mutually independent and identically distributed asX. By Eq. (1.11) their joint probability function is given as

f (x₁, . . . , xn)= Yn i=1

fX(xi)

wheref_Xis the probability mass or density function forX.

Multivariate Sample

For multivariate parameter estimation, thendata pointsxi(with 1≤i≤n) constitute a d-dimensional multivariate random sample drawn from the vector random variable X= (X1,X2, . . . ,Xd). That is, xi are assumed to be independent and identically distributed, and thus their joint distribution is given as

f (x1,x2, . . . ,xn)= Yn i=1

f_X(xi) (1.12)

wheref_Xis the probability mass or density function forX.

Dans le document DATA MINING AND ANALYSIS (Page 26-37)