• Aucun résultat trouvé

TP 1: Description of a qualitative or discrete variable

N/A
N/A
Protected

Academic year: 2022

Partager "TP 1: Description of a qualitative or discrete variable"

Copied!
4
0
0

Texte intégral

(1)

Universit´e Joseph Fourier L2/STA230

TP 1: Description of a qualitative or discrete variable

The aim of these 13 tutorials is to make you able to carry out a standard statistical study on your experimental data, with qualitative, discrete and continuous variables.

Objectives: The objective of this first session is to learn the basic commands when dealing with a series of data saved in a vector. The second objective is to compute the usual descriptive indicators and graphs for a qualitative or a discrete variable.

RStudio is a free software environment for statistical computing and graphics. You can download it on your computer from www.rstudio.com. R (or RStudio) is sensitive to the case (capital and non-capital letters).

RStudio has four windows:

• Lower left corner: execute the instructions with Enter ←-

• Upper left corner: write the ’script’ to save all your instructions. Execute the instruction in the lower left corner with Ctrl+Enter.

• Lower right corner: plots will appear here.

• Upper right corner: history of the instructions

1 First steps with R Studio

1. Open a new script. Save your file as “Lab1.R”.

2. Create the data series (1,2,3,4,5) with the instruction:

c(1,2,3,4,5)

c(.) is the concatenation function which keeps in a vector the different values.

3. The previous vector has no name. To give a name to a vector:

c(1,2,3,4,5)->xorx<-c(1,2,3,4,5)

To verify if the vectorxcontains the values, execute the instruction x

4. Create the vectory with the values (2,4,6,8,10).

5. Verify ifxandy have the same lengths (total number of values):

length(x) length(y)

6. Plot the points defined by the two vectors (x, y):

plot(x,y)

(2)

7. Customize your plot:

plot(x,y, main="y given x", xlab = "x", ylab = "y") # add a title, labels on both axis plot(x,y, type = "p", pch = 3) # change the symbol

plot(x,y, type = "b") # add a line

plot(x,y, col = "red") # change the color All the instructions are described in the help:

help(plot)

8. Save your plot: click into the lower right corner and ”export” as a pdf file.

2 Basic computations

1. Basic operations:

x/5 # multiply or divide a vector by a scalar x+5 # add a scalar to a vector

sum(x) # sum up the elements of x cumsum(x) # cumulative sums of x sqrt(x) # squared root of x xˆ3 # x to the power 3 2. Add new values to the vectorx

c(x,6)

c(x,1,1,1,1,1)

c(x,rep(1,5)) # rep() repeats the same values

c(x,seq(from=1, to=10, by=2)) # seq() creates a sequence

3. Extract the second and fourth values ofy:

y[c(2,4)]

Try to understand the following instructions:

y[1:4]

y[(y>4)]

y[(y!=4)]

y[(y==4)]

y[(y>4)&(y<=6)]

4. Basic operations with two vectors:

x+y x*y x/y

5. Create a table (or a matrix) with the two vectorsxandy cbind(x,y) # matrix with 5 rows and 2 columns rbind(x,y) # matrix with 2 rows and 5 columns

2

(3)

3 Basic statistics of a qualitative or discrete variable

Here are numbers by age of non-smoking mothers at delivery.

age 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

number 7 8 9 10 12 3 2 5 4 5 2 4 2 0 1

1. Is this a discrete or continuous variable ? 2. What are the levels ?

3. Create the vector with all the ages

ages <- c(rep(21,7), rep(22,8),rep(23,9),rep(24,10),rep(25,12), rep(26,3),rep(27,2), rep(28,5),rep(29,4),rep(30,5),rep(31,2), rep(32,4),rep(33,2),35)

4. Find the numbers of each level:

t<-table(ages); t

5. Find the empirical frequencies of the levels, by dividing the numbers by the total number of indi- viduals.

6. Plot the empirical frequencies with a bar plot barplot(t)

What is the problem ?

The solution is to create a vector with the empirical frequencies without forgetting the levels with 0 frequency:

fages <- c(7,8,9,10,12,3,2,5,4,5,2,4,2,0,1) # include 0 in freq.

names(fages) <- as.character(21:35) # make a table barplot(fages)

7. Verify that the empirical frequencies are the same:

fages/sum(fages)

8. Find the empirical mean of the age of non-smoking mother at delivery.

9. Compare the results with mean(ages)

sum((21:35)*fages)/sum(fages)

10. Find the empirical variance (denoteds2) and the corrected empirical variance (s02= n−1n s2).

11. Find the standard deviation of the sample.

12. Compare the results with

var(ages) # is this equal to s2 or s02 ? sd(ages)

13. Find the values of the empirical distribution function.

14. Plot the empirical distribution function withplot(ecdf(ages)).

15. What is the empirical frequency of the interval [22; 25] ?

3

(4)

16. Give the median and the quartiles of the sample:

summary(ages) # some values

boxplot(ages) # box-and-whisker plot

quantile(ages,c(0.25,0.5,0.75)) # quartiles median(ages) # median

IQR(ages) # interquartile range plot(ecdf(ages)) # ecdf

abline(h=0.5,col="red") # horizontal at 0.5

abline(v=median(ages),col="red") # vertical at median abline(h=0.25,col="blue") # horizontal at 0.25

abline(v=quantile(ages,0.25),col="blue") # vertical at quartile abline(h=0.75,col="green") # horizontal at 0.75

abline(v=quantile(ages,0.75),col="green") # vertical at quartile

17. Compare the mean with the median, then the standard deviation with the dis- tances between the median and the quartiles.

4 Exercise

Here are numbers by age of smoking mothers at delivery.

age 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

number 5 5 4 3 3 5 1 4 3 2 3 2 1 1 1

1. Is this a discrete or continuous variable ? 2. What are the levels ?

3. Find the empirical frequencies of the levels.

4. Plot the empirical frequencies with a bar chart.

5. Find the empirical mean, variance and standard deviation of the age of smoking mother at delivery.

6. What is the empirical frequency of the interval [22; 25] ?

7. Draw a graphical representation of the empirical distribution function. Determine from the graph the median and the quartiles of the sample.

8. Compare the mean with the median, then the standard deviation with the dis- tances between the median and the quartiles.

4

Références

Documents relatifs

Quelle proportion parmi tous les passagers ´ etaient en premi` ere classe et sont des femmes2. Quelle proportion, parmi les passagers de premi` ere classe, sont

Your station score for ten-minute stations represents how many checklist items you did correctly (80 percent of the station score) and how the examiners rated your performance

I It can be chosen such that the components with variance λ α greater than the mean variance (by variable) are kept. In normalized PCA, the mean variance is 1. This is the Kaiser

More precisely, he says the following: there exists non-real eigenvalues of singular indefinite Sturm-Liouville operators accumulate to the real axis whenever the eigenvalues of

Results showed that, without training, roughly half of the subjects were able to control the application by using real foot movements and a quarter were able to control it by

A geomaterial called TexSol and composed of sand and wires was investigated by numerical experiments in order to determine its geometrical and mechanical pa- rameters, such

The criterion SICL has been conceived in the model-based clustering con- text to choose a sensible number of classes possibly well related to an exter- nal qualitative variable or a

In this section we test a second order algorithm for the 2-Wasserstein dis- tance, when c is the Euclidean cost. Two problems will be solved: Blue Noise and Stippling. We denote by