Universit´e Joseph Fourier L2/STA230
TP 1: Description of a qualitative or discrete variable
The aim of these 13 tutorials is to make you able to carry out a standard statistical study on your experimental data, with qualitative, discrete and continuous variables.
Objectives: The objective of this first session is to learn the basic commands when dealing with a series of data saved in a vector. The second objective is to compute the usual descriptive indicators and graphs for a qualitative or a discrete variable.
RStudio is a free software environment for statistical computing and graphics. You can download it on your computer from www.rstudio.com. R (or RStudio) is sensitive to the case (capital and non-capital letters).
RStudio has four windows:
• Lower left corner: execute the instructions with Enter ←-
• Upper left corner: write the ’script’ to save all your instructions. Execute the instruction in the lower left corner with Ctrl+Enter.
• Lower right corner: plots will appear here.
• Upper right corner: history of the instructions
1 First steps with R Studio
1. Open a new script. Save your file as “Lab1.R”.
2. Create the data series (1,2,3,4,5) with the instruction:
c(1,2,3,4,5)
c(.) is the concatenation function which keeps in a vector the different values.
3. The previous vector has no name. To give a name to a vector:
c(1,2,3,4,5)->xorx<-c(1,2,3,4,5)
To verify if the vectorxcontains the values, execute the instruction x
4. Create the vectory with the values (2,4,6,8,10).
5. Verify ifxandy have the same lengths (total number of values):
length(x) length(y)
6. Plot the points defined by the two vectors (x, y):
plot(x,y)
7. Customize your plot:
plot(x,y, main="y given x", xlab = "x", ylab = "y") # add a title, labels on both axis plot(x,y, type = "p", pch = 3) # change the symbol
plot(x,y, type = "b") # add a line
plot(x,y, col = "red") # change the color All the instructions are described in the help:
help(plot)
8. Save your plot: click into the lower right corner and ”export” as a pdf file.
2 Basic computations
1. Basic operations:
x/5 # multiply or divide a vector by a scalar x+5 # add a scalar to a vector
sum(x) # sum up the elements of x cumsum(x) # cumulative sums of x sqrt(x) # squared root of x xˆ3 # x to the power 3 2. Add new values to the vectorx
c(x,6)
c(x,1,1,1,1,1)
c(x,rep(1,5)) # rep() repeats the same values
c(x,seq(from=1, to=10, by=2)) # seq() creates a sequence
3. Extract the second and fourth values ofy:
y[c(2,4)]
Try to understand the following instructions:
y[1:4]
y[(y>4)]
y[(y!=4)]
y[(y==4)]
y[(y>4)&(y<=6)]
4. Basic operations with two vectors:
x+y x*y x/y
5. Create a table (or a matrix) with the two vectorsxandy cbind(x,y) # matrix with 5 rows and 2 columns rbind(x,y) # matrix with 2 rows and 5 columns
2
3 Basic statistics of a qualitative or discrete variable
Here are numbers by age of non-smoking mothers at delivery.
age 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
number 7 8 9 10 12 3 2 5 4 5 2 4 2 0 1
1. Is this a discrete or continuous variable ? 2. What are the levels ?
3. Create the vector with all the ages
ages <- c(rep(21,7), rep(22,8),rep(23,9),rep(24,10),rep(25,12), rep(26,3),rep(27,2), rep(28,5),rep(29,4),rep(30,5),rep(31,2), rep(32,4),rep(33,2),35)
4. Find the numbers of each level:
t<-table(ages); t
5. Find the empirical frequencies of the levels, by dividing the numbers by the total number of indi- viduals.
6. Plot the empirical frequencies with a bar plot barplot(t)
What is the problem ?
The solution is to create a vector with the empirical frequencies without forgetting the levels with 0 frequency:
fages <- c(7,8,9,10,12,3,2,5,4,5,2,4,2,0,1) # include 0 in freq.
names(fages) <- as.character(21:35) # make a table barplot(fages)
7. Verify that the empirical frequencies are the same:
fages/sum(fages)
8. Find the empirical mean of the age of non-smoking mother at delivery.
9. Compare the results with mean(ages)
sum((21:35)*fages)/sum(fages)
10. Find the empirical variance (denoteds2) and the corrected empirical variance (s02= n−1n s2).
11. Find the standard deviation of the sample.
12. Compare the results with
var(ages) # is this equal to s2 or s02 ? sd(ages)
13. Find the values of the empirical distribution function.
14. Plot the empirical distribution function withplot(ecdf(ages)).
15. What is the empirical frequency of the interval [22; 25] ?
3
16. Give the median and the quartiles of the sample:
summary(ages) # some values
boxplot(ages) # box-and-whisker plot
quantile(ages,c(0.25,0.5,0.75)) # quartiles median(ages) # median
IQR(ages) # interquartile range plot(ecdf(ages)) # ecdf
abline(h=0.5,col="red") # horizontal at 0.5
abline(v=median(ages),col="red") # vertical at median abline(h=0.25,col="blue") # horizontal at 0.25
abline(v=quantile(ages,0.25),col="blue") # vertical at quartile abline(h=0.75,col="green") # horizontal at 0.75
abline(v=quantile(ages,0.75),col="green") # vertical at quartile
17. Compare the mean with the median, then the standard deviation with the dis- tances between the median and the quartiles.
4 Exercise
Here are numbers by age of smoking mothers at delivery.
age 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
number 5 5 4 3 3 5 1 4 3 2 3 2 1 1 1
1. Is this a discrete or continuous variable ? 2. What are the levels ?
3. Find the empirical frequencies of the levels.
4. Plot the empirical frequencies with a bar chart.
5. Find the empirical mean, variance and standard deviation of the age of smoking mother at delivery.
6. What is the empirical frequency of the interval [22; 25] ?
7. Draw a graphical representation of the empirical distribution function. Determine from the graph the median and the quartiles of the sample.
8. Compare the mean with the median, then the standard deviation with the dis- tances between the median and the quartiles.
4