Statistical inference is the process of drawing conclusions from data, for example by confidence intervals and significance tests. Take the mean as an example. The mean is the most commonly discussed parameter of a population, which normally shows the center of a distribution. I will use the data in this webpage as an example to show how to make inference about means.

These data were taken from the Child Health and Development Studies conducted at the Oakland, CA, Kaiser Foundation Hospital, in which the birth weight of infants are recorded. The variables are

  1. bwt: baby’s weight in ounces at birth
  2. gestation: duration of pregnancy in days
  3. parity: parity indicator (first born = 1, later birth = 0)
  4. age: mother’s age in years
  5. height: mother’s height in inches
  6. weight: mother’s weight in pounds (during pregnancy)
  7. smoke: indicator for whether mother smokes (1=yes, 0=no)

We need to import these data first. R provides many functions to import data from files. The one I think most often used is read.table(), which is suitable to import the spreadsheet data in a text file (e.g., *.txt). Often many people prefer encoding data in MS EXCEL and storing them as a *.xlsx file. This type of file is not the pure text file (e.g., *.txt or *.csv), so it is recommended to save your data in MS EXCEL as *.csv file. Although there are packages in R which can perfectly import data from the EXCEL files, it would be easier and of less troubles to import data from a text file in R.

Example 1

Normally, read.table() requires users to fill in at least the first two arguments, which are the file name (including the file path) and whether the header of the file exists. In the below codes, the first line is to save the string “http://people.reed.edu/~jones/141/Bwt.dat” as a variable url. In fact, this string records the path and name of the target file. The second line shows that read.table() is called. The first argument file=url defines the file name and path. The second argument header=T means the first line in this data set is the headers. Here T is the abbreviation for TRUE. The third argument sep=“,” indicates that the columns are separated by “,”, as shown in the webpage. Note that the returned object by read.table() is a data frame.

url<-"http://people.reed.edu/~jones/141/Bwt.dat"
bw.dta<-read.table(file=url,header=T,sep=",")

If there is no Warning or Error message in the console pane, the data have been imported for sure. Now check the structure of the data. There are 1,174 observations and 7 variables. You can also use dim() to check the numbers of rows and columns, which is also 1174 and 7. The function dim() is used to check the dimensions of a matrix. Any data frame can be viewed as a matrix, so dim() works for a data frame as well. We can also have a glance at this data set by showing its first 6 rows with head().

str(bw.dta)
## 'data.frame':    1174 obs. of  7 variables:
##  $ bwt      : int  120 113 128 108 136 138 132 120 143 140 ...
##  $ gestation: int  284 282 279 282 286 244 245 289 299 351 ...
##  $ parity   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age      : int  27 33 28 23 25 33 23 25 30 27 ...
##  $ height   : int  62 64 64 67 62 62 65 62 66 68 ...
##  $ weight   : int  100 135 115 125 93 178 140 125 136 120 ...
##  $ smoke    : int  0 0 1 1 0 0 0 0 1 0 ...
dim(bw.dta)
## [1] 1174    7
head(bw.dta)
##   bwt gestation parity age height weight smoke
## 1 120       284      0  27     62    100     0
## 2 113       282      0  33     64    135     0
## 3 128       279      0  28     64    115     1
## 4 108       282      0  23     67    125     1
## 5 136       286      0  25     62     93     0
## 6 138       244      0  33     62    178     0

Describe Data

The first variable in this data set is the birth weight of infants. Suppose we are interested in estimating the mean birth weight of infants in the world. What we can do with the data from the sample of 1,174 infants? First, of course, we can plot the histogram of it. R is very powerful on making excellent graphics. Here I will introduce how to use the functions in the package {ggplot2} to make your graphics. First, you have to install this package by clicking on the tag Packages in the lower-right pane, clicking on the button of Install, entering ggplot2 in the dialog box. When you want to use the functions in the package {ggplot2}, you need to load it into the memory in advance with library().

library(ggplot2)

The princile of using {ggplot2} to make graphics is very straightforward. Just like painting a canvas, we put on the graphics layer by layer. The first layer is defined by ggplot(), within which the argument data defines the data source and the argument aes() is used to set up the coordinates of the figure and other parameters for plotting. You can use ?ggplot() on the console pane to see the details of this function. The second layer is meant to be the histogram, which is plotted by calling geom_histogram() with bins=10 for separating the whole value range to 10 bins. Note we use + to link different graphic layers in ggplot.

fg<-ggplot(data=bw.dta,aes(bwt))+
      geom_histogram(bins=10)
fg

We can change the colors of the body and the contour of bars by assigning the color names for the argument col and fill. The argument alpha is the degree of transparency.

fg<-ggplot(data=bw.dta,aes(bwt))+
      geom_histogram(bins=10,col="red",fill="blue",alpha=0.2)
fg

Alternatively, we can make the bar color be shown in a gradient to reprsent the counts of observations.

fg<-ggplot(data=bw.dta,aes(bwt))+
      geom_histogram(bins=10,col="red",aes(fill=..count..))+
      scale_fill_gradient("Count",low="green",high="red")
fg
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In addition, we can also superimpose a normal curve estimated by the normal distribution on the histogram. By checking the similarity between the histogram and the normal curve, we can quickly get the idea of how much the data are normally distributed. To this end, we need to change the Y axis from the counts of observations to the probabilities of observations. Subsequently, we need to add one more layer to plot the normal curve by geom_density().

fg<-ggplot(data=bw.dta,aes(bwt))+
      geom_histogram(aes(y=..density..),bins=10,col="red",fill="blue",alpha=0.2)+
      geom_density()
fg

Normally, we use a Q-Q plot to check how much close the observed data are to the theoretical predictions generated by the standard normal distribution. In {ggplot2}, it is very easy to make a Q-Q plot. Different from the previous setting for data source, now we use the argument sample=bwt to indicate that the variable bwt is to be fit by the standard normal distribution. The scatters in a Q-Q plot are computed and plotted by stat_qq() and the Q-Q line is plotted by stat_qq_line(). Finally, we can plot the Q-Q plot together with the histogram by grid.arrange(), which is installed in the package {gridExtra}. As shown in the below two figures, the distribution of the observed birth weights looks quite normal.

fg1<-ggplot(data=bw.dta,aes(sample=bwt))+
      stat_qq(col="blue",alpha=0.2)+stat_qq_line(col="red")
library(gridExtra)
grid.arrange(fg,fg1,ncol=2)

In addition to histogram, we can directly compute the statistics of the distribution of this sample. The mode can be computed by the codes underneath summary(). Thus, the mean, median, and mode are respectively 119.5, 120, and 115. These three indices of central tendency are quite close to each other. Therefore, it is reasonable to regard this distribution as normal.

summary(bw.dta$bwt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    55.0   108.0   120.0   119.5   131.0   176.0
weights<-table(bw.dta$bwt)
flag<-which(weights==max(weights))
as.numeric(names(weights[flag]))
## [1] 115

Inference about Mean

Now we can draw an inference about the population mean weight by the 95% confidence intervals equally larger and smaller than the sample mean. The upper bound of the confidence intervals is computed as sample mean + t(0.975)xSE, whereas the lower bound is computed as sample mean + t(0.025)xSE. Therefore, we can compute the standard error=(standard deviation) / (squared root of sample size) and the two t scores for computing the bounds. The function qt() is used to get the t score given the area under which it.

# Compute standard error
SE<-sd(bw.dta$bwt)/sqrt(length(bw.dta$bwt))
SE
## [1] 0.53493
# Copmpute the t scores
ut<-qt(0.975,length(bw.dta$bwt-1))
lt<-qt(0.025,length(bw.dta$bwt-1))
c(ut,lt)
## [1]  1.961987 -1.961987
# Upper and lower bounds
c(mean(bw.dta$bwt)+lt*SE,mean(bw.dta$bwt)+ut*SE)
## [1] 118.413 120.512

In fact, we can use t.test() to get the same result without doing that much programming. When the first argument in t.test() is a vector, this is a one-sample t-test and the mu by default is 0. That is, this test is used to examine whether the sample mean is significantly different from 0. For computing the confidence intervals, we can simply assign the birth weights in the first argument with the mu=mean(bw.dta$bwt). The returned results include the testing result as well as the confidence intervals. The bounds are the same as what were returned by the above codes.

bw.t.result<-t.test(bw.dta$bwt,mu=mean(bw.dta$bwt))
bw.t.result
## 
##  One Sample t-test
## 
## data:  bw.dta$bwt
## t = 0, df = 1173, p-value = 1
## alternative hypothesis: true mean is not equal to 119.4625
## 95 percent confidence interval:
##  118.413 120.512
## sample estimates:
## mean of x 
##  119.4625

If you are familiar with the critical Z scores on Z distribution respectively indicating the area under it is 0.025 and 0.975, then you must know they are -1.96 and 1.96, which are the same as the upper and lower t scores we just computed. This is not coincidence, as when df gets larger (normally > 30), t distribution gets close to Z distribution. In the below demonstration, t distributions with three dfs (4, 14, and 74) as well as the Z distribution are plotted. The idea of plotting these distributions is simple: given a sequence of X values, compute the corresponding Y values and connect them by line. In the below codes, the function rep() is used to repeat the conmponents in the first argument (i.e., c(“t4”,“t14”,“t74”,“z”)) for n times for each of them.

The functions dt() and dnorm() are used to generated the probability densities of t distributions with different dfs and Z distribution. It is obvious that when df gets larger, t distribution gets closer to Z distribution.

# Demonstration for t distributions and Z distribution
# First create the data
x<-seq(-4,4,0.1)
dist.dta<-data.frame(x=x,y=c(dt(x,4),dt(x,14),dt(x,74),dnorm(x,0,1)),
             g=rep(c("t4","t14","t74","z"),each=length(x)))
ggplot(data=dist.dta,aes(x=x,y=y))+
     geom_line(aes(color=g))

Compare Means

Researchers are interested in whether or not mother smokes will influence the birth weight of infant. In this data set, the indicator for whether or not mother smokes is smoke. Thus, we can compare the mean birth weights of two groups of infants with mothers smoking or not. First, we can make a bar plot for the means of these two groups.

bw.dta$smoke<-as.factor(bw.dta$smoke)
fg3<-ggplot(data=bw.dta,aes(x=smoke,y=bwt,fill=smoke))+
      stat_summary(fun.y="mean",geom="bar")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
fg3

Sometimes, we want not only the bars to show the group means, but also the error bars in terms of one standard error as an indicator for how different the group means are. To this end, we need to computer the standard error for each group. The first line of the below codes is used to compute the standard deviations of the two groups. Here, with() functions as $. That is, with(bw.dta,bwt) is equivalent to bw.dta$bwt. Different from $, with() defines the data frame in the first argument as the working environment. Thus, we can apply any function to the variables in this data frame by directly calling their names instead of $. In the first line of these codes, tapply() applied the function sd() to the variable bwt with smoke as the grouping index without calling their full paths. Thus, the returned result is a vector containing two standard deviations of the two groups.

As the standard error is computed by dividing the standard deviation by the squared root of group size, the second line of the below codes is used to compute the standard errors of the two groups.

sds<-with(bw.dta,tapply(bwt,smoke,sd))
ses<-sds/sqrt(with(bw.dta,tapply(bwt,smoke,length)))
ses
##         0         1 
## 0.6516092 0.8539380

We create a new data frame for storing the group means and standard errors. In the below codes, we geom_bar() is used to plot the bars for means and geom_errorbar() is used to plot the error bars. The standard errors are too small to show clearly.

dta<-data.frame(means=with(bw.dta,tapply(bwt,smoke,mean)),
                ses=ses,smoke=c("No Smoke","Smoke"))
fg4<-ggplot(data=dta,aes(x=smoke,y=means,fill=smoke))+
       geom_bar(stat="identity")+
       geom_errorbar(aes(ymin=means-ses,ymax=means+ses),width=0.2)
fg4

The visual inspection suggests that the groups means should be different significantly. A t-test is conducted with H0: No difference between the two means and H1: the two means are different. Again, we use with() to set up bw.dta as the working environment, within which a t-test is conducted for comparing the mean birth weights of two groups with smoke as the grouping variable. The argument var.equal=T means that the variance of two populations are assumed to be the same. Since this t-test is for examining the difference between the means of two independent groups, df = df1 + df2 = total number of subjects - 2 =1172. There are other ways to implement t.test() for comparing two means of independent groups. Please ?t.test() in the console panel for more details.

with(bw.dta,t.test(bwt~smoke,var.equal=T))
## 
##  Two Sample t-test
## 
## data:  bwt by smoke
## t = 8.7188, df = 1172, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   7.180973 11.351312
## sample estimates:
## mean in group 0 mean in group 1 
##        123.0853        113.8192

Example 2

One of the common experimental designs is the pre/post experimental design. Suppose you wish to test the effect of one medicine on the well-being of the depressed individuals, using a standardized “well-being scale” that sums Likert-type items to obtain a score that could range from 0 to 20. The higher the score, the greater the well-being is reported. This scale was administered to 9 participants before and after they took the medicine. The well-being scores of each participant are encoded as variables with the below codes.

before<-c(3,0,6,7,4,3,2,1,4)
after<-c(5,1,5,7,10,9,7,11,8)
before
## [1] 3 0 6 7 4 3 2 1 4
after
## [1]  5  1  5  7 10  9  7 11  8

Now we can create a data frame for plotting the two group means with the error bars. In this data frame, the mean well-being scores pre/post test as well as their standard errors are contained. In addition, a grouping variable time is included also.

m.fg.dta<-data.frame(y=c(mean(before),mean(after)),
           ses=c(sd(before),sd(after))/sqrt(9),
           time=c("before","after"))
m.fg.dta
##          y      ses   time
## 1 3.333333 0.745356 before
## 2 7.000000 1.013794  after

As time here is treated as a discrete variable (i.e., before and after), a bar plot will be suitable. Obviously, the well-being score gets larger after taking that medicine. Visual inspection suggests the effect of this medicine should be significant.

fg5<-ggplot(data=m.fg.dta,aes(x=time,y=y,fill=time))+
      geom_bar(stat="identity")+
      geom_errorbar(aes(ymin=y-ses,ymax=y+ses),width=0.2)
fg5

We can conduct the paired-group t-test to verify our observation. Since the paired-group t-test is equivalent to one-sample t-test for the difference between the paired groups, we can simply run t.test() for the difference on the well-being scores between pre/post test. The result confirms our visual inspection that the medicine can effectively increase the depressed individual’s well-being.

t.test(after-before)
## 
##  One Sample t-test
## 
## data:  after - before
## t = 3.1429, df = 8, p-value = 0.01375
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.9763285 6.3570048
## sample estimates:
## mean of x 
##  3.666667

Exercise:

A sleep researcher would like to know whether the sleeping time would affect people’s cognitive ability. He hypothesized that the cognitive ability would be better with a longer sleeping time than a shorter sleeping time. Thus, he manipulated the level of sleeping time as 4 hours and 8 hours for two groups of 8 participants and recorded their cognitive ability performance in an ACT task. The higher the scores, the better the performance. The ACT scores in the 4-hr group are [5 7 5 3 5 3 3 9], whereas those in the 8-hr group are [8,1,4,6,6,4,1,2]. What kind of statistical test you will use to verify this researcher’s hypothesis? What if these data were collected from the same 8 participants in different days? Please conduct the suitable statistical tests as well as plots for describing the data.