Boxplots

In addition to histogram, we have other ways of looking at data. One of those ways gives greater prominence to the dispersion of the data, which is known as a boxplot, developed by John Tukey. For instance, Figure 2.14 shows a boxplot for the data in Table 2.7. See Table 2.8. for detailed description for the indices used in a boxplot. The box ranges from Q1 to Q3 and the middle bar in the box is the median of data, not the mean of data. The lower fence = Q1 - 1.5(interquartile range), whereas the upper fence = Q3 + 1.5(interquartile range). How to use R to make a boxplot? Let’s make a boxplot for the data in Table 2.7.

score<-c(2,1,2,3,3,9,4,20,4,1,3,2,3,2,1,33,3,NA,3,2,3,6,5,NA,3,3,2,4,7,2,4,4,10,5,3,2,2,NA,4,4,3)
dta<-data.frame(score=score)
library(ggplot2)
fg1<-ggplot(data=dta,aes(y=score))+
       geom_boxplot()+xlim(-1.5,1.5)
fg2<-ggplot(data=dta,aes(y=score))+
       geom_boxplot()+coord_flip()+xlim(-1.5,1.5)
library(gridExtra)
grid.arrange(fg1,fg2,ncol=2)
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The real usefulness of boxplots comes when we want to compare several groups. For instance, Figure 2.15 shows the reaction times in the Sternberg’s study for those cases in which the stimulus was actually present, broken down by the set size. Let’s replicate it by using R. First, we have to import those RT data from the file chap2_sternberg.txt. The function read.table( ) is specifically used for importing data from a text file. With dim( ), we know that there are 52 rows (or cases) and 6 columns (or experimental conditions). Now we create a new data frame (RTdta1) for the RT’s in the Yes conditions by the set size. In this data frame, there are only two columns, namely rt and group, respectively for the RT’s in the three Yes conditions and the corresponding set sizes of the study lists.

RTdta<-read.table("chap2_sternberg.txt",header=T,sep="")
dim(RTdta)
## [1] 52  6
RTdta1<-data.frame(rt=unlist(RTdta[,c(1,3,5)]),
                   group=rep(c(1,3,5),each=nrow(RTdta)))
dim(RTdta1)
## [1] 156   2
head(RTdta1)
##      rt group
## V1Y1 40     1
## V1Y2 41     1
## V1Y3 47     1
## V1Y4 38     1
## V1Y5 40     1
## V1Y6 37     1

We can make a boxplot for these RT’s with the following codes. The function as.factor( ) turns the mode of variable to factor. In R, a factor is a nominal variable, particular suitable for representing the group label (e.g., male vs. female). Thus, the x axis in the below figure is not a continuous dimension. Similarly, the x axis in a barplot also represents for the group label. Thus, it should be a factor. Back to the below figure, the visual inspection suggests that the RT for the set size = 1 is the shortest and the RT for the set size = 5 is the longest.

ggplot(data=RTdta1,aes(x=as.factor(group),y=rt))+
  geom_boxplot()+xlab("Set Size")

However, if the values of a variable are of the mode of character, this variable by default is a factor in a data set. In the following figure, the x axis represents the English label of each experimental condition. However, these labels follow the alphabet order. You can Google for the solution of how to adjust the order.

RTdta1$group1<-rep(c("One","Three","Five"),each=nrow(RTdta))
ggplot(data=RTdta1,aes(x=group1,y=rt))+
  geom_boxplot()

Probability Distribution

A probability distribution is a mathematical function that provides probabilities of occurrence of different outcomes in an experiment (see Wikipedia). A probability distribution is often used to describe the frequency distribution of observed scores. For instance, a normal distribution is superimposed on the histogram in Figure 2.4.

Normal Distribution

Normal distribution is the most important probability distribution that we will encounter. There are several reasons for this.

  1. Many of the dependent variables with which we deal are commonly assumed to be normally distributed in the population.

  2. If a variable can be assumed to be normally distributed, many useful mathematical techniques allow us to make a number of inferences about values of that variable.

  3. The sampling distribution of the mean, which is extensively used, is approximately normal.

  4. Most of the statistical procedures we will employ have an assumption that the population of observations is normally distributed.

Mathematically the normal distribution is defined as \(f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-0.5z^2}\), where \(z=\frac{x-\mu}{\sigma}\) with \(x\) and \(\mu\) respectively representing a value of the variable and the mean of the population of that variable. The normal distribution is a continuous probability distribution that \(-\infty \leq x \leq \infty\) and \(\int_{-\infty}^{\infty}f(x)dx=1\). The height of this distribution at \(x\) is \(f(x)\), which is called probability density or likelihood.

There are a couple of characteristics of the normal distribution. First, the normal distribution is bell shaped, namely unimodal with the peak at around the mean. Second, the normal distribution is symmetric. Third, the probability density or the height of normal distribution (i.e., \(f(x)\)) gradually decreases towards 0 (but not equal to 0) from the location of peak to the two extreme ends.

Standard Normal Distribution

A normal distribution with \(\mu=0\) and \(\sigma=1\) is called the standard normal distribution. See Figure 3.6. As \(z=\frac{x-\mu}{\sigma}\), \(z\) is the standard normal distribution.

The population mean of \(z\) or the expected value of \(z\) is 0, namely \(E(z)=0\). Given \(E(x)=\frac{\sum x}{N}\), \(E(z)=E(\frac{x-\mu}{\sigma})=\frac{\sum (x-\mu)}{N\sigma}=0\). Given that the variance of \(z\) is \(Var(z)=\sigma^2_z=\frac{\sum(z-0)^2}{N}\) and \(z^2=\frac{(x-\mu)^2}{\sigma^2}\), \(Var(z)=\frac{\sum z^2}{N}=\frac{\sum(x-\mu)^2}{N\sigma^2}=\frac{\sigma^2}{\sigma^2}=1\). For almost all basic probability distributions, R has corresponding functions. Of course, the normal distribution is not the exception. See the following figure for the standard normal distribution. The function dnorm( ) is used to compute the probability density of a normal distribution for a given value or value vectors, namely \(f(x)\). Apparently, this is a bell-shaped curve with the highest density at \(z=0\) and the density decreases symmetrically towards the two ends.

dta<-data.frame(z=seq(-3,3,0.1),y=dnorm(seq(-3,3,0.1),0,1))
ggplot(data=dta,aes(x=z,y=y))+
  geom_line()+ylab("Probability Density")

As we know that the mean of the normal distribution is also the median, that is, 50% of the values should be smaller than the mean. Also, we know that the total area underneath the normal curve is 1. Accordingly, it can be inferred that the area from \(-\infty\) to the mean (i.e., 0 in the \(z\) distribution) is 0.5. We can also say that we have 50% of chance to sample a value smaller than the mean. Note the area underneath the probability distribution is probability. The left panel in the following figure shows the area below \(z=0\), whereas the right one shows the area for \(-1.96 \leq z \leq 1.96\).

fg3<-ggplot(data=dta,aes(x=z,y=y))+
  geom_line()+
  geom_ribbon(data=subset(dta,z<0),aes(ymin=0,ymax=y),fill="tomato",alpha=0.5)
fg4<-ggplot(data=dta,aes(x=z,y=y))+
  geom_line()+
  geom_ribbon(data=subset(dta,z<=1.96 & z>=-1.96),aes(ymin=0,ymax=y),fill="royalblue",alpha=0.5)
grid.arrange(fg3,fg4,ncol=2)

Let’s see how we can compute the area/probability of the normal distribution. The function pnorm( ) is used to compute the area from \(-\infty\) to the particular value assigned by users. Thus, pnorm(0,mean=0,sd=1) asked R to return the probability to sample a value smaller than 0, given the normal distribution with mean = 0 and sd = 1. As we know that for the standard normal distribution, +3 approximates the positive infinity, we can check the probability to sample a value smaller than +3. The answer is quite close to 1, isn’t it? What if we want to compute the probability between two paricular values, just like the right panel of the above figure? You can use the area below 1.96 to minus the area below -1.96. The area in between these two values is about 0.95. That means, we have 95% of chance to sample a value from this region \(-1.96 \leq z \leq 1.96\). This is a very important feature for the standard normal distribution.

pnorm(0,mean=0,sd=1)
## [1] 0.5
pnorm(3,0,1)
## [1] 0.9986501
pnorm(1.96,0,1)-pnorm(-1.96,0,1)
## [1] 0.9500042

We can also compute the value given the area on the normal distribution. The function qnorm( ) fits this need. When checking with area=0.5, it returns 0. This means that given the area between \(-\infty\) and a \(z\) value = 0.5, that \(z\) value should be 0. Also, you can see in the following codes that the probability \(\leq z=-1.96\) is 0.025 and the probability \(\leq z=1.96\) is 0.975. This result can be calculated by hand. As we know that the probability between -1.96 and 1.96 is 0.95, the remaining area must be 1 - 0.95 = 0.05. Since the normal distribution is symmetric, the two triangular areas out of the region equally shares this residual area. That is, the lower triangular area should be 0.025. The upper triangular area should be 0.025 too. This also means that the area below 1.96 is 1 - 0.025 = 0.975.

qnorm(0.5,0,1)
## [1] 0
qnorm(0.025,0,1)
## [1] -1.959964
qnorm(0.975,0,1)
## [1] 1.959964

In addition to computing with the normal distribution, another useful function in R is to draw random samples from a normal distribution. The following codes draw three samples of n = 30, 100, and 1000 from the standard normal distribution. With these sample data, three histograms are plotted. As inspected, the larger the sample size, the more similar the histogram is like the normal distribution.

set.seed(123)
rdta1<-data.frame(x=rnorm(30,0,1))
rdta2<-data.frame(x=rnorm(100,0,1))
rdta3<-data.frame(x=rnorm(1000,0,1))
fg5<-ggplot(data=rdta1,aes(x))+geom_histogram(color="black",fill="gray",bins=30)
fg6<-ggplot(data=rdta2,aes(x))+geom_histogram(color="black",fill="gray",bins=30)
fg7<-ggplot(data=rdta3,aes(x))+geom_histogram(color="black",fill="gray",bins=30)
grid.arrange(fg5,fg6,fg7,ncol=3)