Sampling Distribution

The sampling distribution of a statistic (e.g., sum, mean, etc.) is the distribution of that , considered as a random variable, when derived from a random sample of size n.

Suppose the population is a normal distribution with \(\mu=0\) and \(\sigma=1\). If we get 100 samples (\(n=10\) in each sample) randomly drawn from this population, what would the distribution of these 10 sample means look like? That is, what would the sampling distribution of these means look like? We can use R to simulate the result. First, we create the population of \(N=10,000\) by rnorm( ). We set up the random seed as 123, so that the random sequence starts from the same value at all times. The histogram of this population is shown as follow. Obviously, this is a normal distribution.

set.seed(123)
P<-rnorm(10000,0,1)
library(ggplot2)
ggplot(data=data.frame(x=P),aes(x))+
  geom_histogram(color="white",fill="deepskyblue3",bins=30)

Now we draw 100 random samples of size \(=10\) with replacement from this population with each sample. The function sapply( ) is used to repeat a set of commands for a certain times. Here, the command is very simple, just sampling 100 values from this population. The returned object s is a \(10\times100\) matrix with the columns representing the sampling iterations and the row representing the cases. With colMeans( ), we can compute the means of the values by column in a matrix, that are actually the 100 sample means.

s<-sapply(1:100,function(x){
  temp<-sample(P,10,replace=T)
  return(temp)
})
dim(s)
## [1]  10 100
sm<-colMeans(s)
ggplot(data=data.frame(x=sm),aes(x))+
  geom_histogram(color="white",fill="pink",bins=30)

The histogram of these 100 sample means is somewhat like a normal distribution. How about we increase the sample size? The below codes generate four groups of 100 sample means respectively with different sample sizes (i.e., \(n=10\), \(n=30\), \(n=100\), and \(n=500\))

set.seed(123)
s1<-sapply(1:100,function(x)sample(P,10,replace=T))
s1m<-colMeans(s1)
s2<-sapply(1:100,function(x)sample(P,30,replace=T))
s2m<-colMeans(s2)
s3<-sapply(1:100,function(x)sample(P,100,replace=T))
s3m<-colMeans(s3)
s4<-sapply(1:100,function(x)sample(P,500,replace=T))
s4m<-colMeans(s4)

Now we can observe these distributions of sample means. Obviously, the distribution of sample means gets more similar to the normal distribution from left to right, as the sample size increases. These distributions are called sampling distributions, since the mean is one of the sample statistics.

library(gridExtra)
fig1<-ggplot(data=data.frame(x=s1m),aes(x))+
  geom_histogram(color="white",fill="tomato",bins=20)
fig2<-ggplot(data=data.frame(x=s2m),aes(x))+
  geom_histogram(color="white",fill="tomato",bins=20)
fig3<-ggplot(data=data.frame(x=s3m),aes(x))+
  geom_histogram(color="white",fill="tomato",bins=20)
fig4<-ggplot(data=data.frame(x=s4m),aes(x))+
  geom_histogram(color="white",fill="tomato",bins=20)
grid.arrange(fig1,fig2,fig3,fig4,ncol=4)

Characteristics of Sampling Distribution

Visual inspection at these sampling distribution suggest that the means are around 0. Indeed, the mean of the sampling distribution is the unbiased estimate of the population mean \(\mu\). The mean of a sample is the unbiased estimate of the population, \(E(\bar{x})=\mu\). The mean of the means of samples is also the unbiased estimate of the population, \(E(\frac{1}{N}(\bar{x}_1+\cdots+\bar{x}_J))=\frac{1}{N}(E(\bar{x}_1)+\cdots+E(\bar{x}_J))=\frac{N\mu}{N}=\mu\). However, these distributions are more condensed than the population. The standard deviation of population is 0.998. However, the sd’s of the sampling distributions of sample size \(n=10\), \(n=30\), \(n=100\), and \(n=500\) are 0.317, 0.197, 0.110, and 0.044. Actually, the standard deviation of sampling distribution is \(\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}<\sigma\). The variance of a sample is the unbiased estimate of the population variance, \(Var(x)=\sigma^2\). Then, the variance of the sampling distribution is \(Var(\bar{x})\). Before we derive this equation, the characteristic of the operator \(Var( )\) is described here. \(Var(a+b)=Var(a)+Var(b)+2cov(a,b)\). First, if \(a\) is independent of \(b\), \(cov(a,b)=0\). Second, \(Var(ax)=a^2Var(x)\). Now back to the equation. \(Var(\bar{x})=Var(\frac{\sum x}{n})\). As \(x_i\) is independent of \(x_j\), \(cov(x_i,x_j)=0\). Then, \(Var(\bar{x})=\frac{1}{n^2}\sum Var(x)=\frac{n}{n^2}Var(x)=\frac{\sigma^2}{n}\). Therefore, the standard deviation of the sampling distribution is \(\frac{\sigma}{\sqrt{n}}\), which is called the standard error.

Central Limit Theorem

In probability theory, the Central Limit Theroem establishes that, in some situations, when independent random variables are added (i.e., \(n\rightarrow\infty\)), their properly normalized sum tends toward a normal distribution, even if the original variables themselves are not normally distributed (see Wikipedia).

When tossing a coin once, we will get either a head or tail. If we encode a head as 1 and a tail as 0, then the population of the outcome of tossing a coin contains only two values. Obviously, this is a Bernoulli distribution, not a normal distribution. However, if we toss this coin for several times and record the sum of the observed values, then the sampling distribution of the sums tend to be normally distributed. Let us do this simulation with R. First, we can create two populations by rbinom( ). The first line means that we draw one random sample from a population of one value with 0.5 probability to get the value 1.

P<-sapply(1:10000,function(x)rbinom(n=1,size=1,prob=0.5))
ggplot(data=data.frame(x=P),aes(x))+
  geom_histogram(color="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Suppose we toss these two coins and compute the sum of the outcomes on each trial as the dependent variable. According to CLT, the sampling distribution of these sums tend to be normally distributed as the sample size increases, say from \(n=10\), \(n=30\), to \(n=100\). Let us do this simulation for 100 trials with each sample size.

s5<-sapply(1:100,function(j)sum(sample(P,10,replace=T)))
s6<-sapply(1:100,function(j)sum(sample(P,30,replace=T)))
s7<-sapply(1:100,function(j)sum(sample(P,100,replace=T)))
fig5<-ggplot(data=data.frame(x=s5),aes(x))+
  geom_histogram(color="white",fill="deepskyblue2",bins=10)
fig6<-ggplot(data=data.frame(x=s6),aes(x))+
  geom_histogram(color="white",fill="deepskyblue2",bins=10)
fig7<-ggplot(data=data.frame(x=s7),aes(x))+
  geom_histogram(color="white",fill="deepskyblue2",bins=10)
grid.arrange(fig5,fig6,fig7,ncol=3)

Sampling Distribution of Differences Between Means

In Chapter 4.2, an interesting study was introduced. Ruback and Juieng (1977) ran a simple study in which they divided drivers into two groups of 100 participants each - those who had someone waiting for their space and those who did not. The results showed that for those who had no one waiting, the average amount of time to leave the parking space was 32.15 seconds. However, it took the drivers about 39.03 seconds on average to leave who had someone waiting for the parking space. For each of these groups, the standard deviation of waiting times was 14.6 seconds. We want to answer the question “Is the obtained difference too great to be attributable to chance?” To do this we have to use the sampling distribution of the differences between mean waiting times.

Suppose the two population means are 35 and the population standard deviation is 15. The below codes are used to create 10000 means with sample size \(n=100\) from each population. Then, the frequency distribution of the differences between two means can be plotted. It is a normal distribution. The Q-Q plot also confirms this inspection.

set.seed(123)
g1<-sapply(1:10000,function(x){
  mean(rnorm(100,35,15))})
g2<-sapply(1:10000,function(x){
  mean(rnorm(100,35,15))})
mean(g1-g2)
## [1] 0.01551114
sd(g1-g2)
## [1] 2.134577
fig9<-ggplot(data=data.frame(x=g1-g2),aes(x))+
  geom_histogram(color="white",fill="tomato",alpha=0.5,bins=30)
fig10<-ggplot(data=data.frame(x=g1-g2),aes(sample=x))+
  stat_qq(color="red",shape=1,size=0.6,alpha=0.5)+stat_qq_line()
grid.arrange(fig9,fig10,ncol=2)

The mean and the standard deviation of this sampling distribution are respectively 0.016 and 2.135. The standard error of the sampling distribution of differences between means is \(\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\). You can compute the standard error with this equation. See the third line in the below codes.

mean(g1-g2)
## [1] 0.01551114
sd(g1-g2)
## [1] 2.134577
sqrt(2*15^2/100)
## [1] 2.12132

Now we go back to the question. The observed difference between the mean waiting times of the two groups is \(39.03-32.15=6.88\). There are 6 out of 10,000 observations (i.e., 0.0007) larger than 6.88. Alternatively, we can use pnorm( ) to compute the probability to observe a difference larger than 6.88. The computed probability is 0.0006, which is quite close to the probability computed by the simulation of 10,000 samples.

sum((g1-g2)>6.88)
## [1] 6
1-pnorm(6.88,mean(g1-g2),sd(g1-g2))
## [1] 0.0006502841

Suppose the sampling distribution of these 10,000 samples is actually the sampling distribution of population for no difference between two groups. We can use the mean and standard error to plot the probability distribution. Apparently, 6.88 is outside the 95% interval. If we sample a value from this “population”, there will be 95% of chance to get it between -4.17 and 4.20. However, we got 6.88, which is far too much outside this interval. That means, this observation is quite unlikely, yet it happened. Thus, we cannot but claim that this observation does not result from random sampling, but some real difference.

x<-seq(-7,7,0.1)
dta<-data.frame(x=x,y=dnorm(x,mean=mean(g1-g2),sd(g1-g2)))
# Get the lower bound and upper bounder for the 95% interval
lower<-qnorm(0.025,mean(g1-g2),sd(g1-g2))
upper<-qnorm(0.975,mean(g1-g2),sd(g1-g2))
ggplot(dta,aes(x,y))+
  geom_line()+
  geom_ribbon(data=subset(dta,x>lower & x<=upper),
              aes(ymin=0,ymax=dnorm(x,mean(g1-g2),sd(g1-g2))),
                  fill="tomato",alpha=0.6)+
  geom_segment(aes(x=6.88,y=0.025,xend=6.88,yend=0),color="deepskyblue3",
               arrow=arrow(length=unit(0.2,"cm")))+
  annotate("text",x=0,y=0.05,label="95%")
## Warning in geom_segment(aes(x = 6.88, y = 0.025, xend = 6.88, yend = 0), : All aesthetics have length 1, but the data has 141 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

The below table shows the SE’s of three sampling distributions for one normal population, two normal populations, and the Bernoulli population.

Population Statistic Sampling Distribution SE
\(N(\mu,\sigma^{2})\) \(\overline{x}\) from samples of size \(n\) \(\overline{x}\sim N(\mu,\frac{\sigma^{2}}{n})\) \(\frac{\sigma}{\sqrt{n}}\)
Bernoulli(\(p\)) \(\overline{x}\) \(n\overline{x}\sim Binomial(n,p)\) \(\sqrt{\frac{p(1-p)}{n}}\)
\(N(\mu_{1},\sigma_{1}^{2})\), \(N(\mu_{2},\sigma_{2}^{2})\) \(\overline{x}_{1}-\overline{x}_{2}=\Delta\overline{x}\) \(\Delta\overline{x}\sim N(\Delta\mu,\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^2}{n_{2}})\) \(\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\)