We make inference about our research issue based on the data we collected. Hypothesis testing is a procedure for us to make such an inference statistically. The stages of hypothesis testing are the same for most of the statistical tests, which are as follows.
Begin with a research hypothesis (or \(H_1\)). A hypothesis is a proposition which can be judged as true or false. For example, people backing out of a parking place take longer when someone is waiting. This hypothesis can be either true or false.
Set up the null hypothesis (or \(H_o\)). A null hypothesis is a hypothesis that you do not have to test. It is a proposition naturally contrary to the research hypothesis. For example, the null hypothesis for the research hypothesis that people backing out of a parking place take longer when someone is waiting is that people backing out of a parking place do not take longer when someone is waiting.
Construct the sampling distribution of the particular statistic on the assumption that \(H_o\) is true. Take the car park case as the example. We can express the research hypothesis as \(H_1:\mu_w>\mu_{nw}\) or \(\Delta\mu>0\) where \(\mu_w\) and \(\mu_{nw}\) respectively mean the mean waiting time of the group when someone is waiting and not waiting and \(\Delta\mu=\mu_w-\mu_{nw}\). The null hypothesis is \(H_o:\mu_w\leq\mu_{nw}\) or \(\Delta\mu\leq0\) naturally. As we do not exactly know what the difference between the means of the two population distributions for the time needed to leave when someone is and is not waiting, if \(H_1\) is true, there is no way to know in advance the mean of the sampling distribution for the difference between two mean waiting times \(\mu_w\) and \(\mu_{nw}\). The only thing of which we are sure is that the mean of the sampling distribution should be 0, if \(H_o\) is true. Note that although the hypotheses in the example here contain an inequality, we normally use \(=\) and \(\neq\) in the hypotheses.
Collect some data. For example, we run an experiment to collect the mean leaving time of people in the conditions with someone waiting or not.
Compare the sample statistic to that distribution. Take as the example the study of Ruback and Juieng (1997). The mean difference of leaving time observed by Ruback and Juieng (1997) is 6.88 seconds. The question is how probably we will obtain a sample mean difference as 6.88 seconds at least, if \(H_o\) is true. We can check the area under the sampling distribution with the values larger than 6.88.
Reject or retain \(H_o\), depending on the probability, under \(H_o\), of a sample statistic as extreme as the one we have obtained.
The sampling distribution is the distribution over sample statistics, such as mean, median, variance, range, correlation coefficient or any other statistic you care to consider. Although we have already known that the samling distribution of the mean approximates the normal distribution when the sample size \(\rightarrow\infty\), actually the sampling distributions over different statistics might be different in shape. For different shapes of the sampling distributions, different test statistics such as z, t, F, and \(\chi^2\), are used.
Whenever we reach a decision with a statistical test, there is always a chance that our decision is the wrong one. A statistician does not only make a decision by some rational process, but also specify the conditional probabilities of a decision being in error. That is, a statistician can precisely state the probability of erroneously rejecting a true \(H_o\) following the logic of hypothesis testing. Consider the parking example. The sampling distribution is the distribution of differences in sample means. This distribution is depicted in the below figure with the mean \(=0\) and the standard error \(=2.065\). The standard error of the distribution of mean differences is estimated by \(\sqrt{\frac{s^2_1}{n}+\frac{s^2_2}{n}}\), where \(s^2_1\) and \(s^2_2\) are the standard deviations of the two samples. It does not matter for now, if you do not quite get the equation for computing the standard error.
se<-sqrt(14.6^2/100+14.6^2/100)
dta<-data.frame(x=seq(-10,10,0.1),y=dnorm(seq(-10,10,0.1),0,se))
library(ggplot2)
fg0<-ggplot(data=dta,aes(x=x,y=y))
fg0+geom_line()+labs(x="Difference in Means")
How large the difference in the mean waiting time is large enough for us to make a statistical decision? This question is equivalent to asking how strong the evidence should be for us to reject \(H_o\). In statistical testing, normally we set up a critical value on the sampling distribution, so that the largest probability to find the evidence against \(H_o\) is allowed to be 5%. In this case, the critival value can be computed by qnorm( ) with the area below it equalling 0.95 and with the mean and standard deivation of this normal distribution as 0 and the standard error. The critical value is 3.396 seconds in this case. Comparing with the observed difference in the mean waiting time (i.e., 6.88), it is obvious that the observed time difference is far larger than it. Therefore, we say that the observed score falls in the rejection region and \(H_o\) is rejected.
# Find the critical value
cx<-qnorm(1-0.05,0,se)
cx
## [1] 3.396214
fg0+geom_line()+
geom_ribbon(data=subset(dta,x>=cx),aes(ymin=0,ymax=y),
fill="tomato",alpha=0.5)
However, this does not mean that to reject \(H_o\) is 100% correct. Instead, what we can only say is that we have rejected \(H_o\) with a 5% chance that \(H_o\) is actually true. That is, we have a 5% chance to make an error if we reject \(H_o\). This kind of error is called Type I error or \(\alpha\). Why 5%? Why not 50%? This is a convention in statistics. Due to the fact that \(H_o\) is the null hypothesis and it should be treated as true at all times before the collected evidence is strong enough to reject it. Therefore, a higher criterion is chosen to evaluate the evidence for protecting \(H_o\). The below is a simulation experiment to show the reason why the critical value is needed when making a statistical inference. First, we create a population of 10,000 entities sampled from a standard normal distribution. Second, we get the sampling distribution of the difference in the two sample means, which are in fact sampled from the same population. That is, there should be no difference between the two sample means. This is what \(H_o\) states. The sampling distribution is created by running this experiment for 100 iterations with the sample size \(n=100\) for S1 as well as S2 on each iteration. The difference between the means of the two samples is computed in every iteration. Third, the critical value is set up so that the probability larger than it is only 0.05. This can be done by using qnorm( ) with the area below it as 0.95, the mean as 0 as suggested in \(H_o\), and the standard deviation as the standard deviation of the sampling distribtuion of the mean differences. Fourth, we can check how many scores in the mean differences are larger than the critical value. There are only 5 scores among all 100 scores to meet this criterion. Thus, we know that even we simply draw the samples from the same population, we still have a 5% chance to observe a large difference between the two sample means. Comparing with this criterion, if the samples in two experimental conditions really reflect two different populations, the difference between their means should be at least as large as the critical value.
set.seed(1234)
P<-rnorm(10000,0,1)
meanDiff<-sapply(1:100,function(i){
S1<-sample(P,100,replace=T)
S2<-sample(P,100,replace=T)
return(mean(S1)-mean(S2))
})
cv<-qnorm(0.95,0,sd(meanDiff))
sum(meanDiff>cv)
## [1] 5
sum(meanDiff>cv)/length(meanDiff)
## [1] 0.05
You might feel that a 5% chance of making an error is too great a risk to take and suggest that we make our criterion much more stringent, by rejecting, for example, only the lowest 1% of the distribution. This procedure is legitimate of course. In fact, in statistics, we have two conventions for Type I error: \(\alpha=.05\) and \(\alpha=.01\).
Parallel to Type I error, another kind of error in hypothese testing is called Type II error or \(\beta\). Type II error is also an error, except it means that we erroneously reject a \(H_1\) which is in fact true. The difficulty in doing hypothesis testing with Type II errors stems from the fact that if \(H_o\) is false, we almost never know what the true distribution (the distribution under \(H_1\)) would look like. Suppose the two conditions in the parking lot experiment actually reflect two different populations. That is, \(H_1\) is true. Suppose again the true mean difference is 2. We can plot the sampling distributions under \(H_1\) and \(H_o\). The below codes are used to do so. The figure below show two sampling distributions representing for the distributions under \(H_1\) and \(H_o\) respectively. With the same critical value, the blue area in the distribution for \(H_o\) is the rejection region corresponding to Type I error or \(\alpha\). When the observed score falls in this region, \(H_o\) is rejected with Type I error \(=.05\). The red area in the distribution for \(H_1\) is where we will reject \(H_1\). If \(H_1\) is in fact true, this area corresponds to the probability that we make Type II error or \(\beta\). Note this area is far larger than 0.05.
dta1<-dta
dta1$x<-dta1$x+2
fg0+geom_line(data=dta,aes(x=x,y=y))+
geom_ribbon(data=subset(dta,x>cx),aes(ymin=0,ymax=y),
fill="deepskyblue2",alpha=0.5)+
geom_line(data=dta1,aes(x,y))+
geom_ribbon(data=subset(dta1,x<=cx),aes(ymin=0,ymax=y),
fill="tomato",alpha=0.5)+
geom_text(x=-5,y=0.15,label="Ho")+
geom_text(x=7.5,y=0.15,label="H1")+
geom_text(x=0.35,y=0.05,label="Beta",color="red")+
geom_text(x=6.25,y=0.025,label="Alpha",color="deepskyblue3")
When the critical value is moved toward right, the rejection area gets smaller. This means that a fewer Type I error is allowed to make. However, Type II error gets larger at the same time. In contrast, when the critical value is moved toward left, the rejection area gets larger. Consequently, a fewer Type II error and a larger Type I error are allowed to make. In fact, there are four possible outcomes of the decision-making process in hypothese testing. They are shown in the below table.
Decision | \(H_o\) True | \(H_o\) False |
---|---|---|
Reject \(H_o\) | Type I error \(p=\alpha\) | Correct decision \(p=1-\beta=\) Power |
Don’t Reject \(H_o\) | Correct decision \(p=1-\alpha\) | Type II error \(p=\beta\) |
The previous parking lot experiment brings us to a consideration of one- and two-tailed tests. In that case, \(H_1: \Delta\mu>0\) and \(H_o: \Delta\mu\leq0\). The testing direction is obviously indicated by the direction of the inequality. Since \(H_1\) states that the difference between the mean waiting times should be larger than 0, we need to check the probability that the mean difference is larger than the observed score in the distribution under \(H_o\). On the contrary, if we are interested in whether a medical treatment can help people to lose their weights, we need to check the probability that the mean lost weight is smaller than the observed lost weight under \(H_o\). This kind of test is called one-tailed test.
On the contrary, if \(H_1:\Delta\mu\neq0\) and \(H_o:\Delta\mu=0\), there is no direction for testing. In this circumstance, Type I error is still .05 but the rejection region is equally divided into an upper and a lower region. Suppose in the parking lot experiment, \(H_1:\Delta\mu\neq0\) and \(H_o:\Delta\mu=0\) consequently. The rejection region under \(H_o\) is depicted as follow. Obviously, there are two critical values and the observed score either larger than the upper critical value or smaller than the lower critical value is regarded as falling in the rejection region. That is, \(H_o\) is rejected. Note that in the two-tailed test, the upper and lower critical values mark respectively the boundaries for the largest 2.5% and the smallest 2.5% of the distribution under \(H_o\).
cx1<-qnorm(0.025,0,se)
cx2<-qnorm(0.975,0,se)
fg0+geom_line(data=dta,aes(x,y))+
geom_ribbon(data=subset(dta,x<cx1),aes(ymin=0,ymax=y),
fill="tomato",alpha=0.5)+
geom_ribbon(data=subset(dta,x>cx2),aes(ymin=0,ymax=y),
fill="tomato",alpha=0.5)
Although the preceding example uses the sampling distribution to demonstrate how hypothesis testing is made, this procedure is not restricted to the sampling distribution only. In Chapter 4.12, the norm of the performance of normal patients on doing a finger-tapping speed task has a mean of 47.8 and a standard deviation of 5.3. Since the norm is normally treated as the population distribution and the finger-tapping performance presumably follows a normal distribution, the mean and the standard deviation of that norm can be regarded as the parameters of the population distribution for the finger-tapping performance. Suppose that we have a patient whose tapping speed is 35. Is this patient far enough below normal that we would suspect some sort of neurological damage? Follow the steps of hypothesis testing. We first establish a research hypothesis that this patient is impaired and she should be drawn from the population with the mean below 47.8, \(H_1:\mu<47.8\). Accordingly, the null hypothesis is that this patient is from a normal population, \(H_o:\mu\geq47.8\). With the normal population as the distribution for statistical test, we can easily compute the z score for the observed tapping speed as \(z=\frac{35-47.8}{5.3}=-2.42\). We can do a one-tailed test for it. There are two approaches to do so. First, we can compute the critical value for the lowest 5% in the z distribution. This critical value is -1.64, which is larger than -2.42, namely the observed score falls in the rejection region. We should reject \(H_o\) and draw the conclustion that this patient is from the population of impaired patients. Second, we can directly compute the probability below this observed score in the z distribution, which is 0.0078 or \(p=.0078\). This probability is far smaller than Type I error \(\alpha=.01\). Thus, we will reject \(H_o\) and draw a conclusion that this patient is from the population of impaired patients.
qnorm(0.05,0,1)
## [1] -1.644854
pnorm(-2.42,0,1)
## [1] 0.007760254
Similarly, suppose we know that the mean weight of 36 months old infants is 13.8 kg and the standard deviation is 2.07 kg. Now a 36-month-old infant weights 12 kg. A doctor claims that he is different from the normal infants at this age. Is the doctor true? The research hypothesis is \(H_1:\mu\neq13.8\) and the null hypothesis is \(H_o:\mu=13.8\), where \(\mu\) is the mean weight of the population from that this infant was drawn. As we know the infant weights are normally distributed, the distribution under \(H_o\) is the norm of the weights. Now we can compute the z score for this infant’s weight that \(z=\frac{12-13.8}{2.07}=-0.87\). For a two-tailed test, the lower cirtical value on the z distribution is -1.96. Obviously, \(-0.87>-1.96\) and we cannot reject \(H_o\). Of course, \(-0.87<1.96\) and it still does not fall in the upper rejection region. Therefore, we cannot reject \(H_o\).
z<-(12-13.8)/2.07
z
## [1] -0.8695652