In the views of Fisher’s null hypothesis significance testing (NHST), Type I errors \(\alpha\) have been emphasized far greater than Type II \(\beta\) errors do. Recently, however, more investigators have argued that \(\beta\) is equally important. Normally the issue about \(\beta\) is represented as the issue of the statistical power \(=1-\beta\), which indicates how big the probability is to accept a \(H_1\), which is actually true. We can run a simulated experiment to estimate the power of an experiment. Suppose the mean and the standard deviation of the experimental group (n=20) are 9.64 and 3.17, whereas the mean and the standard deviation of the control group (n=20) are 6.58 and 3.03. The power of this experiment can be estimated by the following procedure. First, we create two populations respectively for the experimental group and the control group. Second, we randomly draw 20 data points from each of the populations. Third, we run a independent-groups \(t\) test and record the \(t\) value. Fourth, we repeat the third step until we collect 10,000 \(t\) values. Fifth, we compute the probability to accept \(H_1:\Delta\mu\neq0\) based on these 10,000 results. The positive critical value in a two-tailed \(t\) test is 2.024 with \(df=38=2\times(20-1)\).
set.seed(202031)
Ctrl<-rnorm(10000,9.64,3.17)
Exp<-rnorm(10000,6.58,3.03)
Results<-sapply(1:10000,function(i){
Ctrl.dta<-sample(Ctrl,20,replace=T)
Exp.dta<-sample(Exp,20,replace=T)
return(t.test(Ctrl.dta,Exp.dta,var.equal=T)$statistic)
})
cv<-qt(1-0.025,df=38)
cv
## [1] 2.024394
power<-sum(Results>cv)/length(Results)
power
## [1] 0.874
The power is about .87. That is, with the two populations’ parameters being set as the two samples’ statistics, there is a 87% chance to accept \(H_1:\Delta\mu\neq0\). The below R codes are used to make the below figure, which clearly shows the histogram of the \(t\) values collected in the simulated experiment, the critical value, and the power.
library(ggplot2)
x<-seq(min(Results),max(Results),0.1)
ymax<-dt(x-mean(Results),df=38)
ymin<-rep(0,length(x))
qt.dta<-data.frame(x=x,ymax=ymax,ymin=ymin)
ggplot()+
geom_histogram(data=data.frame(x=Results),aes(x=x,y=..density..),
bins=30,color="gray75",fill="white")+
geom_ribbon(data=subset(qt.dta,x>cv),aes(x=x,ymin=ymin,ymax=ymax),
alpha=0.2,color=NA,fill="red")+
geom_segment(aes(x=cv,y=dt(cv-mean(Results),df=38),xend=cv,yend=0),color="red")+
geom_line(data=qt.dta,aes(x=x,y=ymax),color="black",size=0.3)+
annotate("text",label="Area=.87",x=6,y=0.3,size=4)+
geom_segment(aes(x=6,y=0.28,xend=mean(Results),yend=0.15),size=0.3)+
annotate("text",label="t=2.024",x=0,y=0.1,size=4)+
geom_segment(aes(x=0,y=0.08,xend=1.98,yend=0.01),
arrow=arrow(length=unit(0.1,"cm")),size=0.3,color="red")
The power of a test are influenced by three factors: Type I error \(\alpha\), effect size, and sample size \(n\). The below figure shows the distributions for \(H_o\) and \(H_1\) respectively in color green and color red. Thus, Type I error is the small green triangle area. Type II error is the reddish area on \(H_1\).
Now suppose we use a larger Type I error, say 0.1. As a result, the green triangle area will be bigger and the reddish area will be smaller. Consequently, power will be bigger. See the below figure.
Now suppose the mean of \(H_1\) distribution is 3 units larger than that of \(H_o\) distribution. Then the relative positions of these two distributions will be changed as the following figure. Note Type I error is still .05, so the size of the green triangle area is not changed. However, the reddish area gets smaller, namely smaller Type II error. Apparently, this results from the bigger difference between these two means. Since power \(=1-\beta\), when \(\beta\) gets smaller, power gets bigger.
When sample size \(n\) gets bigger, power also gets bigger. This is because the standard eror (\(\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}\)) gets smaller as the sample size gets bigger. Thus, even though the difference between means and Type I error are kept unchanged, as long as sample size increases, power increases too.
Since power is influenced by effect size, sample size, and Type I error, we need to know these three variables ahead of the calculation of power. First, we need to estimate the effect size. There are three traditional ways to estimate the effect size \(d\).
When the expected effect size is determined, we can start calculating the power. First we combine the effect size and the sample size as \(\delta=d[f(n)]\). Here \(d\) is the effect size and \(f(n)\) is a function of sample size, which is different for different statistical tests. For instance, the \(f(n)\) for the \(one-sample\) \(t\) test is \(\sqrt{n}\). Suppose a researcher would like to know whether those people seeking treatments for psychological problems have higher IQs. She planned to randomly select 25 clients to find the pwoer of detecting a difference of 5 points between the mean of the IQ norm and the mean of the population from which those client were drawn. Suppose the mean and standard deviation of the IQ norm are 100 and 15. The following codes compute \(\delta\) as 1.66. Reference to the power table in Appendix Power. You can find that the power is in between .38 and .40 for a two-tailed test.
d<-5/15
delta<-d*sqrt(25)
delta
## [1] 1.666667
Of course, the power will increase as the sample size increases. Suppose the sample size is 64. The estimated power now becomes a value in between .74 and .77.
delta2<-d*sqrt(64)
delta2
## [1] 2.666667
Normally, we calculate the power of a test in order to estimate at least how many participants we need to recruit. Practically, we hope the power can be at least 0.8. For the IQ case, at least how many participants do we need? From the power table, we know for power = 0.8 and \(\alpha=0.05\), \(\delta=2.8\). Then, \(n=(\delta/d)^2=70.56\). That is that we at least need 71 participants.
(2.8/d)^2
## [1] 70.56
Alternatively, you can use the functions in the {pwr} package to calculate powers. The relevant introduction can be found here. We can estimate the sample size we need in order to get power=0.8 with the effect size = 5/15. As shown below, the suggested minimum sample size is 73. Note this is of course more accurate than our rough estimation based on the power table.
library(pwr)
pwr.t.test(d=d,power=0.8,type="one.sample")
##
## One-sample t test power calculation
##
## n = 72.5839
## d = 0.3333333
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
For the independent-groups \(t\) test, \(\delta=d\sqrt{\frac{n}{2}}\) if the sample sizes of the two groups are equal. Suppose we want a difference of 5 points between the means of the experimental and control groups. Further assume that from past data we think that \(\sigma\) is approximately 10. Then \(d=5/10=0.5\). How much the power will be with 25 observations in each of two groups? From Appendix Power, for \(\delta=1.77\), the power is about 0.43.
d<-5/10
delta<-d*sqrt(25/2)
delta
## [1] 1.767767
Of course, we can use R to do this calculation. The estimated power is 0.41. Also, we can estimate the minimum sample size we need if we want power=0.8. The estimated sample size is about 64.
pwr.t.test(n=25,d=d,type="two.sample")
##
## Two-sample t test power calculation
##
## n = 25
## d = 0.5
## sig.level = 0.05
## power = 0.4101003
## alternative = two.sided
##
## NOTE: n is number in *each* group
pwr.t.test(d=d,power=0.8,type="two.sample")
##
## Two-sample t test power calculation
##
## n = 63.76561
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
For calculating the power of the \(t\) test for two unequal-sized groups, \(\delta=\sqrt{\frac{\bar{n_h}}{2}}\), where \(\bar{n_h}=\frac{2}{\frac{1}{n_1}+\frac{1}{n_2}}=\frac{2n_1n_2}{n_1+n_2}\). We can calculate the power of the \(t\) test used to compare the mean mathematics scores of two groups under or not the threatening of the stereotype for Asian students. The \(\delta\) is computed as 2.62. From Appendix Power, the power is about 0.75 under \(\alpha=.05\) (two-tailed).
d<-(9.58-6.55)/3.1
nbar<-2*18*12/(18+12)
delta<-d*sqrt(nbar/2)
delta
## [1] 2.622691
Of course, we can use R to this calculation for us. The power is then computed as 0.72.
pwr.t2n.test(n1=18,n2=12,d=d)
##
## t test power calculation
##
## n1 = 18
## n2 = 12
## d = 0.9774194
## sig.level = 0.05
## power = 0.716317
## alternative = two.sided
Things will become more complicated for calculating \(\delta\) for a matched-sample \(t\) test. Under this circumstance, \(d=\frac{\mu_1-\mu_2}{\sigma_{x_1-x_2}}\). It is not difficult to guess \(\sigma_{x1}\) and \(\sigma_{x2}\). However, it is a bit complicated to estimate \(\sigma_{x_1-x_2}\). The variance sum law can help us solve this problem. \(var(a\pm b)=var(a)+var(b)\pm 2\rho var(a)var(b)\). It is equivalent to say \(\sigma^2_{a\pm b}=\sigma^2_{a}+\sigma^2_{b}\pm 2\rho\sigma_a\sigma_b\). Then, in this case, \(\sigma^2_{x_1-x_2}=\sigma^2_{x_1}+\sigma^2_{x_2}-2\rho\sigma^2_{x_1}\sigma^2_{x_2}\). With the assumption that \(\sigma_{x_1}=\sigma_{x_2}=\sigma\), \(\sigma_{x_1-x_2}=\sqrt{2\sigma^2-2\rho\sigma^2}=\sigma\sqrt{2(1-\rho)}\).
As long as we know \(\rho\), we can compute \(d\) and \(\delta=d\sqrt{n}\). Suppose the stereotype experiment is a test-retest experiment (n=30). The students got a mean of 9.58 at the first test and down to 7.55 at the second test with threatening. Suppose again the correlation between these two lists of scores is 0.92 and the standard deviation is 3.1. Thus, \(\delta=8.97\). The power is 0.99.
sigma<-3.1*sqrt(2*(1-0.92))
d<-(9.58-7.55)/sigma
delta<-d*sqrt(30)
delta
## [1] 8.966748
We can use R to calculate the power of a matched-sample \(t\) test. The power now is 1, which is extremely large.
pwr.t.test(n=30,d=d,type="paired")
##
## Paired t test power calculation
##
## n = 30
## d = 1.637097
## sig.level = 0.05
## power = 1
## alternative = two.sided
##
## NOTE: n is number of *pairs*