Suppose we wish to compare driving proficiency on three different sizes of cars to test the experimental hypothesis that small cars are easier to handle. We have available three different groups of drivers, but we are not able to match individual subjects on driving experience, which will be discussed in more detail later, that the mean level of driving experience is equal across groups. Suppose further that the number of steering errors are used as our dependence variable. Thus, one straightforward way to answer the question is to conduct an ANOVA to examine whether the difference on the errors is significant or not. However, there is another fact that driving experience also influences the number of steering errors. The driving experience might confound with the effect of car size. How can we do to eliminate this possibility?
Analysis of Covariance (ANCOVA) is a solution. There are two assumptions of ANCOVA. First, the whatever relationship between the dependent variable (Y) and the covariate (C) is linear. Second, we will assume homogeneity of regression, that the regression coefficients are equal across the treatments (i.e., car size). This is equivalent to assuming that there is no interaction effect between the covariate (C) and the independent variable (X).
When viewed within the framework of multiple regression, ANCOVA is basically no different from ANOVA, excep that we wish to partial out the effect of the covariate. A covariate is nothing but an independent variable which assumes priority among the set of independent variables as the basis for accounting for Y variance (Cohen, 1968). If we want to ask about the variation in Y after the covariate (C) has been partialled out, and if the variation in Y can be associated with only C, the treatment effect (\(\alpha\)), and error, then \(SS_{c,\alpha}\) represents the total amount of accountable variation. If we now compare \(SS_{c,\alpha}\) with \(SS_c\), the difference will be the variation attributable to treatment effects over and above that attributable to the covariate.
Conti and Musty (1984) analyzed post-injection activity as a percentage of pre-injection activity, because that is the way such data are routinely analyzed in their field. ANCOVA is an alternative solution. Their data are used to create a data frame in R via the below codes.
dta<-data.frame(Y=c(1.3,0.94,2.25,1.05,0.92,1.9,0.32,0.64,0.69,0.93,
0.93,4.44,4.03,1.92,0.67,1.7,0.77,3.53,3.65,4.22,
5.1,4.16,1.54,6.36,3.96,4.51,3.76,1.92,3.84,
2.29,4.75,3.48,2.76,1.67,1.51,1.07,2.35,
1.44,1.11,2.17,2,0.84,0.99,0.44,0.84,2.84,2.93),
C=c(4.34,3.5,4.33,2.76,4.62,5.4,3.95,1.55,1.42,1.9,
1.55,10.56,8.39,3.7,2.4,1.83,2.4,7.67,5.79,9.58,
7.18,8.33,4.05,10.78,6.09,7.78,5.08,2.86,6.3,
6.94,6.1,4.9,3.69,4.76,4.3,2.32,7.35,
4,4.1,3.62,3.92,2.9,2.9,1.82,4.94,5.69,5.54),
X=as.factor(c(rep(0,10),rep(1,10),rep(2,9),rep(3,8),rep(4,10))))
head(dta)
## Y C X
## 1 1.30 4.34 0
## 2 0.94 3.50 0
## 3 2.25 4.33 0
## 4 1.05 2.76 0
## 5 0.92 4.62 0
## 6 1.90 5.40 0
The full model states that \(Y_{ij}=\tau_j+c+c\tau_j+e_{ij}\), where \(\tau_j\) is the effect of the \(j\)th level of the experiment treatment and \(c\tau_j\) represents our term testing homogeneity of regression. First, we test for the homogeneity of regression assumption. The \(c\tau_j\) should be not significantly different from 0. The easiest way to do this testing is to simply conduct an ANOVA for the full model. If the interaction effect between \(c\) and \(\tau\) is not significant, then we can base on the reduced model that \(Y_{ij}=\tau_j+c+e_{ij}\) to run ANCOVA.
summary(aov(Y~C*X,dta))
## Df Sum Sq Mean Sq F value Pr(>F)
## C 1 73.42 73.42 150.012 1.39e-14 ***
## X 4 9.22 2.31 4.712 0.00357 **
## C:X 4 2.02 0.50 1.030 0.40462
## Residuals 37 18.11 0.49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result shows no interaction effect between \(c\) and \(\tau\). Thus, the assumption of homogeneity of regression is confirmed. Now the model that \(Y_{ij}=\tau_j+c+e_{ij}\) is set as the full model, in which \(SS_{total}=SS_c+SS_\tau+SS_{e}\). Then, the effect of treatment with the covariate effect being partialled out is tested as \(F\frac{MS_{treat}}{MS_{total}}\).
summary(aov(Y~C+X,dta))
## Df Sum Sq Mean Sq F value Pr(>F)
## C 1 73.42 73.42 149.572 2.91e-15 ***
## X 4 9.22 2.31 4.698 0.00326 **
## Residuals 41 20.13 0.49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can use multiple regression to redo ANCOVA. First, we need to add dummy variables in the data set, as X is set up as a nominal variable. Then we compute the \(SS_{c\tau_j}\) of the model that \(Y_{ij}=\tau_j+c\tau_j+e_{ij}\). With no doubt, we got the same result as that of ANOVA. We also know that the interaction effect of \(c\tau\) is not significant.
dta$X1<-ifelse(dta$X==0,1,0)
dta$X2<-ifelse(dta$X==1,1,0)
dta$X3<-ifelse(dta$X==2,1,0)
dta$X4<-ifelse(dta$X==3,1,0)
anova(lm(Y~C*X,dta))
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## C 1 73.420 73.420 150.0124 1.387e-14 ***
## X 4 9.224 2.306 4.7116 0.003567 **
## C:X 4 2.017 0.504 1.0302 0.404624
## Residuals 37 18.109 0.489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Then, we use the model that \(Y_{ij}=\tau_j+c+e_{ij}\) as our basis. We can test the effect of \(c\) via the model with treatment predictors omitted that \(Y_{ij}=c+e_{ij}\). Also, we do test the effect of \(\tau\) via the model with \(c\) being omitted that \(Y_{ij}=\tau_j+e_{ij}\).
anova(lm(Y~C,dta))
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## C 1 73.420 73.420 112.57 7.877e-14 ***
## Residuals 45 29.349 0.652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(Y~X,dta))
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 4 44.303 11.0757 7.9564 7.214e-05 ***
## Residuals 42 58.466 1.3921
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we want to test the treatment effect with the covariate being partialled out, we need to check whether or not the difference on the accountable variation of Y between the full model that \(Y_{ij}=c+\tau_j+e_{ij}\) and the reduced model that \(Y_{ij}=c+e_{ij}\) is significant. That is, the lack of fit. We can use the function anova( ). The F value is significant, suggesting a significant treatment effect. Similarly, we can test the covariate effect by comparing the full model and the reduced model that \(Y_{ij}=\tau_j+e_{ij}\). The result also supports that the covariate’s effect is significant.
m0<-lm(Y~C+X,dta)
m1<-lm(Y~C,dta)
m2<-lm(Y~X,dta)
anova(m1,m0)
## Analysis of Variance Table
##
## Model 1: Y ~ C
## Model 2: Y ~ C + X
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 29.349
## 2 41 20.125 4 9.2239 4.6978 0.003257 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The group means with the covariate effect being partialled out can be computed by the coefficients of the full model.
pre<-mean(dta$C)
o.means<-with(dta,tapply(Y,X,mean))
index<-matrix(c(1,pre,0,0,0,0,
1,pre,1,0,0,0,
1,pre,0,1,0,0,
1,pre,0,0,1,0,
1,pre,0,0,0,1),6,5)
cofs<-matrix(m0$coefficients,1,6)
a.means<-cofs%*%index
library(ggplot2)
dd<-data.frame(y=c(as.numeric(a.means),o.means),cond=rep(as.factor(1:5),2),
group=rep(c("Adjusted","Original"),each=5))
ggplot(dd,aes(cond,y))+
geom_col()+xlab("Conditions")+ylab("Activity")+
scale_x_discrete(labels=c("1"="Control","2"="0.1 ug","3"="0.5 ug","4"="1 ug","5"="2 ug"))+
facet_wrap(~group)
The effect size of the difference between adjusted means \(m^{'}_1\) and \(m^{'}_3\) is computed as \(d=\frac{m^{'}_3-m^{'}_1}{\hat{\sigma}}=1.23\).
(a.means[3]-a.means[1])/sqrt(anova(m2)[2,3])
## [1] 1.234504