Analysis of Variance I

The ANALYSIS OF VARIANCE (ANOVA) has long enjoyed the status of being the most used statistical techniques in psychological research. The popularity and usefulness of this technique can be attributed to two sources. First, the ANOVA like \(t\) test deals with difference between or among sample means; unlike \(t\) test, the ANOVA can deal with the difference among multiple group means. Second, the ANOVA also allows us to deal with two or more independent variables simultaneously.

The Underlying Model

Suppose we collected data from different subjects in five experimental conditions which corresponded to one level of the independent variable. See the below table.

C1 C2 C3 C4 C5 Total
\(X_{1,1}\) \(X_{1,2}\) \(X_{1,3}\) \(X_{1,4}\) \(X_{1,5}\)
. . . .
. . . .
. . . .
\(X_{n,1}\) \(X_{n,2}\) \(X_{n,3}\) \(X_{n,4}\) \(X_{n,5}\)
\(\mu_{.,1}\) \(\mu_{.,2}\) \(\mu_{.,3}\) \(\mu_{.,4}\) \(\mu_{.,5}\) \(\mu_{.,.}\) or \(\mu\)

Suppose you had to guess a subject’s score. One obvious guess would be the grand mean in the population \(\mu\).

\(X_{i,j}=\mu\)

But now suppose I told you that the subject is in a certain group \(j\). We will represent the effect of being in Group \(j\) as \(\tau_{j}\), which is simply the difference between the mean of group \(j\) and the grand mean \(\mu\). Thus,

\(X_{i,j}=\mu+(\mu_{j}-\mu)=\mu+\tau_{j}\).

This is a better guess, because you will predict a subject’s score based on the grand mean and the effect brought from the \(j^{th}\) condition of the experiment treatment. But this is not the final answer. If you are one of the member in Group \(j\), you probably differ from the group’s mean in some way. That is the error \(\epsilon_{i,j}\). Thus, the final model is

\(X_{i,j}=\mu+(\mu_{j}-\mu)+(X_{i,j}-\mu_{j})=\mu+\tau_{j}+\epsilon_{i,j}\).

Assumptions

Homogeneity of Variance: It is assumed that each of our populations has the same variance. In other words,

\(\sigma_{1}^2=\sigma_{2}^2=\sigma_{3}^2=\sigma_{4}^2=\sigma_{5}^2=\sigma_{\epsilon}^2\).

In condition \(j\), the score \(i\) can be decomposed as \(X_{i.j}=\mu+\tau_j+\epsilon_{i.j}\). Since the grand mean \(\mu\) and the treatment effect \(\tau_j\) are fixed, the distribution of scores \(X_{i.j}\) has the variance of the distribution of \(\epsilon_{i.j}\). As it is assumed that \(\epsilon\sim ND(0,\sigma_\epsilon)\) for no matter which condition, the variance of the distribution of the scores in every experimental condition will be the same \(\sigma^2_\epsilon\).

Normality: Again, following the same assumption for the random error \(\epsilon\sim ND(0,\sigma_\epsilon)\). The distribution of the scores in each condition are assumed to be normal.

Independence: The third assumption is that the observations are independent of one another. In other words, the error components \(\epsilon_{i,j}\) are independent.

Hypothesis Testing

When conducting the ANOVA, the main concern is whether or not the experiment treatment exists. If the treatment effect is not for real, the group means should be quite close to each other. On the contrary, if the treatment effect is for real, then at least one group mean is significantly different from another group mean. Thus, the scientific and null hypothesis can be described as follows.

\(H_o: \mu_{1}=\mu_{2}=\mu_{3}=\mu_{4}=\mu_{5}\) or \(\mu_{j}=\mu_{k}\), \(\forall j\neq k\)

\(H_1: \mu_{j}\neq \mu_{k}\), \(\forall j\neq k\)

The Logic of ANOVA

The logic of ANOVA is very simple. All is relevant to the estimation of the variance of population distribution. There are two routes to estimate the population variance.

First, we can estimate it by using the variance of the sample distribution in each experimental condition. The scores in condition \(1\) are sampled from the population corresponding to condition \(1\). As the sample variance is the unbiased estimate of the population variance, \(E(s^2_1)=\sigma^2_1\). Same for the other conditions \(2, \dots,k\), \(E(s^2_2)=\sigma^2_2\), \(E(s^2_3)=\sigma^2_3\), \(\dots\), \(E(s^2_k)=\sigma^2_k\). According to the homogenity of variance assumption, \(\sigma^2_1=\sigma^2_2=\dots=\sigma^2_k=\sigma^2_\epsilon\). Then, \(s^2_1,\dots,s^2_k\) can all estimate the population variance \(\sigma^2_\epsilon\). Which sample variance is the best estimate of the population variance? The straightforward answer is that the mean of these sample variances is better than any one of them to estimate the population variance. Thus,

\(\sigma_\epsilon^2 \doteq \sum^k_{j=1} s^2_j /K\), where \(K\) is the number of treatments. The operator \(\doteq\) means “be estimated by”.

Second, the other route is a bit tricky. Recall the CLT for the sampling distribution of means. When sample size \(n\rightarrow\infty\), the sampling distribution of means will be normal with the mean as \(\mu\) and the variance as \(\sigma^2/n\). The scores in each condition are sampled from its population with mean = \(\mu_j\) and variance = \(\sigma^2_\epsilon\). If \(H_o\) is true, namely that any two means are not different (\(\mu_1=\mu_2=\dots=\mu_k=\mu\)), the scores in different conditions are sampled from the same population. This population distribution has a mean \(\mu\) and a variance \(\sigma^2_\epsilon\). Therefore, according to CLT, the variance of the sampling distribution of these condition means \(s^2_\overline{x}\) multiplied with sample size \(n\) is the unbiased estimate of the population variance \(nE(s^2_\overline{x})=\sigma^2_\epsilon\) or \(\sigma^2_\epsilon \doteq ns^2_\overline{x}\).

Now we have two estimates for the population variance. When these two estimates are equal, we know that \(H_o\) must be true, while when these two estimates are significantly different from each other, we know that \(H_o\) must not be true.

Let’s transfer the above logic to real equations. First, the variance of the scores in condition \(j\) is \(s^2_j=\frac{\sum(x-\overline{x}_j)^2}{n-1}\). Then, \(\frac{\sum^k_{j=1} s^2_j}{K}=\frac{\sum s^2_j}{K(n-1)}=\frac{\sum\sum(x-\overline{x}_j)^2}{K(n-1)}\). By convention, this term is always called mean square of error or \(MS_{error}\) or \(MS_{within}\), although it is not actually the mean of squared errors within all conditions.

Second, the term \(ns^2_\overline{x}\) can be referred to as \(MS_{treat}\), as it represents the variety of scores in respect of experimental treatment. In the preceding discussion, we know that the expected value of \(MS_{treat}\) and the expected value of \(MS_{error}\) would be the same only if \(H_o\) is true, namely no treatment effect. Thus, \(E(MS_{treat})\geq E(MS_{error})\). As these two expected values will be the same when there is no treatment effect, we know \(E(MS_{treat})\) can be decompsed as two parts: \(E(MS_{error})\) and treatment effect. The formal equations to represent the relationships between these terms are as follows.

\(E(MS_{error})=\sigma^2_\epsilon\) and

\(E(MS_{treat})=\sigma^2_\epsilon+n\frac{\sum(\tau_j)^2}{K-1}\).

The term \(n\frac{\sum(\tau_j)^2}{K-1}\) is the treatment effect, which is \(n\frac{\sum(\mu_j-\mu)^2}{K-1}\). Note the expected value of \(ns^2_\overline{x}\) is not \(n\frac{\sum(\tau_j)^2}{K-1}\). When the treatement effect is 0, \(E(MS_{treat})=E(MS_{error})\). We know that \(F=\frac{s^2_1}{s^2_2}\frac{\sigma^2_2}{\sigma^2_1}\). In ANOVA, \(F=\frac{MS_{treat}}{MS_{error}}\frac{E(MS_{error})}{E(MS_{treat})}\). As \(H_o:\mu_j=\mu_k \forall j\neq k\), namely \(\tau_j=0\), or \(H_o:E(MS_{treat})=E(MS_{error})\), when \(H_o\) is assumed to be true, \(F=\frac{MS_{treat}}{MS_{error}}\).

A numeric example

Two researchers tested 12*5 subjects’ aggressive behavior under five conditions of different extents of distraction during task. Create the data as follow.

dta<-data.frame(cond=gl(5,12,12*5,labels=1:5),
                score=c(1.28,1.35,3.31,3.06,2.59,3.25,2.98,1.53,-2.68,2.64,1.26,1.06,
                        -1.18,0.15,1.36,2.61,0.66,1.32,0.73,-1.06,0.24,0.27,0.72,2.28,
                        -0.41,-1.25,-1.33,-0.47,-0.60,-1.72,-1.74,-0.77,-0.41,-1.2,-0.31,-0.74,
                        -0.85,0.14,-1.38,1.28,1.85,-0.59,-1.3,0.38,-0.35,2.29,-0.25,0.51,
                        2.01,0.4,2.34,-1.8,5,2.27,6.47,2.94,0.47,3.22,0.01,-0.66))
dim(dta)
## [1] 60  2

Let’s do some math first. Compute the sum of squares for the treatment effect.

means<-with(dta,tapply(score,cond,mean))
grand.mean<-mean(dta$score)
n<-sum(dta$cond=="1") # all conditions contain equal numbers of subjects
sstreat<-n*sum((means-grand.mean)^2)
sstreat
## [1] 66.44906

Compute the sum of squares for the error.

sserror<-unlist(lapply(1:5,function(x)(n-1)*with(dta,sd(score[cond==x]))^2))
sserror<-sum(sserror)
sserror
## [1] 124.4581

The degrees of freedom of \(SS_{treat}\) is 5-1=4 and the degrees of freedomw of \(SS_{error}\) is 5*(12-1)=55. Then, we can have two mean squares to compute an F value and check whether it is significant. Alternatively, you can check your F table also.

mstreat<-sstreat/4
mserror<-sserror/55
fvalue<-mstreat/mserror
fvalue
## [1] 7.34122
1-pf(fvalue,4,55)
## [1] 8.216394e-05

There is always an easy way to do statistics in R. We can use the function aov() to do the analysis of variance. See the codes as below.

dta.aov<-aov(score~cond,dta)
summary(dta.aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cond         4  66.45  16.612   7.341 8.22e-05 ***
## Residuals   55 124.46   2.263                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Partition of Sums of Squares

The sum of squares total of all observed scores can be decomposed to the sum of two sums of squares. One is the sum of squares between conditions, which represents for the effect of experimental treatment. The other is the sum of squares within conditions, which represents for the random error. Let’s see the proof. First, the linear model for ANOVA is

\(x_{ij}=\mu+(\mu_j-\mu)+(x_{ij}-\mu_{j})\). This equation can be rewritten as

\(x_{ij}-\mu=(\mu_{j}-\mu)+(x_{ij}-\mu_{j})\).

Let’s square the terms of both sides of this equation. Then,

\((x_{ij}-\mu)^2=[(\mu_{j}-\mu)+(x_{ij}-\mu_{j})]^2\).

Sum these squares along subjects \(i\) and conditions \(j\) and we will get

\(\sum\sum (x_{ij}-\mu)^2=\sum\sum[(\mu_{j}-\mu)+(x_{ij}-\mu_{j})]^2\).

We call \(\sum\sum (X_{ij}-\mu)^2\) sum of squares total, \(SST\). Thus,

\(SST=\sum\sum[(\mu_{j}-\mu)+(x_{ij}-\mu_{j})]^2\).

Since \(x_{ij}-\mu_{j}=\epsilon_{ij}\), which is independent of the treatement effect, \(2cov(\mu_j,\epsilon_{ij})=0\),

\(SST=\sum\sum(\mu_{j}-\mu)^2+\sum\sum(x_{ij}-\mu{j})^2\).

Also, we know that \(\sum\sum(\mu_{j}-\mu)^2\) is the sum of squares of treatment, we can call it \(SS_{treat}\). When the groups are equal sized (e.g., sample size = n), \(SS_{treat}=n\sum(\mu_{j}-\mu)^2\).

The rest term is the sum of squares of errors, which we call \(SS_{error}\). Therefore, the equation can be rewritten as

\(SST=SS_{treat}+SS_{error}\).

As the treatment effect represents the variability between treatment groups, some also call \(SS_{treat}\) as \(SS_{between}\). Also, as the error is computed within each treatment group, \(SS_{error}\) can be called \(SS_{within}\). Thus, \(SST=SS_{between}+SS_{within}\).

Check the summary table of ANOVA for the 5 conditions and you will see it basically show the above partitions. One last principle is worth noting. The degrees of freedom of the terms have the same equation. For \(SST\), the variability from each score to the grand mean, the degrees of freedom is total number of subjects minus 1, \(df_{total}=kn-1\), where \(k\) is the number of groups and \(n\) is the number of subjects in each group. For \(SS_{treat}\), the degrees of freedom is \(df_{treat}=k-1\), since there are \(k\) groups. The degrees of freedom for \(SS_{error}\) is \(df_{error}=k(n-1)\). The same equaton for \(SS\) can be applied to the degrees of freedom.

\(df_{total}=df_{treat}+df_{error}\) or \(kn-1=k-1+k(n-1)\)

Check the summary table for this observation.

Another Numeric Example

The level-of-processing theory posits that the deeper an item is processed, the better we recall it. A psychologist would like to examine this theory. He randomly assigned 50 subjects to each of 5 conditions. The difference between these conditions is the depth of processing for the study list. The dependent variable is the number of items which can be correctly recalled. See the data as follow.

dta<-data.frame(Y=c(9,8,6,8,10,4,6,5,7,7,
                   7,9,6,6,6,11,6,3,8,7,
                   11,13,8,6,14,11,13,13,10,11,
                   12,11,16,11,9,23,12,10,19,11,
                   10,19,14,5,10,11,14,15,11,11),
                G=rep(c("Counting","Rhyming","Adjective",
                        "Imagery","Intentional"),each=10))
dim(dta)
## [1] 50  2

We can make a box plot to see the distribution of the scores in each condition.

library(ggplot2)
ggplot(dta,aes(x=G,y=Y))+
  geom_boxplot()+xlab("Condition")+ylab("Accuracy")

According to the experimental design, the conditions shows the levels of processing for stimuli, which from shallow to deep, should be counting the number of letters, checking whether the world is a rhyming world (i.e., having a particular ending sound), whether the world is an adjectice, whether the world is an imagery world, and whether the world shows some intention. However, the x axis in this box plot follows the alphabetic order. This is the default setting of R for data in nominal scale. We can arrange the rank order according to our experimental design.

dta$g<-as.factor(rep(1:5,each=10))
ggplot(dta,aes(g,Y))+
  geom_boxplot()+
  scale_x_discrete(labels=c("1"="Counting","2"="Rhyming","3"="Adjective",
                            "4"="Imagery","5"="Intentional"))+
  xlab("Condition")+ylab("Accuracy")

Visual inspection of this box plot suggests that the deeper the processing level is, the better the recall accuracy is. However, the box plot does not show the mean accuracy of each condition. Thus, we can change to using the bar plot to represent the means of all conditions.

# First, compute the mean of each condition
means<-with(dta,tapply(Y,g,mean))
# Second, compute the standard error
ses<-with(dta,tapply(Y,g,sd))/sqrt(with(dta,table(g)))
dta1<-data.frame(means=means,ses=ses,cond=as.factor(1:5))
ggplot(dta1,aes(cond,means))+
  geom_bar(stat="identity",position=position_dodge(0.9),fill="#00BFC4")+
  geom_errorbar(aes(ymin=means-ses,ymax=means+ses),width=0.2)+
  scale_x_discrete(labels=c("1"="Counting","2"="Rhyming","3"="Adjective",
                            "4"="Imagery","5"="Intentional"))+
  xlab("Condition")+ylab("Mean Accuracy")

Again, we can quickly have a check at this figure. First, not every mean is the same. The mean accuracy in the two shallow conditions are apparently lower than the others. Thus, it can be expected that the omnibus F test will be significant. That is, \(H_o:\mu_j=\mu_k\forall j\neq k\) should be rejected. Let’s conduct a one-way between-subjects ANOVA.

result1.aov<-aov(Y~G,dta)
summary(result1.aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## G            4  351.5   87.88   9.085 1.82e-05 ***
## Residuals   45  435.3    9.67                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Obviously, the treatment effect is significant, \(F(4,45)=9.09\), \(MSe=9.67\), \(p<.01\). The \(df\) of the treatment effect is \(p-1=5-1=4\) and the \(df\) of the error term is \(p(n-1)=5\times(10-1)=45\). In the summary table, these \(df\)s are all correct. Then, we should be relieved that we did not run the ANOVA incorrectly.

In fact, ANOVA can be viewed as a special case of regression. The same data set can be analyzed with the function lm( ). Comparing to the group means, you might notice that the estimated intercept is actually the mean of the first condition. The second estimated coefficient is -0.1, which is actually \(\overline{x}_2-\overline{x}_1\). Similarly, the third estimated coefficient is \(\overline{x}_3-\overline{x}_1\).

result1.lm<-lm(Y~g,dta)
summary(result1.lm)
## 
## Call:
## lm(formula = Y ~ g, data = dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7.00  -1.85  -0.45   2.00   9.60 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.0000     0.9835   7.117 6.83e-09 ***
## g2           -0.1000     1.3909  -0.072 0.943004    
## g3            4.0000     1.3909   2.876 0.006138 ** 
## g4            6.4000     1.3909   4.601 3.43e-05 ***
## g5            5.0000     1.3909   3.595 0.000802 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.11 on 45 degrees of freedom
## Multiple R-squared:  0.4468, Adjusted R-squared:  0.3976 
## F-statistic: 9.085 on 4 and 45 DF,  p-value: 1.815e-05
means
##    1    2    3    4    5 
##  7.0  6.9 11.0 13.4 12.0

As \(g\) is a nominal variable, in lm( ), a nominal predictor by default will be encoded by dummy codes. Since we have 5 conditions here, namely 5 levels of the independent variable, there are 4 dummy variables. In the summary table, they are \(g_2\), \(g_3\), \(g_4\), and \(g_5\). Thus, the regression model can be written as

\(\hat{Y}=b_1+b_2g_2+b3g_3+b_4g_4+b_5g_5\).

When \(g_2\), \(g_3\), \(g_4\), and \(g_5\) are all 0, the regression model becomes

\(\hat{Y}=b_1\). This is for condition 1.

In this circumstance, the intercept is the prediction of this regression model for condition 1. As the predicted value is actually the mean, the intercept \(b_1\) is actually the mean of the first condition. When \(g_2=1\) and other dummy variables are 0, the regression model becomes

\(\hat{Y}=b_1+b_2\).

Thus, the mean of condition 2 is estimated by \(b_1+b_2\). As \(\overline{x}_1=b_1\) and \(\overline{x}_2=b_1+b_2\), the second estimated coefficient \(b_2=\overline{x}_2-\overline{x}_1\). This is how lm( ) deals with the problem of ANOVA.