Analysis of Variance

The has long enjoyed the status of being the most used statistical techniques in psychological research. The popularity and usefulness of this technique can be attributed to two sources. First, the ANOVA like t test deals with difference between or among sample means; unlik t test, the ANOVA can deal with the difference among . Second, the ANOVA also allows us to deal with .

The Underlying Model

Suppose we collected data from different subjects in five experimental conditions, each of which corresponded to one level of the independent variable. See the below table. Each entry records a score made by one subject \(i\) in one condition \(j\). The grand mean of all observed data is \(M_{..}=\frac{1}{N}\sum_{j}\sum_{i}x_{i,j}\).

Suppose you are asked to estimate a subject’s score based on the current data set. Intuitively, the grand mean \(M\) would be the relatively safe guess. Thus, you may say \(x_{ij}=M\). Suppose again that you now the subject is actually in a certain condition \(j\). What is your guess now? Since we would expect different conditions should bring different effects to subjects’ performance, now you might estimate this subject’s score as the sum of the grand mean and the effect of that particular condition. That is, \(x_{ij}=M+\tau_j\). The effect of condition \(j\) on subject’s score is simply the difference between the mean of that condition and the grand mean, \(\tau_j=M_j-M_{..}\). Now this estimate is better. However, even in the same condition, we would expect different subjects might have different performance due to the sampling error. Thus, a full estimate for any subject’s score can be written as \(x_{ij}=M+\tau_j+\epsilon_{ij}\).

Assumptions of ANOVA

Homogeneity of Variance

It is assumed that each of our populations has the same variance. Just like the assumption for prediction residuals in regression analysis, \(\epsilon\sim ND(0,\sigma_\epsilon)\). According to the above formula, each subject’s score can be derived as \(x_{ij}=M_{..}+\tau_j+\epsilon_{ij}=M_j+\epsilon_{ij}\). Thus, the distribution of the scores in condition \(j\) has a mean as \(M_j\) and a standard deviation as \(\sigma_j=\sigma_\epsilon\). Similarly, the distributions of scores in other conditions are assumed to have the same standard deviation \(\sigma_\epsilon\). Therefore, \(\sigma_{1}^2=\sigma_{2}^2=\sigma_{3}^2=\sigma_{4}^2=\sigma_{5}^2=\sigma_{\epsilon}^2\).

Normality

A second assumption the ANOVA is that the scores for each condition are . Because \(\sigma_{\epsilon}^2\) represents the variability of each subject’s score around the mean, a more correct way to write this assumption is \(\epsilon_{ij}\sim ND(0,\sigma_\epsilon)\).

Independence

The third assumption is that the observations are independent of one another. In other words, the error components \(\epsilon_{ij}\) are independent.

Hypothesis Testing in ANOVA

Different from t test, in the null hypothesis \(H_{o}\), we would not say which two group means are equal. Instead, we say that no significant different between any two group means.

\[\begin{align*} H_o: & \mu_{j}=\mu_{k}, \forall j\neq k\\ H_1: & \mu_{j}\neq \mu_{k}, \forall j\neq k \end{align*}\]

But how exactly do we conduct the statistical test for these two hypotheses?

The Logic of ANOVA

The logic of ANOVA is very simple. According to the assumptions of the ANOVA, the difference between the populations for the experimental conditions is the means only. As we know that the variance of the scores in each group would be an estimate of the variance of the popultation from which the scores are sampled. Since the unbaised estimate for the population variance is the sample variance. That is,

\(\sigma_{1}^2 \doteq s_{1}^2\), \(\sigma_{2}^2 \doteq s_{2}^2\), \(\cdots\), \(\sigma_\epsilon^2 \doteq s_{e}^2\), where \(\doteq\) is read as to be estimated by.

According to the assumption of homogeneity of variances, the population variances of each condition are all the same, namely \(\sigma^2_\epsilon\), which can be estimated by each sample variance \(s^2_j\). Therefore, a better estimate for it would be the average of all sample variances. Thus, \(\sigma_\epsilon^2 \doteq s_{e}^2 \doteq \overline{s_j^2} \doteq \frac{\sum s_j^2}{ k}\), where \(k\) is the number of groups (or treatments).

This gives us one estimate of the population, which we refere to as \(MS_{error}\), or sometimes \(MS_{within}\).

Now let’s assume that \(H_o\) is true. Then, the \(k\) samples can be thought of as \(k\) independent samples from the same population and we can produce another possible estimate of \(\sigma_\epsilon^2\). Recall that the central limit theorem states that the variance of means sampled from the same population equals the variance of the population divided by the sample size. Thus, if \(H_{o}\) is true, the variance of the \(k\) sample means estimates \(\frac{\sigma_\epsilon^2}{n}\), where \(n\) is the sample size or number of subjects in each condition. Therefore, \(\sigma_\epsilon^2 \doteq ns_{\bar{x}}^2\). This term is referred to as \(MS_{treatment}\).

We now have two estimates of the population variance \(\sigma_\epsilon^2\). One of these estimates (\(MS_{error}\)) is independent of whether or not \(H_{o}\) is true. The other estimate (\(MS_{treat}\)) only holds when \(H_o\) is true. Thus, if the two estimates agree, we will have support for \(H_o\) and if they disagree, we can reject \(H_{o}\).

Varinace Estimation

We first define the treatment effect, \(\tau_j=\mu_j-\mu\) and we will define \(\theta_\tau^2\) as the variance of the true populations’ means.

\(\theta_{\tau}^2=\frac{\sum (\mu_{j}-\mu)^2}{k-1}=\frac{\sum \tau_{j}^2}{k-1}\)

Also, \(\sigma_\epsilon^2\) is the expected value of \(MS_{error}\).

\(E(MS_{error})=\sigma_\epsilon^2\)

The expected value of \(MS_{treatment}\) represents the sum of the variance of population \(\sigma_{e}^2\) and the variability brought by the treatment effect, namely the treament effect.

\(E(MS_{treat})=\sigma'^2_\epsilon=\sigma_\epsilon^2+n\frac{\sum \tau_{j}^2}{k-1}=\sigma_\epsilon^2+n\theta_{\tau}^2\)

Now if \(H_{o}\) is true, there is no treatment effect, \(\theta_{\tau}^2=0\), \(E(MS_{error})=\sigma_{e}^2=E(MS_{treat})+0\). However, if \(H_{o}\) is false, then the treatment effect is not 0, \(\theta_{\tau}^2 \neq 0\). Thus, \(E(MS_{error})<E(MS_{treat})\).

Calculations in the Analysis of Variance

By comparing \(E(MS_{error})\) with \(E(MS_{treat})\), we can gain evidece of the truth or faslity of \(H_o\). Recall the F test, which is used to check whether two population variances are equal. The alias of variance is mean square \(MS\). Thus, the F test is suitable to check whether \(E(MS_{error})=E(MS_{treat})\), with \(H_{o}: \mu_j=\mu_k\) assumed to be true first.

Example 1

Two researchers tested 60 subjects’ aggressive behavior under five conditions of different extents of distraction during task. The larger the score, the stronger the extent of aggressive behavor is. The data are created as follows. In this data frame, there are two columns. The first one is called cond, which is created by gl(). This function is meant to create a factor. That is, the levels of this factor is nominal. Since we have 5 conditions, the first argument is 5. The second argument defines how many repetitions of each level. Since we have 12 subjects in every condition, the condition level repeats for 12 times in this column. As shown by calling dta$cond, there are 5 levels. Note A common vector in R does not have levels. Thus, this must be a factor. The second column is called score which simply stores the aggressive behavior scores.

dta<-data.frame(cond=gl(5,12,12*5,labels=1:5),
                score=c(1.28,1.35,3.31,3.06,2.59,3.25,2.98,1.53,-2.68,2.64,1.26,1.06,
                        -1.18,0.15,1.36,2.61,0.66,1.32,0.73,-1.06,0.24,0.27,0.72,2.28,
                        -0.41,-1.25,-1.33,-0.47,-0.60,-1.72,-1.74,-0.77,-0.41,-1.2,-0.31,-0.74,
                        -0.85,0.14,-1.38,1.28,1.85,-0.59,-1.3,0.38,-0.35,2.29,-0.25,0.51,
                        2.01,0.4,2.34,-1.8,5,2.27,6.47,2.94,0.47,3.22,0.01,-0.66))
head(dta)
##   cond score
## 1    1  1.28
## 2    1  1.35
## 3    1  3.31
## 4    1  3.06
## 5    1  2.59
## 6    1  3.25
dta$cond
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [36] 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5
## Levels: 1 2 3 4 5

Describe Data

In this experiment, the 5 conditions are treated as distinct levels. Thus, we can make a dot plot for the histograms of data on each level. In the left panel of the below figure, the scores on each level are vertically diesplayed with two boxies above and below the mean for 1 standard deviation. The right panel is what we usually saw in journal papers, in which the error bar is one standard error. Different from the previous figures, I remove all the grids and use the black and white backgroud in this figure, that is adjusted by theme_bw() and theme().

library(ggplot2)
library(gridExtra)
fg1<-ggplot(data=dta,aes(x=cond,y=score))+
       geom_dotplot(binaxis="y",stackdir="center",binwidth=0.2)+
       stat_summary(fun.data="mean_sdl",fun.args=list(mult=1),
                    geom="crossbar",width=0.5)
means<-with(dta,tapply(score,cond,mean))
ses<-with(dta,tapply(score,cond,sd))/sqrt(12)
fg2<-ggplot(data=data.frame(cond=as.factor(1:5),
                            Aggression=means,
                            ses=ses),aes(x=cond,y=Aggression))+
      theme_bw()+theme(plot.background=element_blank(),
                  panel.grid.major=element_blank(),
                  panel.grid.minor=element_blank())+
      geom_bar(stat="identity",color="black",fill="grey")+
      geom_errorbar(aes(ymin=Aggression-ses,ymax=Aggression+ses),width=0.2,
                    position=position_dodge(0.9))
grid.arrange(fg1,fg2,ncol=2)

ANOVA

This is a between-subjects design, in which each subject only participated in one condition or took on one experimental treatmeant. Of course, these subjects must be randomly recurited. In order to examine whether the aggressive behavor would be influenced by distractors, a one-way between-subjects ANOVA is conducted. Here one-way means one factor and the factor is actually the experimental treatment or condition. In R, we can use aov() to conduct ANOVA. Similar to lm(), you need to put in the formula as the first argument. In this case, the dependent variable is score and the independent variable is of course cond. Instead of predictor, I use independent variable here. In the convention of experimental design, we always say that the experimental conditions/treatments are manipulated by the experimenter. With the errors well controlled or elimentated, the corresponding change on the dependent variable is thought to be attributed to the effect of the independent variable. This is why experiment is thought to be the only means in research methodology to confirm causality.

agg.aov<-aov(score~cond,dta)
summary(agg.aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cond         4  66.45  16.612   7.341 8.22e-05 ***
## Residuals   55 124.46   2.263                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above summary table of ANOVA shows the source of variable, df, \(SS\), \(MS\), \(F\) value, and \(p\) value. As described in the previous section, the main effect of the independent variable (i.e., cond) can be tested by comparing \(MS_{treat}\) and \(MS_{error}\) via \(F\) test. Let us check \(MS_{error}\) first. As we know that each condition has equal number of subjects, \(MS_{error}=\frac{SS_{error}}{k(n-1)}\) and \(SS_{error}=\sum\sum(x_{ij}-\bar{x}_{.j})^2\). Thus, the df for \(MS_{error}\) is \(k(n-1)=11\times5=55\). According to this formula, \(MS_{error}\) is the mean of the score variances in the 5 conditions, which is the estimate of the population variance \(\sigma_\epsilon^2\). Indeed it is, as shown in the following codes.

Second, \(MS_{treat}=\frac{SS_{treat}}{k-1}\). Thus, the df for \(MS_{treat}\) is \(k-1=4\). When \(H_o\) is true, according to the CLT, the variance of the sampling distribution of these means is the population variance divided by sample size \(n\). The second of the lines below compute the estimate of the population variance \(\sigma_\epsilon^2\) (i.e., a mean square).

mean(with(dta,tapply(score,cond,sd))^2)
## [1] 2.262875
sd(means-mean(dta$score))^2*12
## [1] 16.61227

As shown in the above summary table of ANOVA, it is suggested that the main effect of the independent variable is significant. When reporting this result, you should state that the occurring extent of aggressive behavior is affected by the level of distraction during task.

Also, just like regression, in ANOVA the total variability of the dependent variable \(SST\) is the sum of total squared deviation of each data point to the grand mean, which can be decomposed as the sum of squares from two sources: \(SST=SS_{treat}+SS_{error}\). The \(R^2=\frac{SS_{treat}}{SST}=1-\frac{SS_{error}}{SST}\). Thus, the \(R^2=\frac{66.45}{66.45+124.46}=0.35\), suggesting 35% of the variability of scores can be explained by the independent variable, namely the level of distraction.

Regression and ANOVA

In the previous tutorial, it is evident that the regression model can include nominal predictors. Therefore, we can also conduct a regression analysis for the current case. The regression model then is \(\hat{y}=b_0+b_1g_1+b_2g_2+b_3g_3+b_4g_4\), where every \(g\) is a dummy variable. Thus, when all \(g\) are 0, the model predicts the mean of first group as the intercept \(b_0\). Indeed, this is confirmed in the following codes. The rest coefficients are the differences between the first group mean and all other groups means.

m1<-lm(score~cond,dta)
summary(m1)
## 
## Call:
## lm(formula = score ~ cond, data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4825 -0.5904  0.0500  0.7106  4.5808 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.80250    0.43425   4.151 0.000116 ***
## cond2       -1.12750    0.61412  -1.836 0.071772 .  
## cond3       -2.71500    0.61412  -4.421 4.67e-05 ***
## cond4       -1.65833    0.61412  -2.700 0.009188 ** 
## cond5        0.08667    0.61412   0.141 0.888288    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.504 on 55 degrees of freedom
## Multiple R-squared:  0.3481, Adjusted R-squared:  0.3007 
## F-statistic: 7.341 on 4 and 55 DF,  p-value: 8.216e-05
means
##          1          2          3          4          5 
##  1.8025000  0.6750000 -0.9125000  0.1441667  1.8891667

In the summary of regression analysis, the residual standard error is 1.504. We already know this term is the standard error of estimate: \(\sqrt{\sum(y-\hat{y})^2/N-1}\), which represents the deviation extent of the residuals. Thus, when being squared, it becomes the mean square of residuals. Indeed, \(1.504^2=2.262\), which can be found in the MS term of residuals in the summary table of ANOVA. In fact, you can use anova() to turn the output format of a regression analysis to the format of ANOVA. You can find this is exactly the same as what you get when using aov().

anova(m1)
## Analysis of Variance Table
## 
## Response: score
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## cond       4  66.449 16.6123  7.3412 8.216e-05 ***
## Residuals 55 124.458  2.2629                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing Means

The main purpose of conducting an experiment is to test the hypothesis by manipulating the independent variable and observe whether the subjects’ responses change in the expected direction correspondingly. Thus, the difference on subjects’ response (i.e., dependent variable) between different conditions is the basis of our hypothesis testing. This difference is often represented by the difference between the mean scores of the responses in different conditions.

Normally, in the context of any experiment, we have a clear hypothesis before we conduct the experiment. For instance, we might expect that a new treatment can effectively help people lose weights. Then the experimental group must be those subjects who are going to receive this treatment whereas the control group must be the others who are not. Obviously, the dependent variable of this experiment is the losing weight (maybe in kg). If the mean losing weight of the experimental group is larger than that of the control group, we can declare that this treatment is effective for losing weight. In order to guarantee the reliability of our experiment, statistical tests are applied for compairing the means of different conditions with the level of type I error as 5%. This is the basic logic of psychological expeirments.

However, the ANOVA I just introduced is not used for testing whether a particular group mean is different from the other. It is called omnibus \(F\) test and it just answers the question whether the variance of total data due to experimental manipulation is the same as that due to random error. How can we compare the means of multiple groups? One common way is to use \(t\) test with Bonferroni correction. What is Bonferroni correction? Simple. You simply divide the type I error by the number of comparisons as the type I error for each comparison. For example, you want to compare the means pairwisely in three conditions \(a\), \(b\), and \(c\), the type I error is adjusted to be \(0.05/3\) or just \(.01\). The reason why we need Bonferroni correction is to prevent the inflation of type I error. What if we do not correct the type I error for each comparison? Since every comparison is made under 5% chance to reject a correct \(H_o\), in total, the chance for the whole experiment to reject a correct \(H_o\) is up to 15%, that is far too large. Thus, Bonferroni correction is used for maintining the family-wise type I error as 5%.

Posterior and A Priori Comparison

Since the omnibus \(F\) test cannot test the difference in a particular pair of group means, after the omnibus \(F\) test, two kinds of comparisons are often conducted. One is posterior comparison, which is a variant of \(t\) test. The posterior comparison is designed to compare any pair of group means with the standard error transferred from the \(MS_{error}\). The tests designed for making the posterior comparsion is a big family, which can be called post hoc tests. In contrast with the posterior comparison, the other is a priori comparison, which is also called planned comparison or contrast. Normally, when the omnibus \(F\) test is significant, the posterior or a propri comparisons need to be tested in order to clarify which pair of means makes the \(F\) test significant. If you have no particular idea about which pair should be tested, the posterior comparisons are conducted. On the contrary, if you have, the planned comporisons are conducted. As your suspicion might come from the visual inspection of the group means, a priori comparison does not have to be made before the experiment. Therefore, the difference between the posterior and a priori comparisons is not whether the comparisons are designed before or after the experiment. Instead, the difference between them is whether you have particular comparison to test. As there is no particular preference, the post hoc tests will test all pairwise comparisons. However, if you are particularly interested some pairs of means, you do not have to test all comparisons and just focus on those of your interest. That is the difference between the posterior and a priori comparisons.

Post Hoc Tests

Three popular post hoc tests are introduced. The first is Fisher’s LSD Test (LSD). When we have only a few means to compare, this is a very legitimate and usefule procedure. Fisher’s LSD is basically a set of individual t-tests and the only difference between it and linear contrasts is that Fisher’s LSD requires a significant overall \(F\) test. The second is Turkey’s Test, which is also called Turkey’s HSD (Honestly Significant Difference) test or WSD (Wholly Significant Difference) test. The third is Scheffe test, which is the most rigorous post hoc test. The formulas of these tests can be easily got on the Internet. You are welcome to google them. In R, it is very easy to run these post hoc tests with PostHocTest(). You simply need to enter the object returned by aov() as the first argument and define which post hoc test you want in the argument of method.

library(DescTools)
PostHocTest(agg.aov,method="lsd")
## 
##   Posthoc multiple comparisons of means : Fisher LSD 
##     95% family-wise confidence level
## 
## $cond
##            diff      lwr.ci     upr.ci    pval    
## 2-1 -1.12750000 -2.35822804  0.1032280  0.0718 .  
## 3-1 -2.71500000 -3.94572804 -1.4842720 4.7e-05 ***
## 4-1 -1.65833333 -2.88906137 -0.4276053  0.0092 ** 
## 5-1  0.08666667 -1.14406137  1.3173947  0.8883    
## 3-2 -1.58750000 -2.81822804 -0.3567720  0.0124 *  
## 4-2 -0.53083333 -1.76156137  0.6998947  0.3911    
## 5-2  1.21416667 -0.01656137  2.4448947  0.0531 .  
## 4-3  1.05666667 -0.17406137  2.2873947  0.0909 .  
## 5-3  2.80166667  1.57093863  4.0323947 2.9e-05 ***
## 5-4  1.74500000  0.51427196  2.9757280  0.0063 ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are 5 conditions in Example 1. Therefore, in total \(\binom{5}{2}=10\) comparisons to go. Thus, in the above table, there are 10 comparisons. Now we can run Turkey’s HSD test with the same function but the method changed to “hsd”. The Turkey’s HSD test is more rigorous than Fisher’s LSD test, so you can find fewer significant comparisons in the table for HSD test.

PostHocTest(agg.aov,method="hsd")
## 
##   Posthoc multiple comparisons of means : Tukey HSD 
##     95% family-wise confidence level
## 
## $cond
##            diff      lwr.ci      upr.ci    pval    
## 2-1 -1.12750000 -2.85952525  0.60452525 0.36394    
## 3-1 -2.71500000 -4.44702525 -0.98297475 0.00044 ***
## 4-1 -1.65833333 -3.39035858  0.07369191 0.06680 .  
## 5-1  0.08666667 -1.64535858  1.81869191 0.99991    
## 3-2 -1.58750000 -3.31952525  0.14452525 0.08725 .  
## 4-2 -0.53083333 -2.26285858  1.20119191 0.90861    
## 5-2  1.21416667 -0.51785858  2.94619191 0.29067    
## 4-3  1.05666667 -0.67535858  2.78869191 0.43010    
## 5-3  2.80166667  1.06964142  4.53369191 0.00027 ***
## 5-4  1.74500000  0.01297475  3.47702525 0.04746 *  
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we use the same function to run Scheefe test by setting method=“scheffe”. This time, there are only two comparisons significant.

PostHocTest(agg.aov,method="scheffe")
## 
##   Posthoc multiple comparisons of means : Scheffe Test 
##     95% family-wise confidence level
## 
## $cond
##            diff     lwr.ci     upr.ci   pval    
## 2-1 -1.12750000 -3.0848789  0.8298789 0.5042    
## 3-1 -2.71500000 -4.6723789 -0.7576211 0.0019 ** 
## 4-1 -1.65833333 -3.6157122  0.2990456 0.1375    
## 5-1  0.08666667 -1.8707122  2.0440456 0.9999    
## 3-2 -1.58750000 -3.5448789  0.3698789 0.1700    
## 4-2 -0.53083333 -2.4882122  1.4265456 0.9443    
## 5-2  1.21416667 -0.7432122  3.1715456 0.4276    
## 4-3  1.05666667 -0.9007122  3.0140456 0.5687    
## 5-3  2.80166667  0.8442878  4.7590456 0.0013 ** 
## 5-4  1.74500000 -0.2123789  3.7023789 0.1045    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise:

The level-of-processing theory posits that the deeper an item is processed, the better we recall it. A psychologist would like to examine this theory. He randomly assigned 50 subjects to each of 5 conditions. The difference between these conditions is the depth of processing for the study list. The dependent variable is the number of items which can be correctly recalled. Please use the following codes to create the data and start your data analysis to examine this researcher’s hypothesis.

dta<-data.frame(Y=c(9,8,6,8,10,4,6,5,7,7,
                   7,9,6,6,6,11,6,3,8,7,
                   11,13,8,6,14,11,13,13,10,11,
                   12,11,16,11,9,23,12,10,19,11,
                   10,19,14,5,10,11,14,15,11,11),
                G=rep(c("Counting","Rhyming","Adjective",
                        "Imagery","Intentional"),each=10))