The independent variables of experiment are referred to as factors. If an experiment has two factors, it is an instance of what is called a two-way factorial design. As shown in the below table, the columns represent for the two levels of factor A, whereas the rows represent for the three levels of factor B. There \(j\) levels of A and \(k\) levels of B. In each of \(j\times k\) cells, there are \(n\) subjects.

ANOVA for Two-Way Factorial design

In the one-way between-subjects design, the total variability of response scores can be composed as two parts: one caused by the treatment effect and the other caused by the error. That is, \(SST=SS_{treat}+SS_{error}\). Similarly, in the two-way between-subjects design, the total variability of response scores can be decomposed as two parts as well, one for the treatment effect and the other for the error. However, the part of the treatment effect now has three sources: factor A, factor B, and the interaction effect between A and B. That is, \(SST=SS_A+SS_B+SS_{AB}+SS_{error}\).

Therefore, the omnibus \(F\) test for a two-way factorial design is used to test the three main effects. The scientific hypotheses are as follows. Following the same logic behind the one-way between-subjects ANOVA, each main effect is tested by comparing the mean square of the variation on response scores caused by it and the mean square of error with \(F\) test.

\[\begin{align*} H_1:& \tau_A \neq 0\\ H_1:& \tau_B \neq 0\\ H_1:& \tau_{AB} \neq 0 \end{align*}\]

The mean square of error is the estimate of population variance \(\sigma_\epsilon\), which is computed as \(\frac{\sum\sum\sum(x_ijk-x_.jk)^2}{jk(n-1)}\), where \(jk(n-1)\) is the df. When \(H_o\) is correct, the population variance can be estimated by the variance of the means of factor A multiplied with sample size \(n\), \(\frac{n\sum(\bar{x}_{.j.}-\bar{x}_{...})^2}{(j-1)}\). Thus, the df of factor A is \(j-1\). Likewise, the population variance can also be estimated by the variance of the means of factor B multipled with \(n\), the df of factor B \(\frac{n\sum(\bar{x}_{..k}-\bar{x}_{...})^2}{(k-1)}\). The df of factor B is \(k-1\). Also, the effect of the interaction between the two factors is multipled with \(n\) as the estimate of the population variance, \(\frac{\sum\sum(\bar{x}_{.jk}-\bar{x}_{...})^2}{(j-1)(k-1)}\). Thus, the df of the interaction effect is \((j-1)(k-1)\). What is described above can be summarized as the below equations, where \(\theta^2\) is the variance of the effect \(\tau\).

\[\begin{align*} E(MS_A)&=MS_{\epsilon}+n\theta^2_{\tau_A}\\ E(MS_B)&=MS_{\epsilon}+n\theta^2_{\tau_B}\\ E(MS_{AB})&=MS_{\epsilon}+n\theta^2_{\tau_{AB}}\\ E(MS_{\epsilon})&=MS_{\epsilon} \end{align*}\]

Note in ANOVA, the partitioning of sum of squares is always corresponding to the partitioning of the degrees of freedom. Therefore, \(SST=SS_A+SS_B+SS_{AB}+SS_{error}\) and \(df_{total}=df_{A}+df_{B}+df_{AB}+df_{error}\), namely \(njk-1=(j-1)+(k-1)+(j-1)(k-1)+jk(n-1)\).

Example 1

Suppose the recall task under 5 different levels of learning now is tested for two groups of participants: old and young. These 5 levels of processing from shallow to deep are counting the number of alphabets of the word, judging the rhyme of the word, judging whether the word is an adjective, judging whether the word can be imagined, and whether the word shows for any intentation. The data are as follow. Obviously, this is a two-factorial design with 10 subjects in each condition. The hypotheses then are whether age influences the memory perforamnce, whether the level of processing influences the memory performance, and whether there is an interaction effect between age and level of processing.

dta<-data.frame(age=factor(rep(c(1,2),each=50),labels=c("young","old")),
                cond=factor(rep(1:5,each=10),labels=c("count","rhym","adj",
                                                      "imagery","intentation")),
                y=c(8,6,4,6,7,6,5,7,9,7,
                    10,7,8,10,4,7,10,6,7,7,
                    14,11,18,14,13,22,17,16,12,11,
                    20,16,16,15,18,16,20,22,14,19,
                    21,19,17,15,22,16,22,22,18,21,
                    9,8,6,8,10,4,6,5,7,7,
                    7,9,6,6,6,11,6,3,8,7,
                    11,13,8,6,14,11,13,13,10,11,
                    12,11,16,11,9,23,12,10,19,11,
                    10,19,14,5,10,11,14,15,11,11))
dim(dta)
## [1] 100   3

Descrbie Data

The subjects’ memory performance is measured as how many words they could correctly recall. The more the better. The below figure shows the mean performance in each of the 10 conditions. Normally, when we call geom_bar() to plot data along a nominal dimension, R automatically arrange the levels in the alphabet order. However, in the current example, the levels actually represent some sort of the level of mental processing. Thus, I prefer to sort out the levels accoring to the experimental manuipulation instead of the first alphabet of the level name. The left panel in the below figure shows the bar plot with the labels arranged in the alphabet order. The right one shows the bar plot with the labels arranged in the order of level of processing. Is it making more sense now? The visual inspection of this panel suggests that there is an interaction effect between these two factors. The interaction effect can be defined as that one factor’s effect on the dependent variable depends on the level of the other factor. For the current data, it looks like that the young and old subjects’ memory performance only differs for the conditions demanding deep mental processing, but not those demanding shallow processing.

means<-with(dta,tapply(y,list(cond,age),mean))
nums<-with(dta,tapply(y,list(cond,age),length))
ses<-with(dta,tapply(y,list(cond,age),sd))/sqrt(nums)
library(reshape)
lop.dta<-melt(means)
names(lop.dta)<-c("cond","age","score")
lop.dta$ses<-melt(ses)[,3]
library(ggplot2)
library(gridExtra)
mem1.fg<-ggplot(data=lop.dta,aes(x=cond,y=score,fill=age))+
          geom_bar(stat="identity",position=position_dodge())+
          geom_errorbar(aes(ymin=score-ses,ymax=score+ses),width=0.2,
                        position=position_dodge(0.9))
lop.dta$treat<-as.factor(rep(1:5,2))
mem2.fg<-ggplot(data=lop.dta,aes(x=treat,y=score,fill=age))+
          geom_bar(stat="identity",position=position_dodge())+
          geom_errorbar(aes(ymin=score-ses,ymax=score+ses),width=0.2,
                        position=position_dodge(0.9))+
          scale_x_discrete(labels=c("count","rhym","adj",
                                    "imagery","intent"))
grid.arrange(mem1.fg,mem2.fg,ncol=2)

ANOVA

Now we conduct a two-way between-subjects ANOVA. The formula is set as y~cond*age, which is equivalent to y~cond+age+cond:age. In the definition of formula, the colon operator is used for interaction effect. Thus, this formula means the full model for the two factors. As shown in the below summary table, the main effects of cond and age are significant and the interaction effect between cond and age is significant also. This is consistent with our visual inspection. Remember to check the degrees of freedom for each mean square term. Checking the df is a way to check whether our data are properly set up for the experimental design.

mem.aov<-aov(y~cond*age,dta)
summary(mem.aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cond         4 1514.9   378.7  47.191  < 2e-16 ***
## age          1  240.2   240.2  29.936 3.98e-07 ***
## cond:age     4  190.3    47.6   5.928 0.000279 ***
## Residuals   90  722.3     8.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Simple Main Effects

In order to understand the interaction effect further, we can focus on a particular level of one factor and check out the effect of the other factor. We now check the age effect in each condition. As expected, there is no age effect in the two shallow conditions, whereas there is a significant age effect in the deeper conditions.

summary(aov(y~age,subset(dta,cond=="count")))
##             Df Sum Sq Mean Sq F value Pr(>F)
## age          1   1.25   1.250   0.464  0.504
## Residuals   18  48.50   2.694
summary(aov(y~age,subset(dta,cond=="rhym")))
##             Df Sum Sq Mean Sq F value Pr(>F)
## age          1   2.45   2.450   0.586  0.454
## Residuals   18  75.30   4.183
summary(aov(y~age,subset(dta,cond=="adj")))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## age          1   72.2    72.2   7.848 0.0118 *
## Residuals   18  165.6     9.2                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(y~age,subset(dta,cond=="imagery")))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## age          1   88.2   88.20   6.539 0.0198 *
## Residuals   18  242.8   13.49                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(y~age,subset(dta,cond=="intentation")))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## age          1  266.4  266.45   25.23 8.84e-05 ***
## Residuals   18  190.1   10.56                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Of course, you can test the effect of processing level for different ages. It is not surprising that the memory performance is affected by the level of processing for no matter young or old sugjects. These results might suggest that the aging effect on recall memory is a quantiative degeneration.

summary(aov(y~cond,subset(dta,age=="young")))
##             Df Sum Sq Mean Sq F value Pr(>F)    
## cond         4   1354   338.4   53.06 <2e-16 ***
## Residuals   45    287     6.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(y~cond,subset(dta,age=="old")))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cond         4  351.5   87.88   9.085 1.82e-05 ***
## Residuals   45  435.3    9.67                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 2

There are two brands of laundry powders tested for the capacity to remove dirt in the water of three different temperatures. The data are as follow. As shown in the codes below, there are 4 samples in each of 6 (=\(2\) brands \(\times\) \(3\) temperatures) conditions.

dta<-data.frame(Brand=rep(c("Super","Best"),each=3*4),
                Temp=rep(rep(c("Cold","Warm","Hot"),each=4),2),
                D=c(4,5,6,5, 7,9,8,12, 10,12,11,9,
                    6,6,4,4, 13,15,12,12, 12,13,10,13))
dim(dta)
## [1] 24  3

Describe Data

We use bar plot to show the mean score of each condition for removing dirts. See the below figure. It looks like that when the temperature is low, these two brands are equally bad. When the temperature is high, the capacity of removing dirts seems to be no different between these two brands. When the temperature is moderate, the difference between these two brands is larger.

power.means<-with(dta,tapply(D,list(Brand,Temp),mean))
power.nums<-with(dta,tapply(D,list(Brand,Temp),length))
power.ses<--with(dta,tapply(D,list(Brand,Temp),sd))/sqrt(power.nums)
power.dta<-cbind(melt(power.means),melt(power.ses)[,3])
names(power.dta)<-c("Brand","Temperature","Score","ses")
power.dta$temp<-as.factor(rep(c(1,3,2),each=2))
power.fg<-ggplot(data=power.dta,aes(x=temp,y=Score,fill=Brand))+
            geom_bar(stat="identity",position=position_dodge())+
            geom_errorbar(aes(ymin=Score-ses,ymax=Score+ses),
                          width=0.2,position=position_dodge(0.9))+
            scale_x_discrete(labels=c("Cold","Warm","Hot"))
power.fg

ANOVA

A two-way between-subjects ANOVA is conducted to examine the main effects of brand and water temperature as well as their interaction effect. All these effects are significant.

summary(aov(D~Brand*Temp,dta))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Brand        1  20.17   20.17   9.811  0.00576 ** 
## Temp         2 200.33  100.17  48.730 5.44e-08 ***
## Brand:Temp   2  16.33    8.17   3.973  0.03722 *  
## Residuals   18  37.00    2.06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In order to understand the interaction effect in more detail, the simple main effects of these two factors are tested. First, we can focus on testing the difference between these two brands of laundry powers on three temperatures each. The results are consistent with our inspection of the bar plot. Only when the water temperature is warm is the simple main effect of brand significant. Although not as meaningful as testing the simple main effect of brand, the simple main effect of water temperature is tested as well. Of course, for no matter which brand, the capacity of removing dirts would be influnenced by the water temperature.

# Simple main effect of brand
summary(aov(D~Brand,subset(dta,Temp=="Cold")))
##             Df Sum Sq Mean Sq F value Pr(>F)
## Brand        1      0       0       0      1
## Residuals    6      6       1
summary(aov(D~Brand,subset(dta,Temp=="Warm")))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Brand        1     32   32.00     9.6 0.0212 *
## Residuals    6     20    3.33                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(D~Brand,subset(dta,Temp=="Hot")))
##             Df Sum Sq Mean Sq F value Pr(>F)
## Brand        1    4.5   4.500   2.455  0.168
## Residuals    6   11.0   1.833
# Simple main effect of temperature
summary(aov(D~Temp,subset(dta,Brand=="Best")))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Temp         2    152   76.00   42.75 2.54e-05 ***
## Residuals    9     16    1.78                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(D~Temp,subset(dta,Brand=="Super")))
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Temp         2  64.67   32.33   13.86 0.00179 **
## Residuals    9  21.00    2.33                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise:

A researcher examined whether appearance effects the judgment of student’s essay. Two factors include the essay quality (EQ) and Appearance (Att). There are two levels of EQ: 1 (good) and 2 (poor). There are three levels of Att: 1 (Attractive), 2 (Control), and 3 (Unattractive). Sixty participants were randomly assigned to the six conditions with 10 in each condition. These data are contained in halo.txt. Please analyze these data to better understand whether the essay quality and the appeareance would influence the judgment. Also, is there any interaction effect? If yes, please use appropriate tests to explain it.