Statistical Methods for Psychology

Effect Size

Suppose a \(t\) test was conducted for two matched groups (e.g., pre-test and post-test). Suppose again the means of these two groups were significantly different (i.e., \(p<.05\)). The term “significant” is not used as a measure of the magnitude of the treatment effect. In fact, when the sample size gets pretty large, even a small group difference might get statistically significant.

Take as an example the \(t\) test for two matched groups. Suppose the population mean increases for 0.05 units after an experimental treatment. That means that the true difference between the pre-test mean and the post-test mean is 0.05. Suppose again the population SD is 0.2. When each sample size is 16, the standard error is \(\frac{0.2}{\sqrt{16}}=0.05\). The two-tailed \(t\) test for this mean difference is not significant with \(p=.32\). However, when each sample size is 100, the standard error is \(\frac{0.2}{\sqrt{100}}=0.02\). Now the testing result becomes significant with \(p<.05\). In the below figure, the arrows show the \(t\) scores in these two studies and the bars show the critical values used in \(t\) test in these two studies. Obviously, the \(t\) score in study 1 is smaller than the critical value, whereas the \(t\) score in study 2 is larger than the critical value. However, the true difference (i.e., 0.05) between the two group means is kept the same in these two studies.

2*(1-pt(0.05/0.05,df=15))

## [1] 0.3331701

2*(1-pt(0.05/0.02,df=99))

## [1] 0.0140626

x<-c(seq(-3,3,0.01),seq(-3,3,0.01))
y<-c(dt(seq(-3,3,0.01),df=15),dt(seq(-3,3,0.01),df=99))
g<-rep(c("study1","study2"),each=length(seq(-3,3,0.01)))
dd<-data.frame(x=x,y=y,g=g)
library(ggplot2)
ggplot(dd,aes(x=x,y=y,color=g))+
  geom_line()+
  geom_segment(aes(x=1,y=0.05,xend=1,yend=0),color="#F8766D",
               arrow=arrow(length=unit(0.2,"cm")))+
  geom_segment(aes(x=2.5,y=0.05,xend=2.5,yend=0),color="#00BFC4",
               arrow=arrow(length=unit(0.2,"cm")))+
  geom_segment(aes(x=2.13,y=0.05,xend=2.13,yend=0),color="#F8766D")+
  geom_segment(aes(x=1.98,y=0.05,xend=1.98,yend=0),color="#00BFC4")

## Warning in geom_segment(aes(x = 1, y = 0.05, xend = 1, yend = 0), color = "#F8766D", : All aesthetics have length 1, but the data has 1202 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## Warning in geom_segment(aes(x = 2.5, y = 0.05, xend = 2.5, yend = 0), color = "#00BFC4", : All aesthetics have length 1, but the data has 1202 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## Warning in geom_segment(aes(x = 2.13, y = 0.05, xend = 2.13, yend = 0), : All aesthetics have length 1, but the data has 1202 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## Warning in geom_segment(aes(x = 1.98, y = 0.05, xend = 1.98, yend = 0), : All aesthetics have length 1, but the data has 1202 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

The effect size is a way of understanding the magnitude of the effect that we observe in an experiment, as opposed to simply the statistical significance. There are a number of different effect size measures, which can be distinguished as the measure based on differences between groups (d-family) and the measure based on correlations between variables (\(r\)-family).

d-family of measures for mean comparison

In the case of inferring the difference between population means, the common effect size measure is Cohen’s d, which is computed as \(d=\frac{\mu_1-\mu_2}{\sigma}\). For two matched groups, the denominator is the standard deviation of either population. Similarly, the effect size in terms of sample statistics is computed as \(d=\frac{\bar{x}_1-\bar{x}_2}{s}\). The denominator is the standard deviation of either sample. Suppose \(\bar{x}_1=90.49\) and \(\bar{x}_2=83.23\) and \(s_{x_1}=5.02\). Then, \(d=\frac{90.49-83.23}{5.02}=1.45\). The effect size can be viewed as the normalized difference between means. In the below figure, the difference between two means is 0.5 and the effec size is 0.5/1=0.5, where 1 is the standard deviation of the population of the control group as well as the experimental group.

ggplot(data.frame(x=c(seq(-3,3,0.01),seq(-3+0.5,3+0.5,0.01)),
                       y=c(dnorm(seq(-3,3,0.01),0,1),
                           dnorm(seq(-3+0.5,3+0.5,0.01),0.5,1)),
                       g=rep(c("Control","Exp"),each=length(seq(-3,3,0.01)))),
            aes(x=x,y=y,color=g))+
     geom_line()+
     geom_segment(aes(x=0,y=0.4,xend=0,yend=0),color="#F8766D")+
     geom_segment(aes(x=0.5,y=0.4,xend=0.5,yend=0),color="#00BFC4")

## Warning in geom_segment(aes(x = 0, y = 0.4, xend = 0, yend = 0), color = "#F8766D"): All aesthetics have length 1, but the data has 1202 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## Warning in geom_segment(aes(x = 0.5, y = 0.4, xend = 0.5, yend = 0), color = "#00BFC4"): All aesthetics have length 1, but the data has 1202 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

Similarly, Cohen’s d for the difference between the means of two independent groups is computed as \(\frac{\bar{x}_1-\bar{x}_2}{s_p}\). Take the homophobia data as an example in Chapter 7. The effect size is \(d=\frac{\bar{x}_1-\bar{x}_2}{s_p}=\frac{24-16.5}{12.02}=0.62\). As described previously, the effect size d can be viewed as normalized difference between two means. Thus, d indicates how many standard deviation these two means are apart from each other. The below table lists the magnitude of d. According to it, the effect size measured in the homophobia experiment (\(d=0.62\)) is a medium effect.

Numeric Example

A researcher examined the effect of the cognitive behavior therapy on those women with abusive partners. The amount of weight gained after the counseling program was used as the index for the treatment effect. The amount of gained weight of the control group and the cognitive-therapy group are coded as a data frame.

dta<-data.frame(gain=c(-0.5,-9.3,-5.4,12.3,-2,-10.2,-12.2,11.6,-7.1,6.2,
                    -0.2,-9.2,8.3,3.3,0,-1,-10.6,-4.6,-6.7,2.8,
                    0.3,1.8,3.7,15.9,-10.2,
                    1.7,0.7,-0.1,-0.7,-3.5,14.9,3.5,17.1,-7.6,1.6,11.7,
                    6.1,1.1,-4,20.9,2.1,-1.4,1.4,-0.3,-3.7,-0.8,
                    2.4,12.6,1.9,3.9,0.1,15.4,-0.7),
             group=c(rep("Control",25),rep("CogTherapy",28)))

Of course, the scientific hypothesis is that the cognitive behavior therapy is effective. That is, \(H_1: \mu_{cognitive} \neq \mu_{control}\) and \(H_o:\mu_{cognitive} = \mu_{control}\). A bar plot is used to show the group means. See below.

means<-with(dta,tapply(gain,group,mean))
ses<-with(dta,tapply(gain,group,sd))/sqrt(with(dta,tapply(gain,group,length)))
ggplot(data.frame(means=means,ses=ses,group=c("Cognitive","Control")),
       aes(x=group,y=means,fill=group))+
  geom_bar(stat="identity")+
  geom_errorbar(aes(ymax=means+ses,ymin=means-ses),width=0.2)

The visual inspection of this figure suggests that the two means should be different. This is supported by an independent-groups \(t\) test, \(t_{51}=2.14\), \(p<.05\). The effect size d is 0.59. This is a medium effect. Note here the standard deviation for computing the effect size is the pooled standard deviation of these two samples. That is, \(s_{spool}=\sqrt{\frac{df_1s^2_1+df_2s^2_2}{df_1+df_2}}\).

with(dta,t.test(gain~group,var.equal=T))

## 
##  Two Sample t-test
## 
## data:  gain by group
## t = 2.1398, df = 51, p-value = 0.03718
## alternative hypothesis: true difference in means between group CogTherapy and group Control is not equal to 0
## 95 percent confidence interval:
##  0.2693003 8.4492711
## sample estimates:
## mean in group CogTherapy    mean in group Control 
##                 3.439286                -0.920000

sds<-with(dta,tapply(gain,group,sd))
dfs<-table(dta$group)-1
sp<-sqrt(sum(sds^2*dfs/sum(dfs)))
effectsize<-(means[1]-means[2])/sp
effectsize

## CogTherapy 
##  0.5887842

Since the effect size is a measure of the magnitude of true effect, it reflects the validity of an experiment. It is the mean of the \(H_1\) distribution. The \(p\) value shows the probability to observe the expected effect (\(\Delta\bar{x}\neq0\)), given that \(H_o\) is correct. A \(p\) value smaller than \(.05\) suggest us to reject \(H_o\), but tells us nothing about \(H_1\). What if the difference between two means is not significant? Then, we have no reason to reject \(H_o\). That is, we have no evidence for \(H_1\). Therefore, it is unnecessary to compute the effect size in this case.

d-family of measurments for association

The relationship between heart attack and aspirin has been debated for a long time among physicians. Over 22,000 physicians were administered aspirin and placebo over a nubmer of years. The data are shown in the below table. This design is a prospective study, because the treatments were applied and then future outcome was determined.

Num	Heart Acctack	No Heart Acctack	Total
Aspirin	104	10933	11037
Placebo	189	10845	11034
Total	293	21778	22071

Prospective studies are often called cohort studies (because two or more cohorts of participants are identified). On the other hand, a retrospecitve study, or called case-control design, would select people who had, or had not, experienced a heart attack and then look backward in time to see whether they had been in the habit of taking aspirin in the past. For these data \(\chi^2=25.014\) on one degree of freedom \(p<.01\) which is statistically significant, indicating a relationship between whether or not one takes aspirin daily and whether one later has a heart attack.

HA.dta<-matrix(c(104,189,10933,10845),2,2)
chisq.test(HA.dta,correct=F)

## 
##  Pearson's Chi-squared test
## 
## data:  HA.dta
## X-squared = 25.014, df = 1, p-value = 5.692e-07

There are two statistics for the data in a contingency table: risk and odds. First look at the risk. In this table, the risk to suffer a heart attack for a man taking aspirin daily is \(\frac{104}{11037}=0.94\%\). The risk to suffer a heart attack for a man not taking aspirin is \(\frac{189}{11034}=1.71\%\). The effect of aspirin can be measured by comparing these two risks. One d-family measure is the risk difference, which is \(1.71\%-0.94\%=0.77\%\). However, the magnitude of the risk difference depends on the overall level of risk. A better way to compare the risks is to form a risk ratio, also called relative risk. For the heart attack data the risk ratio is \(RR=1.71\%/0.94\%=1.819\). That means the risk of suffering a heart attack without taking aspirin is twice as much as that with taking aspirin. This difference is quite large.

The third d-family measure of effect size that we must consider is the odds ratio. The odds of having a heart attack in the aspirin group is the ratio between the numbers in the two conditions (having a heart attack vs. having no heart attack) in the aspirin group. That is \(104/10933=0.95\%\). The odds of having a heart attack in the placebo group is \(189/10845=1.74\%\). Then the odd ratio \(OR=\frac{1.74\%}{0.95\%}=1.83\). Thus, the odds of a heart attack without aspirin are 1.83 times higher than the odds of a heart attack with aspirin.

Since the risk ratio and odds ratio are similar, why do not we just use the risk ratio? The odds ratio has at least two things in its favor. The risk is future oriented, namely the probability to get some outcome (i.e., having a heart attack or not) after an action is taken or not (i.e., taking aspirin or not). For example, we give 1000 people aspirin and withhold it from 1000 others. Then, the risk can be computed as the number of heart attacks in each group. However, in a retrospective study, we only know whether a person has a heart attack or not. If we collected 1000 people with (and without) heart attacks and look backward, we cannot really calculate the risk, because we have sampled heart attack patients at far greater than their normal rate in the population. The problem can be circumvented by using the odds ratio. A second important advantage of the odds ratio is that taking the natural log of the odds ratio gives us a statistic that is extremely useful in a variety of situations.

Odds ratio in 2 x k tables

A researcher was interested in how adult abuse is influenced by earlier childhood abuse. This study must be a retrospective study. All interviewees were either commiting abuse or not. Each interviewee was classified to four levels, according to the level of childhood abuse they received. The data are shown in the below table.

Num	No	Yes	Odds
0	512	54	.106
1	227	37	.163
2	59	15	.254
3-4	18	12	.667
Total	816	118	.145

The odds ratio for k rows is computed as the raio of the \(k^{th}\) odds over the reference odds. In this case, level 0 is suitable to be the reference group. Thus, we can compute all four odds ratios for this data set. Apparently, the odds ratio increases as the level of abuse received in childhood increases.

OR.dta<-data.frame(x=c(0,1,2,3),y=c(0.106,0.163,0.254,0.667)/rep(0.106,4))
ggplot(OR.dta,aes(x=x,y=y))+
  geom_line()+geom_point()+
  scale_x_continuous(name="Childhood Abused Level")+
  scale_y_continuous(name="Odds Ratio relative to level 0")

r-family measures

The d-family measures focus on comparing differences between conditions, such as Cohen’s d, risk ratio, and odds ratio. The r-family measures are measures of association, which instead focus on the correlation between two variables. In fact, the correlation coefficient \(r\) is an r-family measure of effect size. In addition to \(r\), there are more r-family measurements. Two out of them can be applied for the data in the \(2\times2\) contingency table. One is called phi(\(\phi\)). The \(\phi\) correlation can be computed by many different equations. An easier way to calculate \(\phi\) is \(\phi=\sqrt{\frac{\chi^2}{N}}\). For the previous case concerning the relationship between aspirin and heart attack, \(\phi\) is 0.03. This is a quite week association between aspirin and heart attack.

result<-chisq.test(HA.dta,correct=F)
phi<-sqrt(result$statistic/sum(HA.dta))
phi

##  X-squared 
## 0.03366507

Another r-family measure suitable for the \(2\times2\) table is called Cramer’s \(V\), which is computed as \(V=\sqrt{\frac{\chi^2}{N(k-1)}}\), where \(N\) is the sample size and \(k\) is defined as the smaller of the number of rows and columns. Thus, \(k=2\). Since \(k=2\), in this case, Cramer’s \(V\) is the same as \(\phi\).

v<-sqrt(result$statistic/(sum(HA.dta)*(2-1)))
v

##  X-squared 
## 0.03366507