Another Example

Aronson, Lustina, Good, Keough, Steele, & Brown (1998) used two independent groups of college students who were known to excel in mathematics. They assigned 11 students to a control group that was simply asked to complete a difficult mathematics exam. They assigned 12 students to a threat condition, in which they were told that Asian students typically did better than other students in math tests, and that the purpose of the exam was to help the experimenter to understand why this difference exists. Aronson et al. reasoned that simply telling white students that Asians did better on math tests would arouse feelings of stereotype threat and diminish the students’ performance. Table 7.7 shows the data in their study. Accroding to the description of their study, the research hypothesis should be that the mean scores of these two groups are different and the null hypothesis is thus that the mean scores of these two groups are not different. Let us analyze these data.

# Create data frame for analysis
threat.dta<-data.frame(score=c(4,9,13,9,13,7,12,12,6,8,13,
                               7,6,5,8,9,0,7,7,10,2,10,8),
                       group=c(rep("C",11),rep("T",12)))
# Create data frame for plot
threatfg.dta<-data.frame(score=with(threat.dta,tapply(score,group,mean)),
                         ses=with(threat.dta,tapply(score,group,sd))/
                           sqrt(c(11,12)),
                         group=c("C","T"))

We can use bar plot to show the group means. Apparently, the control group got a higher accuracy than the threat group.

library(ggplot2)
ggplot(data=threatfg.dta,aes(x=group,y=score,fill=group))+
  geom_bar(stat="identity")+
  geom_errorbar(aes(ymin=score-ses,ymax=score+ses),width=0.2)

Now we conduct \(t\) test for these data with the assumption that the population variances for these two groups are equal.

with(threat.dta,t.test(score~group,var.equal=T))
## 
##  Two Sample t-test
## 
## data:  score by group
## t = 2.3614, df = 21, p-value = 0.02795
## alternative hypothesis: true difference in means between group C and group T is not equal to 0
## 95 percent confidence interval:
##  0.3643033 5.7417573
## sample estimates:
## mean in group C mean in group T 
##        9.636364        6.583333

ScatterPlot

When we collect measures on two variables for the purpose of examining the relationship between these variables, scatterplot is one of the most useful techniques.

Take the data in Table 9.1 as an example, which were collected across 36 cities in order to investigate the relationship between “pace of life” and “heart disease”. Let’s create the data first.

dta<-data.frame(pace=c(27.67,25.33,23.67,26.33,26.33,25,26.67,26.33,
                       24.33,25.67,22.67,25,26,24,26.33,20,24.67,
                       24,24,20.67,22.33,22,19.33,19.67,23.33,22.33,
                       20.33,23.33,20.33,22.67,20.33,22,20,18,16.67,15),
                hd=c(24,29,31,26,26,20,17,19,26,24,
                     26,25,14,11,19,24,20,13,20,18,
                     16,19,23,11,27,18,15,20,18,21,
                     11,14,19,15,18,16))

In order to investigate the relationship between these two variables, we can map these data points to a geometric space with one variable as one dimension. The below codes are used to make a scatter plot and add a regression line to it. Apparently, when the more intense the pace of life is, the higher the rate to have ischemic heart disease is, suggesting a positive relationship beween them.

library(ggplot2)
ggplot(data=dta,aes(x=pace,y=hd))+
  geom_point(color="tomato")+
  geom_smooth(method="lm",se=F,color="deepskyblue3")
## `geom_smooth()` using formula 'y ~ x'

Wagner, Compas, and Howell (1988) investigated the relationship between stress and mental health in first-year college students. Table 9.2 shows the data. We encode these data with the following codes.

symp<-c(40.6,41.1,41.1,41.3,41.3,41.4,41.6,41.7,41.7,41.9,41.9,42.2,42.5,42.5,42.5,
        42.6,42.8,42.9,42.9,43,43,43,43,43.2,43.4,43.4,43.6,43.6,43.6,43.7,43.7,
        43.8,43.8,43.8,43.9,43.9,43.9,44.1,44.1,44.1,44.2,44.2,44.2,44.3,44.3,44.4,
        44.5,44.5,44.5,44.5,44.7,44.7,44.8,44.8,44.8,44.8,44.9,44.9,45,45.1,45.1,45.1,
        45.2,45.2,45.3,45.3,45.4,45.4,45.5,45.5,45.6,45.6,45.7,46,46,46,46,46.1,46.1,46.1,
        46.2,46.2,46.2,46.2,46.2,46.2,46.4,46.5,46.6,46.7,46.7,46.9,46.9,47.1,47.1,47.2,
        47.6,47.7,48,48,48.3,48.4,48.8,49.1,49.1,49.8,49.9)
stress<-c(0.1,0.1,0.2,0.3,0.3,0.3,0.4,0.5,0.5,0.6,0.7,0.7,0.8,0.8,0.8,0.9,0.9,0.9,0.9,0.9,
          1,1.1,1.1,1.2,1.2,1.2,1.2,1.3,1.3,1.3,1.3,1.3,1.4,1.4,1.4,1.5,1.5,1.5,1.5,1.5,1.5,1.5,
          1.6,1.6,1.6,1.6,1.7,1.7,1.7,1.8,1.8,1.8,1.9,2,2,2,2,2,2.1,2.1,2.2,2.2,2.2,2.2,2.2,
          2.3,2.3,2.3,2.3,2.4,2.4,2.4,2.5,2.6,2.7,2.7,2.7,2.8,2.9,2.9,3,3,3.1,3.3,3.3,3.3,
          3.4,3.4,3.4,3.4,3.6,3.6,3.7,3.7,3.8,3.8,3.8,3.9,4.3,4.3,4.4,4.5,4.5,4.5,4.5,5.5,5.8)

With the Q-Q plots, we know that these two variables are close to normal distributions.

fg1<-ggplot(data=data.frame(symp=symp),aes(sample=symp))+
       geom_qq(color="tomato")+geom_qq_line()
fg2<-ggplot(data=data.frame(stress=stress),aes(sample=stress))+
       geom_qq(color="tomato")+geom_qq_line()
library(gridExtra)
grid.arrange(fg1,fg2,ncol=2)

Now we can make the scatter plot for these data points with stress as axis X and symptom as axis Y. The relationship between symptom and stress is perfectly positive. That is, the more stress a first-year college student has, the more mental symptoms would be reported.

ggplot(data=data.frame(symp=symp,stress=stress),aes(stress,symp))+
  geom_point(color="tomato",shape=1)+
  geom_smooth(method="lm",se=F,color="deepskyblue3")
## `geom_smooth()` using formula 'y ~ x'

The Covariance

From the above figures, you can see a noticeable positive relationship between two variables. However, we are not satisfied with a qualitative description about the relationship such as positive or negative. We need to quantify the extent of the relationship between two variables. The covariance, \(cov_{XY}\), is one of such indices, \(cov_{XY}=\frac{\sum(X-\overline{X})(Y-\overline{Y})}{N-1}\). The covariance can be realized as the sum of signed areas, defined by any point and the mean point (\(\overline{X},\overline{Y}\)).

In the below scatterplot, the red point is the mean point and the blue rectangles are two example areas consisting of the mean point and another two points. The sum of areas is positive. That is, when x increases, y increases and vice versa.

Likewise, in the below figure, the sum of the areas is negative. Thus, when x increases, y decreases and vice versa. When the sum of positive areas is about equal to the sum of negative areas, the covariance is close to 0. That is no relationship between X and Y.

The Pearson Product-Moment Correlation Coefficient (r)

The covariance is also a kind of measure of the average “distance” from all data points to the mean point and would be influenced by the metrics of X and Y. Thus, a value of \(cov_{XY}=1.336\) might reflect a high degree of correlation when the standard deviations of the variables are small, but a low degree correlation when the standard deviations are high. To resolve this difficulty, we normalize the covariance by dividing it by the size of the standard deviations and make this our estimate of correlation. That is Pearson’s r, computed as \(r=\frac{COV_{XY}}{S_{X}S_{Y}}\).

Proof: \(-1\leq r\leq 1\)

When \(X=Y\), there is a maximum value of \(COV_{XY}=\frac{\sum (X-\overline{X})^2}{N-1}\). Thus, \(COV_{XX}=S^2_{X}\) and \(r=\frac{COV_{XX}}{S_{X}S_{X}}=\frac{S^2_{X}}{S^2_{X}}=1\)

Likewise, when \(X=-Y\), there is a minimum value of \(COV_{XY}=\frac{-\sum(X-\overline{X})(X-\overline{X})}{N-1}=-S^2_{X}\). Thus, \(r=\frac{COV_{XY}}{S_{X}S_{Y}}=\frac{-S^2_{X}}{S^2_{X}}=-1\).

Hypothesis Testing for r

The correlation coefficient of a population is \(\rho\). When \(\rho=0\), \(r\) will be approximately normally distributed around zero. A legitimate \(t\) test can be formed from the ratio below, which is distributed as \(t\) on \(N-2\) \(df\).

\(t=\frac{r\sqrt{N-2}}{\sqrt{1-r^2}}\).

For example, given \(r=.532\) and \(n=28\), \(t=\frac{.532\sqrt{26}}{\sqrt{1-.532^2}}=3.202\). This value of \(t\) is significant at \(\alpha=.05\) and df \(=28-2=26\).

Numeric Example

The file bodyWeights.txt contains subjects’ bodyweights at their age 8 and age 20. The hypothesis is that people’s bodyweights at age 20 are correlated with their bodyweights at age 8. Thus,

\(H_{o}:\rho=0\), \(H_{1}: \rho\neq 0\).

dta<-read.table("bodyWeights.txt",header=T,sep="")
dta
##    Age8 Age20
## 1    31    55
## 2    25    44
## 3    29    55
## 4    20    51
## 5    40    85
## 6    32    57
## 7    27    59
## 8    33    60
## 9    28    66
## 10   21    49

Second, make a scatterplot to show the relationship between these two bodyweights.

ggplot(data=dta,aes(Age8,Age20))+
  geom_point(color="tomato")+
  geom_smooth(method="lm",color="deepskyblue3",se=F)
## `geom_smooth()` using formula 'y ~ x'

Q: What relationship do you expect between these two bodyweights?

Third, compute \(r\) and check if \(r\) is significantly different from 0.

r<-with(dta,cor(Age8,Age20))
r
## [1] 0.7964106

Compute the \(t\) value by

tr<-r*sqrt(10-2)/sqrt(1-r^2)
tr
## [1] 3.724789
(1-pt(tr,8))*2
## [1] 0.005831181

There is a more direct way to do \(t\) test for \(r\).

with(dta,cor.test(Age8,Age20))
## 
##  Pearson's product-moment correlation
## 
## data:  Age8 and Age20
## t = 3.7248, df = 8, p-value = 0.005831
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3345328 0.9497788
## sample estimates:
##       cor 
## 0.7964106

Therefore, the testing result rejects \(H_{o}\) and our bodyweight at age 20 is relevant to it at age 8. That is, when you are overweight in your childhood, you might still be overweight when you grow up.

Scatterplot Between Multiple Pairs of Variabls

Suppose we have also collected the bodyweights of the same 10 participants at age 17. We can make scatterplots for the pairs of three bodyweights. The data are contained in BodyWeights2.dat.

dta2<-read.table("BodyWeights2.dat",header=T,sep="")
dta2
##    Age20 Age8 Age17
## 1     58   31    69
## 2     52   25    50
## 3     61   29    69
## 4     57   20    39
## 5     93   40    90
## 6     63   32    82
## 7     68   27    65
## 8     71   33    72
## 9     73   28    66
## 10    48   21    43

There is a quick way to make scatterplots for these three variables.

pairs(dta2)