It is common to observe the correlation between two variables in the natural environment. For instance, the number of preys decreases and the number of predators decreases as well; or the taller/shorter parents, the taller/shorter children. There are many coefficients for describing the correlation between two variables. The most well-known is the Person’s \(r = cov(x,y)/(\sigma_x\sigma_y)\), where \(cov(x,y)\) is the covariance between \(x\) and \(y = \sum(x-\mu_x)(y-\mu_y)\) and \(\sigma_x\) and \(\sigma_y\) are the standard deviations of \(x\) and \(y\). \(-1\le r \le 1\).

Example 1

A researcher would like to examine the relationship between the studying hours per week and the SAT score. She collected 20 participants’ SAT scores as well as their studying hours per week. The data are contained in sat.csv. For the convenience of conducting statistical tests, it is suggested to save this file to your working directory. Import this data set with read.table() and show the first 6 rows with head().

url<-"http://cml.nccu.edu.tw/lxyang/ExpDesign/sat.csv"
SATHr.dta<-read.table(url,header=T,sep=",")
head(SATHr.dta)
##   Hours SAT
## 1     4 390
## 2     9 580
## 3    10 650
## 4    14 730
## 5     4 410
## 6     7 530

Describing Data

Since this researcher was interested in any correlation between studying hours per week and SAT scores, it is suggested to plot the data points as scatters in a space consisting of the dimensions of studying hour and SAT score. In the plotting codes, the second layer is drawing the points by geom_point() with shape=1 for hollow points. Apparently, the loner the studying hour, the higher the SAT score is.

library(ggplot2)
fg<-ggplot(data=SATHr.dta,aes(x=Hours,y=SAT))+
      geom_point(shape=1,color="blue")
fg

Correlation Analysis

We can test this inspection with cor.test(). When calling cor.test(a,b) that a and b are vectors, R will return r and the testing results of t-test (df = \(N-2\)) with Ho: \(r=0\) and H1: \(r\ne0\).The below codes show two ways to compute Pearson’s \(r\). Both ways get the same result. Consistent with our inspection, the correlation between studying hours and SAT scores is highly positive (\(r=.93\)).

cor.test(SATHr.dta$Hours,SATHr.dta$SAT)
## 
##  Pearson's product-moment correlation
## 
## data:  SATHr.dta$Hours and SATHr.dta$SAT
## t = 11.055, df = 18, p-value = 1.868e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8368003 0.9738076
## sample estimates:
##       cor 
## 0.9336055
with(SATHr.dta,cor.test(Hours,SAT))
## 
##  Pearson's product-moment correlation
## 
## data:  Hours and SAT
## t = 11.055, df = 18, p-value = 1.868e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8368003 0.9738076
## sample estimates:
##       cor 
## 0.9336055

The correlation coefficient is just a measure of the relation between variables. However, it cannot predict the value of one variable, given the value of the other, hence limiting its utility. To fulfill this expectation, a model which can quantatively represent the relation between variables is needed. In statistics, a simple linear regression model is the one mostly chosen. This model is acutally a line relatively close to each data point in the data space. Take as an example the case of two variables \(x\) and \(y\). With \(x\) as the predictor and \(y\) as the dependent variable, the line \(y=\beta_0+\beta_1 x\) is the regression model. The relation between \(x\) and \(y\) is represented by \(\beta_1\) when positive \(y\) increases as \(x\) increases and when negative \(y\) increases as \(x\) decreases and vice versa. That is the slope of this line or called the regression coefficient. The parameter \(\beta_0\) then is the intercept of this line.

The regression line varies with different pairs of \(\beta_0\) and \(\beta_1\). However, among all these variant lines, the one relatively close to all data points is the best-fit model. Thus, the regression model is determined by the data. The intercept and the slope of a regression model (i.e., the parameters of the regression model) can be directly solved by the method of least squares or estimated by the method of maximum likelihood to optimize the goodness of fit of the model. Since for each \(x_i\), the model \(\hat{y}=\beta_0+\beta_1 x\) preidcts \(\hat{y_i}\), the difference between \(\hat{y_i}\) and \(y_i\) is called the residual or \(\epsilon_i\). Thus, \(y=\hat{y}+\epsilon\).

The assumptions of linear regression analysis include:

  1. The dependent variable is a continuous variable.
  2. The predictors are independent of each other.
  3. The resiudals are normally distributed with 0 as the mean and \(\sigma_\epsilon\) as the standard deviation. \(\epsilon\sim ND(0,\sigma_\epsilon)\).
  4. The dependent variable is normally distributed as well, as \(y=\hat{y}+\epsilon\) where \(\hat{y}\) is a scalar computed by the regression model and \(\epsilon\) is a normal distribution.
  5. The predictors should be independent of the residuals.

Regression Analysis

In R, lm() is used to conduct the regression analysis. When calling this function, the first argument is the formula or the linear model. For instance, SAT~Hours means that SAT and Hours are respectively the dependent variable and predictor. It is equivalent to saying that we regress SAT by Hours. Here you can view the operator ~ as = in the model \(\hat{y}=\beta_0+\beta_1 x\).

The results returned by lm() are save in a variable, which can be presented in a format of summary table with summary(). In this summary table, the user-defined formula is shown in the very top area, followed by the results returned by summary() for the residuals \(\epsilon=y-\hat{y}\). Subsequently, the estimated coefficients are reported. In the column of Estimate lists the estimated intercept and slope. Each of them is examined by t-test with Ho: \(\beta=0\) and H1: \(\beta\ne0\). As shown in the right most column, the estimated beta coefficients are both significantly different from 0. Of mose interest is the relation between studying hours and SAT scores. The positive slope suppoorts the inspection of the scatter plot that the longer the studying hour, the higher the SAT score is.

m1<-lm(SAT~Hours,SATHr.dta)
summary(m1)
## 
## Call:
## lm(formula = SAT ~ Hours, data = SATHr.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -120.347  -29.308    9.928   33.734   83.570 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  353.165     24.337   14.51 2.24e-11 ***
## Hours         25.326      2.291   11.05 1.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.72 on 18 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8645 
## F-statistic: 122.2 on 1 and 18 DF,  p-value: 1.868e-09

It is very easy to add a layer in the previous scatter plot to plot the regression line with stat_smooth().

fg<-fg+stat_smooth(method="lm",col="black",se=F)
fg
## `geom_smooth()` using formula = 'y ~ x'

Example 2

In a study conducted by the US army, different body parts of soldiers (e.g., arm length, foot length, etc.) were measured for the need to design suitable uniforms. The file ansur_men.txt contains the data from 1,774 men. As shown by dim(), there are 1,774 rows and 166 columns, namely 166 variables.

url2<-"http://cml.nccu.edu.tw/lxyang/ExpDesign/ansur_men.txt"
men.dta<-read.table(url2,header=T,sep="")
dim(men.dta)
## [1] 1774  166

For practice, the variables of foot length and hand length are chosen. Presumably, the foot length should be positively correlated with the hand length. In the following codes, we are going to verify this hypothesis.

Describe Data

First, we can describe the data as histograms with estimated normal curves.

fg.foot<-ggplot(data=men.dta,aes(x=FOOT_LNTH))+
           geom_histogram(aes(y=..density..),bins=10,col="blue",fill="blue",alpha=0.2)+
           geom_density()
fg.hand<-ggplot(data=men.dta,aes(x=HAND_LNTH))+
           geom_histogram(aes(y=..density..),bins=10,col="red",fill="red",alpha=0.2)+
           geom_density()
library(gridExtra)
grid.arrange(fg.foot,fg.hand,ncol=2)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Of course, we can make Q-Q plots to check how similar the distributions of these two variables are to the standard normal distribution. Apparently, these two distributions are quite close to the standard normal distribution, except that there are more extremely large hand-length data in the sample than the normal distribution.

qq.foot<-ggplot(data=men.dta,aes(sample=FOOT_LNTH))+
           stat_qq(col="blue",alpha=0.2)+stat_qq_line(col="blue")
qq.hand<-ggplot(data=men.dta,aes(sample=HAND_LNTH))+
           stat_qq(col="red",alpha=0.2)+stat_qq_line(col="red")
grid.arrange(qq.foot,qq.hand,ncol=2)

Correlation Analysis

The sample distributions of hand length and foot length are quite normal. Subsequently, we can check out the relation between these two variables. It looks like a high positive correlation between them, that is supported by \(r=.81\), whihc is significantly different from 0, \(t(1172)=57.24\), \(p<.01\).

fg.fthd<-ggplot(data=men.dta,aes(x=HAND_LNTH,y=FOOT_LNTH))+
            geom_point(shape=1,col="blue")
fg.fthd

with(men.dta,cor.test(FOOT_LNTH,HAND_LNTH))
## 
##  Pearson's product-moment correlation
## 
## data:  FOOT_LNTH and HAND_LNTH
## t = 57.241, df = 1772, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7886369 0.8213534
## sample estimates:
##       cor 
## 0.8056085

Note the correlation between two variables drops when the observed value range shrinks. We have known that the foot length is highly correlated with the hand length. However, if we sample the data only from the region above and below the mean of each variable for 0.5 SD, the correlation coefficient drops. The below codes are used to compute the means and standard deviations of the variables of HAND_LNTH and FOOT_LNTH. Then, we get the row positions of the HAND_LNTH and FOOT_LNTH data we want. Subsequently, we use match() to find out the row positions on which both the HAND_LNTH and FOOT_LNTH data are what we want. For detail of match(), please check the help document for it with ?match() in the console pane. Finally, we create a new data frame by treating men.dta as a matrix and retrieving the data on our target rows. The returned object is a new matrix which is called newdta. Now conduct the correlation analysis for the hand length and foot length data in this new data set. The correlation coefficient drops from \(.81\) to \(.23\). The reason for this finding is relevant to the range of the variable. Recall the equation of Pearson’s \(r\) that the nominator is the covariance between variables, \(cov(x,y)=E((x-\bar{x})(y-\bar{y}))\). When we shrink the sampled range for each variable, the distance to mean becomes smaller accordingly. Thus, \(r\) becomes smaller.

mH<-mean(men.dta$HAND_LNTH)
Hd<-0.5*sd(men.dta$HAND_LNTH)
x<-which(men.dta$HAND_LNTH>=mH-Hd & men.dta$HAND_LNTH<mH+Hd)
mF<-mean(men.dta$FOOT_LNTH)
Fd<-0.5*sd(men.dta$FOOT_LNTH)
y<-which(men.dta$FOOT_LNTH>=mF-Fd & men.dta$FOOT_LNTH<mF+Fd)
index<-match(y,x)
newdta<-men.dta[y[!is.na(index)],]
with(newdta,cor.test(HAND_LNTH,FOOT_LNTH))
## 
##  Pearson's product-moment correlation
## 
## data:  HAND_LNTH and FOOT_LNTH
## t = 4.7841, df = 407, p-value = 2.406e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1368376 0.3205358
## sample estimates:
##       cor 
## 0.2307417

Regression Analysis

Again, we can regress FOOT_LNTH by HAND_LNTH. The resutls show that either the intercept or the slope is significantly different from 0. The slope is quite close to 1. This suggests that the foot length will increase one unit when the hand length incresase one unit.

m1<-lm(FOOT_LNTH~HAND_LNTH,men.dta)
summary(m1)
## 
## Call:
## lm(formula = FOOT_LNTH ~ HAND_LNTH, data = men.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.3346  -4.8451  -0.0802   5.0957  24.7630 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.71414    3.65532   16.61   <2e-16 ***
## HAND_LNTH    1.07840    0.01884   57.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.761 on 1772 degrees of freedom
## Multiple R-squared:  0.649,  Adjusted R-squared:  0.6488 
## F-statistic:  3277 on 1 and 1772 DF,  p-value: < 2.2e-16

Finally, we can add the regression line to the scatter plot.

fg.fthd<-fg.fthd+stat_smooth(method="lm",col="black",se=F)
fg.fthd
## `geom_smooth()` using formula = 'y ~ x'

Exercise:

A data frame has been installed in R which is cars. In this data frame, there are two columns representing the car speed and the minimum safe distance respectively. Please do a complete analysis for the relationship between car speed and minimum distance.