Simple Linear Regression

In addition to correlation coefficient, the regression line is often used to describe the relationship between two variables. A regression line is a mathematical model, which posits a linear relationship between the predictor \(x\) and the dependent variable \(y\) as \(\hat{y}=\beta_0+\beta_1x\). In this model, there are two parameters, \(\beta_0\) and \(\beta_1\), which are the intercept and the slope of a linear line.

When \(\beta_0=0\), this line will go through the origin. When \(\beta_1>0\), this is an ascending line, indicating a positive relationship between \(x\) and \(y\). When \(\beta_1<0\), this is a descending line, indicating a negative relationship between \(x\) and \(y\). Of course, when \(\beta_1=0\), this is a horizontal line, namely no relationship between \(x\) and \(y\).

Suppose we have \(n\) pairs of \(x\) and \(y\) values and we want to build up a regression model which predicts a \(y\) value, namely \(\hat{y}\), for any given \(x\) value. Now \(x\) and \(y\) values are known. What we need to do is to estimate the values of the two parameters \(\beta_0\) and \(\beta_1\).

Least Square Method

Since we have \(n\) \(x\) values, we can predict \(n\) \(\hat{y}\) values with the regression model \(\hat{y}=b_0+b_1x\). For each \(\hat{y}\), there is a real \(y\). The difference between \(y\) and \(\hat{y}\) is called error \(\epsilon=y-\hat{y}\). In regression, it is assumed that \(\epsilon \sim ND(0,\sigma_\epsilon)\).

Therefore, \(y=\hat{y}+\epsilon=\beta_0+\beta_1x+\epsilon\). Since the mean of \(\epsilon\) is 0 and \(\beta_0\) and \(\beta_1\) are constants, the mean of \(y\) is then \(\mu_y=\beta_0+\beta_1\mu_x\). Now we find the first feature of the regression line. That is, this line must go through the means of \(x\) and \(y\) values.

How to get the values of these two parameters? The least square method is the most common method. First, the total error which is made by the regression model is a function of the parameters. That is, \(E=f(\beta_0,\beta_1)\). Also, \(f(\beta_0,\beta_1)=\sum(y-\hat{y})^2=\sum(y-\beta_0-\beta_1x)^2\).

Given \(\mu_y=\beta_0+\beta_1\mu_x\), \(\beta_0=\mu_y-\beta_1\mu_x\). Then,

\(E=\sum(y-\mu_y+\beta_1\mu_x-\beta_1x)^2=\sum((y-\mu_y)-\beta_1(x-\mu_x))^2=\sum[(y-\mu_y)^2+\beta^2_1(x-\mu_x)^2-2\beta_1(y-\mu_y)(x-\mu_x)]\).

The minimum error occurs when \(\frac{\partial E}{\partial \beta_1}=0\). \(\frac{\partial E}{\partial \beta_1}=\sum[2\beta_1(x-\mu_x)^2-2(y-\mu_y)(x-\mu_x)]=0\). That is,

\(\sum\beta_1(x-\mu_x)^2-\sum(y-\mu_y)(x-\mu_x)=0\). Divided by \(n\) on both sides, the equation becomes \(\frac{\beta_1\sum(x-\mu_x)^2}{n}=\frac{\sum(y-\mu_y)(x-\mu_x)}{n}\).

Be definition, we know that \(\frac{\sum(x-\mu_x)^2}{n}\) is \(\sigma^2_x\) and \(\frac{\sum(y-\mu_y)(x-\mu_x)}{n}\) is the covariance between \(x\) and \(y\), \(cov_{xy}\). Therefore, \(\beta_1=\frac{cov_{xy}}{\sigma^2_x}\). Consequently, \(\beta_0=\mu_y-\frac{cov_{xy}}{\sigma^2_x}\mu_x\). Parallel to population parameters, the statistics of sample are denoted as \(b_0\) and \(b_1\), which are the intercept and slope of the regression line made for a sample. Then, \(b_1=\frac{cov_{xy}}{s^2_x}\) and \(b_0=\bar{y}-\frac{cov_{xy}}{s^2_x}\bar{x}\).

Numeric Example 1

The data from the study of Levine (1990) listed in Table 9.1 are used as an example. The researcher collected the scores of life pace and the scores of heart disease threat in 36 cities.

dta<-data.frame(pace=c(27.67,25.33,23.67,26.33,26.33,25,26.67,26.33,
                       24.33,25.67,22.67,25,26,24,26.33,20,24.67,
                       24,24,20.67,22.33,22,19.33,19.67,23.33,22.33,
                       20.33,23.33,20.33,22.67,20.33,22,20,18,16.67,15),
                hd=c(24,29,31,26,26,20,17,19,26,24,
                     26,25,14,11,19,24,20,13,20,18,
                     16,19,23,11,27,18,15,20,18,21,
                     11,14,19,15,18,16))
dim(dta)
## [1] 36  2

Before doing any statistical analysis, we can make a scatter plot for checking for any relationship between stress and mental health score. As is seen in the below figure, it is suggested that the faster the life pace is, the larger threat of heart disease is. The correlation between life pace and heart disease scores is significant \(r=.37\), \(p<.05\).

library(ggplot2)
fg1<-ggplot(dta,aes(x=pace,y=hd))+
  geom_point(shape=16,color="deepskyblue3")
fg1

cor.test(dta$pace,dta$hd)
## 
##  Pearson's product-moment correlation
## 
## data:  dta$pace and dta$hd
## t = 2.2869, df = 34, p-value = 0.02855
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04158114 0.61936671
## sample estimates:
##       cor 
## 0.3651289

For describing the relationship between life pace and heart disease in a linear line, we need to compute the intercept \(b_0\) and the slope \(b_1\). Following the equations derived above, we can compute the intercept and the slope with the following codes. The function cov( ) is used to compute the covariance between \(x\) and \(y\).

means<-apply(dta,2,mean)
sds<-apply(dta,2,sd)
cov<-cov(dta$pace,dta$hd)
b1<-cov/sds[1]^2
b0<-means[2]-b1*means[1]
c(b0,b1)
##        hd      pace 
## 5.3792801 0.6315618

Instead of composing our own codes to compute the slope and intercept of a regression line, we can directly use the function lm( ) to conduct the regression analysis. The first argument is the dependent variable followed by the symbol \(\sim\) which can be thought of as \(=\) in the regression model. After \(\sim\) is the predictor, herein life pace score. The results are assigned to the variable r1. Then, we use summary( ) to get the summary table. The first column contains the names of coefficients. One is intercept and the other is pace. The second column contains the estimated values. You will find they are the same values as those that we just computed with equations. According to the summary table, the intercept is not significantly different from 0 but the slope is.

r1<-lm(hd~pace,dta)
summary(r1)
## 
## Call:
## lm(formula = hd ~ pace, data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5368 -3.0610 -0.2463  3.8717 10.6717 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   5.3793     6.3613   0.846   0.4037  
## pace          0.6316     0.2762   2.287   0.0285 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.925 on 34 degrees of freedom
## Multiple R-squared:  0.1333, Adjusted R-squared:  0.1078 
## F-statistic:  5.23 on 1 and 34 DF,  p-value: 0.02855

In R, the package {ggplot2} provides a useful function to easily add the linear regression line in the scatterplot. As is done in the following codes, the regression line whose intercept and slpoe are computed by lm( ) is added in the scatterplot.

fg1+geom_smooth(method="lm",se=F,color="#F8766D")
## `geom_smooth()` using formula = 'y ~ x'

Correlation Coefficient and Beta

Similar to the correlation coefficient \(r\), the slope \(\beta_1\) is also called regression coefficient \(\beta\) or Beta. The correlation coefficient is a coefficient to show the relationship between two variables and so is the slope of the regression line. Presumably, these two coefficients should be related. Let’s start from the equations of computing \(r\) and \(\beta_1\). \(r=\frac{cov_{xy}}{s_xs_y}\) and \(\beta_1=\frac{cov_{xy}}{s^2_x}\). They really look very similar. Thus, \(\frac{r}{\beta_1}=\frac{s_x}{s_y}\). This is confirmed by the following codes. For the regression model with the normalized values of \(x\) and \(y\), \(r_{z_xz_y}=\beta_1\), as \(s_{z_x}=s_{z_y}=1\).

cor.test(dta$pace,dta$hd)$estimate/b1
##       cor 
## 0.5781364
sds[1]/sds[2]
##      pace 
## 0.5781364

For correlation coefficient, \(r_{xy}=r_{yx}\). However, this is not ture for regression coefficient. That is, \(\beta_1\) in \(\hat{y}=\beta_0+\beta_1x\) is different from \(\beta'_1\) in \(\hat{x}=\beta'_0+\beta'_1y\). As \(\beta_1=\frac{cov_{xy}}{s^2_x}\) and \(\beta'_1=\frac{cov_{xy}}{s^2_y}\), the regression coefficient might be different when the predictor swaps with the dependent variable. However, when \(x\) and \(y\) are both normalized, \(\beta_1=\beta'_1\), as the standard deviations of \(z_x\) and \(z_y\) are both 1.

The Accuracy of Prediction

In the two-dimensional data space, theoretically we can draw infinite numbers of lines. However, only one among them is relatively close to every data point, which is the regression line. Then, exactly how accurate it can predict the observed \(y\) values is the index to evaluate the regression model. A straightforward index is the error or the distance between observed and predicted values, \(\epsilon=y-\hat{y}\). As \(\sum\epsilon=0\), the sum of squared errors is more suitable, \(\sum\epsilon^2=\sum(y-\hat{y})^2\). This is also called sum of squares of residuals.

Similar to the definition of variance, the sum of squares of residuals divided by \(df\) is a variance, which is called residual variance or error variance and denoted as \(s^2_{y.x}=\frac{\sum(y-\hat{y})^2}{n-2}\). Here, the \(df\) is \(n-2\) (\(n\): sample size). Just like the \(t\)-test for the correlation coefficient \(r\), the degrees of freedom is \(n-2\) instead of \(n-1\). Thus, the squared root of the residual variance is called the standard error of estimate, \(s_{y.x}=\sqrt{\frac{\sum(y-\hat{y})^2}{n-2}}\). This is one common measure of the accuracy of prediction of a regression model. Since this is an index for error, the smaller it is, the better the model is. In fact, the residual standard error in the above summary table is just the standard error of estimate. You can save the output of summary( ) as an object, say results. In this object, the variable sigma is the standard error of estimate. Of course, you can compute the residuals by yourself and then compute the standard error of estimate. See the following steps behind the remarks.

results<-summary(r1)
results$sigma
## [1] 4.925227
# Predict the y values given the x values by the regression model
pred.y<-cbind(1,dta$pace)%*%(results$coefficients[,1])
# Compute the residuals
error<-dta$hd-pred.y
# Compute the standard error of estimate
see<-sqrt(sum(error^2)/(nrow(dta)-2))
see
## [1] 4.925227

\(r^2\) and the Standard Error of Estimate

The standard error of stimate is related to \(r^2\), as \(s_{y.x}=s_y\sqrt{(1-r^2)\frac{n-1}{n-2}}\). See the proof as follows.

First, \(\sum(y-\hat{y})^2=\sum(y-\bar{y}+\bar{y}-\hat{y})^2=\sum(y-\bar{y})^2-2\sum(y-\bar{y})(\hat{y}-\bar{y})+\sum(\hat{y}-\bar{y})^2\).

As we know that \(\bar{y}=b_0+b_1\bar{x}\) and \(\hat{y}=b_0+b_1x\), \(\hat{y}-\bar{y}=b_1x-b_1\bar{x}=b_1(x-\bar{x})\).

Second, \(\sum(y-\hat{y})^2=\sum(y-\bar{y})^2-2\sum(y-\bar{y})(\hat{y}-\bar{y})+\sum(\hat{y}-\bar{y})^2=\sum(y-\bar{y})^2-2b_1\sum(y-\bar{y})(x-\bar{x})+b^2_1\sum(x-\bar{x})^2\).

We then divide the terms by \(n-1\) on the both sides of this equation.

Third, \(\frac{\sum(y-\hat{y})^2}{n-1}=\frac{\sum(y-\bar{y})^2}{n-1}-2b_1\frac{\sum(y-\bar{y})(x-\bar{x})}{n-1}+b^2_1\frac{\sum(x-\bar{x})^2}{n-1}\).

As we know that \(b_1=\frac{cov_{xy}}{s^2_x}\), the equation can be derived as follows.

Fourth, \(\frac{\sum(y-\hat{y})^2}{n-1}=\frac{\sum(y-\bar{y})^2}{n-1}-2\frac{cov^2_{xy}}{s^2_x}+\frac{cov^2_{xy}}{s^2_x}=s^2_y-\frac{cov^2_{xy}}{s^2_x}\).

Now we divide the terms on the both sides by \(s^2_y\).

Fifth, \(\frac{\sum(y-\hat{y})^2}{s^2_y(n-1)}=1-\frac{cov^2_{xy}}{s^2_xs^2_y}=1-r^2\).

We know the residual variance is \(s^2_{y.x}=\frac{\sum(y-\hat{y})^2}{n-2}\).

Sixth, \(\frac{\sum(y-\hat{y})^2}{s^2_y(n-1)}=\frac{(n-2)s^2_{y.x}}{s^2_y(n-1)}=1-r^2\). Then, we can move every term except \(s^2_{y.x}\) on the left side to the right side.

Seventh, \(s^2_{y.x}=s^2_y(1-r^2)\frac{n-1}{n-2}\). Or \(s_{y.x}=s_y\sqrt{(1-r^2)\frac{n-1}{n-2}}\).

For large samples the fraction \(\frac{n-1}{n-2}\approx1\). Thus, \(s^2_{y.x}=s^2_y(1-r^2)\). Since \(0\leq r^2\leq1\), \(s^2_{y.x}\leq s^2_y\).

What does that mean? This means that with a valid predictor, the variance of the predicted values for a dependent variable is smaller than the variance of the same dependent variable without any predictor. For example, we know that the children heights are positively correlated with their father’s. When we estimate a child’s height without knowing his/her father’s height, the range of the possible heights will be larger than when we know his/her father’s height. Just think about which estimate will be more accurate: estimating the height of a kid at year 20 without knowing his father’s height vs. estimating his height at year 20, given his father’s height as 180 cm. Obviously, the latter will be more accurate.

\(r^2\) as a Measure of Predictable Variablility

Following the previous discussion of the relationships between the residual variance and \(r^2\), \(s^2_{y.x}=s^2_y(1-r^2)\frac{n-1}{n-2}\) can be rewritten as \(\sum(y-\hat{y})^2=\sum(y-\bar{y})^2(1-r^2)\) or \(SS_r=SS_y(1-r^2)\), where \(SS_r\) is the sum of squares residual.

Then, \(SS_r=SS_y-SS_yr^2\). \(r^2=\frac{SS_y-SS_r}{SS_y}\). The sum of squares \(y\) is the total variability of \(y\) values. The sum of squares residual then is the variablity of residual, which is not predicted by the regression model. Thus, \(SS_y-SS_r\) means the variability of \(y\) values which can be explained by the regression model. We can call it sum of squares model \(SSR=SS_y=SS_r\). Then, \(r^2\) is the percentage of the extent to that the regression model can expalin the total variability (i.e., \(SS_y\) or \(SST\)).

Source SS
\(SST\) \(\sum(y-\bar{y})^2\)
\(SSR\) \(\sum(\hat{y}-\bar{y})^2\)
\(SS_r\) \(\sum(y-\hat{y})^2\)

Note \(SST=SSR+SS_r\). Obviously, the larger \(r^2\) is, the better the regression model is. For our numeric example, we can compute these sums of squares with the below codes. The result supports this equation \(SST=SSR+SS_r\). Alternatively, we can say that we partition the sum of squares total to two independent parts: the sum of squares model and the sum of squares residual.

SST<-sds[2]^2*(nrow(dta)-1)
SSr<-results$sigma^2*(nrow(dta)-2)
SSR<-sum((dta$hd-results$residuals-mean(dta$hd))^2)
SST
##       hd 
## 951.6389
c(SSr,SSR)
## [1] 824.7672 126.8716
SSr+SSR
## [1] 951.6389

Alternatively, we can use anova( ) to compute these sums of squares. In the table, the column Sum sq means the sum of squares. Since pace is the predictor now, the sum of squares of it is the sum of squares model, \(SSR\). The sum of squares residual \(SS_r\) can be easily seen in this table. As the \(r^2=\frac{SSM}{SST}=1-\frac{SS_r}{SST}\), \(r^2=0.133\). This value can also be seen in the summary table reported by summary( ) for the output object of lm( ). Also, \(r=\sqrt(1-\frac{SS_r}{SST})=0.365\), which is the same value reported in the correlation analysis.

anova(r1)
## Analysis of Variance Table
## 
## Response: hd
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## pace       1 126.87 126.872  5.2301 0.02855 *
## Residuals 34 824.77  24.258                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1-SSr/SST
##        hd 
## 0.1333191
(1-SSr/SST)^0.5
##        hd 
## 0.3651289