Features of Simple Linear Regression

In the previous tutorial, the accuracy of prediction of a regression model is shown as the standard error of estimate \(s_{y.x}\). Indeed, the smaller \(s_{y.x}\), the higher the prediction accuracy is. In addition, to what extent the regression model can account for the total variance on the dependent variable, namely \(R^2\), can also represent the accuracy of model prediction. The relevant proof is listed as follows.

Suppose a sample data of pairs of \(x\) and \(y\). The variance of \(y\) is \(s^2_y=\frac{1}{N-1}\sum(y-\bar{y})^2\). The nominator is called \(SST\). According to the regression model \(\hat{y}=b_0+b_1x\),

\[\begin{align*} y-\bar{y}&=y-\hat{y}+\hat{y}-\bar{y}\\ &=\epsilon+\hat{y}-\bar{y} \end{align*}\]

Also, \(\bar{y}=\frac{1}{N}\sum y=\frac{1}{N}\sum(\hat{y}+\epsilon)=\frac{1}{N}\sum(b_0+b_1x+\epsilon)=b_0+b_1\bar{x}+\bar{\epsilon}\). Therefore,

\[\begin{align*} y-\bar{y}&=\epsilon+b_0+b_1x-(b_0+b_1\bar{x}+\bar{\epsilon})\\ &=\epsilon-\bar{\epsilon}+b_1(x-\bar{x}) \end{align*}\]

\(SST\) is the sum of squares of \(y-\bar{y}\), which can be derived as

\[\begin{align*} \sum(y-\bar{y})^2&=\sum(\epsilon-\bar{\epsilon}+b_1(x-\bar{x}))^2\\ &=\sum(\epsilon-\bar{\epsilon})^2+\sum(b_1(x-\bar{x}))^2+\sum 2b_1(\epsilon-\bar{\epsilon})((x-\bar{x})). \end{align*}\]

Given the assumption that \(x\) is independent of \(\epsilon\), \(cov(x,\epsilon)=\sum(x-\bar{x})(\epsilon-\bar{\epsilon})/N=0\). Therefore,

\[\begin{align*} \sum(y-\bar{y})^2=\sum(\epsilon-\bar{\epsilon})^2+\sum b^2_1(x-\bar{x})^2 \end{align*}\]

As \(\sum(\epsilon-\bar{\epsilon})^2\) is the sum of squares of residuals, it can be called \(SSRes\). Thus, \(SST=SSRes+\sum b^2_1(x-\bar{x})^2\), or \(SST=SSRes+SSR\) with \(SSR\) for the part of the sum of squares of deviations on \(y\) explained by \(x\). The percentage of the total variance of \(y\) explained by \(x\) is then \(R^2=1-\frac{SSRes}{SST}\). When \(b_1\rightarrow 0\), \(SSR\rightarrow 0\) and \(SSRes\rightarrow SST\), \(R^2\rightarrow 0\). When \(SSRes\rightarrow 0\) and \(SSR\rightarrow SST\), \(R^2\rightarrow1\), namely the highest accuracy. Thus, \(0\le R^2\le1\).

Example 1

The ages and blood pressures of 30 adults are contained in ab.txt. With this data, we can examine the relationship between age and blood pressure.

abdta<-read.table("ab.txt",header=T,sep="")
head(abdta)

##   age blood
## 1  39   144
## 2  47   220
## 3  45   138
## 4  47   145
## 5  65   162
## 6  46   142

Data Analysis

As usual, we can have a visual inspection of the scatters of the data. Of course, we can also add the regression line as well as the 95% confidence intervals for drawing an inference about the band where the population regression line lies. It looks like that the blood pressure increases as age increases.

library(ggplot2)
ab.fg<-ggplot(data=abdta,aes(x=age,y=blood))+
         geom_point(shape=1,col="blue")+
         stat_smooth(method="lm",col="black")
ab.fg

## `geom_smooth()` using formula = 'y ~ x'

To verify this hypothesis, the correlation coefficient is computed and tested for its statistical significance. The Pearson’s \(r=.66\), \(p<.01\) that confirms our inspection.

with(abdta,cor.test(age,blood))

## 
##  Pearson's product-moment correlation
## 
## data:  age and blood
## t = 4.6184, df = 28, p-value = 7.867e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3895932 0.8228923
## sample estimates:
##       cor 
## 0.6575673

Also, we can make a simple linear regression model to predict blood pressure by a person’s age. No matter intercept or slope is significantly different from 0. Note that in this summary table, Multiple R-squared is \(R^2\) and Adjusted R-square is the estimated \(R^2\) for population.

m1<-lm(blood~age,abdta)
summary(m1)

## 
## Call:
## lm(formula = blood ~ age, data = abdta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.724  -6.994  -0.520   2.931  75.654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  98.7147    10.0005   9.871 1.28e-10 ***
## age           0.9709     0.2102   4.618 7.87e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared:  0.4324, Adjusted R-squared:  0.4121 
## F-statistic: 21.33 on 1 and 28 DF,  p-value: 7.867e-05

You can use the equations shown in the above section to compute \(R^2\) and see if it is correct. For the ease of computation, anova() is used. With the object returned by lm() as the argument, anova() returns the summary table for analysis of variance. The column Sum Sq lists the sum of squares. Therefore, the sum of squares of residuals \(SSRes=8393.4\) and \(SSR=6394.0\). As \(r^2=1-\frac{SSRes}{SST}=\frac{SSR}{SST}\), \(R^2\) can be computed by dividing the first component in the second column of m2 (i.e., m2[1,2]) by the sum of the entities in the second column of m2 (i.e., sum(m2[,2])), which is .4324, exactly the same as reported in the summary table of the regression analysis.

m2<-anova(m1)
m2

## Analysis of Variance Table
## 
## Response: blood
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1 6394.0  6394.0   21.33 7.867e-05 ***
## Residuals 28 8393.4   299.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m2[1,2]/sum(m2[,2])

## [1] 0.4323947

Regression and Correlation

Recall that \(\sum(y-\bar{y})^2=\sum(\epsilon-\bar{\epsilon})^2+\sum b^2_1(x-\bar{x})^2\). Then,

\[\begin{align*} \frac{1}{N-1}\sum(y-\bar{y})^2&=\frac{1}{N-1}(\sum(\epsilon-\bar{\epsilon})^2+\sum b^2_1(x-\bar{x})^2)\\ s^2_y&=\frac{\sum(\epsilon-\bar{\epsilon})^2}{N-1}+\frac{\sum b^2_1(x-\bar{x})^2}{N-1} \end{align*}\]

Since \(\epsilon=y-\hat{y}\), the standard deviation of the distribution of \(y-\hat{y}\), \(s_{y.x}\), is also the standard deviation of the distribution of \(\epsilon\). As \(\bar{\epsilon}\) is a constant, the distribution of \(\epsilon-\bar{\epsilon}\) has the standard deviation of the distribution of \(\epsilon\). The df for estimating \(\epsilon\) is \(N-2\), because computing \(\hat{y}\) needs two parameters. Thus,

\[\begin{align*} s^2_y&=\frac{N-2}{N-1}\frac{\sum(\epsilon-\bar{\epsilon})^2}{N-2}+\frac{\sum b^2_1(x-\bar{x})^2}{N-1}\\ s^2_y&=\frac{N-2}{N-1}s^2_{y.x}+b^2_1\frac{\sum(x-\bar{x})^2}{N-1} \end{align*}\]

As \(b_1=\frac{cov(x,y)}{s^2_x}\), \(b^2_1=\frac{cov(x,y)^2}{s^4_x}\). Therefore,

\[\begin{align*} s^2_y&=\frac{N-2}{N-1}s^2_{y.x}+\frac{cov(x,y)^2}{s^4_x}\frac{\sum(x-\bar{x})^2}{N-1}\\ s^2_y&=\frac{N-2}{N-1}s^2_{y.x}+\frac{cov(x,y)^2}{s^2_x}\\ 1&=\frac{N-2}{N-1}\frac{s^2_{y.x}}{s^2_y}+\frac{cov(x,y)^2}{s^2_x s^2_y}\\ 1&=\frac{N-2}{N-1}\frac{s^2_{y.x}}{s^2_y}+r^2\\ 1-r^2&=\frac{N-2}{N-1}\frac{s^2_{y.x}}{s^2_y}\\ s_{y.x}&=s_y\sqrt{(1-r^2)\frac{N-1}{N-2}} \end{align*}\]

When \(N\) is very large, \(N-1 \approx N-2\). The standard error estimate \(s_{y.x}\) is approximated by \(s_y\sqrt{(1-r^2)}\). Nonetheless, the larger \(r^2\), the smaller \(s_{y.x}\) is, namely that the regression model is more precise. In fact, \(1-r^2=\frac{N-2}{N-1}\frac{s^2_{y.x}}{s^2_y}=\frac{\sum(\epsilon-\bar{\epsilon})^2}{\sum(y-\bar{y})^2}=\frac{SSRes}{SST}\). That is the percentage of the variance of \(y\) unexplained by the model. Therefore, \(r^2=R^2\). We can check whether the squared \(r\) is the same as R-squared (i.e., .4324) reported in the summary table of regression analysis. Indeed, they are the same.

cor.res<-with(abdta,cor.test(age,blood))
cor.res$estimate^2

##       cor 
## 0.4323947

Example 2

The relationship between birth rate and economic development has interested social scientists. In the study of Weintraub (1962), the birth rates, per capita income, proportion of population in farming, and infant mortality dury early 1950s for 30 nations were collected. The data can be downloaded here.

First, we import the data. There would be an error when importing this data with read.table(), as some nation names contain a space, which happens to be the default delimiter for separating the variable columns. Thus, I copied and pasted this data to my text editor and saved it as a text file birth_economics.txt. Subsequently, I turned this file as a .csv file in EXCEL, which can be found here. As the columns are delimited by a comma, the above problem can be prevented. Of course, this is not the only way to manage the data you download. You can find your own way.

dta<-read.table("birth_economics.csv",header=F,sep=",")
names(dta)<-c("nation","br","income","pf","mortality")
head(dta)

##        nation   br income   pf mortality
## 1   Venezuela 46.4    392 0.40      68.5
## 2      Mexico 45.7    118 0.61      87.8
## 3     Ecuador 45.3     44 0.53     115.8
## 4    Colombia 38.6    158 0.53     106.8
## 5      Ceylon 37.2     81 0.53      71.6
## 6 Puerto Rico 35.0    374 0.37      60.2

Correlation Analysis

A common observation shows that the birth rates in the Developed countries are relatively low. We can check if the higher the per capita income, the lower the birth rate is. The below scatter plot suggests a descending trend between the per capita income and the birth rate, that is supported with \(r=-.42\), \(p<.05\).

brincome.fg<-ggplot(data=dta,aes(x=income,y=br))+
               geom_point(shape=1)+stat_smooth(method="lm",col="red")
               labs(x="Per Capita Income",y="Birth Rate")

## $x
## [1] "Per Capita Income"
## 
## $y
## [1] "Birth Rate"
## 
## attr(,"class")
## [1] "labels"

brincome.fg

## `geom_smooth()` using formula = 'y ~ x'

with(dta,cor.test(br,income))

## 
##  Pearson's product-moment correlation
## 
## data:  br and income
## t = -2.4425, df = 28, p-value = 0.02116
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.67712346 -0.06928102
## sample estimates:
##        cor 
## -0.4190898

Of course, we can conduct the regression analysis using per capita income to predict birth rate. The estimated slope is significantly smaller than 0, suggesting the negative trend expected by the hypothesis.

m<-lm(br~income,dta)
summary(m)

## 
## Call:
## lm(formula = br ~ income, data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.063  -6.349  -2.906   7.265  20.182 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.237676   2.764106  10.939 1.28e-11 ***
## income      -0.010255   0.004199  -2.442   0.0212 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.85 on 28 degrees of freedom
## Multiple R-squared:  0.1756, Adjusted R-squared:  0.1462 
## F-statistic: 5.966 on 1 and 28 DF,  p-value: 0.02116

Knowing that two variables are correlated with each other benefits us to make statistical inference. Take the birth rate data as an example. Based on this data, if we have sampled a nation’s birth rate as 24.76, what are the 95% confidence intervals for estimating its true birth rate?

t.res<-with(dta,t.test(br,mu=mean(br)))
confs<-t.res$conf.int
confs

## [1] 21.1837 28.3363
## attr(,"conf.level")
## [1] 0.95

However, if we know the mean per capita income as 534.1 and we also know that 17.56% of the variance of birth rate can be explained by per capita income, what are the 95% confidence intervals for the birth rate when per capita income is 534.1? First we can get the predicted birth rate via \(\hat{y}=b_0+b_1\times534.1=b_0\times1+b_1\times534.1\). This equation then is equivalent to the inner product of two vectors [\(b_0\) \(b_1\)] and [1 534.1]. In R, the matix mutiplication operator is %*%. See the first line in the below codes. Although the standard error of estimate for different \(x\) values would be different, here we use the one computed for all sample data as a rough estimate. As the standard error of estimate is provided in the output of summary(), I save this output as a variable v, in which sigma is the standard error of estimate. The range of the 95% confidence intervals is smaller when an income is given. This is not surprising, as \(s_{y.x}=s_y\sqrt{1-r^2\frac{N-1}{N-2}}\), \(r^2\le1\), hence \(s_{y.x}\le s_y\). Therefore, knowing the correlation between variables can increase the precision of inference.

pred.br<-m$coefficients%*%c(1,534.1)
v<-summary(m)
bands<-c(pred.br+v$sigma/sqrt(30)*qt(0.025,29),pred.br+v$sigma/sqrt(30)*qt(0.975,29))
bands

## [1] 21.45578 28.06490

max(confs)-min(confs)

## [1] 7.152602

max(bands)-min(bands)

## [1] 6.609117

Exercise:

In addition to the per capita income, does any other variable in this data set predict the birth rate? Please do your best to get a better understanding of the relationship between birth rate and economics with your regression analysis.