Testing the Significance of Regression Coefficient

Same as \(r\), the slope of the regression line \(b\) shows to what extent the dependent variable \(y\) varies with the predictor \(x\). When \(b=0\), there is no relationship between \(y\) and \(x\), or \(x\) cannot predict \(y\). Also, we know that \(b\) is the slope of the regression line generated from a sample. It might not be the same for another sample. Thus, \(b\) is the slope of the sample regression line. Presumably, there will be a regression line for the population of \(x\) and \(y\) values, whose slope is \(\beta\). Apparently, it is assumed that the sample slope is the unbiased estimate of the population slope, \(E(b)=\beta\). Same as inferring about the population mean, we do hypothesis testing to infer about the population slope with \(H_1: \beta\neq0\) and \(H_o:\beta=0\).

In order to do hypothesis testing for \(b\), we need to know the sampling distribution of \(b\) and the standard error of that distribution. Since \(b=\frac{cov_{xy}}{s^2_x}=r\frac{s_y}{s_x}\), let’s start from considering how to test the significance of \(r\). We normally test the hypothesis that \(x\) is correlated with \(y\) by stating \(H_1:\rho\neq0\) and \(H_o:\rho=0\) with \(t\) test. This is because when \(H_o\) is true, the mean of \(r\) is 0 and the sampling distribution approximates the \(t\) distribution. The \(r\) value can be transferred to \(t\) value as

\(t=\frac{r\sqrt{N-2}}{\sqrt{1-r^2}}=\frac{r-0}{\sqrt{\frac{1-r^2}{N-2}}}\).

The nominator of this equation is \(r-0\), the difference between \(r\) and 0. According to the definition of \(t\), the denominator should be the standard error. As we know that \(s_{y.x}=s_y\sqrt{(1-r^2)\frac{N-1}{N-2}}\), we can multiply \(\sqrt{\frac{1-r^2}{N-2}}\) with \(\frac{\sqrt{N-1}}{\sqrt{N-1}}\).

\(\sqrt{\frac{1-r^2}{N-2}}\frac{\sqrt{N-1}}{\sqrt{N-1}}=\sqrt{(1-r^2)\frac{N-1}{N-2}}\sqrt{\frac{1}{N-1}}=\frac{s_{y.x}}{s_y}\sqrt{\frac{1}{N-1}}=\frac{s_{y.x}}{s_y\sqrt{N-1}}\).

Now the \(t\) value becomes

\(t=\frac{r}{\frac{s_{y.x}}{s_y\sqrt{N-1}}}=\frac{b\frac{s_x}{s_y}}{\frac{s_{y.x}}{s_y\sqrt{N-1}}}=\frac{b}{\frac{s_{y.x}}{s_x\sqrt{N-1}}}\).

Therefore, the standard error of the sampling distribution of \(b\) is \(s_b=\frac{s_{y.x}}{s_x\sqrt{N-1}}\). However, the \(df\) of this \(t\) distribution is still \(N-2\), as we only replace \(r\) with \(b\) and the \(t\) distribution here is still the one used for testing the significance of \(r\).

Numerical Example

A data set contains 10 participants’ body weights at their age of 8 and 20. See below. We are interested in whether one’s weight at the age of 8 can predict his/her weight at the age of 20. As usual, we firstly make a scatter plot to visualize the relationships between these two weights.

bw.dta<-data.frame(age8=c(31,25,29,20,40,32,27,33,28,31),
                   age20=c(55,44,55,51,85,57,59,60,66,49))
library(ggplot2)
ggplot(bw.dta,aes(age8,age20))+
  geom_point()

Visual inspection of this figure suggests a positive relation between these weights with \(r=.72\), \(p<.05\). Of course, you can compute \(t\) value with the equation instead of using the function cor.test( ). The result is the same of course.

with(bw.dta,cor.test(age8,age20))
## 
##  Pearson's product-moment correlation
## 
## data:  age8 and age20
## t = 2.9297, df = 8, p-value = 0.01901
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1641658 0.9284805
## sample estimates:
##       cor 
## 0.7194296
r<-cor(bw.dta$age8,bw.dta$age20)
rt<-r*sqrt((nrow(bw.dta)-2)/(1-r^2))
rt
## [1] 2.92968
2*(1-pt(rt,df=nrow(bw.dta)-2))
## [1] 0.01900745

We can add a regression line in this scatter plot. The estimated coefficients of this regression line can be computed by using the function lm( ). The slope is 1.53. You can see the \(t\) value for it is the same as the one for testing \(r\). Also, the \(p\) value is just the same value in the significance test for \(r\). This is not surprising, as these two coefficients are interchangable.

ggplot(bw.dta,aes(age8,age20))+
  geom_point()+
  geom_smooth(method="lm",se=F,color="red")
## `geom_smooth()` using formula = 'y ~ x'

m1<-lm(age20~age8,bw.dta)
summary.m1<-summary(m1)
summary.m1
## 
## Call:
## lm(formula = age20 ~ age8, data = bw.dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.243  -5.126  -2.743   6.918  10.979 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  12.7853    15.6887   0.815    0.439  
## age8          1.5309     0.5225   2.930    0.019 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.302 on 8 degrees of freedom
## Multiple R-squared:  0.5176, Adjusted R-squared:  0.4573 
## F-statistic: 8.583 on 1 and 8 DF,  p-value: 0.01901

Confidence Interval of \(\beta\)

Since we have the standard error of the sampling distribution of \(b\), we can easily set up the 95% confidence limits on \(b\). The upper limit of confidence on \(\beta\) is \(b+t_{.975}SE_b\) and the lower limit is \(b+t_{.025}SE_b\). Then, the population slope \(\beta\) is located in the range between 0.33 and 2.70 with 95% confidence.

summary.m1$coefficient
##              Estimate Std. Error   t value   Pr(>|t|)
## (Intercept) 12.785261 15.6886771 0.8149356 0.43869429
## age8         1.530903  0.5225496 2.9296802 0.01900745
b<-summary.m1$coefficient[2,1]
seb<-summary.m1$coefficient[2,2]
tup<-qt(0.972,df=nrow(bw.dta)-2)
tlow<-qt(0.025,df=nrow(bw.dta)-2)
c(b+tlow*seb,b+tup*seb)
## [1] 0.3259017 2.6979606

Assumptions Underlying Regression and Correlation

In order to understand the assumptions underlying regression, let’s look at the left panel of the below figure. Obviously, for a given \(x\) value \(x_j\), there are a number of \(y\) values. Thus, the regression model is \(y_{ij}=b_0+b_1x_j+\epsilon_{ij}\).

  1. Normality. It is assumed that in the population the values of \(y\) corresponding to any specific value of \(x\) are normally distributed around \(\hat{y}\). Since \(\epsilon_{ij}\sim ND(0,\sigma_\epsilon)\), \(y_{ij}\sim ND(\bar{y}_{.j},\sigma_{\epsilon})\) as \(b_0+b_1x_j\) is a constant. In fact, \(b_0+b_1x_j=\hat{y}_j\). Then, \(\bar{y}_{.j}=\frac{1}{n}\sum_i b_0+b_1x_j+\epsilon_{ij}=b_0+b_1x_j+\frac{\sum_i \epsilon_{ij}}{n}=\hat{y}_j\).

  2. Homogeneity of variance in arrays. That is \(y_{i.j}=\hat{y_j}+\epsilon_{ij}\) and \(y_{ij}\sim ND(\hat{y}_j,\sigma_{\epsilon})\) for any given \(x_j\).

  3. Linearity. A linear relationship exists between the predictor and the dependent variable. Whether the relationship between \(x\) and \(y\) is linear can be checked by the scatter plot of the residuals. As \(y=\hat y+\epsilon\), if there is a linear relationship between \(x\) and \(y\), the residual should be irrelevant to \(\hat y\). Take the data in nlreg1.txt as an example. The left panel in the below figure shows the scatter plot for the data set. It looks like a linear relationship between \(x\) and \(y\). However, when we fit the linear regression model to this data set, the residuals actually vary non-linearly with \(y\) values. See the middle panel. This means that in addition to linear relationship, some higher order relationship should exist between \(x\) and \(y\) values. The right panel shows the scatter plot for residuals and \(x\) values. The pattern is also nonlinear, suggesting that the residuals contain some components of the power of \(x\) values.

dta<-read.table("nlreg1.txt",header=T,sep="")
dta
##        y     x
## 1  10.07  77.6
## 2  14.73 114.9
## 3  17.94 141.1
## 4  23.93 190.8
## 5  29.61 239.9
## 6  35.18 289.0
## 7  40.02 332.8
## 8  44.82 378.4
## 9  50.76 434.8
## 10 55.05 477.3
## 11 61.01 536.8
## 12 66.40 593.1
## 13 75.47 689.1
## 14 81.78 760.0
dta.lm<-lm(y~x,dta)
dta$residuals<-summary(dta.lm)$residuals
fg1<-ggplot(dta,aes(x,y))+
       geom_point()
fg2<-ggplot(dta,aes(y,residuals))+
       geom_point(color="tomato")
fg3<-ggplot(dta,aes(x,residuals))+
       geom_point(color="deepskyblue3")
library(gridExtra)
grid.arrange(fg1,fg2,fg3,ncol=3)

  1. Independence. That is \(\epsilon\) is independent of the predictor \(x\). We see from the right figure above to know that the data in nlreg1.txt do not support this assumption of independence.