Prediction Interval on \(\hat y\)

One practical purpose of using the regression model is to predict the value of the dependent variable given the predictor value(s). The prediction will fall in a prediction interval, which takes into account both the uncertainty of a mean of \(y\) conditional on a fixed value of \(x\) and the variability of observations around that mean.

Although the standard error of estimates is useful as an overall measure of error, it is not a good estimate of the error associated with any single prediction. When we wish to predict a value of \(y\) for a given subject whose \(x\) score is near the mean, the error in our estimate will be smaller than when \(x\) is far from \(\bar{x}\).

If we wish a prediction interval for \(y\) on the basis of \(x\) for a new member of the population, the standard error of our prediction is given by

\(s'_{y.x}=s_{y.x}\sqrt{1+\frac{1}{n}+\frac{(x_i-\bar{x})^2}{(n-1)s^2_x}}\).

This leads to the following prediction interval on \(\hat{y}\):

\(PI(y)=\hat{y} \pm t_{\frac{\alpha}{2}}s'_{y.x}\).

Let’s see a numerical example. Import the data in sat.txt. In this file, students’ SAT scores and their studying hours are contained.

sat.dta<-read.table("sat.txt",header=T,sep="")
dim(sat.dta)
## [1] 20  2

In order to show the confidence limits of the regression line, we skip those preprocessing steps and directly go to the regression analysis. In R, we can use the function predict() to compute the interval of prediction for a \(y\) score with a regression model.

sat.lm<-lm(SAT~Hours,sat.dta)
summary(sat.lm)
## 
## Call:
## lm(formula = SAT ~ Hours, data = sat.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -120.347  -29.308    9.928   33.734   83.570 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  353.165     24.337   14.51 2.24e-11 ***
## Hours         25.326      2.291   11.05 1.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.72 on 18 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8645 
## F-statistic: 122.2 on 1 and 18 DF,  p-value: 1.868e-09
interval.lm<-predict(sat.lm,interval=c("prediction"))
## Warning in predict.lm(sat.lm, interval = c("prediction")): predictions on current data refer to _future_ responses
interval.lm
##         fit      lwr       upr
## 1  454.4708 344.2703  564.6712
## 2  581.1031 474.0484  688.1578
## 3  606.4296 499.3640  713.4951
## 4  707.7354 598.4851  816.9858
## 5  454.4708 344.2703  564.6712
## 6  530.4502 422.7697  638.1306
## 7  657.0825 549.3483  764.8167
## 8  910.3472 787.4453 1033.2491
## 9  378.4913 263.9916  492.9911
## 10 429.1443 317.7000  540.5886
## 11 555.7766 448.5165  663.0367
## 12 631.7560 524.4635  739.0485
## 13 479.7972 370.6424  588.9521
## 14 505.1237 396.8104  613.4370
## 15 606.4296 499.3640  713.4951
## 16 631.7560 524.4635  739.0485
## 17 758.3884 646.8091  869.9677
## 18 682.4090 574.0209  790.7971
## 19 682.4090 574.0209  790.7971
## 20 606.4296 499.3640  713.4951

In the variable interval.lm, there are three columns respectively show the fitted values, the lower bound of the interval, and the upper bound of the interval. When we choose interval=c(“prediction”), prediction() will return the bounds of the prediction interval. We are going to make a scatterplot with a regression line in it and the prediction intervals.

library(ggplot2)
fg<-ggplot(sat.dta,aes(x=Hours,y=SAT))+
  geom_point(color="deepskyblue3")+
  geom_smooth(method="lm",se=F,color="black")
fg
## `geom_smooth()` using formula = 'y ~ x'

We can add two lines representing the upper and lower intervals of all predicted values.

data<-data.frame(Hours=sat.dta$Hours,lwr=interval.lm[,2],upr=interval.lm[,3])
fg+geom_line(data=data,aes(x=Hours,y=lwr),linetype=2,color="deepskyblue2")+
  geom_line(data=data,aes(x=Hours,y=upr),linetype=2,color="deepskyblue2")
## `geom_smooth()` using formula = 'y ~ x'

Confidence Limits on \(\hat{y}\)

The confidence interval is the interval where the true \(y\) value given an \(x\) value will be in with 95% confidence. The only difference between the confidence interval and the prediction interval is that the standard error of estimate is computed differently as

\(s''_{y.x}=s_{y.x}\sqrt{\frac{1}{n}+\frac{(x_i-\bar{x})^2}{(n-1)s^2_x}}\).

Then, the confidence interval for an estimated \(\hat{y}\) is

\(CI(y)=\hat{y} \pm t_{\frac{\alpha}{2}}s''_{y.x}\).

According to this equation, the confidene interval should be narrower than the prediction interval. Again, we can use predict( ) to compute the confidence intervals.

cinterval.lm<-predict(sat.lm,interval=c("confidence"))
cinterval.lm
##         fit      lwr      upr
## 1  454.4708 419.3475 489.5940
## 2  581.1031 557.6464 604.5598
## 3  606.4296 582.9235 629.9356
## 4  707.7354 675.7176 739.7532
## 5  454.4708 419.3475 489.5940
## 6  530.4502 504.2856 556.6147
## 7  657.0825 630.6975 683.4675
## 8  910.3472 845.5831 975.1112
## 9  378.4913 331.5903 425.3924
## 10 429.1443 390.2942 467.9944
## 11 555.7766 531.3997 580.1535
## 12 631.7560 607.2370 656.2751
## 13 479.7972 448.1067 511.4877
## 14 505.1237 476.4659 533.7815
## 15 606.4296 582.9235 629.9356
## 16 631.7560 607.2370 656.2751
## 17 758.3884 719.1526 797.6241
## 18 682.4090 653.4696 711.3483
## 19 682.4090 653.4696 711.3483
## 20 606.4296 582.9235 629.9356

Now we add the upper and lower bounds of the confidence interval to the above figure.

cdata<-data.frame(Hours=sat.dta$Hours,lwr=cinterval.lm[,2],upr=cinterval.lm[,3])
fg+geom_line(data=data,aes(x=Hours,y=lwr),linetype=2,color="deepskyblue2")+
  geom_line(data=data,aes(x=Hours,y=upr),linetype=2,color="deepskyblue2")+
  geom_line(data=cdata,aes(x=Hours,y=upr),linetype=5,color="tomato")+
  geom_line(data=cdata,aes(x=Hours,y=lwr),linetype=5,color="tomato")
## `geom_smooth()` using formula = 'y ~ x'

In fact, we can use geom_smooth( ) to directly show the confidence intervals. This is because that drawing the confidence interval is the default setting. See the below figure.

ggplot(sat.dta,aes(x=Hours,y=SAT))+
  geom_point(color="deepskyblue3")+
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

Simple Linear Regression and Independent Groups t Test

The independent groups \(t\) test is used to check the difference between two sample means. If the difference is significantly different from 0 then \(H_o\) is rejected and the experimental treatment is suggested. On the other hand, the regression anlysis is used to check whether a predictor can predict a dependent variable. The first glance at these two tests would suggest that they are totally different. However, the independent groups \(t\) test can actually be viewed as a special case of simple linear regression. See the below example, in which 7 paraticipants’ heights and their genders are contained in a data set. We are going to use this data set to describe the relationship between the independent groups \(t\) test and simple linear regression.

dta<-data.frame(gender=c(1,1,1,0,0,0,0),
                height=c(72.71,70.98,71.35,
                         62.87,64.14,62.48,
                         64.73))
dta
##   gender height
## 1      1  72.71
## 2      1  70.98
## 3      1  71.35
## 4      0  62.87
## 5      0  64.14
## 6      0  62.48
## 7      0  64.73

As usual we firstly plot the group means. However, this time we treat the gender variable as a continuous variable only with two levels 0 and 1.

means<-with(dta,tapply(height,gender,mean))
ses<-with(dta,tapply(height,gender,sd))/sqrt(table(dta$gender))
library(ggplot2)
ggplot(data.frame(means=means,ses=ses,gender=c(0,1)),aes(x=gender,y=means))+
  geom_point(color="tomato")+geom_line(color="deepskyblue3")+
  geom_errorbar(aes(ymin=means-ses,ymax=means+ses),width=0.05,color="tomato")

The mean height of females is 8.125 less than that of males, which is significantly different from 0, \(t(5)=-10.9\), \(p<.01\). The standard error for this \(t\) test is 0.75.

means[2]-means[1]
##     1 
## 8.125
t.test(height~as.factor(gender),var.equal=T,dta)
## 
##  Two Sample t-test
## 
## data:  height by as.factor(gender)
## t = -10.63, df = 5, p-value = 0.0001274
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -10.089786  -6.160214
## sample estimates:
## mean in group 0 mean in group 1 
##          63.555          71.680
(means[1]-means[2])/(-10.9)
##         0 
## 0.7454128

Although not common, we can run a regression analysis for this data set with height as the depdenent variable and gender as the predictor. In this table, the estimated slope is 8.125, same as the difference between two mean heights. This is not a coincidence! In fact, in this case, the slope of this simple linear regression line is exactly the difference between two sample means.

m1<-lm(height~gender,dta)
summary(m1)
## 
## Call:
## lm(formula = height ~ gender, data = dta)
## 
## Residuals:
##      1      2      3      4      5      6      7 
##  1.030 -0.700 -0.330 -0.685  0.585 -1.075  1.175 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.5550     0.5004  127.02 5.74e-10 ***
## gender        8.1250     0.7643   10.63 0.000127 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 5 degrees of freedom
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9492 
## F-statistic:   113 on 1 and 5 DF,  p-value: 0.0001274

The geometric proof is shown in the below figure. First, the sample means are always on this regression line. According to the regression line, given an \(x\) value \(x_j\), the predicted value of \(y\) is \(\hat{y}_j\), which is actually the mean of the distribution of \(y_{ij}\). As \(\hat{y}_j=b_0+b_1x_j\) and \(y_{ij}=\hat{y}_j+\epsilon_{ij}\), \(\bar{y}_j=\hat{y}_j\). In other words, the linear line linking these two group means is actually the regression line with gender as the predictor and height as the dependent variable. Second, the difference between these two means can be viewed as the increasing amount of height when gender is moving from 0 to 1. According to the definition of slope \(\frac{\Delta y}{\Delta x}=\frac{8.125}{1-0}=8.125\). This is the slope estimated by regression analysis and also the difference between two means.

In addition, the \(t\) value for testing the slope is 10.63, which is actually the \(t\) value in the independent groups \(t\) test, except the sign. This is because in \(t\) test, the moninator of \(t\) is \(\bar{x}_{female}-\bar{x}_{male}\).

Whether the slope is statistically different from 0 is examined by \(t=\frac{b-0}{s_b}\). In fact, this \(t\) value is the same as \(t=\frac{\bar{x}_{m}-\bar{x}_f}{\sqrt{\frac{s^2_p}{n_1}+\frac{s^2_p}{n_2}}}\). First, the moninators are already proved to be the same. Second, \(s^2_p\) can be derived from \(s^2_{y.x}\). The pooled variance \(s^2_p=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}\).

\(s^2_{y.x}=\frac{\sum(y-\hat{y})^2}{n-2}=\frac{\sum_i(y_{i1}-\hat{y}_1)^2+\sum_i(y_{i2}-\hat{y}_2)^2}{n-2}=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n-2}\)

Since \(n_1+n_2=n\), \(s^2_{y.x}=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}=s^2_p\).

We can compute by the following codes the pooled SD for the independent groups \(t\) test and the standard error of estimate for the regression analysis. As what can be seen, they are the same indeed.

sub<-with(dta,tapply(height,gender,length))
sds<-with(dta,tapply(height,gender,sd))
# pooled SD
sqrt(sum(sds^2*(sub-1))/(sum(sub)-2))
## [1] 1.00075
# standard error of estimate
sqrt(sum(m1$residuals^2)/(nrow(dta)-2))
## [1] 1.00075