In inferential statistics, we infer the parameters of population based on the sample statistics. For instance, we make an inference about the population mean by estimating the region in which the population mean is located with 95% of confidence. Simiarly, we can also estimate the region of the predictions of the regression model for the population of the relationship between \(x\) and \(y\). This can be done by estimating the 95% confidence intervals for \(\hat{y_j}\) for each \(x_j\). See the left panel of the below figure. For each \(x_j\), the \(y_j\) values can be a normal distribution. Given \(y_{j.i}=\hat{y_j}+\epsilon_{j.i}\) and \(\epsilon_{j.i}\sim ND(0,\sigma_\epsilon)\), \(y_{j.i}\sim ND(0,\sigma_\epsilon)\). Similarly, the 95% confidence intervals for the population prediction based on \(\hat{y_j}\) can be estimated as \(\hat{y_j}\pm t(.025)SE\).

In the case of inferring the population mean, the standard error is \(SE=\sqrt{\sum(x-\bar{x})^2/N-1}/\sqrt{N}\). Similarly, in the case of regression analysis, the standard error is computed as \(s_{y.x}=\sqrt{\sum(y-\hat{y})^2/N-2}\). Note the df here is \(N-2\), as there are two parameters in the regression model to predict \(\hat{y}\). In order to distinguish from the conventional standard error, it is called the standard error of estimate. However, the width of the 95% confidence intervals for the population regression line is not the same for all model predictions, \(\hat{y}\). Suppose a population regression line as \(\hat{y_{j.i}}=\beta_0+\beta_1x_j+\epsilon_{j.i}\). And \(\epsilon_{j.i}\sim ND(0,\sigma_\epsilon)\). Given \(\mu_y=\frac{1}{N}\sum\sum y_{j.i}\) (\(N=\) the number of all data points),

\[\begin{align*} \mu_y&=\frac{1}{N}\sum\sum(\beta_0+\beta_1x_j+\epsilon_{j.i})\\ &=\beta_0+\beta_1\frac{1}{N}\sum\sum(x_j+\epsilon_{j.i})\\ &=\beta_0+\beta_1\mu_x+\frac{1}{N}\sum\sum\epsilon_{j.i}\\ &=\beta_0+\beta_1\mu_x\\ &=\hat{\mu_y}. \end{align*}\]

Therefore, the population regression line must go through the point (\(\mu_x\),\(\mu_y\)). See the left panel in the above figure. This is the constraint for estimating the population regression line, as for no matter how many sample regression lines, this point should be their intersection. As a result, the estimate error should be the smallest around the population mean. Also, the farther away from the population mean, the larger the standard error of estimate is. The equation used to compute the standard error of estimate given an \(x\) value is \(S''_{y.x}=S_{y.x}\sqrt{\frac{1}{N}+\frac{(x_{i}-\overline x)^2}{(N-1)S^2_{x}}}\). Clearly, the the width of the 95% confindence intervals increases as \(x\) leaves \(\bar{x}\), namely the two ends of \(x\) values. In the right panel of the above figure, the blue dash lines show the band of the 95% confidence intervals for all possible \(x\) values. This band is relatively narrow around the mean location of the data points.

Example 1

In a study conducted by the US army, different body parts of soldiers (e.g., arm length, foot length, etc.) were measured for the need to design suitable uniforms. The file ansur_men.txt contains the data from 2,208 women. As shown by dim(), there are 2,208 rows and 166 columns, namely 166 variables. Let’s import the data first.

women.dta<-read.table("ansur_women.txt",header=T,sep="")
dim(women.dta)
## [1] 2208  166

Describe Data

Suppose we are interested in the relationship between the lengths of hands and feet. We can run correlation analysis or evan regression analysis to make a model for it. Before that, let’s do some descriptive statistics to get to understand our data. First, we can make histograms for these two variables together with the Q-Q plots for checking whether they are close to normal distribution. The below figure suggests that the two variables are quite normally distributed.

library(ggplot2)
library(gridExtra)
fg.foot<-ggplot(data=women.dta,aes(x=FOOT_LNTH))+
           geom_histogram(aes(y=..density..),bins=10,col="blue",fill="blue",alpha=0.2)+
           geom_density()
fg.hand<-ggplot(data=women.dta,aes(x=HAND_LNTH))+
           geom_histogram(aes(y=..density..),bins=10,col="red",fill="red",alpha=0.2)+
           geom_density()
qq.foot<-ggplot(data=women.dta,aes(sample=FOOT_LNTH))+
           stat_qq(col="blue",alpha=0.2)+stat_qq_line()
qq.hand<-ggplot(data=women.dta,aes(sample=HAND_LNTH))+
           stat_qq(col="red",alpha=0.2)+stat_qq_line()
grid.arrange(fg.foot,qq.foot,fg.hand,qq.hand,ncol=2)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Of course, we can get the statistics of the distributions of these two variables with summary(). For those who are interested, try to get the result with one-line code.

Correlation Analysis

Subsequently, we can conduct the correlation analysis by making the scatter plot and computing and testing the correlation coefficient \(r\). The scatter plot is shown by the codes below.

fh.scatter<-ggplot(data=women.dta,aes(x=HAND_LNTH,y=FOOT_LNTH))+
              geom_point(shape=1,col="blue")
fh.scatter

The correlation coefficient can be computed with the following codes. It is not surprising that the length of hand is highly and positively correlated with the length of foot (\(r=.82\)).

with(women.dta,cor.test(FOOT_LNTH,HAND_LNTH))
## 
##  Pearson's product-moment correlation
## 
## data:  FOOT_LNTH and HAND_LNTH
## t = 68.503, df = 2206, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8109465 0.8376564
## sample estimates:
##       cor 
## 0.8247609

Regression Analysis

Now we know that these two variables are both normal distribution and are hightly correlated. We can estimate the linear regression model for predicting the foot length by the hand length. The intercept and slope are significantly different from 0.

m1<-lm(FOOT_LNTH~HAND_LNTH,women.dta)
summary(m1)
## 
## Call:
## lm(formula = FOOT_LNTH ~ HAND_LNTH, data = women.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3686  -4.5454  -0.0774   4.6730  26.8409 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.42137    2.74780   20.53   <2e-16 ***
## HAND_LNTH    1.04160    0.01521   68.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.914 on 2206 degrees of freedom
## Multiple R-squared:  0.6802, Adjusted R-squared:  0.6801 
## F-statistic:  4693 on 1 and 2206 DF,  p-value: < 2.2e-16

We can add the regression line in the scatter plot. As the default setting for the argument se in stat_smooth() is TRUE that will add the 95% confidence intervals for the population line, we do not need to set up the value of se. However, the figure shows no band as depicted in the above figure. This is because the standard error of estimate is too small and the observation number is too large.

fh.scatter<-fh.scatter+stat_smooth(method="lm",col="black")
fh.scatter
## `geom_smooth()` using formula = 'y ~ x'

Residual Analysis

According to the assumption of regression analysis, the \(x\) values should be independent of \(\epsilon\). We can check it out with the data. In the object returned by lm(), the variable residuals stores the residuals \(\epsilon\) for all \(y\) values. Run the correlation test for the data of length of hand and the residuals. The correlation coefficient \(r=0\).

cor.test(m1$residuals,women.dta$HAND_LNTH)
## 
##  Pearson's product-moment correlation
## 
## data:  m1$residuals and women.dta$HAND_LNTH
## t = 1.6685e-14, df = 2206, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04171495  0.04171495
## sample estimates:
##          cor 
## 3.552352e-16

Also, we can make a scatter plot for \(x\) and \(\epsilon\). That is a perfect circle, namely no correlation.

xres<-ggplot(data=data.frame(HAND=women.dta$HAND_LNTH,Residual=m1$residuals),
             aes(x=HAND,y=Residual))+
        geom_point(shape=1,col="deepskyblue2")
xres

Example 2

Ten people’s body weights at their ages of 8 and 20 are stored in bodyweight.txt. Wether one person’s body weight at the early age can predict his/hers at the later age is the issue to address here. First, import the data.

bw.dta<-read.table("bodyweight.txt",header=T)
bw.dta
##    Age8 Age20
## 1    31    55
## 2    25    44
## 3    29    55
## 4    20    51
## 5    40    85
## 6    32    57
## 7    27    59
## 8    33    60
## 9    28    66
## 10   21    49

Describe data

In the data frame bw.dta, the first column records the body weights at the age of 8 and the second the age of 20. The Q-Q plots below show some outliers especially at the ends of body weight dimension.

bw8.fg<-ggplot(bw.dta,aes(sample=Age8))+
          stat_qq(col="blue")+stat_qq_line()
bw20.fg<-ggplot(bw.dta,aes(sample=Age20))+
          stat_qq(col="red")+stat_qq_line()
grid.arrange(bw8.fg,bw20.fg,nrow=2)

Correlation Analysis

The main concern is the relationships between the two variables in bw.dta. Thus, we make a scatter plot first. It looks like a positive correlation between the body weights at age of 8 and age of 20. The correlation coefficient \(r=.79\) is significantly larger than 0, \(p<.01\).

bw.fg<-ggplot(bw.dta,aes(x=Age8,y=Age20))+
         geom_point()
bw.fg

with(bw.dta,cor.test(Age8,Age20))
## 
##  Pearson's product-moment correlation
## 
## data:  Age8 and Age20
## t = 3.7248, df = 8, p-value = 0.005831
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3345328 0.9497788
## sample estimates:
##       cor 
## 0.7964106

Regression Analysis

Subsequently, we can describe the relationships between these two variables as a linear regression model with the body weight at age of 8 as predictor to predict the body weight at age of 20. It turns out that the slope of this regression line is significantly different from 0, whereas the intercept is not. This result is consistent with the inspection of the scatter plot as well as the result of the correlation analysis.

m1<-lm(Age20~Age8,bw.dta)
summary(m1)
## 
## Call:
## lm(formula = Age20 ~ Age8, data = bw.dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6336 -5.8923 -0.6336  5.3014  9.5897 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  14.6724    11.8810   1.235  0.25189   
## Age8          1.5184     0.4077   3.725  0.00583 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.228 on 8 degrees of freedom
## Multiple R-squared:  0.6343, Adjusted R-squared:  0.5886 
## F-statistic: 13.87 on 1 and 8 DF,  p-value: 0.005831

Also, we can infer the band of the regression line for population. As shown in the below figure, the band is made by the 95% confidence intervals, which is narrower around the mean point than the ends.

bw.fg<-bw.fg+stat_smooth(method="lm",col="red")
bw.fg
## `geom_smooth()` using formula = 'y ~ x'

We can check the correlation between the predictor and residual. Obviously, it is not different from 0, consistent with the assumption of regression analysis.

cor.test(bw.dta$Age8,m1$residuals)
## 
##  Pearson's product-moment correlation
## 
## data:  bw.dta$Age8 and m1$residuals
## t = -2.7719e-17, df = 8, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6296263  0.6296263
## sample estimates:
##           cor 
## -9.800189e-18

Exercise:

A data frame has been installed in R which is cars. In this data frame, there are two columns representing the car speed and the minimum safe distance respectively. Please do a complete analysis for the relationship between car speed and minimum distance, including predicting the 95% confidence intervals of the population regression line.