Lack of Fit

In previous teaching notes, we have introduced two approaches to evaluating how good a regression model is. One is to see the standard error of estimate \(s_{y.x}\) and the other one is \(r^2\). The former concerns the error the model makes on predicting the dependent variable, whereas the latter concerns the percentage of total variance of dependent variable explained by the model. Thus, for the former, the smaller the better; for the latter, the larger the better. When we have more than one variable as potential predictors to build up a regression model, which of these variables should be included is an important concern. One straightforward idea is to compare the goodness of fits of the models with and without that variable as a predictor. The goodness of fit is an index to evaluate how well the model fits the data. That is, how much variance of the dependent variable can be explained by the model. Similarly, the lack of fit is referred to as the decrease in the goodness of fit with some variables dropped out from the model. Also, we can test the lack of fit with some statistical significance test with the scientific hypothesis set as that the lack of fit is not equal to zero and the null hypothesis set as that the lack of fit is equal to zero. Let’s see the following example.

Numerical Example 1

Brown and Maritz (1982) reported a data set containing the modulus of rigidity, elasticity, and air-dried density of 50 samples of timber. With the air-dried density as the dependent variable, we have three candidate regression models according to the combinations of the modulus of rigidity and elasticity as the predictors.

url<-"http://www.statsci.org/data/oz/timber.txt"
dta1<-read.table(url,header=T)
head(dta1)
##   Rigid Elast Dens
## 1  1000    99 25.3
## 2  1112   173 28.2
## 3  1033   188 28.6
## 4  1087   133 29.1
## 5  1069   146 30.7
## 6   925    91 31.4

As usual, we can compute the correlation coefficients between the variables. Use the function corr.test( ) in the package {psych} to compute all correlation coefficients between every two of the variables. Apparently, these three variables are positively correlated with each other.

library(psych)
corr.test(dta1)
## Call:corr.test(x = dta1)
## Correlation matrix 
##       Rigid Elast Dens
## Rigid  1.00  0.82 0.86
## Elast  0.82  1.00 0.74
## Dens   0.86  0.74 1.00
## Sample Size 
## [1] 50
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##       Rigid Elast Dens
## Rigid     0     0    0
## Elast     0     0    0
## Dens      0     0    0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

With the density as the dependent variable, the first model includes the modulus of rigidity as the predictor. The \(r^2=.74\), which is high and all estimated coefficients are significantly different from 0. Also, the second model includes both the moduli of rigidity and elasticity as the predictors. The model performs well as \(r^2=.74\), but it seems to have no improvement with this additional predictor. Normally, the model with more parameters will perform better. Thus, it is expected that the additional predictor(s) in a regression model should lead to a better performance. However, it does not seem to be the case here.

m1<-lm(Dens~Rigid,dta1)
summary(m1)
## 
## Call:
## lm(formula = Dens ~ Rigid, data = dta1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0962  -4.0812  -0.8666   3.2035  17.0098 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.727357   3.640866   2.122    0.039 *  
## Rigid       0.024540   0.002113  11.611 1.53e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.21 on 48 degrees of freedom
## Multiple R-squared:  0.7374, Adjusted R-squared:  0.732 
## F-statistic: 134.8 on 1 and 48 DF,  p-value: 1.525e-15
m2<-lm(Dens~Rigid+Elast,dta1)
summary(m2)
## 
## Call:
## lm(formula = Dens ~ Rigid + Elast, data = dta1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.239 -4.326 -1.201  2.662 17.393 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.455052   3.948878   1.635    0.109    
## Rigid       0.021986   0.003688   5.961 3.08e-07 ***
## Elast       0.026357   0.031136   0.847    0.402    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.229 on 47 degrees of freedom
## Multiple R-squared:  0.7414, Adjusted R-squared:  0.7304 
## F-statistic: 67.37 on 2 and 47 DF,  p-value: 1.574e-14

Therefore, we need to evaluate the performance of these two models. Since these two models both are built up to predict the same dependent variable, we can compare the portions of the sum of squares total of the density explained by these models. Thus, for model 1, \(SST=SSM_1+SSres_1\). For model 2, \(SST=SSM_2+SSres_2\). Model 2 has one predictor more than Model 1. Presumably, \(SSM_1\leq SSM_2\) or \(SSres_1 \geq SSres_2\). Thus, the difference between these two models can be \(SSres_1-SSres_2\), which is called the lack of fit. These three sums of squares are shown by the following codes.

SSres1<-sum(m1$residuals^2)
SSres2<-sum(m2$residuals^2)
SSLoF<-SSres1-SSres2
c(SSres1,SSres2,SSLoF)
## [1] 1851.2796 1823.4781   27.8015

In order to test whether the lack of fit is big enough, we compare it with the residual of the model with more parameters. For example, model 1 has one predictor and model 2 has two. The sum of squares residual of model 1 is \(SSres_1=SSres_2+SS_{LoF}\). If \(SS_{LoF}>SSres_2\), then it is more likely to state that model 2 performs better than model 1. Normally, we transfer these \(SS\) terms to \(MS\) terms and use \(F\) test to test the ratio of \(MS_{LoF}\) and \(MSres_2\). \(F\) value is a ratio between two variances, \(F=\frac{s^2_1}{s^2_2}\). Thus, when we would like to check whether the varinaces of two samples are statistically the same, we use \(F\) test with \(H_o:\sigma^2_1=\sigma^2_2\) (or \(\frac{\sigma^2_1}{\sigma^2_2}=1\)) vs. \(H_1:\sigma^2_1\neq \sigma^2_2\) (or \(\frac{\sigma^2_1}{\sigma^2_2}\neq1\)). As we know that the variance is the mean of squares too, \(s^2_x=MS_x=\frac{SS_x}{df_x}\), thus, the sum of squares LoF and the sum of squares residual of model 2 will be transferred to the mean squares for \(F\) test. To this end, we should know the degrees of freedom for test two terms. The \(df\) for \(SSres_2\) is \(47=50-3\), as 50 data points are fit by model 2 with 3 parameters (one intercept and two regression coefficients). The \(df\) for the lack of fit is the difference on \(df\) for the residuals between model 1 and model 2. For model 1, the \(df\) for residuals is \(50-2=48\). Thus, the \(df\) for the lack of fit is \(48-47=1\). The following codes are used to check for the significance of the lack of fit. The result [\(F<1\),\(p=.40\)] clearly reveals that the lack of fit is not significantly larger than the residual of model 2. Therefore, it is not necessary to add the second predictor.

MSLoF<-SSLoF/1
MSres2<-SSres2/47
f<-MSLoF/MSres2
f
## [1] 0.7165814
1-pf(f,1,47)
## [1] 0.4015572

Alternatively, we can use the function anova( ) to do the same job. The question then is why no effect of adding a second predictor is found. A possible interpretation is that the high correlation between the modulus of rigidity and the modulus of elasticity results in that either one of the two variables is sufficiently to predict the air dried density of timber. Thus, there is no improvement on the model performance when adding the second predictor.

anova(m1,m2)
## Analysis of Variance Table
## 
## Model 1: Dens ~ Rigid
## Model 2: Dens ~ Rigid + Elast
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     48 1851.3                           
## 2     47 1823.5  1    27.802 0.7166 0.4016

Numerical Example 2

Allison and Cicchetti (1976) published a paper in Science, in which the data of 62 species of mammals are reported, including their rain and body weights, life spans, gestation times, time for sleeping, and prediction and danger indices. The data can be found on this webpage. The variable descpritions are listed as follows.

Variable Description
BodyWt body weight (kg)
BrainWt brain weight (g)
NonDreaming slow wave (“nondreaming”) sleep (hrs/day)
Dreaming paradoxical (“dreaming”) sleep (hrs/day)
TotalSleep total sleep, sum of slow wave and paradoxical sleep (hrs/day)
LifeSpan maximum life span (years)
Gestation gestation time (days)
Predation predation index (1-5)
Predation 1 = minimum (least likely to be preyed upon); 5 = maximum (most likely to be preyed upon)
Exposure sleep exposure index (1-5)
Exposure 1 = least exposed (e.g. animal sleeps in a well-protected den); 5 = most exposed
Danger overall danger index (1-5) (based on the above two indices and other information)
Danger 1 = least danger (from other animals); 5 = most danger (from other animals)
url<-"http://www.statsci.org/data/general/sleep.txt"
dta2<-read.table(url,header=T)
dim(dta2)
## [1] 62 11

Suppose we are interested what variables affect the lifespan of mammals. We firstly check the correlation coefficients between the lifespan score and the variables potentially to have an effect on lifespan.

# As there are NA's in the data, we need to remove out those data
CheckNA<-sapply(1:nrow(dta2),function(x){
  if(sum(is.na(dta2[x,]))>0){
    return(1)
  }else{
    return(0)
  }
})
dta3<-dta2[-which(CheckNA==1),]
with(dta3,cor.test(LifeSpan,BodyWt))
## 
##  Pearson's product-moment correlation
## 
## data:  LifeSpan and BodyWt
## t = 3.366, df = 40, p-value = 0.001694
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1935241 0.6770714
## sample estimates:
##       cor 
## 0.4698215
with(dta3,cor.test(LifeSpan,BrainWt))
## 
##  Pearson's product-moment correlation
## 
## data:  LifeSpan and BrainWt
## t = 5.1225, df = 40, p-value = 8.008e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4024416 0.7834531
## sample estimates:
##       cor 
## 0.6293894
with(dta3,cor.test(LifeSpan,TotalSleep))
## 
##  Pearson's product-moment correlation
## 
## data:  LifeSpan and TotalSleep
## t = -2.6178, df = 40, p-value = 0.01243
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.61490284 -0.08884152
## sample estimates:
##        cor 
## -0.3824462
with(dta3,cor.test(LifeSpan,Gestation))
## 
##  Pearson's product-moment correlation
## 
## data:  LifeSpan and Gestation
## t = 5.3579, df = 40, p-value = 3.762e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4261846 0.7942779
## sample estimates:
##       cor 
## 0.6463887
with(dta3,cor.test(LifeSpan,Exposure))
## 
##  Pearson's product-moment correlation
## 
## data:  LifeSpan and Exposure
## t = 2.1046, df = 40, p-value = 0.04166
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01306839 0.56541678
## sample estimates:
##       cor 
## 0.3157456

Now we can develop our regression model to explore the factors to influence the mammal’s lifespan. According to these correlation coefficients, Gestation can be the first predictor into the model. The \(r^2=.42\) when Gestation only is included in the regression model. We gradually add predictors and check whether the model performance gets improved.

m1<-lm(LifeSpan~Gestation,dta3)
summary(m1)
## 
## Call:
## lm(formula = LifeSpan ~ Gestation, data = dta3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.108  -7.067  -3.542   4.694  66.590 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.06229    3.46419   1.750   0.0878 .  
## Gestation    0.10242    0.01912   5.358 3.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.65 on 40 degrees of freedom
## Multiple R-squared:  0.4178, Adjusted R-squared:  0.4033 
## F-statistic: 28.71 on 1 and 40 DF,  p-value: 3.762e-06
m2<-lm(LifeSpan~Gestation+BrainWt,dta3)
summary(m2)
## 
## Call:
## lm(formula = LifeSpan ~ Gestation + BrainWt, data = dta3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.405  -7.735  -5.020   6.318  61.713 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 9.106659   3.690459   2.468   0.0181 *
## Gestation   0.063362   0.027179   2.331   0.0250 *
## BrainWt     0.009290   0.004741   1.959   0.0572 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.12 on 39 degrees of freedom
## Multiple R-squared:   0.47,  Adjusted R-squared:  0.4428 
## F-statistic: 17.29 on 2 and 39 DF,  p-value: 4.203e-06
m3<-lm(LifeSpan~Gestation+BrainWt+TotalSleep,dta3)
summary(m3)
## 
## Call:
## lm(formula = LifeSpan ~ Gestation + BrainWt + TotalSleep, data = dta3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.270  -8.230  -4.261   5.734  61.651 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 11.499912  10.006872   1.149   0.2577  
## Gestation    0.058392   0.033592   1.738   0.0903 .
## BrainWt      0.009558   0.004911   1.946   0.0590 .
## TotalSleep  -0.169706   0.658295  -0.258   0.7980  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.31 on 38 degrees of freedom
## Multiple R-squared:  0.4709, Adjusted R-squared:  0.4291 
## F-statistic: 11.27 on 3 and 38 DF,  p-value: 1.974e-05

However, the \(r^2\) seems not to increase too much. Thus, we can check the lack of fits between these models. The following analysis for the lack of fit suggests that the regression model with Gestation alnoe can fit the data well.

anova(m1,m2)
## Analysis of Variance Table
## 
## Model 1: LifeSpan ~ Gestation
## Model 2: LifeSpan ~ Gestation + BrainWt
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     40 9794.9                              
## 2     39 8917.1  1    877.78 3.8391 0.05725 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m2,m3)
## Analysis of Variance Table
## 
## Model 1: LifeSpan ~ Gestation + BrainWt
## Model 2: LifeSpan ~ Gestation + BrainWt + TotalSleep
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     39 8917.1                           
## 2     38 8901.5  1    15.568 0.0665  0.798
anova(m1,m3)
## Analysis of Variance Table
## 
## Model 1: LifeSpan ~ Gestation
## Model 2: LifeSpan ~ Gestation + BrainWt + TotalSleep
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     40 9794.9                           
## 2     38 8901.5  2    893.35 1.9068 0.1625

Of course, we can try other combinations of the predictors. As shown in the following try, model 1 is still the best model. Thus, a simple conclusion can be drawn as that the longer the gestation is, the longer the lifespan is for the mammals.

m4<-lm(LifeSpan~Gestation+BodyWt,dta3)
summary(m4)
## 
## Call:
## lm(formula = LifeSpan ~ Gestation + BodyWt, data = dta3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.850  -7.165  -3.487   4.685  66.878 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.221121   3.889901   1.599 0.117825    
## Gestation   0.100558   0.027662   3.635 0.000801 ***
## BodyWt      0.000831   0.008795   0.094 0.925208    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.85 on 39 degrees of freedom
## Multiple R-squared:  0.418,  Adjusted R-squared:  0.3881 
## F-statistic:    14 on 2 and 39 DF,  p-value: 2.61e-05
anova(m1,m4)
## Analysis of Variance Table
## 
## Model 1: LifeSpan ~ Gestation
## Model 2: LifeSpan ~ Gestation + BodyWt
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     40 9794.9                           
## 2     39 9792.6  1    2.2416 0.0089 0.9252
m5<-lm(LifeSpan~Gestation+Exposure,dta3)
summary(m5)
## 
## Call:
## lm(formula = LifeSpan ~ Gestation + Exposure, data = dta3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.203  -7.820  -3.284   4.400  64.097 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.65494    4.53510   1.688   0.0994 .  
## Gestation    0.10986    0.02354   4.667 3.56e-05 ***
## Exposure    -1.08575    1.97040  -0.551   0.5848    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.79 on 39 degrees of freedom
## Multiple R-squared:  0.4223, Adjusted R-squared:  0.3927 
## F-statistic: 14.26 on 2 and 39 DF,  p-value: 2.254e-05
anova(m1,m5)
## Analysis of Variance Table
## 
## Model 1: LifeSpan ~ Gestation
## Model 2: LifeSpan ~ Gestation + Exposure
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     40 9794.9                           
## 2     39 9719.2  1    75.669 0.3036 0.5848