Polynomial regression

In mathematics, a polynomial is an expression consisting of variables (also called indeterminates) and coefficients, that involves only the operations of addition, subtraction, multiplication, and non-negative integer exponents of variables. An example of a polynomial of a single indeterminate, x, is \(x^2−4x+7\). The regression model can be of a polynomial form. That is, the model consists of the terms of \(x\) to the \(n^{th}\) power, such as \(\hat{y}=b_0+b_1x+b_2x^2\) for \(n=2\).

Numerical Example I

In 1981, 78 bluegill fishes were randomly sampled from Lake Mary in Minnesota. The researchers Cook and Weisberg (1999) collected and recorded the data (see here). In this data set, one column is the age of the fish and the other is the length (in mm) of the fish. Whether a bluegill fish’s length can be predicted by its age is the current issue.

url<-"https://online.stat.psu.edu/onlinecourses/sites/stat501/files/data/bluegills.txt"
dta<-read.table(url,header=T,sep="")
library(ggplot2)
ggplot(dta,aes(age,length))+
  geom_point()

The scatter plot seems to suggest a linear relationship between age and length with \(r=.86\), \(p<.01\). The linear regression model \(\hat{length}=b_0+b_1Age\) also has an effect size larger than medium level with \(R^2=.73\).

with(dta,cor.test(age,length))
## 
##  Pearson's product-moment correlation
## 
## data:  age and length
## t = 14.514, df = 76, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7843482 0.9067980
## sample estimates:
##       cor 
## 0.8572527
fish.m1<-lm(length~age,dta)
summary(fish.m1)
## 
## Call:
## lm(formula = length ~ age, data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.523  -7.586   0.258  10.102  20.414 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   62.649      5.755   10.89   <2e-16 ***
## age           22.312      1.537   14.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.51 on 76 degrees of freedom
## Multiple R-squared:  0.7349, Adjusted R-squared:  0.7314 
## F-statistic: 210.7 on 1 and 76 DF,  p-value: < 2.2e-16

The results of residual analysis also suggest that there is no correlation between the model residual and the age of bluegill fish. However, it looks a bit quadratic in the trend of these scatters, because the residuals for the very old and young fishes are extremely negative.

fish.r.m1<-data.frame(age=dta$age,residual=fish.m1$residual)
ggplot(fish.r.m1,aes(age,residual))+
  geom_point()

with(fish.r.m1,cor.test(residual,age))
## 
##  Pearson's product-moment correlation
## 
## data:  residual and age
## t = -4.764e-15, df = 76, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2225308  0.2225308
## sample estimates:
##           cor 
## -5.464649e-16

We can add \(Age^2\) as an additional predictor in the model. Thus, the model now is \(\hat{length}=b_0+b_1Age+b_2Age^2\). Apparently, the \(R^2\) of this new model is larger than that of the previous one. The analysis for lack of fit shows that \(Age^2\) should be included in the model for predicting the length of bluegill fish.

fish.m2<-lm(length~age+I(age^2),dta)
summary(fish.m2)
## 
## Call:
## lm(formula = length ~ age + I(age^2), data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.846  -8.321  -1.137   6.698  22.098 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.622     11.016   1.237     0.22    
## age           54.049      6.489   8.330 2.81e-12 ***
## I(age^2)      -4.719      0.944  -4.999 3.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.91 on 75 degrees of freedom
## Multiple R-squared:  0.8011, Adjusted R-squared:  0.7958 
## F-statistic: 151.1 on 2 and 75 DF,  p-value: < 2.2e-16
anova(fish.m1,fish.m2)
## Analysis of Variance Table
## 
## Model 1: length ~ age
## Model 2: length ~ age + I(age^2)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     76 11892.8                                  
## 2     75  8920.7  1    2972.1 24.988 3.675e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can compare the predictions of these two models. Clearly, the second model is better than the first one to predict the lengths of the fishes with an extreme age.

dta$m1<-fish.m1$fitted.values
dta$m2<-fish.m2$fitted.values
ggplot(dta,aes(age,length))+
  geom_point()+
  geom_line(aes(age,m1),col="blue")+
  geom_line(aes(age,m2),col="tomato")

Numerical Example II

The below codes import a set of data from the webpage. This data set contains 13 alloys. Although we do not know what loss actually means in this data set, this does not prevent us from using it as an example for practicing the polynomial regression model.

url<-"http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/corrosion.data"
dta<-read.table(url,header=T,sep="")
dta
##      Fe  loss
## 1  0.01 127.6
## 2  0.48 124.0
## 3  0.71 110.8
## 4  0.95 103.9
## 5  1.19 101.5
## 6  0.01 130.1
## 7  0.48 122.0
## 8  1.44  92.3
## 9  0.71 113.1
## 10 1.96  83.7
## 11 0.01 128.0
## 12 1.44  91.4
## 13 1.96  86.2

With \(Fe\) as the predictor, the linear regression model performs quite well. The \(R^2\) is very high and it seems that only a small part of the overall variance cannot be covered by this model. The model residual is not correlated with the predictor either. However, if you look at these scatters more carefully, a cubic curve seems to be close to these scatters.

g1<-lm(loss~Fe,dta)
summary(g1)
## 
## Call:
## lm(formula = loss ~ Fe, data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7980 -1.9464  0.2971  0.9924  5.7429 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  129.787      1.403   92.52  < 2e-16 ***
## Fe           -24.020      1.280  -18.77 1.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.058 on 11 degrees of freedom
## Multiple R-squared:  0.9697, Adjusted R-squared:  0.967 
## F-statistic: 352.3 on 1 and 11 DF,  p-value: 1.055e-09
dta$residuals<-g1$residuals
ggplot(dta,aes(x=Fe,y=residuals))+
  geom_point()

with(dta,cor.test(Fe,residuals))
## 
##  Pearson's product-moment correlation
## 
## data:  Fe and residuals
## t = -1.2157e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5509853  0.5509853
## sample estimates:
##           cor 
## -3.665407e-17

Thus, we can try to add terms of \(x\) to higher powers into this model in order to improve our model’s performance. First, we create a model with the predictor \(x\) up to the second power. Although this model has a larger \(R^2\), the improvement on the lack of fit is not significant. We can add one more term of \(x\) to the third power into this model. The \(R^2\) this time is even higher. Also, this model performs significantly better than the first model.

g2<-lm(loss~Fe+I(Fe^2),dta)
summary(g2)
## 
## Call:
## lm(formula = loss ~ Fe + I(Fe^2), data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5580 -2.4578  0.0422  0.8142  5.9997 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  130.320      1.771  73.587 5.24e-15 ***
## Fe           -26.220      4.392  -5.970 0.000137 ***
## I(Fe^2)        1.155      2.198   0.526 0.610688    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.164 on 10 degrees of freedom
## Multiple R-squared:  0.9705, Adjusted R-squared:  0.9646 
## F-statistic: 164.7 on 2 and 10 DF,  p-value: 2.221e-08
anova(g1,g2)
## Analysis of Variance Table
## 
## Model 1: loss ~ Fe
## Model 2: loss ~ Fe + I(Fe^2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     11 102.85                           
## 2     10 100.09  1    2.7639 0.2762 0.6107
g3<-lm(loss~Fe+I(Fe^2)+I(Fe^3),dta)
summary(g3)
## 
## Call:
## lm(formula = loss ~ Fe + I(Fe^2) + I(Fe^3), data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2094 -1.2309 -0.3896  1.2691  3.3284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  128.889      1.325  97.298 6.49e-15 ***
## Fe            -5.524      6.972  -0.792   0.4485    
## I(Fe^2)      -29.208      9.288  -3.145   0.0118 *  
## I(Fe^3)       10.523      3.173   3.316   0.0090 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.237 on 9 degrees of freedom
## Multiple R-squared:  0.9867, Adjusted R-squared:  0.9823 
## F-statistic: 223.2 on 3 and 9 DF,  p-value: 9.17e-09
anova(g1,g3)
## Analysis of Variance Table
## 
## Model 1: loss ~ Fe
## Model 2: loss ~ Fe + I(Fe^2) + I(Fe^3)
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     11 102.850                              
## 2      9  45.051  2      57.8 5.7734 0.02436 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can check the correlation between the residuals of this model and the predictors. The visual inspection of these plots suggests that there predictors are not correlated with the residuals of the model with the predictor to the third power. In fact, you add as many polynomial predictors as you want. However, do not forget the principle of Occam’s Razer. We always seek for a model with as fewer parameters as possible to account for as many data points as possible. Therefore, the model with the predictors up to \(x^3\) is good enough.

library(gridExtra)
dta$residuals3<-g3$residuals
fig1<-ggplot(dta,aes(x=Fe,y=residuals3))+
        geom_point()
fig2<-ggplot(dta,aes(x=I(Fe^2),y=residuals3))+
        geom_point()
fig3<-ggplot(dta,aes(x=I(Fe^3),y=residuals3))+
        geom_point()
grid.arrange(fig1,fig2,fig3,ncol=3)

When you go back and check the significance of the parameter for \(x^2\), you will find something weird. In the summary table for the object g2, the parameter for \(x^2\) is not significant, while in the summary table for the object g3, it is significant the other way around. Why? This phenomenon is called collinearity.

Collinearity

According to Wikipedia, “collinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Collinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors”.

In the alloys example, \(Fe\) is highly correlated with \(Fe^2\), \(r=.95\). Also, \(Fe\) is highly correlated with \(Fe^3\) too, \(r=.89\). Thus, the predictors in the model \(\hat{y}=b_0+b_1x+b_2x^2+b_3x^3\) are correlated with each other. This is the case of collinearity. Obviously, this violates the assumption of regression analysis that the predictors should be independent of each other. Normally, we would prevent from this situation, as the parameter estimates now are not reliable. However, this situation does not bring any harm if our goal is to build up the optimal model and does not concern the individual predictors.

with(dta,cor.test(Fe,I(Fe^2)))
## 
##  Pearson's product-moment correlation
## 
## data:  Fe and I(Fe^2)
## t = 10.489, df = 11, p-value = 4.579e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8479560 0.9863029
## sample estimates:
##       cor 
## 0.9534696
with(dta,cor.test(Fe,I(Fe^3)))
## 
##  Pearson's product-moment correlation
## 
## data:  Fe and I(Fe^3)
## t = 6.6406, df = 11, p-value = 3.657e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6776908 0.9683073
## sample estimates:
##       cor 
## 0.8946252

Usually, we can test the collinearity between the predictors with the index VIF(Variance Inflation Factor). As a rule of thumb, a variable whose VIF values is greater than 4, commonly in between 5 and 10, may merit further investigation. In R, the package car contains the function for running this test. See the data in peru.txt, which contains blood pressure and other details of Peruvian people. The meaning of each variables is listed as below.

Systol = systolic blood pressure

Age = age

Years = years in urban area

X3 = X2 /X1 = fraction of life in urban area

Weight = weight (kg)

Height = height (mm)

Chin = chin skinfold

Forearm = forearm skinfold

Calf = calf skinfold

Pulse = resting pulse rate

Let’s import the data and test the model with all 9 predictors included.

dta2<-read.table("peru.txt",header=T,sep="")
m1<-lm(Systol~Age+Years+I(Years/Age)+Weight+Height+Chin+Forearm+Calf+Pulse,dta2)
summary(m1)
## 
## Call:
## lm(formula = Systol ~ Age + Years + I(Years/Age) + Weight + Height + 
##     Chin + Forearm + Calf + Pulse, data = dta2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3442  -6.3972   0.0507   5.7292  14.5257 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   146.81907   48.97096   2.998 0.005526 ** 
## Age            -1.12144    0.32741  -3.425 0.001855 ** 
## Years           2.45538    0.81458   3.014 0.005306 ** 
## I(Years/Age) -115.29395   30.16900  -3.822 0.000648 ***
## Weight          1.41393    0.43097   3.281 0.002697 ** 
## Height         -0.03464    0.03686  -0.940 0.355194    
## Chin           -0.94369    0.74097  -1.274 0.212923    
## Forearm        -1.17085    1.19329  -0.981 0.334612    
## Calf           -0.15867    0.53716  -0.295 0.769810    
## Pulse           0.11455    0.17043   0.672 0.506818    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.655 on 29 degrees of freedom
## Multiple R-squared:  0.6674, Adjusted R-squared:  0.5641 
## F-statistic: 6.465 on 9 and 29 DF,  p-value: 5.241e-05

Apparently, not every predictor is valid. However, some variables are correlated with others. Let’s test the collinearity. Clearly, Years and I(Years/Age) are collinear variables. Also, Weight might need more caution when being included in the model.

library(car)
## Loading required package: carData
vif(m1)
##          Age        Years I(Years/Age)       Weight       Height         Chin 
##     3.213372    34.289194    24.387468     4.747711     1.913991     2.063866 
##      Forearm         Calf        Pulse 
##     3.802313     2.414602     1.329233

Exercise for yourself

Try to analyze the data mtcars in the package cars with mpg as the dependent variable and see what model best accommodates these data.