In regression analysis, the linear model \(\hat{y}=b_0+b_1x_1+b_2x_2+\cdots+b_kx_k\) inherently describes the effects of the predictors from \(x_1\) to \(x_k\) on predicting the dependent variable are additive. That is, each of these predictors can explain a part of the variability of \(y\) and their sum is what the model explains. However, if the predictors are correlated with each other, the effects of these predictors are not independent. Accordingly, it becomes difficult to identify which predictors can really explain the variability of \(y\). This situation is called multicollinearity. Multicollinearity is often regarded as a problem in regression analysis, when we would like to estimate the contribution of predictors individually for predicting the dependent variable.
The wage data in this web-page contain a random sample of 534 persons from the CPS (Current Population Survey), with information on wages and other characters of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. The goal is to examine whether wages are related to these characters. First, we need to retrieve the data from this web-page. There are a number of ways to achieve this goal and I just copied and pasted them to my text editor and saved them as a text file.
dta<-read.table("wages.txt",header=T,sep="")
head(dta)
## EDUCATION SOUTH SEX EXPERIENCE UNION WAGE AGE RACE OCCUPATION SECTOR MARR
## 1 8 0 1 21 0 5.10 35 2 6 1 1
## 2 9 0 1 42 0 4.95 57 3 6 1 1
## 3 12 0 0 1 0 6.67 19 3 6 1 0
## 4 12 0 0 4 0 4.00 22 3 6 0 0
## 5 12 0 0 17 0 7.50 35 3 6 0 1
## 6 13 0 0 9 1 13.07 28 3 6 0 0
Variable names in order from left to right:
EDUCATION: Number of years of education.
SOUTH: Indicator variable for Southern Region (1=Person lives in South, 0=Person lives elsewhere).
SEX: Indicator variable for sex (1=Female, 0=Male).
EXPERIENCE: Number of years of work experience.
UNION: Indicator variable for union membership (1=Union member, 0=Not union member).
WAGE: Wage (dollars per hour).
AGE: Age (years).
RACE: Race (1=Other, 2=Hispanic, 3=White).
OCCUPATION: Occupational category (1=Management, 2=Sales, 3=Clerical, 4=Service, 5=Professional, 6=Other).
SECTOR: Sector (0=Other, 1=Manufacturing, 2=Construction).
MARR: Marital Status (0=Unmarried, 1=Married)
Before we regress WAGE on the other characters, we firstly check the distribution of WAGE. Apparently, it has a long right tail, namely a right-skewed distribution. We can transform WAGE to log(WAGE) to make this distribution more normal. See the below figures.
library(ggplot2)
library(gridExtra)
fg1<-ggplot(dta,aes(WAGE,y=..density..))+
geom_histogram(col="black",fill="grey",binwidth=5)+
geom_density(col="tomato")
fg2<-ggplot(dta,aes(log(WAGE),y=..density..))+
geom_histogram(col="black",fill="grey",binwidth=0.5)+
geom_density(col="tomato")
grid.arrange(fg1,fg2,ncol=2)
Thus, we run regression analysis with log(WAGE) as the dependent variable. The summary table shows \(R^2=.32\), which is not too bad. However, EDUCATION, EXPERIENCE, AGE and OCCUPITION are not statistically significant, which presumably should be able to predict a person’s wage.
m1<-lm(log(WAGE)~.,dta)
summary(m1)
##
## Call:
## lm(formula = log(WAGE) ~ ., data = dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.16246 -0.29163 -0.00469 0.29981 1.98248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.078596 0.687514 1.569 0.117291
## EDUCATION 0.179366 0.110756 1.619 0.105949
## SOUTH -0.102360 0.042823 -2.390 0.017187 *
## SEX -0.221997 0.039907 -5.563 4.24e-08 ***
## EXPERIENCE 0.095822 0.110799 0.865 0.387531
## UNION 0.200483 0.052475 3.821 0.000149 ***
## AGE -0.085444 0.110730 -0.772 0.440671
## RACE 0.050406 0.028531 1.767 0.077865 .
## OCCUPATION -0.007417 0.013109 -0.566 0.571761
## SECTOR 0.091458 0.038736 2.361 0.018589 *
## MARR 0.076611 0.041931 1.827 0.068259 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4398 on 523 degrees of freedom
## Multiple R-squared: 0.3185, Adjusted R-squared: 0.3054
## F-statistic: 24.44 on 10 and 523 DF, p-value: < 2.2e-16
When the overall \(R^2\) is okay but the critical predictors are not significant, the possibility of multicollinearity should be taken into consideration. There are several ways to check if multicollinearity occurs (e.g., see for detailed introduction here). First, we can check the correlations between all variables in the data. The function corr.test( ) is used to do t tests for all correlation coefficients together. The results suggest mulicollinearity. For example, although EDUCATION is correlated with WAGE, it is also correlated with other predictors such as AGE, EXPERIENCE, and OCCUPTAION.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
corr.test(dta)
## Call:corr.test(x = dta)
## Correlation matrix
## EDUCATION SOUTH SEX EXPERIENCE UNION WAGE AGE RACE OCCUPATION
## EDUCATION 1.00 -0.14 0.00 -0.35 -0.02 0.38 -0.15 0.10 -0.21
## SOUTH -0.14 1.00 -0.02 -0.01 -0.09 -0.14 -0.04 -0.11 0.01
## SEX 0.00 -0.02 1.00 0.08 -0.16 -0.21 0.08 0.03 -0.22
## EXPERIENCE -0.35 -0.01 0.08 1.00 0.12 0.09 0.98 -0.02 -0.02
## UNION -0.02 -0.09 -0.16 0.12 1.00 0.16 0.12 -0.09 0.23
## WAGE 0.38 -0.14 -0.21 0.09 0.16 1.00 0.18 0.09 -0.05
## AGE -0.15 -0.04 0.08 0.98 0.12 0.18 1.00 0.00 -0.07
## RACE 0.10 -0.11 0.03 -0.02 -0.09 0.09 0.00 1.00 0.01
## OCCUPATION -0.21 0.01 -0.22 -0.02 0.23 -0.05 -0.07 0.01 1.00
## SECTOR -0.19 0.00 -0.17 0.11 0.10 0.05 0.08 0.00 0.36
## MARR -0.04 0.01 0.01 0.27 0.09 0.10 0.28 0.04 -0.01
## SECTOR MARR
## EDUCATION -0.19 -0.04
## SOUTH 0.00 0.01
## SEX -0.17 0.01
## EXPERIENCE 0.11 0.27
## UNION 0.10 0.09
## WAGE 0.05 0.10
## AGE 0.08 0.28
## RACE 0.00 0.04
## OCCUPATION 0.36 -0.01
## SECTOR 1.00 0.06
## MARR 0.06 1.00
## Sample Size
## [1] 534
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## EDUCATION SOUTH SEX EXPERIENCE UNION WAGE AGE RACE OCCUPATION
## EDUCATION 0.00 0.04 1.00 0.00 1.00 0.00 0.02 0.85 0.00
## SOUTH 0.00 0.00 1.00 1.00 1.00 0.04 1.00 0.27 1.00
## SEX 0.96 0.62 0.00 1.00 0.01 0.00 1.00 1.00 0.00
## EXPERIENCE 0.00 0.86 0.08 0.00 0.23 1.00 0.00 1.00 1.00
## UNION 0.58 0.05 0.00 0.01 0.00 0.01 0.21 1.00 0.00
## WAGE 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.86 1.00
## AGE 0.00 0.37 0.07 0.00 0.01 0.00 0.00 1.00 1.00
## RACE 0.03 0.01 0.53 0.58 0.05 0.03 0.92 0.00 1.00
## OCCUPATION 0.00 0.73 0.00 0.62 0.00 0.26 0.11 0.73 0.00
## SECTOR 0.00 0.99 0.00 0.01 0.03 0.30 0.08 0.97 0.00
## MARR 0.41 0.88 0.80 0.00 0.03 0.02 0.00 0.31 0.79
## SECTOR MARR
## EDUCATION 0.00 1.00
## SOUTH 1.00 1.00
## SEX 0.00 1.00
## EXPERIENCE 0.34 0.00
## UNION 0.85 0.91
## WAGE 1.00 0.66
## AGE 1.00 0.00
## RACE 1.00 1.00
## OCCUPATION 0.00 1.00
## SECTOR 0.00 1.00
## MARR 0.20 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
library(corrplot)
## corrplot 0.84 loaded
cor1<-cor(dta)
corrplot.mixed(cor1,lower.col="black",number.cex=0.7)
We can also compute VIFs (Variance Inflation Factor) for the predictors. For the regression model \(\hat{y}=b_0+b_1x_1+b_2x_2+b_3x_3+\cdots+b_kx_k\), the VIF of a predictor \(x_i\) is computed as \(VIF=\frac{1}{1-R^2_i}\). \(R^2_i\) is the \(R^2\) of the model with \(x_i\) as the dependent variable regressed on other predictors. For example, VIF for \(x_i\) \(VIF_1\) is the \(R^2\) of the model \(x_1=\alpha_0+\alpha_2x_2+\alpha_3x_3+\cdots+\alpha_kx_k+\epsilon\). Theoretically, \(1\leq VIF \leq \infty\), the larger the VIF is, the severer it is. Clearly, EDUCATION, EXPERIENCE, and AGE have extremely large VIF values. Normally, a predictor with VIF > 5 is considered to cause multicollinearity.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
vif(m1)
## EDUCATION SOUTH SEX EXPERIENCE UNION AGE
## 231.195580 1.046828 1.091634 5184.093895 1.120861 4645.664977
## RACE OCCUPATION SECTOR MARR
## 1.037138 1.298232 1.198670 1.096130
qf(0.95,9,524)
## [1] 1.89774
According to the result of F test for each VIF with the df of the numerator (k-1) and the df of the denominator (n-k), where k is the number of predictors and n is the sample size, we can determine the location of multicollinearity. Here, we have 10 predictors and 534 persons, the \(df_1=9\) and \(df_2=524\). The critical F value then is 1.89774. Apparently, EDUCATION, EXPERIENCE, and AGE are the cause of multicollinearity. We can also test the partial correlation coefficients between the predictors to check the location of mulitcollinearity.
library(ppcor)
## Loading required package: MASS
pcor(dta[,-6])
## $estimate
## EDUCATION SOUTH SEX EXPERIENCE UNION
## EDUCATION 1.000000000 -0.031750193 0.051510483 -0.99756187 -0.007479144
## SOUTH -0.031750193 1.000000000 -0.030152499 -0.02231360 -0.097548621
## SEX 0.051510483 -0.030152499 1.000000000 0.05497703 -0.120087577
## EXPERIENCE -0.997561873 -0.022313605 0.054977034 1.00000000 -0.010244447
## UNION -0.007479144 -0.097548621 -0.120087577 -0.01024445 1.000000000
## AGE 0.997261601 0.021525073 -0.053697851 0.99987574 0.012238897
## RACE 0.017230877 -0.111197596 0.020017315 0.01088849 -0.107706183
## OCCUPATION 0.029436911 0.008430595 -0.142750864 0.04205856 0.212996388
## SECTOR -0.021253493 -0.021518760 -0.112146760 -0.01326166 -0.013531482
## MARR -0.040302967 0.030418218 0.004163264 -0.04097664 0.068918496
## AGE RACE OCCUPATION SECTOR MARR
## EDUCATION 0.99726160 0.017230877 0.029436911 -0.021253493 -0.040302967
## SOUTH 0.02152507 -0.111197596 0.008430595 -0.021518760 0.030418218
## SEX -0.05369785 0.020017315 -0.142750864 -0.112146760 0.004163264
## EXPERIENCE 0.99987574 0.010888486 0.042058560 -0.013261665 -0.040976643
## UNION 0.01223890 -0.107706183 0.212996388 -0.013531482 0.068918496
## AGE 1.00000000 -0.010803310 -0.044140293 0.014565751 0.045090327
## RACE -0.01080331 1.000000000 0.057539374 0.006412099 0.055645964
## OCCUPATION -0.04414029 0.057539374 1.000000000 0.314746868 -0.018580965
## SECTOR 0.01456575 0.006412099 0.314746868 1.000000000 0.036495494
## MARR 0.04509033 0.055645964 -0.018580965 0.036495494 1.000000000
##
## $p.value
## EDUCATION SOUTH SEX EXPERIENCE UNION AGE
## EDUCATION 0.0000000 0.46745162 0.238259049 0.0000000 8.641246e-01 0.0000000
## SOUTH 0.4674516 0.00000000 0.490162786 0.6096300 2.526916e-02 0.6223281
## SEX 0.2382590 0.49016279 0.000000000 0.2080904 5.822656e-03 0.2188841
## EXPERIENCE 0.0000000 0.60962999 0.208090393 0.0000000 8.146741e-01 0.0000000
## UNION 0.8641246 0.02526916 0.005822656 0.8146741 0.000000e+00 0.7794483
## AGE 0.0000000 0.62232811 0.218884070 0.0000000 7.794483e-01 0.0000000
## RACE 0.6933788 0.01070652 0.646920379 0.8032546 1.345383e-02 0.8047625
## OCCUPATION 0.5005235 0.84704000 0.001027137 0.3356824 8.220095e-07 0.3122902
## SECTOR 0.6267278 0.62243025 0.010051378 0.7615531 7.568528e-01 0.7389200
## MARR 0.3562616 0.48634504 0.924111163 0.3482728 1.143954e-01 0.3019796
## RACE OCCUPATION SECTOR MARR
## EDUCATION 0.69337880 5.005235e-01 6.267278e-01 0.3562616
## SOUTH 0.01070652 8.470400e-01 6.224302e-01 0.4863450
## SEX 0.64692038 1.027137e-03 1.005138e-02 0.9241112
## EXPERIENCE 0.80325456 3.356824e-01 7.615531e-01 0.3482728
## UNION 0.01345383 8.220095e-07 7.568528e-01 0.1143954
## AGE 0.80476248 3.122902e-01 7.389200e-01 0.3019796
## RACE 0.00000000 1.876376e-01 8.833600e-01 0.2026017
## OCCUPATION 0.18763758 0.000000e+00 1.467261e-13 0.6707116
## SECTOR 0.88336002 1.467261e-13 0.000000e+00 0.4035489
## MARR 0.20260170 6.707116e-01 4.035489e-01 0.0000000
##
## $statistic
## EDUCATION SOUTH SEX EXPERIENCE UNION
## EDUCATION 0.0000000 -0.7271618 1.18069629 -327.2105031 -0.1712102
## SOUTH -0.7271618 0.0000000 -0.69053623 -0.5109090 -2.2436907
## SEX 1.1806963 -0.6905362 0.00000000 1.2603880 -2.7689685
## EXPERIENCE -327.2105031 -0.5109090 1.26038801 0.0000000 -0.2345184
## UNION -0.1712102 -2.2436907 -2.76896848 -0.2345184 0.0000000
## AGE 308.6803174 0.4928456 -1.23097601 1451.9092015 0.2801822
## RACE 0.3944914 -2.5613138 0.45830912 0.2492636 -2.4799336
## OCCUPATION 0.6741338 0.1929920 -3.30152873 0.9636171 4.9902208
## SECTOR -0.4866246 -0.4927010 -2.58345399 -0.3036001 -0.3097781
## MARR -0.9233273 0.6966272 0.09530228 -0.9387867 1.5813765
## AGE RACE OCCUPATION SECTOR MARR
## EDUCATION 308.6803174 0.3944914 0.6741338 -0.4866246 -0.92332727
## SOUTH 0.4928456 -2.5613138 0.1929920 -0.4927010 0.69662719
## SEX -1.2309760 0.4583091 -3.3015287 -2.5834540 0.09530228
## EXPERIENCE 1451.9092015 0.2492636 0.9636171 -0.3036001 -0.93878671
## UNION 0.2801822 -2.4799336 4.9902208 -0.3097781 1.58137652
## AGE 0.0000000 -0.2473135 -1.0114033 0.3334607 1.03321563
## RACE -0.2473135 0.0000000 1.3193223 0.1467827 1.27577106
## OCCUPATION -1.0114033 1.3193223 0.0000000 7.5906763 -0.42541117
## SECTOR 0.3334607 0.1467827 7.5906763 0.0000000 0.83597695
## MARR 1.0332156 1.2757711 -0.4254112 0.8359769 0.00000000
##
## $n
## [1] 534
##
## $gp
## [1] 8
##
## $method
## [1] "pearson"
The results of t test for the partial correlation coefficients show that AGE and EXPERIENCE are highly correlated to each other even when the influences of other variables being removed. Also, the partial correlation coefficient between EDUCATION and EXPERIENCE is still high.
There are several ways to solve the problem of multicollinearity. First, we can remove the predictor(s) which cause multicollinearity or select suitable predictors by step-wised regression. Second, we can run PCA (Principal Component Analysis) instead. However, PCA is not that recommended, as we are interested in the predictors not the components which consists of predictors. Third, we can run Ridge regression. Let’s try the first way via removing the predictor EXPERIEMCE. As shown in below tables, the \(R^2\) of this new model is still high and the critical vairables become significant, such as EDUCATION and AGE. Also, the VIFs become quite okay now.
m2<-lm(log(WAGE)~.,dta[,-4])
summary(m2)
##
## Call:
## lm(formula = log(WAGE) ~ ., data = dta[, -4])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.16018 -0.29085 -0.00513 0.29985 1.97932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.501358 0.164794 3.042 0.002465 **
## EDUCATION 0.083815 0.007728 10.846 < 2e-16 ***
## SOUTH -0.103186 0.042802 -2.411 0.016261 *
## SEX -0.220100 0.039837 -5.525 5.20e-08 ***
## UNION 0.200018 0.052459 3.813 0.000154 ***
## AGE 0.010305 0.001745 5.905 6.34e-09 ***
## RACE 0.050674 0.028523 1.777 0.076210 .
## OCCUPATION -0.006941 0.013095 -0.530 0.596309
## SECTOR 0.091013 0.038723 2.350 0.019125 *
## MARR 0.075125 0.041886 1.794 0.073458 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4397 on 524 degrees of freedom
## Multiple R-squared: 0.3175, Adjusted R-squared: 0.3058
## F-statistic: 27.09 on 9 and 524 DF, p-value: < 2.2e-16
vif(m2)
## EDUCATION SOUTH SEX UNION AGE RACE OCCUPATION
## 1.125994 1.046306 1.088334 1.120743 1.154496 1.037015 1.295935
## SECTOR MARR
## 1.198460 1.094289
When there are too many predictors and we want to select among them those really worth being included in the model, ridge regression is a good choice. The basic idea is to use a parameter \(\lambda\) multiplied with the sum of squared coefficients \(\beta\) (i.e., \(\lambda\sum\beta^2_{j}\)) as a penalty item to the sum of squared residuals (i.e., \(\sum(y_{i}-X^T_{i}\beta)^2\)). Thus, now the goodness of fit of the regression model becomes \(\sum(y_{i}-X^T_{i}\beta)^2+\lambda\sum\beta^2_{j}\). This goodness of fit has a minimum at \(\lambda=0\) and increases as \(\lambda\) increases. Therefore, increasing \(\lambda\) inherently makes the model performance worse. In order to make lower the goodness of fit, the model can not help increase the coefficients for those predictors which are really worth being included and decrease the coefficients for the others. When \(\lambda\) is extremely large, any adjustment on the coefficients is useless to fight against the penalty. Thus, there must be a \(\lambda\), with which the coefficients can be best distinguished.
In R, we can use glmnet( ) to implement the above process. When calling this function, the first and second arguments must be the predictor matrix and the dependent variable vector. As these two arguments are demanded to be numeric, I firstly create the suitable matrix and vector. In addition to the predictor and dependent matrix, the third argument alpha means how you want to design the penalty. When alpha\(=1\), the penalty is the lasso penalty (\(\lambda\sum\ |\beta_{j}|\)), whereas when alpha\(=0\), that is the ridge penalty (\(\lambda\sum\beta^2_{j}\)). The number of \(\lambda\) values, nlambda, is set as 100, which is the default value. That is, glmnet( ) will by default to estimate the coefficients with 100 different \(\lambda\) values. The argument lambda.min.ratio is the smallest value for \(\lambda\). Here I set it up as a very small value.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-1
x<-as.matrix(dta[,-6])
y<-log(dta$WAGE)
db.mod<-glmnet(x,y,alpha=0,nlambda=100,lambda.min.ratio=0.0001)
As show in the below figure, the \(\lambda\) value is all the way down through these 100 rounds of estimation. Which \(\lambda\) value is the optimal?
ggplot(data=data.frame(x=1:100,lambda=db.mod$lambda),
aes(x=x,y=lambda))+
geom_point(shape=1,col="blue")
The optimal \(\lambda\) value provides the lowest generalized cross validation score. We can compute the cross validation score by cv.glmnet(). Before using this function, the random seed is fixed to 1 (or any positive number you like), just because we do not want the modeling results change due to the randomization procedure. The below figure shows the mean-squared error increases as \(\lambda\) increases.
set.seed(1)
cv.out<-cv.glmnet(x,y,alpha=0,nlambda=100,lambda.min.ratio=0.0001)
plot(cv.out)
In the object returned by cv.glmnet( ), the variable lambda.min is the optimal \(\lambda\) value. We can go back to retrieve the estimated coefficients at the round, in which the optimal \(\lambda\) value occurs. It is the 96th round. Alternatively, we can use predict( ) to regenerate the coefficients given the best \(\lambda\) value.
best.lambda<-cv.out$lambda.min
best.lambda
## [1] 0.02909843
best.round<-which(cv.out$lambda==cv.out$lambda.min)
coefs.ridge<-c(db.mod$a0[best.round],db.mod$beta[,best.round])
predict(db.mod,s=best.lambda,type="coefficients")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.604604718
## EDUCATION 0.081676088
## SOUTH -0.102588586
## SEX -0.209307310
## EXPERIENCE 0.002919042
## UNION 0.193151048
## AGE 0.006864832
## RACE 0.049280923
## OCCUPATION -0.006616402
## SECTOR 0.084440422
## MARR 0.074574614
table.coef<-cbind(coefs.ridge,m1$coefficients)
colnames(table.coef)<-c("Ridge","OLS")
table.coef
## Ridge OLS
## s95 0.604604718 1.078596125
## EDUCATION 0.081676088 0.179366097
## SOUTH -0.102588586 -0.102359696
## SEX -0.209307310 -0.221997414
## EXPERIENCE 0.002919042 0.095822118
## UNION 0.193151048 0.200482978
## AGE 0.006864832 -0.085444480
## RACE 0.049280923 0.050405590
## OCCUPATION -0.006616402 -0.007417461
## SECTOR 0.084440422 0.091457733
## MARR 0.074574614 0.076610575
Now we can compare the estimated coefficients by ridge regression and the ordinary least square method. Apparently, the regression coefficient of EXPERIENCE becomes very small from 0.0958 to 0.0029. This suggests that in the original regression model, EXPERIENCE is the predictor which can be omitted.
Although multicollinearity needs researchers’ concern, in some situations it actually can be safely ignored. First, the variables of interest do not have high VIFs, only the control variables do. Second, the high VIFs are caused by the inclusion of powers or products of other variables. For instance, \(x\) and \(x^2\) are included in a regression model or \(x\), \(z\), and \(xz\) are. Third, the variables with high VIFs are dummy variables (or categorical variables).
In the R package {MASS}, the data set Boston contains the median house value (mdev) and other predictors. In Boston, there are 506 cases in total, for each of which there are 14 variables. With the median house value as the dependent variable, a mulitple regression model is going to be built.
library(MASS)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
dim(Boston)
## [1] 506 14
Different from the previous introduction, here 80% of these 506 cases will be randomly sampled out as the training data, leaving the rest 20% as the test data.
library(tidyverse) # for easy data manipulation and visualization
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.5 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x psych::%+%() masks ggplot2::%+%()
## x psych::alpha() masks ggplot2::alpha()
## x dplyr::combine() masks gridExtra::combine()
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x tidyr::pack() masks Matrix::pack()
## x dplyr::recode() masks car::recode()
## x dplyr::select() masks MASS::select()
## x purrr::some() masks car::some()
## x tidyr::unpack() masks Matrix::unpack()
library(caret) # for easy machine learning workflow
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
indices<-c(rep(1,round(nrow(Boston)*0.8)),rep(0,nrow(Boston)-round(nrow(Boston)*0.8)))
set.seed(123)
indices<-sample(indices,nrow(Boston))
train.data<-Boston[indices==1,]
test.data<-Boston[indices==0,]
Now we can build our first regression model. Also, we can generate predictions of this model for the test data.
# Build model
b.m1<-lm(medv~.,train.data)
# Make predictions
pred.m1<-b.m1 %>% predict(test.data)
# Model performance
data.frame(
RMSE=RMSE(pred.m1,test.data$medv),
R2=R2(pred.m1,test.data$medv)
)
## RMSE R2
## 1 5.195852 0.6725872
We can check if there is mulicollinearity. The predictor tax has the largest VIF among all predictors.
vif(b.m1)
## crim zn indus chas nox rm age dis
## 1.829059 2.240485 3.816790 1.083494 4.408732 1.915704 3.436755 4.109652
## rad tax ptratio black lstat
## 7.382281 8.568818 1.858105 1.312868 3.056858
Also, the tests for partial correlation coefficient show that tax is highly correlated with rad (r=.77). Although the partial correlation coefficients for other predictors are also significant, the correlation between tax and rad seems to be severest.
pcor(train.data[,-14])
We can simply remove tax from the model. The performance of the second model on predicting the test data is pretty much the same as the first one’s. Therefore, it is confirmed that tax should be removed out of the model.
# Build a model
b.m2<-lm(medv~. -tax,train.data)
# Make predictions
pred.m2<-b.m2 %>% predict(test.data)
# Model performance
data.frame(
RMSE=RMSE(pred.m2,test.data$medv),
R2=R2(pred.m2,test.data$medv)
)
## RMSE R2
## 1 5.230518 0.6686179
Of course, we can do Ridge regression for the training data to check which predictor(s) should be kept in the regression model. Clearly, the regression coefficients of tax and rad both become smaller. The other predictors, such as indus and rm, have the regression coefficients increased instead.
x<-as.matrix(train.data[,-14])
y<-train.data$medv
b.rgm<-glmnet(x,y,alpha=0,nlambda=100,lambda.min.ratio=0.0001)
set.seed(1)
b.cv.out<-cv.glmnet(x,y,alpha=0,nlambda=100,lambda.min.ratio=0.0001)
plot(b.cv.out)
b.best.lambda<-b.cv.out$lambda.min
b.best.lambda
## [1] 0.6967256
b.best.round<-which(b.cv.out$lambda==b.cv.out$lambda.min)
coefs.ridge<-predict(b.rgm,s=b.best.lambda,type="coefficients")
table.coef<-cbind(coefs.ridge,b.m1$coefficients)
colnames(table.coef)<-c("Ridge","OLS")
table.coef
## 14 x 2 sparse Matrix of class "dgCMatrix"
## Ridge OLS
## (Intercept) 27.208803726 35.401948894
## crim -0.099708789 -0.123094134
## zn 0.032363174 0.046390518
## indus -0.075902198 -0.027013933
## chas 2.490389696 2.259623417
## nox -10.870598491 -16.418753003
## rm 3.861438456 3.645301128
## age 0.002351569 0.007556732
## dis -1.114682631 -1.485241328
## rad 0.154672711 0.309384132
## tax -0.006025604 -0.012414362
## ptratio -0.777182998 -0.863208504
## black 0.009292661 0.009714588
## lstat -0.490399582 -0.550416695