In the previous turtorials, the dependent variable is thought to be influenced by only one predictor. In fact, it is quite normal to predict the dependent variable with multiple predictors. When there are more than one predictors, the regression analysis is called multiple regression analysis. The multiple regression model is written as \(\hat{y}=b_0+b_1x_1+b_2x_2+\cdots+b_kx_k\) for \(k\) predictors and one dependent variable. The same lm() can be directly applied to conducting a multiple regression analysis.
In a study for understanding how the performance in graduate school would be influenced by the intellectual ability and work ethic, 20 students’ data were collected and stored in score.txt. The meanings of the column labels are listed as follows.
Y1 - A measure of success in graduate school.
X1 - A measure of intellectual ability.
X2 - A measure of “work ethic.”
X3 - A second measure of intellectual ability.
X4 - A measure of spatial ability.
Y2 - Score on a major review paper.
First import the data.
dta<-read.table("score.txt",header=T)
head(dta)
## Y1 Y2 X1 X2 X3 X4
## 1 125 113 13 18 25 11
## 2 158 115 39 18 59 30
## 3 207 126 52 50 62 53
## 4 182 119 29 43 50 29
## 5 196 107 50 37 65 56
## 6 175 135 64 19 79 49
In order to test any relationships between the predictors and the dependent variable, I would make scatter plots of these variables in advance. After all, a valid predictor in the regression model is assumed to be correlated with the dependent variable. In addition, the idea situation is that the predictors are not correlated with each other. In that case, we do not have to worry about the interaction effect on the dependent variable between the predictors, hence making the relationships between predictors and dependent varialbe more clearly.
library(ggplot2)
library(gridExtra)
y1x1.fg<-ggplot(data=dta,aes(x=X1,y=Y1))+
geom_point(shape=1,col="blue")
y1x2.fg<-ggplot(data=dta,aes(x=X2,y=Y1))+
geom_point(shape=1,col="red")
x1x2.fg<-ggplot(data=dta,aes(x=X1,y=X2))+
geom_point(shape=1)
grid.arrange(y1x1.fg,y1x2.fg,x1x2.fg,ncol=3)
It looks like that X1 and X2 both have a positive correlation with Y1. However, the correlation between them seems to be quite weak. This visual insepection is supported by the below analyese. The results show that Y1 is positively correlated with X1 [\(r=.76\), \(p<.01\)] and X2 [\(r=.77\), \(p<.01\)] respectively. Also, the correlation between X1 and X2 is not significant.
with(dta,cor.test(Y1,X1))
##
## Pearson's product-moment correlation
##
## data: Y1 and X1
## t = 5.026, df = 18, p-value = 8.776e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4859502 0.9017425
## sample estimates:
## cor
## 0.7641466
with(dta,cor.test(Y1,X2))
##
## Pearson's product-moment correlation
##
## data: Y1 and X2
## t = 5.1012, df = 18, p-value = 7.462e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4945910 0.9038464
## sample estimates:
## cor
## 0.7688386
with(dta,cor.test(X1,X2))
##
## Pearson's product-moment correlation
##
## data: X1 and X2
## t = 1.1187, df = 18, p-value = 0.278
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2114169 0.6267637
## sample estimates:
## cor
## 0.2549567
Let us conduct a multiple regression model. Different from the case of simple linear regression, the formula here needs to be added with a second predictor. See the below codes. The operator + in the formula links all predictors. As shown in the below summary table, the intercept \(b_0\) is significantly larger than 0. Also, the two coefficients for the two predictors are significantly larger than 0. The result 94% of the variance of Y1 can be explained by these two predictors \(R^2=.94\).
m<-lm(Y1~X1+X2,dta)
summary(m)
##
## Call:
## lm(formula = Y1 ~ X1 + X2, data = dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5964 -4.6383 -0.2101 5.2806 10.9762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.2219 4.5874 22.065 5.97e-14 ***
## X1 1.0004 0.1042 9.600 2.81e-08 ***
## X2 1.0708 0.1104 9.699 2.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.541 on 17 degrees of freedom
## Multiple R-squared: 0.9363, Adjusted R-squared: 0.9288
## F-statistic: 125 on 2 and 17 DF, p-value: 6.824e-11
This example is cited from an online database. Wang and Elhag (2007) measured the risks of 66 bridges in Britain with four indices, including the Safety Risk Rating (SRR), the Functionality Risk Rating (FRR), the sustainability risk rating (SUR), and the Environment Risk Rating (ERR). Each is scalled in a range from 0 to 3 with 3=High, 2=Medium, 1=Low, 0=None. Also, the risk of each bridge is measured by Risk Score. Before we analyze these data, the data should be imported. The variables from left to right are Bridge ID, SRR, FRR, SUR, ERR, and RiskScore.
url<-"http://users.stat.ufl.edu/~winner/data/bridge_risk.dat"
bridge.dta<-read.table(file=url,header=F)
names(bridge.dta)<-c("ID","SRR","FRR","SUR","ERR","Risk")
head(bridge.dta)
## ID SRR FRR SUR ERR Risk
## 1 1 3 3 3 3 99
## 2 2 3 3 3 2 98
## 3 3 3 2 3 3 98
## 4 4 3 2 2 3 97
## 5 5 3 3 2 2 97
## 6 6 3 2 2 2 96
Obviously, the dependent variable is the risk score of each bridge. The other variables thus are the predictors. As usual, the relations between every two of these variables are depicted as scatter plots. Here, the scatter plots can be divided as two groups. One group shows for the relation between each risk measure and the risk score. The other shows for the relation between every two risk measures. I will introduce how to use {ggplot2} to lay out these figures. For the first group, since the 4 risk measures have the same scale, the 4 scatter plots can be put together as 1 scatter plots. To this end, the original data frame need to be changed to the format in which the risk score, the scale, and the risk measure are separated as three columns. The function melt() in the package {reshape} can fit this need.
library(reshape)
b.dta<-melt(bridge.dta[,-1],id.vars="Risk")
head(b.dta)
## Risk variable value
## 1 99 SRR 3
## 2 98 SRR 3
## 3 98 SRR 3
## 4 97 SRR 3
## 5 97 SRR 3
## 6 96 SRR 3
Now we can start making these sactter plots. In the left panel, different colors represent for different scales. It looks like that ERR (i.e., puple dots) has nothing to do with the risk score. The right panel shows for the relation between every two of these 4 scales. It is hard to judge which correlation is significant.
fg1<-ggplot(data=b.dta,aes(x=value,y=Risk,color=variable))+
geom_point()+
scale_x_continuous(name="Risk Measure")
SRRFRR.fg<-ggplot(data=bridge.dta,aes(x=SRR,y=FRR))+
geom_point()
SRRSUR.fg<-ggplot(data=bridge.dta,aes(x=SRR,y=SUR))+
geom_point()
SRRERR.fg<-ggplot(data=bridge.dta,aes(x=SRR,y=ERR))+
geom_point()
FRRSUR.fg<-ggplot(data=bridge.dta,aes(x=FRR,y=SUR))+
geom_point()
FRRERR.fg<-ggplot(data=bridge.dta,aes(x=FRR,y=ERR))+
geom_point()
SURERR.fg<-ggplot(data=bridge.dta,aes(x=SUR,y=ERR))+
geom_point()
grid.arrange(grobs=list(fg1,SRRFRR.fg,SRRSUR.fg,FRRSUR.fg,
FRRERR.fg,SURERR.fg),
widths=c(4,1,1,1),
layout_matrix=rbind(c(1,2,3,4),c(1,5,6,NA)))
The 4 risk scales are correlated with the risk score [for SRR, \(r=.84\), \(p<.01\); for FRR, \(r=.59\), \(p<.01\); for SUR \(r=.60\), \(p<.01\)], except ERR [\(r=.21\), \(p.08\)].
with(bridge.dta,cor.test(Risk,SRR))
##
## Pearson's product-moment correlation
##
## data: Risk and SRR
## t = 12.361, df = 64, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7498560 0.8989059
## sample estimates:
## cor
## 0.8395238
with(bridge.dta,cor.test(Risk,FRR))
##
## Pearson's product-moment correlation
##
## data: Risk and FRR
## t = 5.8012, df = 64, p-value = 2.21e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4021535 0.7259352
## sample estimates:
## cor
## 0.5870465
with(bridge.dta,cor.test(Risk,SUR))
##
## Pearson's product-moment correlation
##
## data: Risk and SUR
## t = 5.9732, df = 64, p-value = 1.126e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4165716 0.7340259
## sample estimates:
## cor
## 0.5982831
with(bridge.dta,cor.test(Risk,ERR))
##
## Pearson's product-moment correlation
##
## data: Risk and ERR
## t = 1.7566, df = 64, p-value = 0.08378
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02908025 0.43396882
## sample estimates:
## cor
## 0.214462
Since it is quite tedious to conduct the correlation test for each pair of variables, corr.test() is a useful tool, which shows all correlation coefficients in a matrix and their \(p\) values in another. We can use this function to compute the pairwised correlation coefficients for these 4 scales. Thus, the argument is not the whole data frame, but the one without the first and sixth columns. The results show that FRR and SUR are correlated with SRR respectively and other pairwised correlation coefficients are not significant.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
corr.test(bridge.dta[,-c(1,6)])
## Call:corr.test(x = bridge.dta[, -c(1, 6)])
## Correlation matrix
## SRR FRR SUR ERR
## SRR 1.00 0.45 0.27 0.10
## FRR 0.45 1.00 0.32 0.06
## SUR 0.27 0.32 1.00 0.20
## ERR 0.10 0.06 0.20 1.00
## Sample Size
## [1] 66
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## SRR FRR SUR ERR
## SRR 0.00 0.00 0.11 0.83
## FRR 0.00 0.00 0.05 0.83
## SUR 0.03 0.01 0.00 0.32
## ERR 0.41 0.61 0.11 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
According to the previous correlation analysis, we know that all risk measures but ERR are correlated with the risk score. For the sake of demonstration, I deceide to put all these 4 measures into the regression model without discussing which one is more valid.
m<-lm(Risk~SRR+FRR+SUR+ERR,bridge.dta)
summary(m)
##
## Call:
## lm(formula = Risk ~ SRR + FRR + SUR + ERR, data = bridge.dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.118 -6.867 -1.119 6.477 21.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.571 2.530 5.364 1.33e-06 ***
## SRR 15.500 1.186 13.074 < 2e-16 ***
## FRR 4.017 1.193 3.369 0.00131 **
## SUR 8.654 1.184 7.310 6.78e-10 ***
## ERR 1.602 1.117 1.435 0.15647
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.203 on 61 degrees of freedom
## Multiple R-squared: 0.8797, Adjusted R-squared: 0.8718
## F-statistic: 111.5 on 4 and 61 DF, p-value: < 2.2e-16
In CH06FI05.txt, there contains the data from 21 participants. Suppose the third column is the dependent variable and we would like to know how the first two columns can predict it. To this end, this data is imported first. The variable names are arbitrarily assigned as x1, x2, and y.
dta<-read.table("CH06FI05.txt",header=F)
names(dta)<-c("x1","x2","y")
head(dta)
## x1 x2 y
## 1 68.5 16.7 174.4
## 2 45.2 16.8 164.4
## 3 91.3 18.2 244.2
## 4 47.8 16.3 154.6
## 5 46.9 17.3 181.6
## 6 66.1 18.2 207.5
As usual, we first check out the pairwised correlations among these variables. It looks like that these three varialbe are correlated with each other positively.
x1y.fg<-ggplot(data=dta,aes(x=x1,y=y))+
geom_point(shape=1)
x2y.fg<-ggplot(data=dta,aes(x=x2,y=y))+
geom_point(shape=1)
x1x2.fg<-ggplot(data=dta,aes(x=x1,y=x2))+
geom_point(shape=1)
grid.arrange(x1y.fg,x2y.fg,x1x2.fg,ncol=3)
The correlation coefficients computed for the above scatter plots supports our inspection with \(r=.94\), \(r=.83\), and \(r=.78\) respectively and their \(p\) values are all smaller than \(.01\).
with(dta,cor.test(x1,y))
##
## Pearson's product-moment correlation
##
## data: x1 and y
## t = 12.539, df = 19, p-value = 1.229e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8659668 0.9776164
## sample estimates:
## cor
## 0.9445543
with(dta,cor.test(x2,y))
##
## Pearson's product-moment correlation
##
## data: x2 and y
## t = 6.6357, df = 19, p-value = 2.391e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6322259 0.9314262
## sample estimates:
## cor
## 0.8358025
with(dta,cor.test(x1,x2))
##
## Pearson's product-moment correlation
##
## data: x1 and x2
## t = 5.4563, df = 19, p-value = 2.899e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5275392 0.9070570
## sample estimates:
## cor
## 0.7812993
With y as the dependent variable and the other two variables as the predictors, we can test the parameters in the regression model \(\hat{y}=b_0+b_1x_1+b_2x_2\). The summary table shows that the intercept is not sinificant. However, the parameters for \(x_1\) and \(x_2\) are both significant.
m<-lm(y~x1+x2,dta)
summary(m)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4239 -6.2161 0.7449 9.4356 20.2151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -68.8571 60.0170 -1.147 0.2663
## x1 1.4546 0.2118 6.868 2e-06 ***
## x2 9.3655 4.0640 2.305 0.0333 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.01 on 18 degrees of freedom
## Multiple R-squared: 0.9167, Adjusted R-squared: 0.9075
## F-statistic: 99.1 on 2 and 18 DF, p-value: 1.921e-10
The \(R^2\) is very good, suggesting that this model can accout for 92% of the variability of y. Since we know that x1 and x2 are correlated, perhaps the interaction effect between these two predictors should be included in the model as well. Thus, the model becomes \(\hat{y}=b_0+b_1x_1+b_2x_2+b_3x_xx_2\). We can test this model with the following codes.
m3<-lm(y~x1*x2,dta)
summary(m3)
##
## Call:
## lm(formula = y ~ x1 * x2, data = dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.9159 -7.3276 0.4411 9.4115 20.8075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.47627 227.96284 -0.055 0.957
## x1 0.53195 3.59800 0.148 0.884
## x2 6.09330 13.40395 0.455 0.655
## x1:x2 0.05288 0.20585 0.257 0.800
##
## Residual standard error: 11.3 on 17 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9024
## F-statistic: 62.66 on 3 and 17 DF, p-value: 2.128e-09
Now we have two models. One includes the interaction effect and the other does not. Which model is better? To answer this question, we can check out the lack of fit between these two models. For the convenience of memorizing the models, the model with the interaction effect is called full model, as all effects are included in this model, whereas the other is called restricted model, as it removes the interaction effect. The sum of squared residuals \(SSRes=\sum(y-\hat{y})^2\) of the full model \(SSRes_3\) is presumably smaller than that of the restricted model \(SSRes\). The difference between these two sums of squared residuals is called the lack of fit. That is, \(SSRes=SSRes_3+\) lack of fit. The lack of fit is also a sum of squares, which should \(\geq0\). Turing the ss of residuals and the lack of fit to ms, we can use F test to check wether the lack of fit is significant. The df for the lack of fit is the difference between the dfs of two models. Here it is 1, as the only difference between these two models is the interaction effect. For the ss of residuals of the full model, the df is of course \(N-4\), as there are four parameters in this model including the intercept. See below codes. The result suggests that the lack of fit is not significant. That is, the full model does not perform better than the restricted model. Alternatively, an easier way to conduct the test for the lack of fit is via using anova().
ssr3<-sum(m3$residuals^2)
ssr<-sum(m$residuals^2)
ssr3
## [1] 2172.494
ssr
## [1] 2180.927
mslof<-(ssr-ssr3)/1
msr3<-ssr3/17
1-pf(mslof/msr3,1,17)
## [1] 0.8003446
With anova(a,b), the model a is compared with model b via F test. Thus, we can assign the restricted and full models as arguments in it to check whether the lack of fit is significant. The result is the same as reported by the previous codes.
anova(m,m3)
## Analysis of Variance Table
##
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1 * x2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 18 2180.9
## 2 17 2172.5 1 8.4336 0.066 0.8003
Then, what if we only leave x1 in the model, will the model performance get significant worse? The result shows a significant lack of fit. It is suggested that the model with these two predictors included and without their interaction effect, the regression model provides the relatively best fit to the data.
m1<-lm(y~x1,dta)
summary(m1)
##
## Call:
## lm(formula = y ~ x1, data = dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.403 -6.121 -0.311 4.228 27.452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.0454 9.4622 7.191 7.86e-07 ***
## x1 1.8359 0.1464 12.539 1.23e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.19 on 19 degrees of freedom
## Multiple R-squared: 0.8922, Adjusted R-squared: 0.8865
## F-statistic: 157.2 on 1 and 19 DF, p-value: 1.229e-10
anova(m1,m)
## Analysis of Variance Table
##
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 2824.4
## 2 18 2180.9 1 643.48 5.3108 0.03332 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the package {car}, the data frame mtcars contains the information of 32 cars. Please, based on the lack of fit, build up the best model by selecting suitable variables among cyl, disp, hp, drat, wt, and qsec into the regression model to predict mpg.