In the simple linear regression model, the dependent variable \(y\) is assumed to be predicted by only one predictor \(x\). However, in most cases, more than one predictors are involved in the regression model. The regression model with more than one predictor is called the multiple regression model.
In the data file score.txt, each row represents for a participant and each columns represents for a measurement for this participant. The meanings of the columns are listed as follow.
Y1 - A measure of success in graduate school.
X1 - A measure of intellectual ability.
X2 - A measure of “work ethic.”
X3 - A second measure of intellectual ability.
X4 - A measure of spatial ability.
Y2 - Score on a major review paper.
Suppose we are interested in whether a graduate study’s success in graduate school can be predicted by her/his IQ and work ethic. We establish the multiple regression model as \(\hat{Y}_1=b_0+b_1X_1+b_2X_2\) and fit it to the observed dependent variable \(Y_1\). Let’s conduct the regression analysis. First, we import the data and check how many participants and how many variables there are.
dta<-read.table("score.txt",header=T)
dim(dta)
## [1] 20 6
For a variable to be included in the regression model, at least it should be correlated with the dependent variable. The below shows the correlation coefficients \(r\) between \(Y_1\) and \(X_1\) as well as between \(Y_1\) and \(X_2\). Both estimated correlation coefficients are quite large. Thus, it is suggested that IQ and work ethic should be involved in the regression model.
with(dta,cor.test(Y1,X1))
##
## Pearson's product-moment correlation
##
## data: Y1 and X1
## t = 5.026, df = 18, p-value = 8.776e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4859502 0.9017425
## sample estimates:
## cor
## 0.7641466
with(dta,cor.test(Y1,X2))
##
## Pearson's product-moment correlation
##
## data: Y1 and X2
## t = 5.1012, df = 18, p-value = 7.462e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4945910 0.9038464
## sample estimates:
## cor
## 0.7688386
The below figures show the scatter plot on the Y1-X1 space and the Y1-X2 space. These figures show an ascending trend between the student’s intelligence and his/her success in graduate school and a same trend between the student’s work ethic and his/her success in graduate school.
library(ggplot2)
library(gridExtra)
fg1<-ggplot(dta,aes(x=X1,y=Y1))+
geom_point(color="deepskyblue2")
fg2<-ggplot(dta,aes(x=X2,y=Y1))+
geom_point(color="tomato")
grid.arrange(fg1,fg2,ncol=2)
However, the student’s intelligence is not correlated with his/her work ethic. That is that these two variables are independent of each other.
with(dta,cor.test(X1,X2))
##
## Pearson's product-moment correlation
##
## data: X1 and X2
## t = 1.1187, df = 18, p-value = 0.278
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2114169 0.6267637
## sample estimates:
## cor
## 0.2549567
Thus, it is safe to include both of them in the regression model as the predictors. Normally, it is required to have independent predictors in the multiple regression model. Now we can do the regression test by calling the function lm( ) with the formula set up as \(\hat{Y}_1=b_0+b_1X_1+b_2X_2\). The results show that the intercept (i.e., \(b_0\)) is significantly different from 0 and so are the other two estimated regression coefficients (i.e., \(b_1\) and \(b_2\)). The \(r^2\) is 0.94, suggesting that about 94% of the total variation of the success in graduate school can be explained by the students’ intelligence scores and the degrees of their work ethic.
m1<-lm(Y1~X1+X2,dta)
summary(m1)
##
## Call:
## lm(formula = Y1 ~ X1 + X2, data = dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5964 -4.6383 -0.2101 5.2806 10.9762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.2219 4.5874 22.065 5.97e-14 ***
## X1 1.0004 0.1042 9.600 2.81e-08 ***
## X2 1.0708 0.1104 9.699 2.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.541 on 17 degrees of freedom
## Multiple R-squared: 0.9363, Adjusted R-squared: 0.9288
## F-statistic: 125 on 2 and 17 DF, p-value: 6.824e-11
The residual analysis shows that the residuals are not correlated with either the students’ intelligence score or the students’ work ethic.
dta$residuals<-m1$residuals
fg3<-ggplot(dta,aes(x=X1,y=residuals))+
geom_point(color="deepskyblue2")
fg4<-ggplot(dta,aes(x=X2,y=residuals))+
geom_point(color="tomato")
grid.arrange(fg3,fg4,ncol=2)
with(dta,cor.test(X1,residuals))
##
## Pearson's product-moment correlation
##
## data: X1 and residuals
## t = 1.0161e-15, df = 18, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4425208 0.4425208
## sample estimates:
## cor
## 2.394938e-16
with(dta,cor.test(X2,residuals))
##
## Pearson's product-moment correlation
##
## data: X2 and residuals
## t = 6.3856e-17, df = 18, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4425208 0.4425208
## sample estimates:
## cor
## 1.505109e-17
Although it is assumed that the dependent variable in the regression analysis is a continuous variable, it is not assumed that the predictor must be continuous. Recall the special case of the simple linear regression for doing the independent groups \(t\) test. Normally, we encode a dichotomous variable as 1 and 0 respectively for one of its levels. This way of encoding is called dummy coding and 0 and 1 are the dummay codes. Thus, this variable is called dummy variable. The below example shows an application of dummy variable in the regression analysis.
Suppose we have 10 male and 10 female participants’ heights and bodyweights. We know that human heights are highly correlated with their bodyweights. That is, when we regress botyweight on height, the of slope of the regression line will be significantly different from 0.
However, is the slope of the regression line the same for two genders? That is, we are interested in checking whether the slope \(b_1\) of the regression line (weight=\(b_{0}+b_{1}\)height) will be different for different genders. First, create the data set. See the following codes.
bh.dta<-data.frame(gender=rep(c("F","M"),each=10),
height=c(56,60,64,68,72,54,62,65,65,70,
64,68,72,76,80,62,69,74,75,82),
weight=c(117,125,133,141,149,109,128,131,131,145,
211,223,235,247,259,201,228,245,241,269))
We can make a scatter plot for understanding the relation between height and weight. With no surprise, the bodyweight tends to be large as the height gets larger. See the left panel. The gray line is the estimated regression line. Apparently, the slope is significantly different from 0 and \(r^2=.64\). This result is not bad.
fg5<-ggplot(bh.dta,aes(x=height,y=weight))+
geom_point(color="deepskyblue2")+
geom_smooth(method="lm",se=F,color="gray")
fg6<-ggplot(bh.dta,aes(x=height,y=weight,color=gender))+
geom_point()
grid.arrange(fg5,fg6,ncol=2)
## `geom_smooth()` using formula = 'y ~ x'
m3<-lm(weight~height,bh.dta)
summary(m3)
##
## Call:
## lm(formula = weight ~ height, data = bh.dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.199 -28.823 3.995 25.228 53.286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -227.292 73.483 -3.093 0.00627 **
## height 6.048 1.076 5.621 2.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.9 on 18 degrees of freedom
## Multiple R-squared: 0.637, Adjusted R-squared: 0.6169
## F-statistic: 31.59 on 1 and 18 DF, p-value: 2.473e-05
However, if we remark the genders by different colors, it seems that the slope between height and weight is flatter for female than male. See the right panel. Is our observation true? We can make a little change to our regression model to test for this observation. Now we change our model as \(\hat{y}=b_0+b_1x+b_2g+b_3gx\) with \(x\) as height and \(y\) as weight. The variable \(g\) denotes the dummy variable for genders and \(gx\) means the product of \(g\) and \(x\), which is used to show the interaction between gender and height. For females, \(g=0\) and for males \(g=1\). Thus, for females, this model becomes \(\hat{y}=b_0+b_1x+b_2\cdot0+b_3\cdot0\cdot x=b_0+b_1x\). In this female model, the slope is \(b_1\). For males, then this model becomes \(\hat{y}=b_0+b_1x+b_2\cdot1+b_3\cdot1\cdot x=(b_0+b_2)+(b_1+b_3)x\). In this male model, the slope is \((b_1+b_3)\). If \(b_3\) is not significantly different from 0, then \(b_1+b_3=b_1\), indicating that the two slopes are not different. Similarly, if \(b_2\) is not significantly different from 0, then \(b_0+b_2=b_0\). That is, these two regression lines have the same intercepts. Now we fit this regression model to our data. In the below R code, I(height*g) means the product of height and g.
bh.dta$g<-ifelse(bh.dta$gender=="F",0,1)
m4<-lm(weight~height+g+I(height*g),bh.dta)
summary(m4)
##
## Call:
## lm(formula = weight ~ height + g + I(height * g), data = bh.dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8312 -1.7797 0.4958 1.3575 3.3585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.3975 8.0114 -0.299 0.769
## height 2.0959 0.1255 16.700 1.51e-11 ***
## g 7.9991 11.3705 0.703 0.492
## I(height * g) 1.0939 0.1678 6.520 7.07e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.175 on 16 degrees of freedom
## Multiple R-squared: 0.9987, Adjusted R-squared: 0.9985
## F-statistic: 4250 on 3 and 16 DF, p-value: < 2.2e-16
The results show that the intercept is not different from 0. That is, \(b_0=0\). Height can predict weight, as \(b_1>0\), \(p<.01\). The coefficient of \(g\) is not significantly different from 0. That is, \(b_2=0\). According to the previous discussion, the intercepts of these two lines are not different from each other. However, the test for the coefficient of the product of \(x\) and \(g\) is significant. Thus, \(b_3\neq0\) and the slopes of these two lines are different from each other. Also, as \(b_3>0\), we know that the slope of the male regression model is larger than that of the female regression model. Note this model performs far better than the previous one with \(r^2=.9987\). The following figure shows the two regression lines as well as their corresponding data points.
ggplot(bh.dta,aes(x=height,y=weight,color=gender))+
geom_point(shape=1)+
geom_smooth(method="lm",se=F)
## `geom_smooth()` using formula = 'y ~ x'