Dummy Variables in Multiple Regression

As shown in the previous tutorial, the predictors in a regression model can be categorical. Specifically, we can turn the categorical variables to dummy variables in order to conduct the regression analysis. In this tutorial, some further appliaction of dummy variables in the regression analysis is introduced. Suppose we are interested in predicting \(y\) by \(x\). The normal regression model then would be \(\hat{y}=b_0+b_1x\). Suppose again that these data can be separated to two groups, such as males and females. Then, we can generate a different linear relation between \(x\) and \(y\) for each gender. Now comes the question. Are the intercepts/slopes of these two regression models significantly different from each other? To answer this question, the dummy variable can be useful.

For instance, we can set up gender as a dummy variable \(g\) with \(g=0\) for female and \(g=1\) for male. The regression model is set up as \(\hat{y}=b_{0.all}+b_{1.all}x+b_{2.all}g+b_{3.all}gx\). In this model, \(x\), \(g\), and the interaction effect between \(x\) and \(g\) are all included. We can call it as the full model. Here, \(gx\) is simply the product of \(x\) and \(g\). When \(g=0\), this model becomes \(\hat{y}=b_{0.all}+b_{1.all}x\). This model is specifically genderated for the females. Note that here \(b_{0.all}\) and \(b_{1.all}\) are not the same as \(b_0\) and \(b_1\) in the regression model across all gender groups. When \(g=1\), this model becomes \(\hat{y}=b_{0.all}+b_{1.all}x+b_{2.all}+b_{3.all}x=(b_{0.all}+b_{2.all})+(b_{1.all}+b_{3.all})x\). This model is for males. Now compare this model with the model for females. If \(b_{2.all}\) is not significantly different from 0, these two regression lines have the same intercept. Similarly, if \(b_{3.all}\) is not different from 0, these two regression lines have the same slope. Therefore, the statistical testing results for these coefficients can help us answer the question. Now Let us create the data first.

Example 1

The below codes are used to create the heights, weights, and genders of 10 participants.

dta<-data.frame(gender=rep(c("F","M"),each=10),
                height=c(56,60,64,68,72,54,62,65,65,70,
                         64,68,72,76,80,62,69,74,75,82),
                weight=c(117,125,133,141,149,109,128,131,131,145,
                         211,223,235,247,259,201,228,245,241,269))
dta

##    gender height weight
## 1       F     56    117
## 2       F     60    125
## 3       F     64    133
## 4       F     68    141
## 5       F     72    149
## 6       F     54    109
## 7       F     62    128
## 8       F     65    131
## 9       F     65    131
## 10      F     70    145
## 11      M     64    211
## 12      M     68    223
## 13      M     72    235
## 14      M     76    247
## 15      M     80    259
## 16      M     62    201
## 17      M     69    228
## 18      M     74    245
## 19      M     75    241
## 20      M     82    269

Describe Data

The scatter plot with the height as x axis and the weight as y axis is made by the following codes. Overall speaking, there is a positive relation between height and weight, consistent with our intuition.

library(ggplot2)
hw.fg<-ggplot(data=dta,aes(x=height,y=weight))+
         geom_point(shape=1)
hw.fg

Correlation Analysis

The correlation coefficient between height and weight is \(r=.80\), \(p<.01\). Also, we can check it for each gender. The correlation coefficient is quite large with \(r=.99\) for each gender and the \(p\) values are both less than .01. Here, subset() is used to separate a data frame according to the level(s) of variable(s).

with(dta,cor.test(height,weight))

## 
##  Pearson's product-moment correlation
## 
## data:  height and weight
## t = 5.6208, df = 18, p-value = 2.473e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5498283 0.9168438
## sample estimates:
##       cor 
## 0.7981507

with(subset(dta,gender=="F"),cor.test(height,weight))

## 
##  Pearson's product-moment correlation
## 
## data:  height and weight
## t = 18.969, df = 8, p-value = 6.173e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9527650 0.9975042
## sample estimates:
##       cor 
## 0.9890651

with(subset(dta,gender=="M"),cor.test(height,weight))

## 
##  Pearson's product-moment correlation
## 
## data:  height and weight
## t = 25.883, df = 8, p-value = 5.328e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9742211 0.9986519
## sample estimates:
##       cor 
## 0.9940821

Regression Analysis

Let us first conduct a regression analysis for all data. As shown in the summary table, both the intercept and slope are significant. The \(R^2=.64\) suggests that this model can explain 64% of the variability of the data of weight of these 10 participants.

m1<-lm(weight~height,dta)
summary(m1)

## 
## Call:
## lm(formula = weight ~ height, data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.199 -28.823   3.995  25.228  53.286 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -227.292     73.483  -3.093  0.00627 ** 
## height         6.048      1.076   5.621 2.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.9 on 18 degrees of freedom
## Multiple R-squared:  0.637,  Adjusted R-squared:  0.6169 
## F-statistic: 31.59 on 1 and 18 DF,  p-value: 2.473e-05

It is easy to fit this regression model to the female and male data separately. Not surprisingly, the slopes of these two regression lines are both significant. The intercepts of these two lines are not significant.

m.male<-lm(weight~height,subset(dta,gender=="M"))
summary(m.male)

## 
## Call:
## lm(formula = weight ~ height, data = subset(dta, gender == "M"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8312 -1.5901  0.1174  1.6944  3.3585 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.6017     8.9302   0.627    0.548    
## height        3.1897     0.1232  25.883 5.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.407 on 8 degrees of freedom
## Multiple R-squared:  0.9882, Adjusted R-squared:  0.9867 
## F-statistic: 669.9 on 1 and 8 DF,  p-value: 5.328e-09

m.female<-lm(weight~height,subset(dta,gender=="F"))
summary(m.female)

## 
## Call:
## lm(formula = weight ~ height, data = subset(dta, gender == "F"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8342 -1.2214  0.5906  1.1658  2.0286 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.3975     7.0533   -0.34    0.743    
## height        2.0959     0.1105   18.97 6.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.915 on 8 degrees of freedom
## Multiple R-squared:  0.9782, Adjusted R-squared:  0.9755 
## F-statistic: 359.8 on 1 and 8 DF,  p-value: 6.173e-08

Since we have done the regression models, why not add the regression lines in the scatter plot? This is very easy to do with the functions in {ggplot2}. You simply need to use the stat_smooth() to define how the regression model will be fit with what data.

hw.fg<-hw.fg+stat_smooth(method="lm",col="black",se=F)+
         stat_smooth(data=subset(dta,gender=="F"),
                     method="lm",col="red",se=F)+
         stat_smooth(data=subset(dta,gender=="M"),
                     method="lm",col="blue",se=F)
hw.fg

The visual inspection of the above figure suggests a clear discripancy on the estimated regression line between genders. The regression line for the overall data is not able to cover most of the variability of the dependent variable. Instead, the data of each gender are better explained by its own regression line. The female line looks flatter than the male line. Let us check out this observation. First, we create a dummy variable for gender with 0 for femal and 1 for male. Second, we simply run lm() for the full model: weight = \(b_0\) + \(b_1\)height + \(b_2\)g + \(b_3\)g\(\times\)height. According to the previous description for the full model, we know that here \(b_2\) and \(b_3\) are the key coefficient to test the difference on the intercept and slope between these two genders. As shown in the below summary table, g is not significant. This suggests that the intercepts of the lines for these two genders are not different. However, the coefficient for the interaction effect between g and height is significant. That is, these two slops are different. Since this coefficient is positive, we can say that the tendency between height and weight is stronger for male than female. Also, the \(R^2=1.0\) suggests that this model explain all variability of the weight.

dta$g<-ifelse(dta$gender=="F",0,1)
m2<-lm(weight~height+g+I(height*g),dta)
summary(m2)

## 
## Call:
## lm(formula = weight ~ height + g + I(height * g), data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8312 -1.7797  0.4958  1.3575  3.3585 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.3975     8.0114  -0.299    0.769    
## height          2.0959     0.1255  16.700 1.51e-11 ***
## g               7.9991    11.3705   0.703    0.492    
## I(height * g)   1.0939     0.1678   6.520 7.07e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.175 on 16 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9985 
## F-statistic:  4250 on 3 and 16 DF,  p-value: < 2.2e-16

Example 2

Risk and Sammarco (1991) examined the relation between distance and density of coral skeletons. According to their analysis, it looks like the density of coral skeletons increases with the distance from the Australian shore. The data are contained in rwg179.txt. Let us re-examine their analysis. First import the data. There are 27 observations. The meaning of the variables can be easily recognized by the column names.

RS.dta<-read.table("rwg179.txt",header=T)
head(RS.dta)

##   Sample       Reef Distance Density
## 1      1 MiddleReef      3.5   1.337
## 2      2 MiddleReef      3.5   1.216
## 3      3 MiddleReef      3.5   1.309
## 4      4    AlmaBay     14.3   1.053
## 5      5    AlmaBay     14.3   1.082
## 6      6    AlmaBay     14.3   1.079

dim(RS.dta)

## [1] 27  4

Describe Data

Obviously, the relation between the distance to shore and the density of coral skeletons. Let us draw a scatter plot for a quick check. It looks like that the farther to shore, the higher the density is.

RS.fg<-ggplot(data=RS.dta,aes(x=Distance,y=Density))+
         geom_point(shape=1,col="blue")
RS.fg

Correlation Analysis

Although the correlation between between Distance and Density is significant, \(r=.69\), \(p<.01\), the increasing trend of the density of coral skeletons looks not that linear.

with(RS.dta,cor.test(Distance,Density))

## 
##  Pearson's product-moment correlation
## 
## data:  Distance and Density
## t = 4.728, df = 25, p-value = 7.54e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4155560 0.8461558
## sample estimates:
##       cor 
## 0.6870699

Regression Analysis

We test the linear regression model first. This linear model has a medium degree of \(R^2\), which is not bad. In the study of Risk and Sammarco (1991), they included the term of squared Distance in the regression model as well. The \(R^2\) gets increased with one more predictor, however not significant. Thus, the conclusion can be drawn that the farther the sample location is to shore, the higher the density of coral skeletons is.

RS.m1<-lm(Density~Distance,RS.dta)
summary(RS.m1)

## 
## Call:
## lm(formula = Density ~ Distance, data = RS.dta)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.212753 -0.047247 -0.009136  0.062975  0.169705 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.2119729  0.0324376  37.363  < 2e-16 ***
## Distance    0.0037609  0.0007954   4.728 7.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09813 on 25 degrees of freedom
## Multiple R-squared:  0.4721, Adjusted R-squared:  0.4509 
## F-statistic: 22.35 on 1 and 25 DF,  p-value: 7.54e-05

RS.m2<-lm(Density~Distance+I(Distance^2),RS.dta)
summary(RS.m2)

## 
## Call:
## lm(formula = Density ~ Distance + I(Distance^2), data = RS.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20988 -0.03427  0.01100  0.04247  0.14731 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.167e+00  5.556e-02  20.995   <2e-16 ***
## Distance       7.380e-03  3.678e-03   2.006   0.0562 .  
## I(Distance^2) -4.482e-05  4.447e-05  -1.008   0.3237    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0981 on 24 degrees of freedom
## Multiple R-squared:  0.4935, Adjusted R-squared:  0.4513 
## F-statistic: 11.69 on 2 and 24 DF,  p-value: 0.0002851

anova(RS.m1,RS.m2)

## Analysis of Variance Table
## 
## Model 1: Density ~ Distance
## Model 2: Density ~ Distance + I(Distance^2)
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     25 0.24074                           
## 2     24 0.23096  1 0.0097723 1.0155 0.3237

However, is it the best conclusion? Perhaps the relation between the predicition residuals and the distance to shore can give us some insight. As shown in the below analysis, there is no correlation between them as assumed in the regression analysis. However, if you look at the scatter plot for the distances in the data and the residuals made by this model, this model seems to overestimate the density of coral skeletons at Alma Bay, OrpheusIs., and PandoraReef, whereas underestimate the density of them at GreatPalmIs and MorindaShoals.

with(RS.dta,cor.test(RS.m1$residuals,RS.dta$Distance))

## 
##  Pearson's product-moment correlation
## 
## data:  RS.m1$residuals and RS.dta$Distance
## t = 4.8501e-16, df = 25, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.380014  0.380014
## sample estimates:
##          cor 
## 9.700116e-17

RSresidual.fg<-ggplot(data=data.frame(Residual=RS.m1$residuals,
                                      Distance=RS.dta$Distance),
                      aes(x=Distance,y=Residual))+
                geom_point(shape=1,col="red")
RSresidual.fg

Would it suggest that the density of coral skeletons is more relevant to the sampling location rather than the distance to shore? Thus, I try to use the sampling location, namely Reef, as the predictor to predict the density of coral skeletons. The results are shown in the below summary table.

RS.m3<-lm(Density~Reef,RS.dta)
summary(RS.m3)

## 
## Call:
## lm(formula = Density ~ Reef, data = RS.dta)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.071333 -0.031000  0.007667  0.019500  0.093333 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.07133    0.02614  40.979  < 2e-16 ***
## ReefBowdenReef        0.34067    0.03697   9.214 3.10e-08 ***
## ReefGreatPalmIs.      0.28400    0.03697   7.681 4.35e-07 ***
## ReefGrubReef          0.42433    0.03697  11.477 1.03e-09 ***
## ReefLittleBroadhurst  0.34767    0.03697   9.403 2.28e-08 ***
## ReefMiddleReef        0.21600    0.03697   5.842 1.56e-05 ***
## ReefMorindaShoals     0.39467    0.03697  10.675 3.24e-09 ***
## ReefOrpheusIs.        0.17033    0.03697   4.607 0.000219 ***
## ReefPandoraReef       0.21033    0.03697   5.689 2.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04528 on 18 degrees of freedom
## Multiple R-squared:  0.9191, Adjusted R-squared:  0.8831 
## F-statistic: 25.55 on 8 and 18 DF,  p-value: 2.615e-08

Since Reef is a nominal variable, lm() by default will turn it to dummy variables. There are 9 reefs in total. Therefore, there are 8 predictors in this regression model. Note that in order to represent a nominal variable with \(k\) levels, \(k-1\) dummy variables are needed. The intercept of this model is acutally the group mean of the first group. In R, by default the nominal levels are ranked in the alphabet order. Thus, the first group here is AlmaBay. The mean density of AlmaBay is indeed the same as the intercept. The second predictor in this model is ReefBowdenReef. The estimated coefficient of it is actually the difference of the mean density at BowdenReef minus the mean density at AlmaBay. The regression model here is \(\hat{y}=b_0+b_1g_1+b_2g_2+\cdots+b_9g_8\), with each \(g\) dummy variable being either 0 or 1. Thus, for the first group, all \(g\) variables are 0, then \(\hat{y}=b_0\). This is why the intercept is actually the first group mean. Recall that the regression line always goes through the sample means. Similarly, the second group mean is \(\hat{y}=b_0+b_1\). Therefore, \(b_1\) is the difference between the means of these two groups. With the same principle, you can check the rest coefficients.

unique(RS.dta$Reef)

## [1] MiddleReef       AlmaBay          OrpheusIs.       PandoraReef     
## [5] GreatPalmIs.     MorindaShoals    LittleBroadhurst BowdenReef      
## [9] GrubReef        
## 9 Levels: AlmaBay BowdenReef GreatPalmIs. GrubReef ... PandoraReef

D.AlmaBay<-mean(subset(RS.dta,Reef=="AlmaBay")$Density)
D.AlmaBay

## [1] 1.071333

D.BowdenReef<-mean(subset(RS.dta,Reef=="BowdenReef")$Density)
D.BowdenReef

## [1] 1.412

D.BowdenReef-D.AlmaBay

## [1] 0.3406667

The \(R^2\) of this model is far higher than that of the previous two models. The lack of fit is significant smaller in this model comparing with the one with the term of squared distance, even though this model uses more parameters (df=N-number of parameters=27-9=18).

anova(RS.m2,RS.m3)

## Analysis of Variance Table
## 
## Model 1: Density ~ Distance + I(Distance^2)
## Model 2: Density ~ Reef
##   Res.Df      RSS Df Sum of Sq      F   Pr(>F)    
## 1     24 0.230964                                 
## 2     18 0.036908  6   0.19406 15.774 2.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the regression analysis shows that the density of coral skeletons is not linearly predicted by the distance to shore, it is better to replot the density of coral skeletons by Reef. See the below figure. The black dots represent the density data and the red dots are the means of the density of coral skeletons at the reefs.

RSReef.fg<-ggplot(data=RS.dta,aes(x=Reef,y=Density))+
             geom_dotplot(binaxis="y",stackdir="center",binwidth=0.02)+
             stat_summary(fun.y=mean,geom="point",size=2,col="red")+
             theme(axis.text.x = element_text(angle=90,hjust=1))
RSReef.fg

Exercise:

In the study of Pinhas, Tzelgov, and Stern (2011), 8 subjects’ RT to judge which number in a pair is larger were collected with 5 intrapair distances. The data are contained in repRT.txt. Please use the method introduced in this tutorial to analyze the linear trend between intrapair distance and RT with the individual differences on RT considered as well.

Dummy Variables in Multiple Regression

Lee-Xieng Yang

Example 1

Describe Data

Correlation Analysis

Regression Analysis

Example 2

Describe Data

Correlation Analysis

Regression Analysis

Exercise: