Categorical Predictors in Regression Model

Although a bit atypical, the predictor in a regression model can be nominal/categorical. For example, $y$ is a continuous variable and $x$ is dichotomous with two levels 0 and 1. Therefore, in the scatter plot for the data points, the x axis has only two ticks. All data points line up in two lines. See the below figure. As usual, the linear regression model for thest two variables is $\hat{y}=b_0+b_1x$, except that $x$ has only two levels: 0 and 1. For any $x$ level, each observed $y_{j.i}=b_0+b_1x_{j}+\epsilon_{j.i}$. Thus, the mean of the observed $y$ values for a given $x_j$ is $\bar{y_j}=b_0+b_1x_{j}$, which is actually $\hat{y_j}$. Consequently, when $x=0$, the model predicts $\hat{y_0}=b_0=\bar{y_0}$. That is, the intercept of this regression line is the mean of the observed $y$ values for $x=0$. When $x=1$, the model predicts $\hat{y_1}=b_0+b_1=\bar{y_1}$. That is, the mean of the $y$ values for $x=1$ is actually the sum of the intercept and the slope. The slope is exactly the difference between two means, when $x$ is a dummy (0 and 1) variable.

Example 1

The below is a fictitious data set for 7 participants’ heights and genders (0: female and 1: male).

dta<-data.frame(gender=c(1,1,1,0,0,0,0),
                height=c(72.71,70.98,71.35,
                         62.87,64.14,62.48,
                         64.73))
dta

##   gender height
## 1      1  72.71
## 2      1  70.98
## 3      1  71.35
## 4      0  62.87
## 5      0  64.14
## 6      0  62.48
## 7      0  64.73

Describe Data

We can make a dot plot for the heights of different genders. First, the mode of variable gender is converted to factor, namely a categorical variable. Then, we can use geom_dotplot() to make a dot plot. The dot plot is just a histogram but using dots to represent the counts. We can also use stat_summary() to compute some statistics of the sample (e.g., mean) and mark them as red dots in the dot plot. In addition, we can even connect these two means by a solid line with geom_segment(). Note that the coordinates of the segment on dimension x are 1 and 2, which are irrelevant to the values of gender.

dta$gender<-as.factor(dta$gender)
heights<-with(dta,tapply(height,gender,mean))
library(ggplot2)
hg.fg<-ggplot(data=dta,aes(x=gender,y=height))+
          geom_dotplot(binaxis="y",stackdir="center",binwidth=0.25)+
          stat_summary(fun.data="mean_sdl",fun.args=list(mult=1),
                       geom="point",col="red")+
          geom_segment(data=data.frame(x1=1,x2=2,y1=heights[1],y2=heights[2]),
                       aes(x=x1,y=y1,xend=x2,yend=y2),col="red")+
          scale_x_discrete(breaks=c(0,1),labels=c("Female","Male"))
hg.fg

Regression Analysis

As usual, lm() is conducted for regression analysis. The intercept and slope are significantly different from 0. Specifically, the intercept is exactly the mean of female heights. There are many ways to compute the mean heights of different groups in a data frame. The first method is to locate the positions in the height vector for one gender (e.g., locating females by dta$gender==0) and compute the mean of the heights on these positions. The second method I introduce here is to use tapply(). When tapply(data,index,function) is called, function is applied to data according to index. Thus, when index is gender, data is height, and function is mean, R will compute the mean height for each gender. The results are listed as follows. Clearly, the mean of female heights is exactly the intercept.

m<-lm(height~gender,dta)
summary(m)

## 
## Call:
## lm(formula = height ~ gender, data = dta)
## 
## Residuals:
##      1      2      3      4      5      6      7 
##  1.030 -0.700 -0.330 -0.685  0.585 -1.075  1.175 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.5550     0.5004  127.02 5.74e-10 ***
## gender1       8.1250     0.7643   10.63 0.000127 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 5 degrees of freedom
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9492 
## F-statistic:   113 on 1 and 5 DF,  p-value: 0.0001274

f.height<-mean(dta$height[dta$gender==0])
m.height<-mean(dta$height[dta$gender==1])
c(f.height,m.height)

## [1] 63.555 71.680

heights<-with(dta,tapply(height,gender,mean))
heights

##      0      1 
## 63.555 71.680

According to the previous discussion, the mean of male heights is $b_0+b_1$, namely the sum of the intercept and the slope. This is equivalent to saying that the slope is the difference between the two group means. Indeed, this is confirmed by the below code.

m.height-f.height

## [1] 8.125

Correlation Analysis

Since $r^2$ in regression analysis is exactly the squared correlation coefficient, it suggests that the correlation coefficient $r$ can still be used for describing the association strength between a continuous variable and a dichotomous variable. This is called biserial correlation coefficient. In order to use cor.test(), we have to turn the mode of gender to numeric. As the variable gender now is a factor, we turn it to a vector first and then change its mode to numeric and save it as a new variable gender1. However, biserial correlation coefficient sometimes might get too large, especially one of the variables does not quite match the assumption of normality. Thus, point-biserial correlation coefficient is ued instead. The equation of point-biserial correlation coefficient is $r=\frac{\bar{x_p}-\bar{x_q}}{s_t}\sqrt{pq}$, where $p$ and $q$ are the precentages of the two groups in all observations and $s_t$ is the standard deviation of the dependent variable across all groups. Normally, the point-biserial correlation coefficient is samller than the biserical one.

dta$gender1<-as.numeric(as.vector(dta$gender))
with(dta,cor.test(gender1,height))

## 
##  Pearson's product-moment correlation
## 
## data:  gender1 and height
## t = 10.63, df = 5, p-value = 0.0001274
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8572884 0.9969553
## sample estimates:
##       cor 
## 0.9785843

pbr<-(m.height-f.height)/sd(dta$height)*
      sqrt(sum(dta$gender==0)*sum(dta$gender==1)/dim(dta)[1]^2)
pbr

## [1] 0.905993

Mean Comparison

Normally, when we have data like this, we will do t-test to compare the heights of male and female instead of regression or correlation anlaysis. In this case, an independent-groups t-test is suitable to use. A t-test is conducted for comparing the mean heights of these two genders. As one of the assumptions for t-test is that the population variances of different groups should be the same, we set up the argument var.equal=T.

t.rest<-with(dta,t.test(height~gender,var.equal=T))
t.rest

## 
##  Two Sample t-test
## 
## data:  height by gender
## t = -10.63, df = 5, p-value = 0.0001274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.089786  -6.160214
## sample estimates:
## mean in group 0 mean in group 1 
##          63.555          71.680

The result shows that the difference between two group means is significantly different from 0. Also, the t value is negative. That is, the female height is smaller than the male height. Look at the $p$ value in the summary table of t-test. It is exactly the same as the value reported for testing the slope in the summary table of regression analysis. This is not a coincidence. The pooled standard deviation for independent-groups t-test (with the homoeneity-of-variance assumption held) in this case is actually the standard error of estimate. See the below proof.

\[\begin{align*} s^2_{pooled}&=\frac{(N_1-1)s^2_1+(N_2-1)s^2_2}{N_1+N_2-2}\\ &=\frac{\sum(y_0-\bar{y_0})^2+\sum(y_1-\bar{y_1})^2}{N-2}\\ &=\frac{\sum(y_0-\hat{y_0})^2+\sum(y_1-\hat{y_1})^2}{N-2}\\ &=\frac{\sum(y-\hat{y})^2}{N-2}\\ &=s^2_{y.x} \end{align*}\]

When testing the mean difference, the standarard error $s_e=s_{pooled}\sqrt{\frac{1}{N_1}+\frac{1}{N_2}}$ is then $s_{y.x}\sqrt{\frac{1}{N_1}+\frac{1}{N_2}}$. We can test it with the following codes. With the formula of t value, $s_e=\frac{\bar{x_1}-\bar{x_2}}{t}$.

se<-(f.height-m.height)/t.rest$statistic
se

##         t 
## 0.7643352

Now transfer $s_e$ to $s_{pooled}$. There are many ways to get the size of each gender. The reason I use the below method is just to show some useful programming tips. You can retrieve the standard error of estimate in the returned result of the regression analysis. These two values are exactly the same. Thus, we justify that the independent-groups t-test is the specal case of simple linear regression analysis.

sp<-se/sqrt(1/sum(dta$gender==0)+1/sum(dta$gender==1))
sp

##       t 
## 1.00075

reg.res<-summary(m)
reg.res$sigma

## [1] 1.00075