In fact, the regression model is the model defined by users. Thus, we have already demonstrated how to fit our own model to data in the Bayesian framework. However, the regression model normally predicts a continuous dependent variable, not a discrete dependent variable. In psychology experiments, discrete data are often researchers’ concern, such as classification data, multiple choice data, and so on. Psychologists normally develop their own models to fit to these discrete data. Then how to fit a user-defined model to the discrete data in the Bayesian framework is the main concern of this section.
We can start from the logistic regression model as a simple demonstration. The logistic regression model is suitable to predict a binary dependent variable. Just like the normal regression model, the logistic regression model is a linear model \(y=b_0+b_1x_1+b_2x_2+\cdots+b_kx_k\), except that the model prediction is transferred to the probability supporting one of the two outcomes of the dependent variable by a sigmoid function \(p=\frac{1}{1+e^{-y}}\).
The data we will use can be found on this webpage. This data contains 400 applicants’ GRE scores, GPA scores, Ranks in class, and the result of their applications (i.e., 0: fail and 1: success). We will build up a logit model (i.e., logistic regression model) to predict whether the application is admitted or not for each applicant.
mydata<-read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
head(mydata)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
As the predictive variable rank is a nominal-scaled variable, we turn it to 3 dummy variables, r2, r3, and r4. Then the model is \(y=b_0+b_1gre+b_2gpa+b_3r2+b_4r3+b_5r4\).
mydata$r2<-ifelse(mydata$rank==2,1,0)
mydata$r3<-ifelse(mydata$rank==3,1,0)
mydata$r4<-ifelse(mydata$rank==4,1,0)
k<-mydata$admit
gre<-mydata$gre
gpa<-mydata$gpa
r2<-mydata$r2
r3<-mydata$r3
r4<-mydata$r4
datalist<-list(k=k,gre=gre,gpa=gpa,r2=r2,r3=r3,r4=r4,Ns=nrow(mydata))
k<-mydata$admit
IV<-as.matrix(cbind(1,mydata[,-c(1,4)]))
datalist<-list(k=k,IV=IV,Ns=length(k),Nt=ncol(IV))
We can compose as following our JAGS model for implementing this logistic regression model.
modelString<-"
model{
for(i in 1:Ns){
k[i]~dbern(p[i])
pred[i]~dbern(p[i])
x[i]<-inprod(IV[i,],b)
p[i]<-ilogit(x[i]+error[i])
}
# Prior
for(i in 1:Ns){
error[i]~dnorm(0,lambda.err)
}
lambda.err~dgamma(0.001,0.001)
for(j in 1:Nt){
b[j]~dnorm(0,0.0001)
}
}"
writeLines(modelString,con="e15_model.txt")
Then, we compile this model with jags.model( ), update it for 1000 iterations with update( ), and generate the posterior samples with coda.samples( ).
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.0
## Loaded modules: basemod,bugs
e15.model<-jags.model("e15_model.txt",data=datalist,
n.chains=2,n.adapt=1000)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 400
## Unobserved stochastic nodes: 807
## Total graph size: 5204
##
## Initializing model
update(e15.model,n.iter=10000,thin=10)
post.e15<-coda.samples(e15.model,c("b","error","lambda.err","pred"),
n.chains=2,n.iter=10000,thin=10)
Subsequently, we transfer the posterior samples to tibble object in R. First, we can check how well our model predicts the observed data. The posterior predictive distribution is represented by the variable pred, which is a \(2000\times 400\) matrix, as we have 400 observed data points and for each of which the model generate 1,000 predictions per chain in MCMC. Thus, we compute the mean of predictions for each data point. If the mean prediction for each data point is larger than 0.5, it is regarded as 1 and 0 otherwise. At last, the function table( ) is used to check how close the model’s predictions are to the observed data.
library(ggmcmc)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: tidyr
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
sample.e15<-ggs(post.e15)
preds<-sapply(1:nrow(mydata),function(s){
temp<-paste0("pred[",s,"]")
filter(sample.e15,Parameter==temp)$value
})
preds<-colMeans(preds)
table(ifelse(preds>0.5,1,0),k)
## k
## 0 1
## 0 273 0
## 1 0 127
Now let’s check the parameters with the following codes.
bs<-sapply(1:6,function(s){
temp<-paste0("b[",s,"]")
filter(sample.e15,Parameter==temp)$value
})
colMeans(bs)
## [1] -45.34436536 0.03071183 8.71779407 -11.49395174 -21.55246161
## [6] -24.23578188
p.l<-c("b0","b1","b2","b3","b4","b5")
b.dta<-data.frame(x=c(bs),label=rep(p.l,each=2000))
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
figs<-lapply(p.l,function(s){
fg<-ggplot(subset(b.dta,label==s),aes(x))+
geom_density()+xlab(s)
return(fg)
})
grid.arrange(figs[[1]],figs[[2]],figs[[3]],
figs[[4]],figs[[5]],figs[[6]],ncol=3)