Bayesian Data Analysis

Censored Data

See Chapter 5.5 in Lee and Wagenmakers (2013) textbook. A 68-year-old grandmother in South Korea finally passed the written exam for a driving license after 949 failures. There are 50 items in this written exam, each for 2 points. The pass criterion is at least 60 points, namely 30 items correct. We don’t exactly know how many items she got correct in the 949 failed exams, but we know they range from 15 to 25. That is, for the total 950 exams, we have 949 missing values and 1 data score. Suppose each question is equally difficult. Please estimate her passing probability.

We know that each of the 950 exam results follows a binomial distribution. Thus, the likelihood function is binomial distribution. Also we know that the correct numbers in the past 949 exams range in between 15 and 25. We need to set up our model according to these hints. Before that, we need to prepare our data.

y<-c(rep(0,949),1) # Tag for pass or failure on each exam
z<-c(rep(NA,949),30) # Real correct item number
n<-50 # total number of items in the exam
datalist<-list(y=y,z=z,n=n,nattempts=length(y))

Declare our model as follows. Note the operator T(.) means truncation. That is, T(a,b) constrains the range between a and b for the output sampled from a distribution.

modelString<-"
model{
   for(i in 1:nattempts){
      z.low[i]<-15*(y[i]==0)+0*(y[i]==1)
      z.up[i]<-25*(y[i]==0)+n*(y[i]==1)
      z[i]~dbin(theta,n)T(z.low[i],z.up[i])
   }
   # Prior
   theta~dbeta(1,1)T(15/50,1)
}"
writeLines(modelString,con="e12_model.txt")

Now we initialize this model.

library(rjags)

## Loading required package: coda

## Linked to JAGS 4.3.0

## Loaded modules: basemod,bugs

e12.model<-jags.model(file="e12_model.txt",data=datalist,
                      n.chains=2,n.adapt=1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 950
##    Total graph size: 1925
## 
## Initializing model

update(e12.model,n.iter=1000)
post.e12<-coda.samples(e12.model,variable.names=c("theta"),
                       n.iter=1000)

Let’s check the results. Plot the posterior distribution of estimated \(\theta\)s. The mean of estimated probability of getting passed is close to .6.

library(ggmcmc)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: tidyr

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

sample.e12<-ggs(post.e12)
theta<-filter(sample.e12,Parameter=="theta")$value
library(ggplot2)
ggplot(data.frame(theta=theta),aes(theta))+
  geom_density()

# MAP
theta.d<-density(theta)
theta.d$x[which.max(theta.d$y)]

## [1] 0.6150498

# Mean estimate
mean(theta)

## [1] 0.5964281

Latent-Mixture Models

See Chapter 6. Suppose a group of 15 people sit an exam made up of 40 true-or-false questions and they get 21, 17, 21, 18, 22, 31, 31, 34, 34, 35, 35, 36, 39, 36, and 35 right. Obviously, the first 5 people were just guessing, but the last 10 had some level of knowledge. Let’s how to use Bayesian modeling to infer to which group each person belongs, and also the rate of success for the knowledge group.

First, we prepare the data.

N<-15
k<-c(21,17,21,18,22,31,31,34,34,35,35,36,39,36,35)
n<-40
datalist<-list(N=N,k=k,n=n)

Now we can set up our model. We can estimate the success rate for every individual person.

modelString<-"
model{
   for(i in 1:N){
      k[i]~dbin(theta[i],n)
   }
   # Prior
   for(i in 1:N){
      theta[i]~dbeta(1,1)
   }
}"
writeLines(modelString,con="e13_model.txt")

Second, we initialize this model and start to do the modeling.

e13.model<-jags.model(file="e13_model.txt",data=datalist,
                      n.chains=2,n.adapt=1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 15
##    Unobserved stochastic nodes: 15
##    Total graph size: 33
## 
## Initializing model

update(e13.model,n.iter=1000)
post.e13<-coda.samples(e13.model,c("theta"),n.chains=2,n.iter=1000)
sample.e13<-ggs(post.e13)

Third, let’s check our modeling results. We can check their mean \(\theta\)s. Unsurprisingly, the first 5 people’s \(\theta\)s are clearly estimated as lower than the rest 10 people’s. Thus, it is reasonable to suspect that there are two latent groups in these data.

thetas<-sapply(1:N,function(i){
     temp<-paste0("theta[",i,"]")
     return(filter(sample.e13,Parameter==temp)$value)
})
colMeans(thetas)

##  [1] 0.5215292 0.4292881 0.5219343 0.4513071 0.5468900 0.7615827 0.7628087
##  [8] 0.8356792 0.8329870 0.8547200 0.8560357 0.8792558 0.9531454 0.8811692
## [15] 0.8557839

Group-Level Parameter

Therefore, we can fix this model to estimate the success rates of these two groups. One is the lousy group and the other is the knowledge group. We separately estimate these two success rates. The logic is that for each data point, we firstly assign it to one of these two groups. If it is the lousy group, the success rate is set as 0.5. If it is the knowledge group, the success rate is something in between 0.5 and 1.

modelString<-"
model{
   for(i in 1:N){
      k[i]~dbin(theta[i],n)
   }
   # Priors
   for(i in 1:N){
      g[i]~dbern(0.5)
   }
   phi~dbeta(1,1)T(0.5,1)
   for(i in 1:N){
      theta[i]<-0.5*(g[i]==0)+phi*(g[i]==1)
   }
}"
writeLines(modelString,con="e13_model1.txt")

Start modeling.

e13.model1<-jags.model(file="e13_model1.txt",data=datalist,
                       n.chains=2,n.adapt=1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 15
##    Unobserved stochastic nodes: 16
##    Total graph size: 111
## 
## Initializing model

update(e13.model1,n.iter=1000)
post1.e13<-coda.samples(e13.model1,c("theta","g"),n.chains=2,
                        n.iter=1000)

The results can be transferred to the tibble object. First, we can check the group variable. The mean of each column of the variable gs represents the mean estimate of the group for one person. Obviously, the first 5 persons are assigned to group 0, namely the lousy group.

sample1.e13<-ggs(post1.e13)
gs<-sapply(1:N,function(j){
  temp<-paste0("g[",j,"]")
  return(filter(sample1.e13,Parameter==temp)$value)
})
colMeans(gs)

##  [1] 0.0000 0.0000 0.0000 0.0000 0.0000 0.9960 0.9935 1.0000 1.0000 1.0000
## [11] 1.0000 1.0000 1.0000 1.0000 1.0000

Let’s check the success rate of each person. The mean of posterior estimates for the success rate for each person shows that the lousy group always has the success rate as 0.5. The knowledge group’s success rate is quite high. However, there seem to be no individual differences on the estimate of the success rate in each group. This is because the variance of correct numbers in each group is small.

theta<-sapply(1:N,function(j){
  temp<-paste0("theta[",j,"]")
  return(filter(sample1.e13,Parameter==temp)$value)
})

In the previous example, all members in the knowledge group are constrained to have the same prior \(\theta\). As a result, there are no individual differences on the estimated success rates. In the following example, every member in the knowledge group has his/her own prior \(\theta\).

modelString<-"
model{
   for(i in 1:N){
      k[i]~dbin(theta[i],n)
   }
   # Priors
   for(i in 1:N){
      g[i]~dbern(0.5)
      thetaup[i]~dbeta(1,1)T(0.5,1)
      theta[i]<-0.5*(g[i]==0)+thetaup[i]*(g[i]==1)
   }
}"
writeLines(modelString,con="e13_model2.txt")

Start modeling with JAGS.

e13.model2<-jags.model(file="e13_model2.txt",data=datalist,
                       n.chains=2,n.adapt=1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 15
##    Unobserved stochastic nodes: 30
##    Total graph size: 125
## 
## Initializing model

update(e13.model2,n.iter=1000)
post2.e13<-coda.samples(e13.model2,c("g","theta","thetaup"),
                        n.chains=2,n.iter=1000)

We can check the estimated group of each person. Obviously, the first 5 persons are in the lousy group and the rest 10 are in the knowledge group. Now the estimated \(\theta\) is different for different member in the knowledge group.

sample2.e13<-ggs(post2.e13)
gs<-sapply(1:N,function(j){
  temp<-paste0("g[",j,"]")
  return(filter(sample2.e13,Parameter==temp)$value)
})
colMeans(gs)

##  [1] 0.1830 0.1045 0.1955 0.1155 0.2400 0.9970 0.9945 1.0000 0.9995 1.0000
## [11] 0.9995 1.0000 1.0000 1.0000 1.0000

thetas<-sapply(1:N,function(j){
  temp<-paste0("theta[",j,"]")
  return(filter(sample2.e13,Parameter==temp)$value)
})
colMeans(thetas)

##  [1] 0.5136475 0.5037161 0.5139877 0.5054122 0.5194401 0.7632733 0.7603221
##  [8] 0.8342385 0.8319431 0.8561983 0.8567382 0.8808951 0.9507301 0.8792731
## [15] 0.8564600

dd<-data.frame(theta=c(theta),group=c(rep(0,5*2000),rep(1,10*2000)))
ggplot(dd,aes(theta))+
  geom_density(aes(color=as.factor(group)))+
  xlim(0,1)+theme(legend.title=element_blank())

Gaussian for Individual Differences

In addition to Beta or Uniform prior, we can also use Gaussian prior to infer the success rate of each person in the knowledge group. The logic is the same as the previous examples. First, we need to assign each data point to a latent group. Second, we sample the success rates from the distribution of that group for its members.

modelString<-"
model{
   for(i in 1:N){
      k[i]~dbin(theta[i],n)
      theta[i]<-thetalow[i]*(g[i]==0)+thetaup[i]*(g[i]==1)
   }
   # Prios
   for(i in 1:N){
      g[i]~dbern(0.5)
      thetaup[i]~dnorm(muup,lambda)
      thetalow[i]~dnorm(mulow,lambda)
   }
   mulow~dbeta(1,1)
   muup~dunif(mulow,1)
   lambda~dgamma(0.001,0.001)
}"
writeLines(modelString,con="e13_model3.txt")

Start modeling. First check the groups of these 10 people. Obviously, the first 5 and the rest 10 are separated as different groups.

e13.model3<-jags.model(file="e13_model3.txt",data=datalist,
                       n.chains=2,n.adapt=1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 15
##    Unobserved stochastic nodes: 48
##    Total graph size: 144
## 
## Initializing model

update(e13.model3,n.iter=1000)
post3.e13<-coda.samples(e13.model3,c("g","theta","muup","mulow","lambda"),
                        n.chains=2,n.iter=1000)
sample3.e13<-ggs(post3.e13)
gs<-sapply(1:N,function(j){
  temp<-paste0("g[",j,"]")
  return(filter(sample3.e13,Parameter==temp)$value)
})
colMeans(gs)

##  [1] 0.0015 0.0000 0.0005 0.0015 0.0050 0.9425 0.9585 0.9965 0.9970 0.9995
## [11] 0.9990 1.0000 1.0000 1.0000 0.9985

Now let’s check the mean estimated success rates of these two groups. The posterior distribution of the estimated success rates of these groups are shown in the below figure. Clearly, the knowledge group has a much larger success rate than the lousy group. The estimated individual success rates can be computed by averaging the posterior estimates of these rates for individual persons. Therefore, we estimate individual success rates as well as the group success rates.

muup<-filter(sample3.e13,Parameter=="muup")$value
mulow<-filter(sample3.e13,Parameter=="mulow")$value
c(mean(mulow),mean(muup))

## [1] 0.5059537 0.8666693

dd2<-data.frame(theta=c(mulow,muup),group=rep(c("lousy","knowledge"),each=2000))
ggplot(dd2,aes(theta))+
  geom_density(aes(color=group))

thetas<-sapply(1:N,function(j){
  temp<-paste0("theta[",j,"]")
  return(filter(sample3.e13,Parameter==temp)$value)
})
colMeans(thetas)

##  [1] 0.5098565 0.4868673 0.5105771 0.4905117 0.5162206 0.8227059 0.8242063
##  [8] 0.8585441 0.8570955 0.8680236 0.8654899 0.8742120 0.9111115 0.8776118
## [15] 0.8677657

Twenty Questions

Suppose a group of 10 people attend a lecture, and are asked a set of 20 questions afterwards, with every answer being either correct or incorrect. the pattern of data is show in Table 6.1 in the 2013 textbook of Lee and Wagenmakers. We want to infer (1) how well each person attended the lecture and (2) how hard each of the questions was.

As I have encoded the data in Table 6.1 in a text file (i.e., TwentyQuestionsData.txt), we can use the codes below to retrieve them and transform them to a matrix.

dta<-read.table("TwentyQuestionsData.txt",
                header=F,sep=",")
K<-as.matrix(dta)

Each cell in this matrix records whether the subject i passed the question j, which is actually determined by the joint probability of the probability to pass this question for the subject i \(p(i)\) and the probability of this question to be answered, \(p(j)\). Thus, it is easy to set up the model.

datalist<-list(Nt=dim(K)[2],Ns=dim(K)[1],K=K)
modelString<-"
model{
   for(j in 1:Nt){
      for(i in 1:Ns){
         K[i,j]~dbern(theta[i,j])
      }
   }
   # Priors
   for(j in 1:Nt){
      for(i in 1:Ns){
         theta[i,j]<-pt[j]*ps[i]
      }
   }
   for(i in 1:Ns){
      ps[i]~dbeta(1,1)
   }
   for(j in 1:Nt){
      pt[j]~dbeta(1,1)
   }
}"
writeLines(modelString,con="e14_model.txt")

Start modeling. Let’s collect the modeling data. First, we can compute the posterior samples for each cell in the subject-question matrix. We can plot the estimated pass rate on each cell in a rater figure. The estimated difficulty of each question can be computed as the column means of the matrix of pt. Apparently, question 1 is relatively easy and question 20 is relatively difficult. Similarly, the estimated of each student’s passing rate can be computed in the same way for the matrix ps. The results show that the first student performs better than the second one.

e14.model<-jags.model(file="e14_model.txt",data=datalist,
                      n.chains=2,n.adapt=1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 200
##    Unobserved stochastic nodes: 30
##    Total graph size: 433
## 
## Initializing model

update(e14.model,n.iter=1000)
post.e14<-coda.samples(e14.model,c("theta","pt","ps"),
                       n.chains=2,n.iter=1000)
sample.e14<-ggs(post.e14)
thetas<-sapply(1:ncol(K),function(j){
     sapply(1:nrow(K),function(i){
          temp<-paste0("theta[",i,",",j,"]")
          return(mean(filter(sample.e14,Parameter==temp)$value))
     })
})
dd3<-data.frame(p=c(thetas),x=rep(1:20,each=10),y=rep(10:1,20))
ggplot(dd3,aes(x,y,fill=p))+
  theme_bw()+theme(plot.background=element_blank(),
                       panel.grid.major=element_blank(),
                       panel.grid.minor=element_blank())+
  geom_raster(stat="identity")+
  scale_fill_continuous(guide=guide_legend(title="P"),
                            low="black",high="white")