Model Comparison by Bayes Factors

In Bayesian modeling, the optimal parameters \(\theta\) for the observed data can be estimated via applying the Bayes’ theorem such that

\[\begin{align} p(\theta|D)=\frac{p(D|\theta)p(\theta)}{\int{p(D|\theta)p(\theta)d\theta}}. \end{align}\]

In fact, these parameters must come from a model, which are devised to predict the occurrence of the observed data. Thus, the likelihood function can be rewritten as \(p(D|\theta,M)\) to indicate the likelihood to predict the observed data given the parameters of the model \(M\). Therefore, we can rewrite the above equation as

\[\begin{align} p(\theta,M|D)=\frac{p(D|\theta,M)p(\theta,M)}{\int{p(D|\theta,M)p(\theta,M)d\theta}}. \end{align}\]

Suppose we have two models, \(M_1\) and \(M_2\) competing for accounting for the same observed data. We can estimate the likelihood for the observed data to be accounted for by \(M_1\) as

\[\begin{align} p(M_1|D)=\frac{p(D|M_1)p(M_1)}{\int{p(D|M_1)p(M_1)dM_1}}. \end{align}\]

Similarly,

\[\begin{align} p(M_2|D)=\frac{p(D|M_2)p(M_2)}{\int{p(D|M_2)p(M_2)dM_2}}. \end{align}\]

Here, \(p(M_1|D)\) and \(p(M_2|D)\) are posterior odds, and the posterior odds ratio is computed as

\[\begin{align} \frac{p(M_1|D)}{p(M_2|D)}=\frac{p(M_1)}{p(M_2)}\frac{p(D|M_1)}{p(D|M_2)}\frac{\int{p(D|M_2)p(M_2)dM_2}}{\int{p(D|M_1)p(M_1)dM_1}}. \end{align}\]

Similar to the posterior odds ratio, \(\frac{p(M_1)}{p(M_2)}\) is the prior odds ratio. Of most importance is \(\frac{p(D|M_1)}{p(D|M_2}\), which is called Bayes Factor. What about the final term in the right hand side? Since the observed data would not be different given different models, the right-most fraction is actually 1.

Marginal Likelihoods

The term \(p(D|M_i)\) is called marginal likelihood, which can be very complicated to compute, as the likelihood to observe the data given the model \(M_i\) is computed over all possible parameters weighted by their priors as

\[\begin{align} p(D|M_i)=\int{p(D|\theta,M_i)p(\theta,M_i)d\theta}. \end{align}\]

Normally, we would like to avoid having to calculate the marginal likelihood and this is why MCMC methods are so great: they approximate the posterior distribution over parameters \(p(\theta|D)\) without knowledge or computation of the marginal likelihood. This makes clear why computing Bayes factors, in general, can be quite difficult or a substantial computational burden.

There are two approaches for computing or approximating the Bayes factors. One is to target the marginal likelihood of each model in separation. The other is to target the Bayes factor directly. Generally, there are four methods:

  1. Savage-Dickey method
  2. native Monte Carlo sampling
  3. transdimensional MCMC
  4. supermodels

Savage-Dickey method

The Savage-Dickey method is particularly suitable for the case in which one model is actually nested in another. For instance, a model \(M_0\) has a parameter \(\theta\) fixed as \(\theta=\theta_0\). In another model \(M_1\), the same parameter \(\theta\) is free to vary. The marginal likelihood of \(M_0\) is

\[\begin{align} p(D|M_0)&=\int{p(D|\theta_0,M_0)p(\theta_0|M_0)d\theta_0}\\ &=\int{p(D|\theta=\theta_0,M_1)p(\theta=\theta_0|M_1)d\theta}\\ &=\int{p(D|\theta=\theta_0,M_1)}\\ &=\frac{p(\theta=\theta_0,M_1|D)\int{p(D|\theta=\theta_0,M_1)p(\theta=\theta_0,M_1)d\theta}}{p(\theta=\theta_0|M_1)}\\ &=\frac{p(\theta=\theta_0,M_1|D)p(D|M_1)}{p(\theta=\theta_0|M_1)}. \end{align}\]

Thus, the Bayes factor \(\frac{p(D|M_0)}{p(D|M_1)}=\frac{p(\theta=\theta_0,M_1|D)}{p(\theta=\theta_0|M_1)}\). When the prior distribution over the parameter in \(M_1\) is uniform, the Bayes factor equals the likelihood of the posterior distribution for \(\theta=\theta_0\) in \(M_1\).

Let us consider again the binomial example of 9 correct responses out of 10 questions. Here we have two models for performance guessing (i.e., \(M_1:\theta=0.5\)) versus not guessing (i.e., \(M_2:\theta=\theta_c\), \(\theta_c\sim Beta(1,1)\)). For \(M_1\), there is no prior distribution of \(\theta\), as it is always 0.5. For \(M_2\), \(\theta\) have a uniform prior between 0 and 1.

Thus, BF is the ratio of the posterior probability of \(\theta=0.5\) in \(M_2\) over the prior probability of \(\theta=0.5\) in \(M_2\). Since the prior likelihood of \(\theta=0.5\) in \(M_2\) is 1, the Bayes factor is in fact the posterior likelihood of \(\theta=0.5\) in \(M_2\). We don’t need JAGS to do this computation, as we know that for the prior distribution over the parameter as a Beta distribution and the binomial distribution as the likelihood function, the posterior distribution over that parameter is still a Beta distribution but with the parameters as \(k+a\) and \(N-k+b\), where \(k\) is the observed times of the target outcome and \(N\) is the total observation times. Also, \(a\) and \(b\) here are all 1s. We can use the following R code to compute this likelihood, which is 0.11. That means the Bayes factor favoring \(\theta=0.5\) is far less than 1. That is, \(M_1\) is far less supported.

dbeta(0.5,9+1,10-9+1)
## [1] 0.1074219

Naive Monte Carlo Sampling

Naive Monte Carlo sampling approximates the marginal likelihood of model \(M_i\) by sampling repeatedly parameter values from the prior and recording the likelihood of the data given the sampled parameter values. If we sample enough, the mean of the recorded likelihoods will approximate the true marginalized likelihood:

\[\begin{align*} p(D|M_i)&=\int p(D|\theta,M_i)p(\theta,M_i)d\theta\\ &\approx \frac{1}{n} \sum^n_{\theta_i\sim p(\theta|M_i)}p(D|\theta,M_i) \end{align*}\]

That is, we can actually approximate the marginal likelihood. Let’s how it works. First, we set up models. The first model is \(M_0\), in which the binomial distribution is the likelihood function and the prior of \(\theta\) is Beta distribution with two parameters being extremely large in order to get a narrow peak at around 0.5. In the below codes, we ask JAGS to compute the likelihoods of the model \(M_i\) to generate the observed data \(D\), given the parameter \(\theta\). The first model is obviously made for \(M_0\) and the second model for \(M_1\). Note here in dbin( ), we input three arguments rather than two which we usually use. The first argument is the times of the target outcome.

Set Up Models

modelString0<-"
model{
    lhT<-dbin(k,theta,N)
    lh<-log(lhT)
    theta~dbeta(50000,50000)
}"
writeLines(modelString0,con="e16_model0.txt")
modelString1<-"
model{
    lhT<-dbin(k,theta,N)
    lh<-log(lhT)
    theta~dbeta(1,1)
}"
writeLines(modelString1,con="e16_model1.txt")

For the sake of convenience, we create a function which is used to approximate these two marginal likelihoods.

library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.0
## Loaded modules: basemod,bugs
library(ggmcmc)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: tidyr
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
datalist<-list(k=9,N=10)
model<-c("e16_model0.txt","e16_model1.txt")
nmcs<-function(model,datalist){
  output<-sapply(model,function(filename){
    model<-jags.model(file=filename,data=datalist,n.chains=2,n.adapt=1000)
    update(model,n.iter=20000)
    PostSamples<-coda.samples(model,variable.names="lh",n.chains=2,n.iter=20000)
    mlks<-ggs(PostSamples)
    llh<-filter(mlks,Parameter=="lh")$value
    return(exp(llh))
  })
  return(output)
}
MgLKs<-nmcs(model,datalist)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 0
##    Unobserved stochastic nodes: 1
##    Total graph size: 6
## 
## Initializing model
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 0
##    Unobserved stochastic nodes: 1
##    Total graph size: 6
## 
## Initializing model

It is worth checking how the Bayes factor changes as the iteration increases. Let’s plot the temporal development of Bayes factor approximation. The BF computed with the means of the marginal likelihoods is around 0.91, favoring \(M_1\).

library(ggplot2)
bf.dta<-data.frame(x=seq(1000,20000,by=100),
           BF=sapply(seq(1000,20000,by=100),function(t){
                mean(exp(MgLKs[1:t,1]))/mean(exp(MgLKs[1:t,2]))
           }))
ggplot(bf.dta,aes(x,BF))+
  geom_line()+xlab("Iteration")

mean(exp(MgLKs)[,1])/mean(exp(MgLKs)[,2])
## [1] 0.9141779

Transdimensional MCMC

A straightforward method for approximating posterior odds ratio is the transdimensional MCMC. We consider the posterior over a binary model flag parameter m \(\in{0,1}\), which helps represent our prior and posterior beliefs about which of the models \(M_0\) and \(M_1\) is true. That is, we can set up a binary parameter which is unbiased to either model in the beginning. The prior of this parameter follows the Bernoulli distribution such as \(m\sim dbern(0.5)\). During modeling, these two models compete for being chosen to predict the observed data. If a model outperforms the other, then it should be chosen more frequently. That is, the posterior probability of choosing that superior model should be larger than 0.5.

We can practice how to use transdimensional MCMC method to estimate the BF between the two models for the data of 9 target outcomes out of 10 observations.

modelString<-"
model{
     k~dbin(theta[m1],N)
     #Priors
     m~dbern(0.5)
     m1<-m+1
     theta[1]<-0.5
     theta[2]~dbeta(1,1)
}"
writeLines(modelString,con="e17_model.txt")

The we start modeling. The posterior distribution of \(m\) is still a Bernoulli distribution with 0 and 1 as the two outcomes. Therefore, the BF can be computed as \(\frac{f(M_0)}{f(M_1)}\) or \(\frac{p(m=0)}{p(m=1)}\). The result is close to 0.11 as reported by the Savage-Dickey method.

m17<-jags.model(file="e17_model.txt",data=datalist,
                n.chains=2,n.adapt=1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 8
## 
## Initializing model
update(m17,n.iter=20000)
postsample17<-coda.samples(m17,variable.names="m",
                           n.chains=2,n.iter=20000)
dta17<-ggs(postsample17)
m<-filter(dta17,Parameter=="m")$value
BF<-sum(m==0)/sum(m==1)
BF
## [1] 0.1045147

Supermodels

Instead of a model flag, which determines which model’s likelihood function to use, we can instead combine the model’s likelihoods in a linear way. Concretely, given parameter vectors \(\theta_i\) and \(\theta_j\) for models \(M_i\) and \(M_j\) respectively. We look at a model which contains both \(M_i\) and \(M_j\) and which has the likelihood function of the data as: \(P(D|\alpha,\theta_i,\theta_j)=\alpha P(D|\theta_i,M_i)+(1-\alpha) P(D|\theta_j,M_j)\). Let’s set up the model.

modelString<-"
data{
   one<-1
}
model{
   lhT1<-dbin(k,theta1,N)
   lhT2<-dbin(k,theta2,N)
   one~dbern(alpha*lhT1+(1-alpha)*lhT2)
   theta1~dbeta(30,30)
   theta2~dbeta(1,1)
   alpha~dbeta(1,1)
}"
writeLines(modelString,"e18_model.txt")

Now we can start modeling.

library(ggplot2)
library(dplyr)
BF<-function(k,N){
     m18<-jags.model("e18_model.txt",data=list(k=k,N=N),n.chains=2,n.adapt=1000)
     update(m18,n.iter=10000)
     post18<-coda.samples(m18,variable.names=c("alpha"),n.chains=2,n.iter=10000)
     sample18<-ggs(post18)
     return(sample18)
}
alpha<-BF(9,10)
## Compiling data graph
##    Resolving undeclared variables
##    Allocating nodes
##    Initializing
##    Reading data back into data table
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 3
##    Total graph size: 14
## 
## Initializing model
mean(alpha$value)
## [1] 0.3783888
ggplot(alpha,aes(value))+geom_boxplot()+ylim(-1,1)