Bayesian Data Anaalysis

Difference between Two Rates with Multiple Observations

Suppose we have randomly selected 7 coins and 1 coin respectively from Mint A and Mint B. Suppose the variety of coins within each mint is very small and we can treat different coins in one mint as having the same probability of head. The obtained number of heads in 10 flips for each coin are coded as k1 and k2 respectively for these two mints. The question is whether the probability of head differs between these two mints. In order to answer this question, we can use the below codes to run Bayesian data analysis.

k1<-c(6,7,7,8,7,6,10)
k2<-4
datalist<-list(k1=k1,k2=k2,N=10,m1=length(k1),m2=length(k2))
modelString<-"
model{
   # Likelihood
   for(i in 1:m1){
      k1[i]~dbin(theta1,N)
   }
   for(i in 1:m2){
      k2[i]~dbin(theta2,N)
   }
   # Prior
   theta1~dbeta(1,1)
   theta2~dbeta(1,1)
   Deltheta<-theta1-theta2
   # Posterior predictive
   postk1~dbin(theta1,N)
   postk2~dbin(theta2,N)
}"
writeLines(modelString,con="e4_model3.txt")
library(rjags)

## Loading required package: coda

## Linked to JAGS 4.3.0

## Loaded modules: basemod,bugs

e4_model3<-jags.model(file="e4_model3.txt",data=datalist,
                      n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 8
##    Unobserved stochastic nodes: 4
##    Total graph size: 17
## 
## Initializing model

update(e4_model3,n.iter=1000)
e4_sample3<-coda.samples(e4_model3,variable.names=c("Deltheta","postk1","postk2"),
                         n.iter=1000)
library(ggmcmc)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: tidyr

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

e4_post3<-ggs(e4_sample3)
Delta<-filter(e4_post3,Parameter=="Deltheta")$value
mean(Delta)

## [1] 0.306926

d.Delta<-density(Delta)
d.Delta$x[which.max(d.Delta$y)]

## [1] 0.3391441

The posterior distribution of the estimated difference between the probabilities of head for these two mints can be seen in the figure below. Apparently, the two probabilities of head are different from each other. The mean predicted numbers of head for Mint A and Mint B are close to the mean obtained heads.

library(ggplot2)
# Get HDI
library(HDInterval)
HDI<-hdi(d.Delta)
ggplot(data.frame(delta=Delta),aes(delta))+
  geom_density()+
  geom_segment(aes(x=HDI[1],xend=HDI[1],y=0,yend=attr(HDI,"height")))+
  geom_segment(aes(x=HDI[2],xend=HDI[2],y=0,yend=attr(HDI,"height")))+
  geom_segment(aes(x=HDI[1],xend=HDI[2],y=attr(HDI,"height"),yend=attr(HDI,"height")),linetype=2)

postk1<-filter(e4_post3,Parameter=="postk1")$value
postk2<-filter(e4_post3,Parameter=="postk2")$value
c(mean(postk1),mean(k1))

## [1] 7.247000 7.285714

c(mean(postk2),mean(k2))

## [1] 4.189667 4.000000

Joint Distribution

In psychological studies, the behavior of interest often depends on more than one parameter which might interact with each other. Bayesian statistics can help estimate these parameters simultaneously. See Chapter 3.6 in the textbook of Lee and Wagenmakers (2013). Suppose 5 helpers distributed a bundle of surveys to houses. We only know that each bundle contained the same number of surveys \(n \in\{1\cdots500\}\). Now the 5 helpers received \(k=\{16,18,22,25,27\}\) surveys returned. What would be the most likely return rate \(\theta\) and the number of surveys \(n\) distributed to houses?

First, what we will get after distributing our surveys is either to get nothing or to get the survey back. Thus, the observed number of surveys should follow a binomial distribution with \(\theta\) and \(N\) as the parameters. Different from the previous examples, in which the number of observations is known, we need to estimate the number of surveys. Second, it is assumed that the prior distribution of the return rate is a uniform distribution, \(\theta\sim Beta(1,1)\). Third, since we have no idea about what the number of survey would be, any number in between 1 and 500 is equally likely. Thus, the prior of the survey number can be assumed as being sampled from a nominal distribution with 500 entries with the probability for any entry to be sampled as 1/500.

# Prepare data
N<-500
K<-c(16,18,22,25,27)
datalist<-list(N=N,K=K,m=length(K))
# Set up model
modelString<-"
model{
   for(i in 1:m){
     K[i]~dbin(theta,Nk)
   }
   Nk~dcat(Pi)
   # Priors
   theta~dbeta(1,1)
   for(i in 1:N){
      Pi[i]<-1/N
   }
}"
writeLines(modelString,con="e5_model1.txt")

Now we can start parameter estimation.

e5_model<-jags.model(file="e5_model1.txt",data=datalist,
                     n.chains=2,n.adapt=100)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 5
##    Unobserved stochastic nodes: 2
##    Total graph size: 12
## 
## Initializing model

update(e5_model,n.iter=1000)
post_e5<-coda.samples(e5_model,c("theta","Nk"),n.iter=1000)

Transfer the object returned by coda.samples( ) to a tibble. Also, we can extract the posterior samples of these two parameters. The MAP estimates of these two parameters and the expected values of the posterior distributions of these two parameters can be computed by the following codes.

library(ggmcmc)
sample_e5<-ggs(post_e5)
Nk<-filter(sample_e5,Parameter=="Nk")$value
theta<-filter(sample_e5,Parameter=="theta")$value
# MAP estimate of number of surveys
dNk<-density(Nk)
MAP.N<-dNk$x[which.max(dNk$y)]
# MAP estimate of theta
dtheta<-density(theta)
MAP.theta<-dtheta$x[which.max(dtheta$y)]
c(MAP.N,MAP.theta)

## [1] 72.048197  0.065823

# Mean estimate of number of surveys
mean.N<-mean(Nk)
# Mean estimate of theta
mean.theta<-mean(theta)
c(mean.N,mean.theta)

## [1] 222.4280000   0.1738945

Then, we can plot the joint posterior distribution for these two parameters.

library(ggplot2)
dta<-data.frame(N=Nk,theta=theta)
dta.p<-data.frame(N=c(MAP.N,mean.N),theta=c(MAP.theta,mean.theta),
                  type=c("MAP","Mean"))
ggplot(dta,aes(x=N,y=theta))+
  geom_point(size=0.3,alpha=0.4)+
  geom_point(data=dta.p,aes(N,theta,shape=type,color=type))+
  scale_shape_manual(values=c(2,4))

Inferences with Gaussians

One of the most common inference problems involves assuming data following a Gaussian distribution, and inferring the mean and the standard deviation of this distribution from a sample of observed independent data.

In JAGS, a Gaussian distribution is defined by two parameters \(\mu\) and \(\lambda=\frac{1}{\sigma^2}\). Note \(\lambda\) is the reverse of the variance. Suppose you randomly sampled 100 participants for doing the WAIS intelligence test. The mean and the standard deviation of the norm of WAIS are known as 100 and 15. What will be the estimates of the mean and the standard deviation based on this sample?

First, we can use R to draw 100 scores from a normal distribution with \(\mu=100\) and \(\sigma=15\) as our data in this simulation experiment. The histogram of these sampled scores is also shown in the below figure.

set.seed(123)
k<-rnorm(100,100,15)
datalist<-list(k=k,s=length(k))
ggplot(data.frame(k=k),aes(k))+
  geom_histogram(color="white",fill="deepskyblue2",bins=10)

Now we initialize our JAGS model.

modelString<-"
model{
   for(i in 1:s){
      k[i]~dnorm(mu,lambda)
   }
   # Prior
   mu~dunif(20,200)
   sigma~dunif(0,30)
   lambda<-pow(sigma,-2)
}"
writeLines(modelString,con="e6_model.txt")

Start estimating the parameters. Then, we turn the returned object to the variable in the tibble structure.

e6_model<-jags.model(file="e6_model.txt",data=datalist,
                     n.chains=2,n.adapt=100)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 100
##    Unobserved stochastic nodes: 2
##    Total graph size: 110
## 
## Initializing model

update(e6_model,n.iter=1000)
post_e6<-coda.samples(e6_model,c("mu","lambda","sigma"),
                      n.iter=1000)
sample_e6<-ggs(post_e6)

The MAP estimates and mean estimates of these two parameters can be computed by the following codes. For each parameter, the two estimates are quite close to each other.

mu<-filter(sample_e6,Parameter=="mu")$value
sigma<-filter(sample_e6,Parameter=="sigma")$value
# Get MAP estimates
d.mu<-density(mu)
d.mu$x[which.max(d.mu$y)]

## [1] 101.8665

d.sigma<-density(sigma)
d.sigma$x[which.max(d.sigma$y)]

## [1] 13.7164

# Get mean estimates
mean(mu)

## [1] 101.4252

mean(sigma)

## [1] 13.87241

We can plot the posterior distributions for these two parameters.

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

fig1<-ggplot(data.frame(mu=mu),aes(mu))+
  geom_density()
fig2<-ggplot(data.frame(sigma=sigma),aes(sigma))+
  geom_density()
grid.arrange(fig1,fig2,ncol=2)

We can also check the correlation of the posterior estimates of the mean and the SD. The scatter plot for the means and SDs is shown as below. There is no correlation between them.

ggplot(data.frame(mu=mu,sigma=sigma),aes(mu,sigma))+
  geom_point(size=0.3,shape=1)+
  geom_segment(aes(x=min(mu),y=mean(sigma),xend=mean(mu),yend=mean(sigma),
                   color="red"))+
  geom_segment(aes(x=mean(mu),y=min(sigma),xend=mean(mu),yend=mean(sigma),
                   color="red"))

cor.test(mu,sigma)

## 
##  Pearson's product-moment correlation
## 
## data:  mu and sigma
## t = -1.288, df = 1998, p-value = 0.1979
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07254169  0.01504759
## sample estimates:
##         cor 
## -0.02880234

Inference under Constraint of Other Parameter

Seven scientists with wildly-differing skills all make a measurement of the same quantity. They get the answers \(x=\{-27.020,3.570,8.191,9.898,9.603,9.945,10.056\}\). Intuitively, it seems clear that the first two are pretty inept measures, and the true value of the quantity is probably just a bit below 10. What is the most likely value for this quantity?

Normally, when we estimate the mean of a population, the sample mean is by default the unbiased estimate. Here, if we compute the mean of these values, we will get 3.46 as our estimate. The standard error is 5.16. Then, the true value will be in the 95% confidence interval between -6.64 and 13.57. This confidence interval is pretty large. In other words, the precision of the measurement for that quantity is low. However, a reasonable suspicion is that the first two values might bias our estimation. You may want to trim them out, but you will need a reason to do so. In Bayesian data analysis, this will not be a problem.

First, we now that the observed score is the real score plus a random error. That is, \(X=T+\epsilon\), where \(\epsilon\sim N(0,\sigma_\epsilon)\). Therefore, \(X\sim N(T,\sigma_\epsilon)\). The measures of these 7 scientists can be viewed as the samples drawn from the normal distribution with the mean as the true measure score and the SD as the SD of random errors.

Let us prepare the data first. Suppose every scientist has the same precision on measuring. The model can be shown as follows. Here, the suitable prior distribution for \(\lambda\) is the Gamma distribution, \(\lambda\sim \Gamma(0.001,0.001)\).

# Prepare data
k<-c(-27.020,3.570,8.191,9.898,9.603,9.945,10.056)
n<-length(k)
datalist<-list(k=k,n=n)
# Set up model
modelString<-"
model{
   for(i in 1:n){
      k[i]~dnorm(mu,lambda)
   }
   # Prior
   mu~dnorm(0,0.001)
   lambda~dgamma(0.001,0.001)
   sigma<-pow(lambda,-0.5)
}"
writeLines(modelString,con="e6_model1.txt")

Start estimating the parameters. Check the mean estimate of the mean score and you will find it quite close to the mean of these 7 measured scores.

e6_model1<-jags.model(file="e6_model1.txt",data=datalist,
                     n.chains=2,n.adapt=100)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 7
##    Unobserved stochastic nodes: 2
##    Total graph size: 15
## 
## Initializing model

update(e6_model1,n.iter=1000)
post1_e6<-coda.samples(e6_model1,c("mu","sigma"),n.iter=1000)
sample1_e6<-ggs(post1_e6)
mu<-filter(sample1_e6,Parameter=="mu")$value
sigma<-filter(sample1_e6,Parameter=="sigma")$value
mean(mu)

## [1] 3.488244

mean(sigma)

## [1] 15.56902

However, we have reason to believe that the first two scientists have very low precision in measurement, comparing to others. Thus, we can re-estimate these parameters with the assumption that each scientist has a different measurement precision.

modelString<-"
model{
   for(i in 1:n){
      k[i]~dnorm(mu,lambda[i])
   }
   # Priors
   mu~dnorm(0,0.001)
   for(i in 1:n){
      lambda[i]~dgamma(0.001,0.001)
      sigma[i]<-pow(lambda[i],-0.5)
   }
}"
writeLines(modelString,con="e6_model2.txt")

Now we estimate the parameters. The mean estimate of the true score of this measurement is now 9.92. This is more reasonable than the previous one. The mean estimates of \(\sigma\) for the 7 scientists reveal that the first two scientists have a very poor precision.

e6_model2<-jags.model(file="e6_model2.txt",data=datalist,
                      n.chains=2,n.adapt=100)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 7
##    Unobserved stochastic nodes: 8
##    Total graph size: 27
## 
## Initializing model

update(e6_model2,n.iter=1000)
post2_e6<-coda.samples(e6_model2,c("mu","sigma"),n.iter=1000)
sample2_e6<-ggs(post2_e6)
# Mean estimate of the true score
mu2<-filter(sample2_e6,Parameter=="mu")$value
mean(mu2)

## [1] 9.910931

# Mean estimate of the SD of measurement
sigma2<-sapply(1:n,function(x){
    temp<-paste0("sigma[",x,"]")
    filter(sample2_e6,Parameter==temp)$value
})
dim(sigma2)

## [1] 2000    7

colMeans(sigma2)

## [1] 225.4620471  44.7504210   8.5100584   0.9032357   1.9092042   0.4686368
## [7]   1.1766262

The Highest Density Interval for the posterior \(\mu\) and \(\sigma\) can be computed as follows. Apparently, the 95% HDI for \(\mu\) now is pretty narrow, namely that the measurement is very precise. Also, the 95% HDI for \(\sigma\) is particularly wide for the first two scientists.

library(HDInterval)
hdi(density(mu2))

## Warning in hdi.density(density(mu2)): The HDI is discontinuous but allowSplit = FALSE;
##     the result is a valid CrI but not HDI.

##     lower     upper 
##  9.595802 10.115288 
## attr(,"credMass")
## [1] 0.95
## attr(,"height")
## [1] 0.3717066

hdi.sigmas<-sapply(1:n,function(x){
  temp<-hdi(density(sigma2[,x]))
  return(temp[c(1,2)])
})

## Warning in hdi.density(density(sigma2[, x])): The HDI is discontinuous but allowSplit = FALSE;
##     the result is a valid CrI but not HDI.

hdi.sigmas

##            [,1]       [,2]      [,3]        [,4]       [,5]        [,6]
## lower -25.46082  -4.633496 -1.232916 -0.09824091 -0.2927514 -0.09085307
## upper 528.34857 123.707110 24.320022  1.83832532  4.5604787  2.31094483
##             [,7]
## lower -0.1582957
## upper  2.7184423

Bayesian Data Anaalysis

Lee-Xieng Yang

Difference between Two Rates with Multiple Observations

Joint Distribution

Inferences with Gaussians

Inference under Constraint of Other Parameter