Bayesian Data Anaalysis

Joint Distribution

In psychological studies, the behavior of interest often depends on more than one parameter which might interact with each other. Bayesian statistics can help estimate these parameters simultaneously. See Chapter 3.6 in the textbook of Lee and Wagenmakers (2013). Suppose 5 helpers distributed a bundle of surveys to houses. We only know that each bundle contained the same number of surveys \(n \in\{1\cdots500\}\). Now the 5 helpers received \(k=\{16,18,22,25,27\}\) surveys returned. What would be the most likely return rate \(\theta\) and the number of surveys \(n\) distributed to houses?

First, what we will get after distributing our surveys is either to get nothing or to get the survey back. Thus, the observed number of surveys should follow a binomial distribution with \(\theta\) and \(N\) as the parameters. Different from the previous examples, in which the number of observations is known, we need to estimate the number of surveys. Second, it is assumed that the prior distribution of the return rate is a uniform distribution, \(\theta\sim Beta(1,1)\). Third, since we have no idea about what the number of survey would be, any number in between 1 and 500 is equally likely. Thus, the prior of the survey number can be assumed as being sampled from a nominal distribution with 500 entries with the probability for any entry to be sampled as 1/500.

# Prepare data
N<-500
K<-c(16,18,22,25,27)
datalist<-list(N=N,K=K,m=length(K))
# Set up model
modelString<-"
model{
   for(i in 1:m){
     K[i]~dbin(theta,Nk)
   }
   Nk~dcat(Pi)
   # Priors
   theta~dbeta(1,1)
   for(i in 1:N){
      Pi[i]<-1/N
   }
}"
writeLines(modelString,con="e5_model1.txt")

Now we can start parameter estimation.

library(rjags)

## Loading required package: coda

## Linked to JAGS 4.3.0

## Loaded modules: basemod,bugs

e5_model<-jags.model(file="e5_model1.txt",data=datalist,
                     n.chains=2,n.adapt=100)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 5
##    Unobserved stochastic nodes: 2
##    Total graph size: 12
## 
## Initializing model

update(e5_model,n.iter=1000)
post_e5<-coda.samples(e5_model,c("theta","Nk"),n.iter=1000)

Transfer the object returned by coda.samples( ) to a tibble. Also, we can extract the posterior samples of these two parameters. The MAP estimates of these two parameters and the expected values of the posterior distributions of these two parameters can be computed by the following codes.

library(ggmcmc)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: tidyr

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

sample_e5<-ggs(post_e5)
Nk<-filter(sample_e5,Parameter=="Nk")$value
theta<-filter(sample_e5,Parameter=="theta")$value
# MAP estimate of number of surveys
dNk<-density(Nk)
MAP.N<-dNk$x[which.max(dNk$y)]
# MAP estimate of theta
dtheta<-density(theta)
MAP.theta<-dtheta$x[which.max(dtheta$y)]
c(MAP.N,MAP.theta)

## [1] 110.45078317   0.07146054

# Mean estimate of number of surveys
mean.N<-mean(Nk)
# Mean estimate of theta
mean.theta<-mean(theta)
c(mean.N,mean.theta)

## [1] 223.1205000   0.1372014

Then, we can plot the joint posterior distribution for these two parameters.

library(ggplot2)
dta<-data.frame(N=Nk,theta=theta)
dta.p<-data.frame(N=c(MAP.N,mean.N),theta=c(MAP.theta,mean.theta),
                  type=c("MAP","Mean"))
ggplot(dta,aes(x=N,y=theta))+
  geom_point(size=0.3,alpha=0.4)+
  geom_point(data=dta.p,aes(N,theta,shape=type,color=type))+
  scale_shape_manual(values=c(2,4))

Inferences with Gaussians

One of the most common inference problems involves assuming data following a Gaussian distribution, and inferring the mean and the standard deviation of this distribution from a sample of observed independent data.

In JAGS, a Gaussian distribution is defined by two parameters \(\mu\) and \(\lambda=\frac{1}{\sigma^2}\). Note \(\lambda\) is the reverse of the variance. Suppose you randomly sampled 100 participants for doing the WAIS intelligence test. The mean and the standard deviation of the norm of WAIS are known as 100 and 15. What will be the estimates of the mean and the standard deviation based on this sample?

First, we can use R to draw 100 scores from a normal distribution with \(\mu=100\) and \(\sigma=15\) as our data in this simulation experiment. The histogram of these sampled scores is also shown in the below figure.

set.seed(123)
k<-rnorm(100,100,15)
datalist<-list(k=k,s=length(k))
ggplot(data.frame(k=k),aes(k))+
  geom_histogram(color="white",fill="deepskyblue2",bins=10)

Now we initialize our JAGS model.

modelString<-"
model{
   for(i in 1:s){
      k[i]~dnorm(mu,lambda)
   }
   # Prior
   mu~dunif(20,200)
   sigma~dunif(0,30)
   lambda<-pow(sigma,-2)
}"
writeLines(modelString,con="e6_model.txt")

Start estimating the parameters. Then, we turn the returned object to the variable in the tibble structure.

e6_model<-jags.model(file="e6_model.txt",data=datalist,
                     n.chains=2,n.adapt=100)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 100
##    Unobserved stochastic nodes: 2
##    Total graph size: 110
## 
## Initializing model

update(e6_model,n.iter=1000)
post_e6<-coda.samples(e6_model,c("mu","lambda","sigma"),
                      n.iter=1000)
sample_e6<-ggs(post_e6)

The MAP estimates and mean estimates of these two parameters can be computed by the following codes. For each parameter, the two estimates are quite close to each other.

mu<-filter(sample_e6,Parameter=="mu")$value
sigma<-filter(sample_e6,Parameter=="sigma")$value
# Get MAP estimates
d.mu<-density(mu)
d.mu$x[which.max(d.mu$y)]

## [1] 101.2191

d.sigma<-density(sigma)
d.sigma$x[which.max(d.sigma$y)]

## [1] 13.71862

# Get mean estimates
mean(mu)

## [1] 101.3861

mean(sigma)

## [1] 13.8926

We can plot the posterior distributions for these two parameters.

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

fig1<-ggplot(data.frame(mu=mu),aes(mu))+
  geom_density()
fig2<-ggplot(data.frame(sigma=sigma),aes(sigma))+
  geom_density()
grid.arrange(fig1,fig2,ncol=2)

We can also check the correlation of the posterior estimates of the mean and the SD. The scatter plot for the means and SDs is shown as below. There is no correlation between them.

ggplot(data.frame(mu=mu,sigma=sigma),aes(mu,sigma))+
  geom_point(size=0.3,shape=1)+
  geom_segment(aes(x=min(mu),y=mean(sigma),xend=mean(mu),yend=mean(sigma),
                   color="red"))+
  geom_segment(aes(x=mean(mu),y=min(sigma),xend=mean(mu),yend=mean(sigma),
                   color="red"))

cor.test(mu,sigma)

## 
##  Pearson's product-moment correlation
## 
## data:  mu and sigma
## t = -1.3018, df = 1998, p-value = 0.1932
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07284828  0.01473944
## sample estimates:
##        cor 
## -0.0291103

Inference under Constraint of Other Parameter

Seven scientists with wildly-differing skills all make a measurement of the same quantity. They get the answers \(x=\{-27.020,3.570,8.191,9.898,9.603,9.945,10.056\}\). Intuitively, it seems clear that the first two are pretty inept measures, and the true value of the quantity is probably just a bit below 10. What is the most likely value for this quantity?

Normally, when we estimate the mean of a population, the sample mean is by default the unbiased estimate. Here, if we compute the mean of these values, we will get 3.46 as our estimate. The standard error is 5.16. Then, the true value will be in the 95% confidence interval between -6.64 and 13.57. This confidence interval is pretty large. In other words, the precision of the measurement for that quantity is low. However, a reasonable suspicion is that the first two values might bias our estimation. You may want to trim them out, but you will need a reason to do so. In Bayesian data analysis, this will not be a problem.

First, we now that the observed score is the real score plus a random error. That is, \(X=T+\epsilon\), where \(\epsilon\sim N(0,\sigma_\epsilon)\). Therefore, \(X\sim N(T,\sigma_\epsilon)\). The measures of these 7 scientists can be viewed as the samples drawn from the normal distribution with the mean as the true measure score and the SD as the SD of random errors.

Let us prepare the data first. Suppose every scientist has the same precision on measuring. The model can be shown as follows. Here, the suitable prior distribution for \(\lambda\) is the Gamma distribution, \(\lambda\sim \Gamma(0.001,0.001)\).

# Prepare data
k<-c(-27.020,3.570,8.191,9.898,9.603,9.945,10.056)
n<-length(k)
datalist<-list(k=k,n=n)
# Set up model
modelString<-"
model{
   for(i in 1:n){
      k[i]~dnorm(mu,lambda)
   }
   # Prior
   mu~dnorm(0,0.001)
   lambda~dgamma(0.001,0.001)
   sigma<-pow(lambda,-0.5)
}"
writeLines(modelString,con="e6_model1.txt")

Start estimating the parameters. Check the mean estimate of the mean score and you will find it quite close to the mean of these 7 measured scores.

e6_model1<-jags.model(file="e6_model1.txt",data=datalist,
                     n.chains=2,n.adapt=100)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 7
##    Unobserved stochastic nodes: 2
##    Total graph size: 15
## 
## Initializing model

update(e6_model1,n.iter=1000)
post1_e6<-coda.samples(e6_model1,c("mu","sigma"),n.iter=1000)
sample1_e6<-ggs(post1_e6)
mu<-filter(sample1_e6,Parameter=="mu")$value
sigma<-filter(sample1_e6,Parameter=="sigma")$value
mean(mu)

## [1] 3.315206

mean(sigma)

## [1] 15.42851

However, we have reason to believe that the first two scientists have very low precision in measurement, comparing to others. Thus, we can re-estimate these parameters with the assumption that each scientist has a different measurement precision.

modelString<-"
model{
   for(i in 1:n){
      k[i]~dnorm(mu,lambda[i])
   }
   # Priors
   mu~dnorm(0,0.001)
   for(i in 1:n){
      lambda[i]~dgamma(0.001,0.001)
      sigma[i]<-pow(lambda[i],-0.5)
   }
}"
writeLines(modelString,con="e6_model2.txt")

Now we estimate the parameters. The mean estimate of the true score of this measurement is now 9.92. This is more reasonable than the previous one. The mean estimates of \(\sigma\) for the 7 scientists reveal that the first two scientists have a very poor precision.

e6_model2<-jags.model(file="e6_model2.txt",data=datalist,
                      n.chains=2,n.adapt=100)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 7
##    Unobserved stochastic nodes: 8
##    Total graph size: 27
## 
## Initializing model

update(e6_model2,n.iter=1000)
post2_e6<-coda.samples(e6_model2,c("mu","sigma"),n.iter=1000)
sample2_e6<-ggs(post2_e6)
# Mean estimate of the true score
mu2<-filter(sample2_e6,Parameter=="mu")$value
mean(mu2)

## [1] 9.91709

# Mean estimate of the SD of measurement
sigma2<-sapply(1:n,function(x){
    temp<-paste0("sigma[",x,"]")
    filter(sample2_e6,Parameter==temp)$value
})
dim(sigma2)

## [1] 2000    7

colMeans(sigma2)

## [1] 279.0978499 104.3341603  15.1707539   1.4654750   2.5421565   3.3917382
## [7]   0.8219529

The Highest Density Interval for the posterior \(\mu\) and \(\sigma\) can be computed as follows. Apparently, the 95% HDI for \(\mu\) now is pretty narrow, namely that the measurement is very precise. Also, the 95% HDI for \(\sigma\) is particularly wide for the first two scientists.

library(HDInterval)
hdi(density(mu2))

## Warning in hdi.density(density(mu2)): The HDI is discontinuous but allowSplit = FALSE;
##     the result is a valid CrI but not HDI.

##     lower     upper 
##  9.621187 10.099332 
## attr(,"credMass")
## [1] 0.95
## attr(,"height")
## [1] 0.3026617

hdi.sigmas<-sapply(1:n,function(x){
  temp<-hdi(density(sigma2[,x]))
  return(temp[c(1,2)])
})
hdi.sigmas

##            [,1]       [,2]      [,3]        [,4]       [,5]        [,6]
## lower -27.73943  -4.633415 -1.442246 -0.08405958 -0.3131237 -0.07155505
## upper 712.28292 176.764149 28.837281  2.30444502  7.3187236  7.49315860
##             [,7]
## lower -0.1516489
## upper  2.4987315

Bayesian Data Anaalysis

Lee-Xieng Yang

Joint Distribution

Inferences with Gaussians

Inference under Constraint of Other Parameter