Cognitive modeling in Bayesian framework

In Chapter 17 of the textbook of Lee and Wagenmakers (2014), the authors introduced how to fit the famous classification model GCM to the data of Kruschke (1993) study. As the GCM is a very successful classification model, it is quite reasonable to use it as an example to learn how to do cognitive modeling in the Bayesian framework.

In the GCM (Generalized Context Model; Nosofsky, 1986; 1987), every object is represented as a dot in the stimulus space. The closer two dots are to each other, namely the shorter the distance between them is, the more similar they are to each other. The summed similarity between an object and the exemplars of each category is the basis for categorization. That is, the higher is the summed similarity of an object to a category, the more likely it is classified as that category. Suppose there are two categories A and B, the similarity of an input item \(i\) to the exemplar \(j\) of Category A is \(s_{ij}=e^{-cd_{ij}}\), where \(c\) is the sensitivity parameter (the larger \(c\) the more dissimilar each item is to others) and \(d\) is the distance and \(d_{ij}=\sum \alpha_m|x_i-x_j|\) as the sum of the distance on dimension \(m\) weighted by selective attention \(\alpha_m\), \(\sum \alpha_m=1\). This distance is also called city-block distance, which is suitable for the stimuli composed of psychologically separable dimensions (e.g., size and angle). For the stimuli composed of psychologically integral dimensions, Euclidean distance is more suitable. The summed similarity of this item to all exemplars of Category A is \(S_A=\sum s_{ij\in A}\). Similarly, the summed similarity of it to all exemplars of Category B is \(S_B=\sum s_{ij\in B}\). The probability to assign this item to Category A is then computed as \(p(A|i)=\frac{\beta_A S_A}{\beta_A S_A+\beta_B A_B}\), where \(\beta\) is the bias parameter and \(\beta_B=1-\beta_A\). The bias parameter reflects the propensity to choose a particular category regardless of the evidence of similarity. Normally, it reflects the base rate of a category. For instance, if one category has twice as many exemplars as the other, the bias parameter for this category might be as twice large as the other.

In the study of Kruschke (1993), the category structure is shown as the below figure. Obviously, the category boundary is an ascending line and anything above it is Category B and Category A otherwise. According to the assumption of the exemplar models (e.g., the GCM and ALCOVE), the selective attention can only be tuned along stimulus dimensions. That is, when the selective attention for dimension x is increased, the selective attention for dimensions y should be decreased, and vice versa. This category structure provides a situation in which both dimensions should be equally attended to, in order to learn this boundary.

library(ggplot2)
dta<-data.frame(x=c(1,1,2,2,3,3,4,4),
                y=c(2,3,1,4,1,4,2,3),
                c=c("B","B","A","B","A","B","A","A"))
ggplot(dta,aes(x,y))+
  geom_point(aes(shape=c,size=4))+
  geom_segment(aes(x=1,y=1,xend=4,yend=4),lty=2)

Let’s fit the GCM to the data of Kruschke (1993). First, we need to load the data. The data file named KruschkeData.Rdata contains the distance of each item to each exemplar on dimensions 1 and 2, denoted by d1 and d2 respectively. There were 40 subjects in total, each of which learned these 8 items in 8 blocks. The times that each subject classified each item as Category A in these 8 blocks can be found in the matrix variable y.

load(file="KruschkeData.Rdata")

Set up model

Now we set up our JAGS model to fit the GCM to the Kruschke (1993) data in the Bayesian framework. Same as the previous examples, we need to determine what parameters we need to estimate and what prior distributions they should follow. In this case, as there are equal numbers of exemplars in the two categories, the biases for these two categories should always be the same. Thus, in total, there are two free parameters to estimate: \(c\) and \(\alpha_1\) (\(\alpha_2=1-\alpha_1\)). As the response for each trial is either Category A or Category B, clearly it has a binary outcome. Thus, for each stimulus through the 8 blocks, the responses follow a binomial distribution. That is that the likelihood to observe \(y\) Category A response for a stimulus \(i\) in 8 blocks is \(\binom{8}{y}p(A|i)^y(1-p(A|i))^{8-y}\). In the previous examples, the probability of Category A is a free parameter. However, in the current case, the GCM can actually compute it. Thus, we do not need to estimate this probability directly. Instead, we estimate \(c\) and \(\alpha_1\) in order to make the GCM to make predictions as close as possible to the observed data.

Fitting to aggregated data

When do the modeling, we can either consider the individual differences of subjects. Let’s start from the case with no regard to the individual differences. To this end, we count the times of each stimulus to be classified as Category A across all subjects in these 8 blocks. Thus, the total trial number for each stimulus is \(t=40\times8=320\).

t<-4*80
k<-y
y<-rowSums(y)
datalist<-list(d1=d1,d2=d2,t=t,y=y,nstim=nstim,a=a)

The JAGS model can be composed in the following way.

modelString<-"
model{
   # Decision data
   for(i in 1:nstim){
      y[i]~dbin(r[i],t)
      pred[i]~dbin(r[i],t)
   }
   # Similarity computation
   for(i in 1:nstim){
      for(j in 1:nstim){
         s[i,j]<-exp(-c*(w*d1[i,j]+(1-w)*d2[i,j]))
      }
   }
   # Check if exemplar belongs to Category 1
   for(j in 1:nstim){
         index1[j]<-1*(a[j]==1)+0*(a[j]==2)
         index2[j]<-1-index1[j]
   }
   # Decision probability
   for(i in 1:nstim){
      temp1[i]<-inprod(s[i,],index1)
      temp2[i]<-inprod(s[i,],index2)
      r[i]<-temp1[i]/(temp1[i]+temp2[i])
   }
   # Priors
   c~dunif(0,5)
   w~dbeta(1,1)
}"
writeLines(modelString,con="gcm.txt")

Start doing modeling.

library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.0
## Loaded modules: basemod,bugs
gcm.m1<-jags.model(file="gcm.txt",data=datalist,
                   n.chains=2,n.adapt=1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 8
##    Unobserved stochastic nodes: 10
##    Total graph size: 299
## 
## Initializing model
update(gcm.m1,n.iter=1000)
post.m1<-coda.samples(gcm.m1,c("c","w","pred"),n.chains=2,n.iter=1000)

Transfer the posterior samples to R objects. Clearly, the sensitivity parameter is independent of the attentional weight on dimension 1.

library(ggmcmc)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: tidyr
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
sample.m1<-ggs(post.m1)
c<-filter(sample.m1,Parameter=="c")$value
w<-filter(sample.m1,Parameter=="w")$value
preds<-sapply(1:nstim,function(x){
  temp<-paste0("pred[",x,"]")
  filter(sample.m1,Parameter==temp)$value
})
library(ggplot2)
ggplot(data.frame(c=c,w=w),aes(c,w))+
  geom_point(shape=1,size=0.8)+
  xlab("Sensitivity")+ylab("Attention on Dimension 1")

Also, the mean estimate for the sensitivity parameter is close to 1.00, and the mean estimate for the attentional weight for dimension 1 is close to .60, namely the attentional weight for dimension 2 is close to .40. The GCM’s prediction for each stimulus is close to the observed times also.

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
figc<-ggplot(data.frame(c=c),aes(c))+
  geom_density()
figw<-ggplot(data.frame(w=w),aes(w))+
  geom_density()
grid.arrange(figc,figw,ncol=2)

mean(c)
## [1] 0.9786591
mean(w)
## [1] 0.5829929
rbind(obseved=y,prediction=colMeans(preds))
##                [,1]     [,2]    [,3]     [,4]     [,5]    [,6]    [,7]   [,8]
## obseved    245.0000 218.0000 255.000 126.0000 182.0000 71.0000 102.000 65.000
## prediction 249.7825 224.2335 235.735 114.0245 208.3805 87.4325  97.809 73.024

Fitting to individual data

When considering the individual differences, we should not use the aggregated data. Instead, the individual subjects’ data are used and the model will be adjusted accordingly. Specifically, the sensitivity parameter and the attentional weight on dimension 1 should be estimated independently for different subjects. One thing worth noting about the programming in WINBUGS or JAGS is that every variable can be assigned a value at the first time only when it is declared. Although the updated JAGS allows to update the value of an existing variable, it is still a bit confusing about how it actually works. Thus, I would suggest creating a new variable to store the value, instead of replacing an existing variable’s value. In the below codes, this characteristic can be seen clearly. Since every subject has his/her own sensitivity parameter and attentional weight on dimension 1, the similarity between the \(i\)th stimulus and the \(j\)th exemplar might be different for different subjects. Thus, in order to prevent replacing an existing similarity value, I create a 3 dimensional matrix with the third dimension denoting the subject. Similarly, when computing the probability of Category 1, a two dimensional matrix is created with the second dimension denoting the subject.

datalist<-list(y=k,d1=d1,d2=d2,a=a,nstim=nstim,nsubj=nsubj,t=nstim)
modelString<-"
model{
   # Decision data
   for(k in 1:nsubj){
      for(i in 1:nstim){
         y[i,k]~dbin(r[i,k],t)
         pred[i,k]~dbin(r[i,k],t)
      }
   }
   # Similarity computation
   for(k in 1:nsubj){
      for(i in 1:nstim){
         for(j in 1:nstim){
            s[i,j,k]<-exp(-c[k]*(w[k]*d1[i,j]+(1-w[k])*d2[i,j]))
         }
      }
   }
   # Check if exemplar belongs to Category 1
   for(j in 1:nstim){
         index1[j]<-1*(a[j]==1)+0*(a[j]==2)
         index2[j]<-1-index1[j]
   }
   # Decision probability
   for(k in 1:nsubj){
      for(i in 1:nstim){
         temp1[i,k]<-inprod(s[i,,k],index1)
         temp2[i,k]<-inprod(s[i,,k],index2)
         r[i,k]<-temp1[i,k]/(temp1[i,k]+temp2[i,k])
      }
   }
   # Priors
   for(k in 1:nsubj){
      c[k]~dunif(0,5)
      w[k]~dbeta(1,1)
   }
}"
writeLines(modelString,con="gcm_indv.txt")

Start doing modeling. Let’s check for how well the GCM can prediction these data, namely the goodness of fit. There are many indexes for goodness of fit. In fact, the maximum likelihood is one of them. Here I introduce another common index RMSD (Root Mean Square Deviation) \(=\sqrt{\frac{\sum(p_i-o_i)^2}{N}}\), where \(p_i\) is the model prediction for the \(i\)th stimulus and \(o_i\) is the observed value for the \(i\)th stimulus and \(N\) is the total number of data points. The smaller the RMSD is the better. For each subject, a RMSD is computed. Their mean can be the index of how well the GCM fit these data.

gcm.m2<-jags.model(file="gcm_indv.txt",data=datalist,
                   n.chains=2,n.adapt=1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 320
##    Unobserved stochastic nodes: 400
##    Total graph size: 5877
## 
## Initializing model
update(gcm.m2,n.iter=1000)
post.m2<-coda.samples(gcm.m2,c("c","w","pred"),n.chains=2,n.iter=1000)
sample.m2<-ggs(post.m2)
preds<-sapply(1:nsubj,function(k){
   sapply(1:nstim,function(j){
      temp<-paste0("pred[",j,",",k,"]")
      mean(filter(sample.m2,Parameter==temp)$value)
   })
})
GoF<-sapply(1:nsubj,function(j){
   p<-preds[,j]/nstim
   o<-k[,j]/nstim
   return(sqrt(mean((p-o)^2)))
})
GoF
##  [1] 0.11174942 0.19898957 0.23317092 0.07931802 0.20843488 0.14382260
##  [7] 0.13871698 0.13381812 0.16181946 0.22150121 0.17118624 0.13797834
## [13] 0.20123021 0.10530506 0.15360597 0.17733802 0.10439563 0.14690323
## [19] 0.21141308 0.13460798 0.17551814 0.16616706 0.14607310 0.14520364
## [25] 0.08525260 0.18874376 0.16161853 0.06527919 0.22613165 0.12539493
## [31] 0.16412933 0.08371352 0.18534609 0.20628441 0.19784855 0.23609039
## [37] 0.07008575 0.14453346 0.24800764 0.17136728
mean(GoF)
## [1] 0.1592023

Let’s check for the posterior estimates for these two parameters. The scatter plot of the pairs of estimated sensitivity parameters and attentional weights on dimension 1 show a clear clustering pattern. Apparently, these 40 subjects can be separated to two groups. One group pays larger attention (>.5) to dimension 1, whereas the other group pays larger attention to dimension 2 instead. Although the mean of the estimated attentional weights on dimension 1 seems to support that the subjects pay attention evenly to these two dimensions, in fact, this is the result of averaging two opposite groups.

pars<-sapply(1:nsubj,function(j){
     temp.c<-paste0("c[",j,"]")
     temp.w<-paste0("w[",j,"]")
     mean.c<-mean(filter(sample.m2,Parameter==temp.c)$value)
     mean.w<-mean(filter(sample.m2,Parameter==temp.w)$value)
     return(c(mean.c,mean.w))
})
pars<-data.frame(t(pars))
names(pars)<-c("c","w")
colMeans(pars)
##         c         w 
## 1.3994849 0.5813179
ggplot(pars,aes(c,w))+
   geom_point()+xlab("Sensitivity")+ylab("Attentional Weight on Dimension 1")

Latent groups in Kruschke(1993) data

As shown in the scatter plot that these subjects can be separated to two groups by the attentional weight, we can estimate the attentional weights of these two groups as well as individual subjects. To this end, we fix our model a bit as follows.

modelString<-"
model{
   # Decision data
   for(k in 1:nsubj){
      for(i in 1:nstim){
         y[i,k]~dbin(r[i,k],t)
         pred[i,k]~dbin(r[i,k],t)
      }
   }
   # Similarity computation
   for(k in 1:nsubj){
      for(i in 1:nstim){
         for(j in 1:nstim){
            s[i,j,k]<-exp(-c[k]*(w[k]*d1[i,j]+(1-w[k])*d2[i,j]))
         }
      }
   }
   # Check if exemplar belongs to Category 1
   for(j in 1:nstim){
         index1[j]<-1*(a[j]==1)+0*(a[j]==2)
         index2[j]<-1-index1[j]
   }
   # Decision probability
   for(k in 1:nsubj){
      for(i in 1:nstim){
         temp1[i,k]<-inprod(s[i,,k],index1)
         temp2[i,k]<-inprod(s[i,,k],index2)
         r[i,k]<-temp1[i,k]/(temp1[i,k]+temp2[i,k])
      }
   }
   # Priors
   for(k in 1:nsubj){
      c[k]~dunif(0,5)
      # Groupping
      g[k]~dbern(0.5)
      highw[k]~dbeta(1,1)T(0.5,1)
      loww[k]~dbeta(1,1)T(0,0.5)
      w[k]<-loww[k]*(g[k]==0)+highw[k]*(g[k]==1)
   }
}"
writeLines(modelString,con="gcm_latent.txt")

Start modeling and transfer the posterior samples to a tibble object.

gcm.m3<-jags.model(file="gcm_latent.txt",data=datalist,
                   n.chains=2,n.adapt=1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 320
##    Unobserved stochastic nodes: 480
##    Total graph size: 6158
## 
## Initializing model
update(gcm.m3,n.iter=1000)
post.m3<-coda.samples(gcm.m3,c("c","w","g","pred"),n.chains=2,n.iter=1000)
sample.m3<-ggs(post.m3)

According to this model, there are two latent groups. One puts more than 50% attention on dimension 1 and the other puts less than 50% attention on dimension 1. The first is called the high group. The mean probability for each of the 40 subjects to be assigned to the high group can be computed as follow. Apparently, some of the subjects are consistently classified as one group or the otehr, but some others are not.

gs<-sapply(1:nsubj,function(k){
   temp<-paste0("g[",k,"]")
   filter(sample.m3,Parameter==temp)$value
})
colMeans(gs)
##  [1] 0.5285 0.0440 0.9885 0.9950 1.0000 0.9840 0.6360 0.9990 0.6450 1.0000
## [11] 0.9855 0.9815 0.8460 0.6230 0.7720 0.9965 0.9895 0.9940 0.8620 0.9720
## [21] 0.9385 0.7825 0.0685 0.1000 0.0380 0.9995 0.0120 0.9995 0.2425 0.6265
## [31] 0.4130 0.6200 0.0000 0.0005 0.9910 0.8825 0.8110 0.4370 0.0000 0.2555

Also, we can check the posterior distributions for the estimated attentional weight on dimension 1 of the two groups. Apparently, there are more subjects who majorly attended to dimension 1 for learning the categories. In other words, dimension 1 is more dominant than dimension 2 for categorization in this experiment.

ws<-sapply(1:nsubj,function(j){
   temp<-paste0("w[",j,"]")
   filter(sample.m3,Parameter==temp)$value
})
group<-ifelse(colMeans(gs)>0.5,"Dim1","Dim2")
table(group)
## group
## Dim1 Dim2 
##   28   12
w.dta<-data.frame(w=c(ws),group=rep(group,each=2000))
ggplot(w.dta,aes(w))+
   geom_density(aes(color=group))+xlab("Attentional Weight on Dimension 1")