Bayesian Data Analysis

Dirichlet Process Mixture Modeling

In this section, the paper of Li, Schofield, and Gönen (2019) on Journal of Mathematical Psychology will be introduced for the purpose to show how Dirichlet process mixture modeling can be implemented with multivariate Gaussian models. The basic idea is that the observed data points can be separated to \(K\) clusters, each of which can be realized as a multivariate Gaussian distribution with parameters \(\mu_k\) and \(\tau_k\), where \(\tau_k\) is the precision parameter, \(\tau_k=1/\sigma^2_k\). These Gaussian models have their own weightings given by a posterior Dirichlet distribution with \(\alpha\) as the hyperparameter vector. Thus, put all together, we can have

\[\begin{align} \pi^{(t)}_k&\sim Dirichlet(n_1+\alpha/K-1,n_2+\alpha/K-1,\cdots,n_k+\alpha/K-1)\\ \mu^{(t)}_k|c^{(t-1)},y,\tau^{(t-1)}_k&\sim N(\frac{\bar{y}_k n_k\tau^{(t-1)}_k+\mu_0\tau_0}{n_k\tau^{(t-1)}_k+\tau_0},n_k\tau^{(t-1)}_k+\tau_0)\\ \tau^{(t)}_k|c^{(t-1)},y,\mu^{(t-1)}_k&\sim Gamma(\alpha-1+n_k/2,\beta+\frac{1}{2}\sum_{[i]c=k}(y_i-\bar{y}_k)^2)\\ c^{(t)}&\sim Multinomial(1,\pi^{(t)}_1N(\mu^{(t)}_1,\tau^{(t)}_1),\pi^{(t)}_2N(\mu^{(t)}_2,\tau^{(t)}_2),\cdots,\pi^{(t)}_kN(\mu^{(t)}_k,\tau^{(t)}_k)), \end{align}\]

where \(t\) denotes trial \(t\), \(\pi^{(t)}_k\) is the probability of cluster \(k\) on trial \(t\), and \(n^{(t-1)}_k\) is the size of cluster \(k\) on trail \(t-1\). In this case, \(K\) is the total number of clusters. Here, \(K\) is finite. In fact, \(K\) can be infinite, \(K\rightarrow\infty\).

Infinite Mixture Model

First, we can derive the posterior probability vector of clusters given the hyperparameter vector \(\alpha\), where

\[\begin{align} p(c_1,c_2,\cdots,c_k|\alpha)&=\int p(c_1,c_2,\cdots,c_k|\pi_1,\pi_2,\cdots,\pi_k)p(\pi_1,\pi_2,\cdots,\pi_k)d\pi_1 d\pi_2 \cdots d\pi_k\\ &=\frac{\Gamma(\alpha)}{\Gamma(\alpha/K)^k}\int\prod^{K}_{k=1} \pi_k^{n_k+\alpha/K-1}d\pi_k\\ &=\frac{\Gamma(\alpha)}{\Gamma(\alpha/K)^k}\prod^{K}_{k=1} \frac{\Gamma(n_k+\alpha/K)}{\Gamma(\alpha/K)}. \end{align}\]

This equation can be leaveraged to produce a complete workaround of K. See below.

\[\begin{align} p(c_i=k|c_{-i},\alpha)=\frac{n_{-i,k}+\alpha/K}{n-1+\alpha}, \end{align}\]

where \(n-1\) is the total number of obsevered data points on trial \(t-1\). When \(K\rightarrow\infty\), this equation becomes

\[\begin{align} p(c_i=k|c_{-i},\alpha)=\frac{n_{-i,k}}{n-1+\alpha}. \end{align}\]

When \(n_{-i,k}=n-1\), \(p(c_i=k|c_{-i},\alpha)=\frac{n-1}{n-1+\alpha}\), namely the probability to assign the current item \(y_i\) to any of the existing clusters \(k\). Alternatively, the probability to create a new cluster is

\[\begin{align} 1-p(c_i|c_{-i},\alpha)=1-\frac{n-1}{n-1+\alpha}=\frac{\alpha}{n-1+\alpha}. \end{align}\]

Now we know the probability to assign an item to an old or new cluster with a Dirichlet prior. The assignment of an item \(y_i\) to a cluster is not only dependent on the cluster size but also the likelihood of the Gaussian model corresponding to that cluster to generate it. That is, we need to solve the conditional posterior probability of \(c_i\), namely

\[\begin{align} p(c_i|c_{-i},\mu_k,\tau_k,\alpha)&\propto p(c_i|c_{-i},\alpha)p(\tilde{y}|\mu_k,\tau_k,c_{-i})\\ &\propto \frac{n_{-i,k}}{n-1+\alpha}N(\tilde{y}_i;\frac{\bar{y}_k n_k\tau_k+\mu_0\tau_0}{n_k\tau_k+\tau_0},n_k\tau_k+\tau_0+\tau_y) \end{align}\]

Similarly, the condintional posterior probability to assign an item to a new cluster is

\[\begin{align} p(c_i\neq c_k\forall j\neq i|c_{-i},\mu_0,\tau_0,\alpha) &\propto p(c_i\neq c_k\forall j\neq i|c_{-i},\alpha)\times\int p(\tilde{y}_i|\mu_k,\tau_k)p(\mu_k,\tau_k|\mu_0,\tau_0)d\mu_k d\tau_k\\ &\propto \frac{\alpha}{n-1+\alpha}N(\tilde{y}_i;\mu_0,\tau_0+\tau_y), \end{align}\]

where \(\tau_y\) is the precision of measurment. In these equations, there are three types of precision paramters. The first is \(\tau_0\), which is the precision parameter of the base model, from which all Gaussian models are derived. The second is \(\tau_k\), which is the precision parameter of a particular Gaussian model corresponding to a particular cluster. The last one is \(\tau_y\), which is the precision parameter of the Gaussian posterior predictive distribution for a cluster. Here for the sake of simplification, the precision paramter is set to be the same \(\tau_y\) for all clusters. Thus, we can treat it as a measurement error.

Posterior and Prior Predictive Distributions

The posterior predictive distributions (PPDs) for already occupied clusters can be derived as follows. Suppose a variable \(y\) is normally distributed with an unknown mean of \(\mu\) and a measurement error in the form of a fixed and known variance \(\sigma^2_y\), that is

\[\begin{align} y\sim N(\mu,\sigma^2_y). \end{align}\]

Also, if the unkonwn mean \(\mu\) is expressed with a prior belieft of

\[\begin{align} \mu\sim N(\mu_0,\sigma^2_0), \end{align}\]

then the posterior predictive distribution of a newly observed given the data already seen is

\[\begin{align} \tilde{y}\sim N(\mu,\sigma^2_0+\sigma^2_y). \end{align}\]

In fact, the mean of an already occupied cluster can be derived as

\[\begin{align} p(\mu_k|y,c=k)&\propto p(y,c=K|\mu_k)p(\mu_k)\\ &\sim N(\frac{\bar{y}_k n_k\tau_k+\mu_0\tau_0}{n_k\tau_k+\tau_0},n_k\tau_k+\tau_0). \end{align}\]

Thus, the PPD to generate a variable \(y\) in the \(k\)th cluster is

\[\begin{align} \tilde{y}\sim N(\frac{\bar{y}_kn_k\tau_k+\mu_0\tau_0}{n_k\tau_k+\tau_0},\frac{1}{n_k\tau_k+\tau_0}+\sigma^2_y). \end{align}\]

Pseudo Codes

The below pseudo codes are used to implement the Dirichlet process mixture model with multivariate Gaussian distributions as the likelihood functions.

Clustering Data