Dirichlet process mixture model

Dirichlet distribution can be viewed as a distribution of distributions, which normally is applied to generate a mixture model to describe any distribution with unknown parameters. Latent Dirichlet Allocation (LDA) is one of the examples. We will introduce LDA as an example of Dirichlet process mixture model. We start from introducing the multinomial distribution.

Multinomial, Binomial, and Bernoulli

The multinomial distribution is a generalization of the binomial distribution, whose probability density function is

\[\begin{align} P(X=k)=\frac{N!}{k!(N-k)!}\theta^{k}(1-\theta^{N-k}). \end{align}\]

We can replace \(\theta\) by \(\theta_1\) and \((1-\theta)\) by \(\theta_2\). Then, the function can be written as

\[\begin{align} P(X=k)=\frac{N!}{k!(N-k)!}\theta_1^{k}\theta_2^{N-k}. \end{align}\]

Since \(k\) is the presence times of the event \(x\), we can also call \(N-k\) the absence times of \(x\). More generally, we can call \(k\) and \(N-k\) as \(x_1\) and \(x_2\), with \(N=x_1+x_2\). Therefore,

\[\begin{align} P(x_1,x_2)=\frac{N!}{x_1!x_2!}\theta_1^{x_1}\theta_2^{x_2}. \end{align}\]

If \(N=1=x_1+x_2\), then \(x \in\{1,0\}\). Thus, this function becomes

\[\begin{align} P(x_1,x_2)=\theta_1^{x_1}\theta_2^{x_2}. \end{align}\]

This is probability density function for the Bernoulli distribution. Similarly, if we have \(k\) events \(x=(x_1,x_2,\cdots,x_k)\). The function becomes

\[\begin{align} P(x_1,x_2,\cdots,x_k)=\frac{N!}{x_1!x_2!\cdots x_k!}\theta_1^{x_1}\theta_2^{x_2}\cdots\theta_k^{x_k}. \end{align}\]

This is the probability density function for the multinomial distribution. Of course, \(N=\sum_i^k x_k\).

Conjugate prior of multinomial distribution

The conjugate prior of the multinomial distribution is the Dirichlet distribution. Suppose we have \(K\) partitions of the sample space. The probability density function of the Dirichlet distribution is

\[\begin{align} P(\theta_1,\theta_2,\cdots,\theta_k|\alpha_1,\alpha_2,\cdots,\alpha_k)=\frac{\Gamma(\sum_i^K\alpha_i)}{\Pi_i^K\Gamma\alpha_i}\Pi_i^K\theta_i^{\alpha_i-1}. \end{align}\]

Where \(\alpha\) is the vector of hyperparameters \(\alpha=(\alpha_1,\alpha_2,\cdots,\alpha_k)\). This equation can be written as

\[\begin{align} P(\theta|\alpha)=\frac{1}{B(\alpha)}\Pi_i^K\theta_i^{\alpha_i-1}, \end{align}\]

where \(\frac{1}{B(\alpha)}=\frac{\Gamma(\sum_i^K\alpha_i)}{\Pi_i^K\Gamma\alpha_i}\). \(B(\alpha)\) is the Beta function. The prior mean of each partition \(P(\theta|\alpha)=P_0=(\frac{\alpha_1}{\sum\alpha},\frac{\alpha_2}{\sum\alpha},\cdots,\frac{\alpha_k}{\sum\alpha})\), where \(P_0\) is called the base probability. The prior variance of each partition is \(var(\theta|\alpha)=\frac{\bar{\alpha}(1-\bar{\alpha})}{\alpha_0+1}\), where \(\alpha_0=\sum\alpha\) and \(\bar{\alpha_i}=\frac{\alpha_i}{\alpha_0}\). In fact, the Beta distribution is a special case of the Dirichlet distribution, where there are only two outcomes (or partitions).

The hyperparameter \(\alpha\) is a scale and is often interpreted as a prior sample size. Let’s see the influence of \(\alpha\).

library(dirmult)
trials<-10000
getDir<-function(trials,alpha){
    x<-rdirichlet(trials,alpha)
    result<-cbind(apply(x,2,mean),apply(x,2,sd))
    colnames(result)<-c("mean","sd")
    rownames(result)<-c("theta_1","theta_2","theta3")
    return(result)
}
alphas<-c(100,100,100)
getDir(trials,alphas)
##              mean         sd
## theta_1 0.3333443 0.02708573
## theta_2 0.3334633 0.02705015
## theta3  0.3331924 0.02681490
alphas<-c(1,1,1)
getDir(trials,alphas)
##              mean        sd
## theta_1 0.3352238 0.2371939
## theta_2 0.3304746 0.2343746
## theta3  0.3343017 0.2361053

Generative model of Dirichlet distribution

In LDA, the words in a document is the results of the posterior predicitve distribution, which is the multinomial distribution with the probability of each outcome (i.e., each word) is computed by the Dirichlet prior distribution.

\[\begin{align} \phi_1,\phi_2,\cdots,\phi_k \sim Dir(\alpha_1,\alpha_2,\cdots,\alpha_k),\\ w_1,w_2,\cdots,w_k\sim Multinomial(\phi_1,\phi_2,\cdots,\phi_k). \end{align}\]

The vector \(w=(w_1,w_2,\cdots,w_k)\) represent the words in a document. The probability of each word to be sampled is \(\phi=(\phi_1,\phi_2,\cdots,\phi_k)\), which is sampled from the Dirichlet distribution given the hyperparameter \(\alpha=(\alpha_1,\alpha_2,\cdots,\alpha_k)\). Thus, we can also write the generative equation as

\[\begin{align} w|\phi\sim Dir(\alpha). \end{align}\]

Suppose a document consists of three alphabets A, B, and C presented for 1, 1, and 8 times respectively. We treat this document as the ideal document. Now we generate a document of 10 words with the alphabet counts as the hyperparameter vector \(\alpha\).

vocab<-c("A","B","C")
document<-rep(vocab,c(1,1,8))
word_count<-c(1,1,8)
alphas<-word_count
num_words<-sum(word_count)
alphabets_generated<-sapply(1:num_words,function(i){
  set.seed(i)
  p<-rdirichlet(1,alphas)
  set.seed(i)
  w<-which(rmultinom(1,1,p)==1)
  return(vocab[w])
})
rbind(alphabets_generated,document)
##                     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## alphabets_generated "C"  "C"  "C"  "C"  "C"  "C"  "A"  "C"  "C"  "C"  
## document            "A"  "B"  "C"  "C"  "C"  "C"  "C"  "C"  "C"  "C"

Although the generated document looks a bit different from the ideal document, this result should result from random sampling. When more documents are generated, the probability of each alphabet should approximate the ideal one.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
n_doc<-5 # set up number of documents
# Create a data set for encoding alphabets
alphabet_encoding<-tibble(label=c("A","B","C"),
                          alphabet_prop=c(0.1,0.1,0.8))
# Generate each alphabet
phis<-rdirichlet(n_doc*num_words,alphas)
head(phis)
##            [,1]        [,2]      [,3]
## [1,] 0.05255487 0.016589637 0.9308555
## [2,] 0.04139391 0.012703592 0.9459025
## [3,] 0.08829015 0.087398487 0.8243114
## [4,] 0.11241923 0.255504126 0.6320766
## [5,] 0.08748072 0.008798821 0.9037205
## [6,] 0.14668026 0.024794568 0.8285252
# Sample alphabets with phis
select_alphabets<-apply(phis,1,function(x)which(rmultinom(1,1,x)==1))
ds<-tibble(doc_id=rep(1:n_doc,each=num_words),
           alphabet=alphabet_encoding$label[select_alphabets])
ds %>% group_by(doc_id) %>% summarise(alphabets=paste(alphabet,collapse=" "))
## # A tibble: 5 × 2
##   doc_id alphabets          
##    <int> <chr>              
## 1      1 C C A C C C C B C A
## 2      2 C C C C C C A C C C
## 3      3 C C C A C C C C C C
## 4      4 C C C B A A C C C A
## 5      5 A C C C C C A B C C

Inference model of Dirichlet distribution

In addition to generating a document by sampling each word from the multinomial distribution of all words given the probabilities sampled from the Dirichlet prior, we can also infer the probability of each word from existing documents. The Dirichlet Process prior distribution also has a conjugacy property which makes inference straightforward. Take the above case as an example. The probability of each word given existing documents is

\[\begin{align} P(w)|y \sim Dir(\alpha+n), \end{align}\]

where \(w=(w_1,w_2,\cdots,w_k)\) are the alphabets here, \(\alpha=(\alpha_1,\alpha_2,\cdots,\alpha_k)\) are the hyperparameters, and \(n=(n_1,n_2,\cdots,n_k)\) are the counts of alphabets aggregated across all existing documents. Let’s see how the inference model of LDA works with the previously created 5 documents as the existing documents.

# Set up hyperparameters
alphas<-rep(1,3)
n<-table(ds$alphabet)
niters<-2000
phis<-sapply(1:niters,function(x)rdirichlet(1,n+alphas))
phis<-t(phis)
colnames(phis)<-c("A","B","C")
burin<-500
es.phis<-as_tibble(phis[-(1:burin),]) %>% gather(alphabet,phi)
#es.phis$alphabet<-ifelse(es.phis$alphabet=="V1","A",
#                         ifelse(es.phis$alphabet=="V2","B"),"C")
library(ggplot2)
ggplot(es.phis,aes(x=alphabet,y=phi,fill=alphabet))+geom_violin()

n/500
## 
##     A     B     C 
## 0.018 0.006 0.076

Generating documents

In the previous examples, we learned about the generation of documents given the observed word counts. Now we are going to see how LDA generates documents given multiple topics. The procedure of document generation is shown in the below pseudo codes.

See the below example.

k<-2 # number of topics
M<-10 # create 10 documents
theta<-c(0.5,0.5) # probabilities in topic distribution
vocab<-c("A","B","C")
phi<-matrix(c(0.1,0,0.9,
              0.4,0.4,0.2),2,length(vocab),byrow=T)
N<-10 # word number in each document
# Generate documents
allwords<-sapply(1:M,function(m){
     nw.topic<-sapply(1:N,function(n){
          # Select topic
          topic<-which(rmultinom(1,1,theta)==1)
          # Sample word
          new_word<-vocab[which(rmultinom(1,1,phi[topic,])==1)]
          return(c(new_word,topic))
     })
     return(nw.topic)
})
gd<-tibble(doc_id=rep(1:M,each=N),
           word=c(allwords[seq(1,nrow(allwords),2),]),
           topic=c(allwords[seq(2,nrow(allwords),2),]))
gd %>% group_by(doc_id) %>% summarise(tokens=paste(word,collapse=" "))
## # A tibble: 10 × 2
##    doc_id tokens             
##     <int> <chr>              
##  1      1 C A A C B C B C C C
##  2      2 C B B A A A B C B C
##  3      3 C C C A B C C C C C
##  4      4 A A C C C C B A C B
##  5      5 C C C A A C C B C C
##  6      6 C C B C C A C B C C
##  7      7 A B C C B A C C A C
##  8      8 B C B B C A A B A C
##  9      9 A C C A C A A C C B
## 10     10 A A A A C C C C C C