LDA

LDA (Latent Dirichlet Allocation) is an application of Bayesian technique for finding out the latent topics of documents. The basic idea of LDA is quite simple that an article is a bag of topics and a topic is a bag of words. Any article can be viewed as a multinomial distribution of words. For example, the below shows the frequency distribution of words in an article.

knitr::include_graphics("article.png")

We can treat this distribution as a sum of two multinomial distributions, as shown in the below figure.

knitr::include_graphics("topics.png")

Thus, the \(i\)th word in an article \(w^c_i\) is generated by a multinomial distribution of words given topic \(c\). The topic \(c\) is generated by a multinomial distribution of topics given the article. The conjugate priors of these multinomial are Dirichlet distributions with hyper parameters \(\gamma\) and \(\beta\).

\[\begin{align*} w^c_i & \sim M(\phi^c_1,\phi^c_2,\dots)\\ \phi^c_i & \sim Dir(\beta,\beta,\dots)\\ c_i & \sim M(\pi_1,\pi_2,\dots)\\ \pi_k & \sim Dir(\gamma,\gamma,\dots) \end{align*}\]

LDA is also called topic modeling and the process to generate topics is not supervised by any true answer. Thus, LDA is viewed as an unsupervised learning algorithm, whose job is to reallocate each word to the topic which can generate the word with the highest likelihood.

In R, the function LDA( ) in the package {topicmodels} is often used to conduct top modeling. In order to demonstrate how to use this function, I decide to use the comments for the posts about the president candidates as the data here.

Ly<-read.table("Ly.txt",header=T)
Kp<-read.table("Kp.txt",header=T)
Ho<-read.table("Ho.txt",header=T)

First, we need to do Chinese word segmentation for these commends.

library(jiebaR)
## Loading required package: jiebaRD
wk<-worker(user="user_defined_ptt.txt",stop_word="user_stopwords_ptt.txt")
Ly.comments<-sapply(Ly$comment,function(x){
  temp<-wk[x]
  return(paste(temp,collapse=" "))
})
names(Ly.comments)<-NULL
Kp.comments<-sapply(Kp$comment,function(x){
  temp<-wk[x]
  return(paste(temp,collapse=" "))
})
names(Kp.comments)<-NULL
Ho.comments<-sapply(Ho$comment,function(x){
  temp<-wk[x]
  return(paste(temp,collapse=" "))
})
names(Ho.comments)<-NULL
library(tidytext)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Gsp.dta<-tibble(title=rep(c("Ly","Kp","Ho"),c(17,30,11)),
       post=c(1:17,1:30,1:11),
       text=c(Ly.comments,Kp.comments,Ho.comments))

We need to sort out the comment data. First, we combine title and post. Thereafter, we transform the data to one word per document per row format.

library(tidyr)
by_post<-Gsp.dta %>% unite(document,title,post)
texts<-lapply(by_post$text,function(x){
  temp<-unlist(strsplit(x," "))
  return(temp)
})
# Create a tibble in the format of one token per document per row
#by_post_word<-by_post %>% unnest_tokens(word,text) This function somehow went wrong
Word<-sapply(1:length(texts),function(i){
  word<-texts[[i]]
  return(word)
})
Doc<-sapply(1:length(texts),function(i){
  document<-rep(by_post$document[i],length(texts[[i]]))
  return(document)
})
by_post_word<-tibble(document=unlist(Doc),
                     word=unlist(Word))

We count the number of words in each document. As LDA can only receive a document-term matrix, we need to transform the current data to a document-term matrix. To this end, we use the function tidy( ) in the package {tidytext}.

word_counts<-by_post_word %>% count(document,word,sort=T)
post_dtm<-word_counts %>% cast_dtm(document,word,n)
post_dtm
## <<DocumentTermMatrix (documents: 57, terms: 7768)>>
## Non-/sparse entries: 17729/425047
## Sparsity           : 96%
## Maximal term length: 31
## Weighting          : term frequency (tf)

Now we can conduct LDA. Since we know these comments come from the posts for three president candidates, we can set up the topic number k = 3.

library(topicmodels)
posts_lda<-LDA(post_dtm,k=3,control=list(seed=1234))
posts_lda
## A LDA_VEM topic model with 3 topics.

We can check the probability of word to be generated by each topic. To this end, we use the function tidy( ) to transfer the object generated by LDA function to a tibble. Take the word “功德” as an example. Apparently, “功德” is the least likely to be generated by the third topic and it is the most likely to be generated by the second topic. However, “夜市” is most likely to be generated by the third topic.

posts_topics<-tidy(posts_lda,matrix="beta")
posts_topics
## # A tibble: 23,304 × 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     1 功德  7.20e- 3
##  2     2 功德  1.76e- 2
##  3     3 功德  1.98e- 4
##  4     1 夜市  1.43e- 4
##  5     2 夜市  4.41e-53
##  6     3 夜市  2.29e- 2
##  7     1 老人  3.71e- 4
##  8     2 老人  1.35e- 2
##  9     3 老人  4.01e- 4
## 10     1 柯    1.13e- 2
## # ℹ 23,294 more rows

For the ease to picture each topic, we extract the top 10 words in each topic and then plot the word frequencies.

library(ggplot2)
library(showtext)
## Loading required package: sysfonts
## Loading required package: showtextdb
# call showtext_auto()
showtext_auto()
top_terms<-posts_topics %>%
           group_by(topic) %>%
           slice_max(beta,n=10) %>%
           ungroup() %>%
           arrange(topic,-beta)
top_terms %>%
  mutate(term=reorder_within(term,beta,topic)) %>%
  ggplot(aes(beta,term,fill=factor(topic)))+
     geom_col(show.legend=FALSE)+
     facet_wrap(~topic,scales="free")+
     scale_y_reordered()

Now we check the probability for each topic to be generated, given eacn post. Subsequently, we can make a box plot for the probabilities of the topics in the posts for each candidate. It looks like that the third topic represents 侯 the most. The first topic represents 柯 the best. However, no topic exclusively represents 賴.

posts_gamma<-tidy(posts_lda,matrix="gamma")
posts_gamma
## # A tibble: 171 × 3
##    document topic      gamma
##    <chr>    <int>      <dbl>
##  1 Ly_17        1 0.00000904
##  2 Kp_29        1 0.00000860
##  3 Ly_16        1 0.00000607
##  4 Kp_1         1 0.0000108 
##  5 Ly_9         1 1.00      
##  6 Ho_8         1 0.0000293 
##  7 Kp_23        1 0.0000197 
##  8 Kp_17        1 0.0000392 
##  9 Kp_30        1 1.00      
## 10 Ly_2         1 1.00      
## # ℹ 161 more rows
posts_gamma<-posts_gamma %>%
  separate(document,c("title","post"),sep="_",convert=T)
posts_gamma %>%
  mutate(title = reorder(title, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ title) +
  labs(x = "topic", y = expression(gamma))