Introduction to R

Topic model

In addition to extracting keywords from texts, researchers can also extract topics from texts, each of which is a bag of words. The basic idea of topic modeling is that an article can be decomposed to several topics, each of which is a multinominal distribution over all words. The common method used to extract topics of a text is LDA (Latent Dirichlet Allocation). The purpose of LDA is to assign each word to the most suitable topic for it. That is, the word can be most likely predicted by that topic. For the purpose of demonstrating how to use LDA, we focus on the abstracts of articles in Chinese Journal of Psychology from 2015 to 2024. We would like to how these abstracts can be summarized as topics. First, let’s get those abstracts. I have already scrapped the url of each web page in cjp_urls.txt. Thus, all we have to do here is access each article in each web page in order to get the abstract.

# Get article abstracts in Chinese Journal of Psychology from 2015 to 2024
urls<-scan("cjp_urls.txt",what=character())
library(rvest)
Abstracts<-sapply(urls,function(x){
     webpage<-read_html(x)
     abstracts<-webpage %>% html_nodes("p.fullcontext.二行省略") %>% html_text()
     abstracts<-paste0(abstracts,collapse="_@_")
     return(abstracts)
})

When the abstracts have all been scrapped, we need to do Chinese word segmentation. Here I use <font color=“red>{jiebaR}. The user defined words can be seen here and the stop words are contained in this file. Thereafter, all abstracts were transferred to the format with words separated by a space.

# Chinese word segmentation
library(jiebaR)

## Loading required package: jiebaRD

wk<-worker(user="user_defined.txt",stop_word="user_stopwords_ptt.txt")
abs<-lapply(Abstracts,function(a){
     temp<-unlist(strsplit(a,"_@_"))
     names(temp)<-NULL
     temp<-sapply(temp,function(b)paste0(wk[b],collapse=" "))
     names(temp)<-NULL
     return(temp)
})
abs1<-unlist(abs)
names(abs1)<-NULL

Visual inspection of these abstracts suggests that two abstracts are actually empty. Thus, we select the remaining abstracts for LDA. As the data format that is acceptable to LDA is one-token-per-row, we need to transfer the abstract data to this format. The below codes show how to make it.

library(tidytext)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

valids<-sapply(abs1,function(x){
    t<-length(unlist(strsplit(x," ")))
    return(ifelse(t>0,T,F))     
})
# Get one-token-per-row format
id<-NULL
word<-NULL
for(i in 1:length(abs1[valids])){
  t<-unlist(strsplit(abs1[i]," "))
  id<-c(id,rep(i,length(t)))
  word<-c(word,t)
}
dta<-tibble(id=id,word=word)

Subsequently, we need to generate a document-term matrix from the data set. First, we count the number of words. Second, we use cast_dtm( ) to create a document-term matrix. Generally, the number of topics is determined before implementing LDA by the user. In the present case, it might not be suitable if we use too few topics. After several times of trying, 30 should be a relatively suitable number of topics. Thereafter, we extract the top 5 words in each topic.

# Compute word counts
word_counts<-dta %>% count(id,word,sort=T)
# Cast word_counts to a DocumentTermMatrix
abst_dtm<-word_counts %>% cast_dtm(id,word,n)
library(topicmodels)
# Extract 30 topics
abst_lda<-LDA(abst_dtm,k=30,control=list(seed=1234))
# Turn to a tibble including estimated parameter for each word
abst_topics<-tidy(abst_lda,matrix="beta")
# Get the top 5 words within each topic
top_terms<-abst_topics %>% group_by(topic) %>% slice_max(beta,n=5) %>% 
  ungroup() %>% arrange(topic,-beta)

Thus, we can plot the top 5 words for each topic. As shown in the below figure, it looks like that IO psychology, clincial psychology, and social psychology are relatively popular topics.

# Visualize top terms within each topic
library(ggplot2)
library(showtext)

## Loading required package: sysfonts

## Loading required package: showtextdb

showtext_auto()
top_terms %>% mutate(term=reorder_within(term,beta,topic)) %>%
  ggplot(aes(beta,term,fill=factor(topic)))+
     geom_col(show.legend=F)+
     facet_wrap(~topic,scales="free")+
     scale_y_reordered()+
     theme(axis.text=element_text(size=8))

Introduction to R

Lee-Xieng Yang

Topic model