Reddit is a famous social media in the US, which many young people love to use. Same as PTT in Taiwan, there are a lot of forums on Reddit. For example, the forum talking about cats has a url https://www.reddit.com/r/cats, where r/cats is the path of this forum. Reddit provides API (Application Programming Interface) for users to easily request for data from its server. The form of API normally is just like a regular url. Suppose you search Gaza on the forum geopolitics. The url appearing on your browser shows https://www.reddit.com/r/geopolitics/search/?q=Gaza&type=link&cId=9a26529d-5787-4694-aa39-bdd68b59d25a&iId=73cf51e8-5803-4607-9e08-666bafe19ad4, where /search/?q=Gaza… indicates that you want to search on this forum for the keyword Gaza following the operator ?q. In R, we normally use the functions in the package {httr} to help us implement the functions of API. For Reddit, we do not need to do so, as someone has already made an R package {RedditExtractoR} with which we only need to call a couple of functions to collect the data from Reddit. Here is the author’s Github and anyone can easily follow the instruction to install it.
Suppose we want to search for the posts mentioning the keyword Gaza war on Reddit. The following code can help us easily do scrap the posts as long as that it contains the keyword.
library(RedditExtractoR)
Gaza<-find_thread_urls(keywords="Gaza war")
write.table(data.frame(text=Gaza$text,subreddit=Gaza$subreddit),"gaza.txt")
We save the results as an object called Gaza, which is a list containing 7 variables. The first variable is the date and the second one is too, only in UNIX time format. The third variable contains the titles of those posts mentioning about Gaza war. Suppose we want to know which forums (i.e., subreddits) would care about the issue of Gaza war more. We can simply use table( ) to show those subreddits having at least one post relevant to Gaza war. We first label which text has a length unequal to 0 and does not contain a url only. Thereafter, we turn the subreddits to the counts of subreddits for which we make a bar plot. Not surprsinngly, the subreddits relevant to Palestine (including Islam) and Israel (including Jewish) have more posts.
Gaza<-read.table("gaza.txt",header=T,sep="")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
valids<-which(nchar(Gaza$text)!=0 & !grepl("https",Gaza$text))
counts<-table(Gaza$subreddit[valids])
dta<-tibble(forum=names(counts),counts=as.numeric(counts))
dta %>% ggplot(aes(forum,counts))+
geom_col(color="tomato",fill="white")+theme(axis.text.x=element_text(angle=90))
Now we can conduct LDA to generate topics for these posts. Before that we have to sort out our data in one-token-per-row format. We remove those stop words and form a data set. Subsequently, we count each word and store in the data set the frequency of each word. Based on this data set, we can transform it to the Document-Term matrix.
library(tidytext)
# Transform to one-token-per-row format
dd<-tibble(subreddit=Gaza$subreddit[valids],text=Gaza$text[valids])
dd1<-dd %>% unnest_tokens(word, text)
data("stop_words")
# Remove stop words
dd1<-dd1 %>% anti_join(stop_words)
## Joining with `by = join_by(word)`
# Compute word counts
word_counts<-dd1 %>% count(subreddit,word,sort=T)
# Cast word_counts to a DocumentTermMatrix
abst_dtm<-word_counts %>% cast_dtm(subreddit,word,n)
Finally, LDA can be conducted and we plan to extract 2 topics. The results show that in one topic hamas and crimes appear together. It is reasonable to infer this topic as the one more supporting Israel. In contrast, the other topic has the words together such as encampments, students, campus, and mit. Obviously, this topic should be relevant to those protests against Israel on the campuses in the US. Thus, we can say that Topic 1 is pro-Palestine topic and Topic 2 is pro-Israel topic.
library(topicmodels)
abst_lda<-LDA(abst_dtm,k=2,control=list(seed=1234))
# Turn to a tibble including estimated parameter for each word
abst_topics<-tidy(abst_lda,matrix="beta")
# Get the top 10 words within each topic
top_terms<-abst_topics %>% group_by(topic) %>% slice_max(beta,n=10) %>%
ungroup() %>% arrange(topic,-beta)
top_terms %>% mutate(term=reorder_within(term,beta,topic)) %>%
ggplot(aes(beta,term,fill=factor(topic)))+
geom_col(show.legend=F)+
facet_wrap(~topic,scales="free")+
scale_y_reordered()+
theme(axis.text=element_text(size=8))