Web Scraper for Chinese websites

Let’s scrap Chinese contents on webpages. While more and more websites prevent the scraping of web scrapers, still we can find some relatively friendly websites, such as PTT. PTT was the largest social media in Taiwan, which has not accepted new applications any more from many years ago. However, PTT is still an active website and posts on PTT would often be cited by journalists. The URL of PTT is here, When you access it, you will see a long list of forums. For the sake of demonstration, I choose the movie forum. Every forum on PTT has its own regulations for users to publish their posts. On the movie forum, you can easily find a short label within two square brackets before the post title. The label highlights the attributes of a post. For example, you might find a Chinese character 雷 in the label string. This means that the story of the movie will be exposed in this post. The Chinese character 負 means negative, so a label [負雷] suggests that the content of a post is about negative comments for a movie and the story of that movie. Of course, if a post is talking about positive comments for a movie, then [喜雷] might be used as the label. For a neutral comment, people might use [普雷]. Sometimes, people would like to add some words as an adjective before the label, such as 超負雷, 偏負雷, or 核彈負雷, etc. Alright, let’s start dealing with the movie comments on this forum.

The procedure of web scraping is as follows. First, we need to get the url of every post showing comments for a movie. Second, we need to access each post and scrap any information we want on it. To this end, we have to find out the regularity of the page urls on this forum. The newest page’s url is https://www.ptt.cc/bbs/movie/index.html. When we press 上頁, we can see the second newest page. Its url is https://www.ptt.cc/bbs/movie/index9826.html. Now we can check the third newest page by pressing 上頁 again. Now the url becomes https://www.ptt.cc/bbs/movie/index9825.html. Of course, we can for the urls of those previous pages. Nonetheless, we find out the regularity of making up a ulr on the movie forum. That is, the serial number of the second newest page is 9826 when I am composing this note. Therefore, we can decide how many pages we want to scrap. A rough estimation suggests 19 posts per page on average. Suppose on average 15% of the posts on each page are talking about movie comments. If we want to get about 200 comments, we might need to scrap about 70 pages.

library(rvest)
main.url<-"https://www.ptt.cc/bbs/movie/index"
suffix<-".html"
URLs<-sapply(seq(9827,9757),function(x){
        if(x==9827){
          url<-paste0(main.url,suffix)
        }else{
          url<-paste0(main.url,x,suffix)
        }
      return(url)
})

Now we can scrap the web pages. The title and url of each post will be collected.

library(dplyr)
title_links<-lapply(URLs,function(p){
  results<-tryCatch({
    page<-read_html(p)
    titles<-page %>% html_nodes(".title a") %>% html_text()
    links<-page %>% html_nodes(".title a") %>% html_attrs()
    links<-unlist(links)
    names(links)<-NULL
    return(list(titles=titles,links=links))
  },
  error=function(cond){
    message("Error on",p)
    return(NA)
  },
  warning=function(cond){
    message("Warning on",p)
    return(NULL)
  },
  finally={
    message("Scraping ",p," is done.")
  })
  return(results)
})

We will select the posts with a label including 雷. In total, we collected 473 posts which contain comments for a movie.

postTitles<-NULL
postUrls<-NULL
for(s in 1:length(title_links)){
    valids<-sapply(title_links[[s]]$titles,function(x){
        temp<-unlist(strsplit(x,"] "))[1]
        return(grepl('雷',temp))
    })
    postTitles<-c(postTitles,title_links[[s]]$titles[valids])
    postUrls<-c(postUrls,title_links[[s]]$links[valids])
}

Before we go one step further, it is better to classify these posts to negative, neutral, and positive comments according to their labels. However, we do not know how many kinds of labels that people used. We firstly sort out those labels.

labels<-sapply(postTitles,function(x){
      temp<-gsub(" ","",x)
      temp<-gsub("]","] ",temp)
      l<-unlist(strsplit(temp," "))[1]
      l<-gsub("R:","",l)
      l<-gsub("Re:","",l)
      return(l)
})
names(labels)<-NULL
unique(labels)

According to the labels, we sort out these posts to three classes: negative, neutral, and positive. There are more positive comments than negative comments.

# Negative: -1, Neutral: 0, Positive: 1
sentiments<-sapply(labels,function(x){
     if(grepl("好",x) | grepl("神",x) | grepl("爽",x) | grepl("上",x)|
        grepl("刺激",x) | grepl("鸚鵡",x)){
       return(1)
     }else if(grepl("普",x) | grepl("無",x) | grepl("還可以",x) | grepl("有",x)){
       return(0)
     }else{
       return(-1)
     }
})
table(sentiments)
posts.dta<-data.frame(title=postTitles,url=postUrls,sentim=sentiments)

Now we can scrap the content of every post.

postTexts<-sapply(posts.dta$url,function(x){
    u<-paste0("https://www.ptt.cc",x)
    texts<-tryCatch({
         pp<-read_html(u)
         temp<-pp %>% html_node(".bbs-screen.bbs-content") %>% html_text()
         return(temp)
    },
    error=function(cond){
      message("Error on ",u)
      return(NA)
    },
    warning=function(cond){
      message("Warning on ",u)
      return(NULL)
    },
    finally={
      message(u," is done.")
    })
    return(texts)
})
names(postTexts)<-NULL

We need to clean up the texts, as there is much information irrelevant to contents.

postTexts1<-sapply(postTexts,function(x){
     temp<-unlist(strsplit(x,"※ 發信站:"))[1]
     temp<-unlist(strsplit(temp,"2023\n"))[2]
     return(temp)
})
names(postTexts1)<-NULL

Chinese word segmentation

We have collected a pile of Chinese posts. Before we can do any text analysis, we have to do word segmentation first, because there is no significant mark to indicate the boundary of a word. Chinese word segmentation is normally accomplished via applications. One is CKIP form Academia Sinica and the other is JIEBA (結巴) from 百度. Since CKIP does not provide API for R, but JIEBA does, I will use JIEBA here. Detailed descriptions of how to use the package {jiebaR} can be found here. Here I only select those words having a function similar to adjectives, as I want to see which Chinese words are used to express emotions. We need to load the package {jiebaR} first. Thereafter, we need to create an environment for JIEBA by the function worker( ). I set up the first argument of this function as tag, that will return the part of speech of each word. Also, I defined a bunch of words as stop words in a text file. Stop words are those which will not be selected in word segmentation. At last, it is very important to save your data to a file. Thus, you can re-use them without going through the tedious procedure again.

library(jiebaR)
wk<-worker(type="tag",stop_word="user_stopwds.txt",user="user_define.txt")
posts.dta$text<-sapply(postTexts1,function(x){
     wds<-wk[x]
     v<-sapply(names(wds),function(x){
       # select those words having a function similar to adjectives
         if(x=="Ag" | x=="a" | x=="ad" | x=="an"){
            return(1)
         }else{
            return(0)
         }
     })
     return(paste(wds[v==1],collapse=" "))
})
write.table(posts.dta,file="movies.txt")

Text Analysis

We can retrieve data from the file we created. Do not forget to change your working directory to where your data file is. First, we need to transfer the data frame to a tibble object. Thereafter, we transform this tibble to the format of one token per row. Now, let’s check the TF-IDF index of each word. To this end, we compute total words in each type of comments and save them as a tibble, which is joined in the tibble containing the words and frequencies.

posts.dta<-read.table("movies.txt",header=T)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
posts.dta<-as_tibble(posts.dta)
posts.dta$group<-ifelse(posts.dta$sentim==1,"positive",
                        ifelse(posts.dta$sentim==0,"neutral","negative"))
wc.dta<-posts.dta %>% unnest_tokens(word,text) %>% group_by(group) %>% count(word,sort=T) %>% ungroup
wc.totaln<-wc.dta %>% group_by(group) %>% summarize(total=sum(n)) %>% ungroup
wc.dta<-left_join(wc.dta,wc.totaln)
## Joining with `by = join_by(group)`
# TF-IDF
wc.tf_idf<-wc.dta %>% bind_tf_idf(word,group,n)

Make a bar plot for the keywords in each type of comment. Note ggplot( ) does not support Chinese displaying, so we need to call showtext_auto( ) in the package {showtext}.

library(forcats)
library(ggplot2)
library(showtext)
## Loading required package: sysfonts
## Loading required package: showtextdb
# call showtext_auto()
showtext_auto()
wc.tf_idf %>% group_by(group) %>% slice_max(tf_idf,n=15) %>% ungroup() %>%
  ggplot(aes(tf_idf,fct_reorder(word,tf_idf),fill=group))+
    geom_col(show.legend=F)+
    facet_wrap(~group,ncol=3,scales="free")+
    labs(x="tf-idf",y=NULL)

LSA

In order to understand the relationships between different types of comments and different words, we can conduct LSA (Latent Semantic Analysis). First, we need to create a term-document matrix.

all.words<-unique(wc.tf_idf$word)
TD<-sapply(posts.dta$text,function(x){
    temp<-unlist(strsplit(x," "))
    k<-rep(0,length(all.words))
    if(length(temp)==0){
       return(k)
    }else{
       index<-sapply(temp,function(y){
         l<-which(y==all.words)
         if(length(l)==0){
           return(0)
         }else{
           return(l)
         }
       })  
       p<-table(index)
       for(j in 1:length(p)){
         g<-as.numeric(names(p[j]))
         if(g!=0){
            k[g]<-p[j]  
         }else{
            k[j]<-k[j]
         }
       }  
       return(k)
    }
})
dim(TD)
## [1] 907 477

Subsequently, we run SVD to decompose this term-document matrix. As shown in the elbow plot, the squared root of eigenvalue decreases quickly. In fact, the first three components might be sufficient, as the decrease of the squared root of eigenvalue gets slow after the first three components. For the sake of simplification, I only plot the top 20 words here.

H<-svd(TD)
ggplot(data.frame(d=H$d,x=1:ncol(TD)),aes(x,d))+geom_line()

# Relationships between words
TC<-H$u%*%diag(H$d)
dim(TC)
## [1] 907 477
# Selectively show high tf-idf words
hws<-wc.tf_idf %>% arrange(desc(tf_idf)) %>% slice_max(order_by=tf_idf,n=20) %>% with(word)
valid<-sapply(hws,function(x)which(all.words==x))
TC.dta<-data.frame(x1=TC[valid,1],x2=TC[valid,3],label=all.words[valid])
ggplot(TC.dta,aes(x1,x2,label=label))+geom_point(alpha=0.3,color="blue")+
  geom_text(size=4)

How about the relationships between types of comments? We can use the document-component matrix to make a scatter plot. The second and third components were chosen as the dimensions of a geometric space, in which all posts are represented as numbers -1, 0, or 1, respectively for negative, neutral, and positive comments. It seems that the negative comments have a positive slope and the positive comments have a slightly negative slope in the space made of components x2 and x3.

DC<-H$v%*%diag(H$d)
dim(DC)
## [1] 477 477
DC.dta<-data.frame(x1=DC[,1],x2=DC[,3],label=ifelse(posts.dta$group=="positive","1",
                                                    ifelse(posts.dta$group=="neutral","0","-1")))
ggplot(DC.dta,aes(x1,x2,label=label,color=label))+
  geom_text(size=3)+labs(x="x2",y="x3")+
  scale_color_manual(values=c("tomato","gray30","deepskyblue3"))

# Remove neutral comments
DC.dta %>% filter(label!="0") %>% ggplot(aes(x1,x2,label=label,color=label))+
  geom_text(size=3)+labs(x="x1",y="x3")+
  stat_smooth(data=subset(DC.dta,label=="-1"),method="lm")+
  stat_smooth(data=subset(DC.dta,label=="1"),method="lm")+
  scale_color_manual(values=c("tomato","deepskyblue3"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'