Introduction to R

How to Scrap Text Contents on Forum with Authentication?

Some forums on PPT, such as Gossiping, are not open to users under year 18. For these web pages, it is impossible to scrap the posts directly, unless we pass authentication check. This tutorial provides a resolution.

When we access the index page of the Gossiping forum, we are directed to the authentication page, on which a reader is required to identify whether or not s/he is over age 18. The url of this authentication page is “https://www.ptt.cc/ask/over18?from=%2Fbbs%2FGossiping%2Findex.html”. That is, one cannot get access to the Gossiping forum without answering the question on this page. How to answer get our scraper passed by this censor is the key point to scrape the contents on the Gossiping forum. We can use session( ) in {rvest} to establish a session, which is named as gossip.session. The content of gossip.session shows the url which is exactly the url of the authentication page.

library(rvest)
url<-"https://www.ptt.cc/bbs/Gossiping"
gossip.session<-session(url=url)
gossip.session

## <session> https://www.ptt.cc/ask/over18?from=%2Fbbs%2FGossiping%2Findex.html
##   Status: 200
##   Type:   text/html; charset=utf-8
##   Size:   2411

Switch to the inspection mode and check out the html node form. The attribute action is set as “ask/over18” and method is set as “post”. That means that this form allows a user to post something on this web page. When we click on the button for showing that we are over age 18, it is in fact that we post a message on this web page. We need to use R codes to do this thing. First, we get the form. by locating the form by html_code( ) and html_form( ).

gossip.form<-gossip.session %>%
              html_node("form") %>%
              html_form()
gossip.form

## <form> '<unnamed>' (POST https://www.ptt.cc/ask/over18)
##   <field> (hidden) from: /bbs/Gossiping/in...
##   <field> (button) yes: yes
##   <field> (button) no: no

Then, we have to submit a yes answer to the over18 question with session_submit( ). The returned object is another session whose url is “https://www.ptt.cc/bbs/Gossiping/index.html”, showing the latest web page on the Gossiping forum.

gossip<-session_submit(
  x=gossip.session,
  form=gossip.form,
  submit="yes"
)
gossip

## <session> https://www.ptt.cc/bbs/Gossiping/index.html
##   Status: 200
##   Type:   text/html; charset=utf-8
##   Size:   14758

Now we can start scraping. We have already known the url of the index page and we want to make sure which pages we are going to scrape. This depends on the purpose of our scraping. For example, if we want to know the posts relevant to the news about the scandals of Leehom Wang, we can start scraping from the page numbered 39389 to the very latest page. After inspecting the serial numbers of the pages of the Gossiping forum, I chose to scrape the urls of the posts relevant to Leehom Wang from the page numbered 39389 to the page unmbered 39548 plus the index page.

urls<-sapply(39389:39548,function(x){
  url<-paste0("https://www.ptt.cc/bbs/Gossiping/index",x,".html")
  return(url)
})
urls<-c(urls,"https://www.ptt.cc/bbs/Gossiping/index.html")
# Start scraping page by page
allPostUrls<-sapply(urls,function(url){
  titles<-gossip %>%
            jump_to(url) %>%
            html_nodes("div.title") %>%
            html_nodes("a") %>%
            html_text()
  post.urls<-gossip %>%
            jump_to(url) %>%
            html_nodes("div.title") %>%
            html_nodes("a") %>%
            html_attr("href")
  # Get the url of the post relevant to Leehom Wang and his exwife
  URLS<-post.urls[which(grepl("王力宏",titles) | 
                        grepl("李靚蕾",titles))]
  URLS<-sapply(URLS,function(x){
    return(paste0("https://www.ptt.cc",x))
  })
  return(paste(URLS,collapse=";"))
})

## Warning: `jump_to()` was deprecated in rvest 1.0.0.
## Please use `session_jump_to()` instead.

allPostUrls<-unlist(strsplit(allPostUrls,";"))
names(allPostUrls)<-NULL

Thereafter, we can extract the information of these posts, such as the publishing time and the number of comments.

postDta<-sapply(allPostUrls,function(post){
     meta.dta<-gossip %>%
               jump_to(post) %>%
               html_nodes("span.article-meta-value") %>%
               html_text()
     userid<-gossip %>%
             jump_to(post) %>%
             html_nodes("span.f3.hl.push-userid") %>%
             html_text()
     post.dta<-length(unique(userid))
     names(post.dta)<-meta.dta[4]
     return(post.dta)
})
# Get time
ptimes<-sapply(names(postDta),function(x){
    tt<-unlist(strsplit(x," "))
    dd<-as.numeric(tt[3])-17
    dt<-as.numeric(unlist(strsplit(tt[4],":")))-c(23,38,31)
    diftime<-dd*24*3600+dt[1]*3600+dt[2]*60+dt[3]*1
    return(diftime)
})
post.dta<-data.frame(ReplyN=postDta,time=ptimes,tlabel=1:length(ptimes))

We can plot the number of ids who replied to the posts by the published time. It looks clear that the longer this topic has lasted, the less people would reply to this topic.

library(ggplot2)
ggplot(post.dta,aes(time,ReplyN))+
  geom_line()

Introduction to R

Lee-Xieng Yang

How to Scrap Text Contents on Forum with Authentication?