On PTT, users under 18 years old are not allowed to enter some forums, specifically those which might contain potentially controversial posts. For example, the Gossiping forum. This forum is always on the chart top on the front page of PTT, as there are most posts on this forum on PTT everyday. When you click this forum, the first pops out is the age-verification page, on which one should click on “Yes, larger or equal to 18” before s/he is allowed to enter. It looks like as follow.
Before we click on the left button, we can inspect the html codes of this page, which looks like below.
You can find an action called “ask/over18”, which corresponds to the left button. However, we do not want to manually press down this button every time when we are running our web crawler. The question here is how to make our web crawler automatically pass this age verification. Fortunately, PTT only uses cookies to verify user age. As long as we can change the cookie for age-verification as “yes”, it is equivalent to that we have pressed the “yes” button. The below codes show one way to perform this trick. We firstly create a session which can be viewed as a simulated browser by using the function session( ) in the package {rvest}.
library(rvest)
library(dplyr)
gossiping.session<-session("https://www.ptt.cc/bbs/Gossiping/index.html")
On this session, we access a html node which is called form, which is the page checking your age. This form can be accessed as a user-defined object, gossiping.form. Subsequently, we submit or POST the answer “yes” on it by using session_submit( ). The result is a new page directed from the age-verification page, which is actucally the latest page on Gossiping board.
gossiping.form<-gossiping.session %>%
html_node("form") %>%
html_form()
gossiping<-session_submit(x=gossiping.session,form=gossiping.form,submit="yes")
Now we can use our web crawler to scrape the contents on Gossiping board. Suppose we want to scrape the contents from page #39054 to #38855. We use session_jump_to( ) to move to the page that we want in the current session.
url<-"https://www.ptt.cc/bbs/Gossiping/index"
postsinfo<-lapply(seq(39054,38855),function(i){
page<-paste0(url,i,".html")
dd<-tryCatch({
titles<-gossiping %>% session_jump_to(page) %>% html_nodes(".title") %>% html_text(trim=T)
links<-gossiping %>% session_jump_to(page) %>% html_nodes(".title a") %>%
html_attrs()
removed<-sapply(titles,function(x)grepl("本文已被刪除",x))
if(sum(removed)>0){
titles<-titles[!removed]
}
links<-unlist(links)
names(links)<-NULL
return(list(title=titles,link=links))
},
error=function(cond){
message("Error on ",page)
return(NA)
},
warning=function(cond){
message("Warning on ",page)
return(NULL)
},
finally={
message("Done with ",page)
})
return(dd)
})
Suppose we want to analyze the posts relevant to the candidates of president election 2024. We can select the posts by key words, such as 賴, 柯, 侯, 郭, or 蔡英文. In total, there are 600 more posts.
links.target<-sapply(postsinfo,function(x){
targets<-sapply(x$title,function(y){
if(grepl("賴",y) | grepl("柯",y) | grepl("侯",y) | grepl("郭",y) | grepl("蔡英文",y)){
return(1)
}else{
return(0)
}
})
return(x$link[which(targets==1)])
})
links.target<-unlist(links.target)
Suppose we want to know who replied to these posts. We can check the user ids who gave comments to each post.
links.target<-links.target[!is.na(links.target)]
Users<-sapply(links.target,function(l){
page<-paste0("https://www.ptt.cc",l)
uu<-tryCatch({
users<-gossiping %>% session_jump_to(page) %>% html_nodes(".f3.hl.push-userid") %>%
html_text()
return(users)
},
error=function(cond){
message("Error on ",page)
return(NA)
},
waring=function(cond){
message("Warning on ",page)
return(NULL)
},
finally={
message("Done with ",page)
})
return(uu)
})
Total user ids can be obtained. It is amazing that in total we have more than 10,000 users. That would be interesting to check out the relationships between them. To this end, we can use social network analysis.
all.users<-unique(unlist(Users))
length(all.users)
The idea is very simple. If two users both replied to the same post, then the link number between them +1. Therefore, the more links between two users, the closer they are, and vice versa.
# For demonstration only
uu<-unique(Users[[2]])
temp<-NULL
for(i in 1:(length(uu)-1)){
for(j in (i+1):length(uu)){
temp<-rbind(temp,c(uu[i],uu[j]))
}
}
temp<-data.frame(temp)
names(temp)<-c("from","to")
write.table(temp,file="users.txt")
user.dta<-read.table("users.txt",header=T)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
net<-graph_from_data_frame(user.dta,directed=F,vertices=unique(unlist(user.dta)))
plot(net,edge.arrow.size=0.3,vertex.shape="none", vertex.label.cex=0.7,vertex.label.color="red")