Introduction to R

Web scraper for the Gossips forum on PTT

On PTT, users under 18 years old are not allowed to enter some forums, specifically those which might contain potentially controversial posts. For example, the Gossiping forum. This forum is always on the chart top on the front page of PTT, as there are most posts on this forum on PTT everyday. When you click this forum, the first pops out is the age-verification page, on which one should click on “Yes, larger or equal to 18” before s/he is allowed to enter. It looks like as follow.

Before we click on the left button, we can inspect the html codes of this page, which looks like below.

You can find an action called “ask/over18”, which corresponds to the left button. However, we do not want to manually press down this button every time when we are running our web crawler. The question here is how to make our web crawler automatically pass this age verification. Fortunately, PTT only uses cookies to verify user age. As long as we can change the cookie for age-verification as “yes”, it is equivalent to that we have pressed the “yes” button. The below codes show one way to perform this trick. We firstly create a session which can be viewed as a simulated browser by using the function session( ) in the package {rvest}.

library(rvest)
gossiping.session<-session("https://www.ptt.cc/bbs/Gossiping/index.html")

On this session, we access a html node which is called form, which is the page checking your age. This form can be accessed as a user-defined object, gossiping.form. Subsequently, we submit or POST the answer “yes” on it by using session_submit( ). The result is a new page directed from the age-verification page, which is actually the latest page on Gossiping board.

gossiping.form<-gossiping.session %>%
  html_node("form") %>%
  html_form()
gossiping<-session_submit(x=gossiping.session,form=gossiping.form,submit="yes")

Start scrapping

Now we can use our web crawler to scrape the contents on Gossiping board. Suppose we want to track the volume of posts talking about earthquakes from 4/2 until now. First, we need to know from which web page our web cralwer starts to scrape. After checking the urls of the web pages on the Gossips forum, we know that the first web page which contains the first post on 4/2 can be accessed on https://www.ptt.cc/bbs/Gossiping/index36115.html. Thus, this web page must be the starting page. The end page is actually the latest page on the Gossips forum. The url of this web page is https://www.ptt.cc/bbs/Gossiping/index.html. The second newest web page can be accessed on https://www.ptt.cc/bbs/Gossiping/index39147.html. Our plan is to scrape the titles and publishing dates of all posts in the time period of our interest. Subsequently, we count the posts with a title talking about earthquakes or related information.

How to get post titles on each web page on the Gossips forum

On the starting web page, visual inspection of the HTML codes suggests that the post title can be retrieved on the node div and class title. Also, the publishing date can be retrieved on the node div and class date. As we are going to scrape 3,034 web pages, which is a large amount, this might take a while.

# Two loops
# The outer loop aceess web pages
# The inner loop access metadata of posts

main_url<-"https://www.ptt.cc/bbs/Gossiping/index"
MetaData<-sapply(36115:39148,function(i){
    # synthesize full path to a web page
    if(i<39148){
      url<-paste0(main_url,i,".html")
    }else{
      url<-paste0(main_url,".html")
    }
    wps<-tryCatch({
      titles<-gossiping %>% session_jump_to(url) %>% html_nodes("div.title") %>% html_text(trim=T)
      dates<-gossiping %>% session_jump_to(url) %>% html_nodes("div.date") %>% html_text(trim=T)
      return(list(title=titles,date=dates))
    },
    error=function(cond){
      message("Error on ",url)
      return(NA)
    },
    warning=function(cond){
      message("Warining on ",url)
      return(NULL)
    },
    finally={
      message("Done for ",url)
    })
    return(wps)
})

Now we sort out the metadata. The variable MetaData is a 2 x 3034 matrix. In this matrix, the first row records the titles and the second row records the dates. It’s convenient to turn these metadata as a data frame and saved as a file. This is very important, so please do remember to save the scrapped data as a file.

titles<-sapply(MetaData[1,],function(x){
     temp<-unlist(x)
     names(temp)<-NULL
     temp<-paste0(temp,collapse='_@_')
     return(temp)
})
dates<-sapply(MetaData[2,],function(x){
     temp<-unlist(x)
     names(temp)<-NULL
     temp<-paste0(temp,collapse='_@_')
     return(temp)
})
write.table(data.frame(titles=titles,dates=dates),
            file="MetaData.txt",row.names=F)

Suppose now we are going to analyze the metadata in the file we saved before. It is convenient to turn this data frame to a structure with one row representing one title and one date.

dta<-read.table("MetaData.txt",header=T,sep="")
alltitles<-sapply(dta$titles,function(j)unlist(strsplit(j,"_@_")))
names(alltitles)<-NULL
alldates<-sapply(dta$dates,function(j)unlist(strsplit(j,"_@_")))
names(alldates)<-NULL
alldates<-sapply(alldates,function(x)paste0("2024/",x))
names(alldates)<-NULL
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Dta<-tibble(Title=c(alltitles),Date=c(alldates))
Dta

## # A tibble: 60,680 × 2
##    Title                                          Date     
##    <chr>                                          <chr>    
##  1 Re: [新聞] 4惡煞強盜移工5600判7年多 宣判後竟在 2024/4/01
##  2 [問卦] 土耳其好玩嗎0_0?                        2024/4/01
##  3 [問卦] 正妹網紅吸果凍  葉黃素益生菌凍          2024/4/01
##  4 [問卦] 您您韭是您輸不起的八卦                  2024/4/01
##  5 [問卦] 想要買C300但爸爸不給我錢該怎麼辦?       2024/4/01
##  6 [新聞] 唱25年不煩嗎？五月天阿信不忍發聲        2024/4/01
##  7 [問卦] 黃明志死了嗎                            2024/4/01
##  8 [新聞] 韓國10部核電機組啟動延役程序 扭轉文在   2024/4/01
##  9 [新聞] 愚人節限定！　必勝客超逼真「蛇圈圈」    2024/4/01
## 10 [新聞] 張學友6萬張票 30秒售罄！網哀嚎「到底    2024/4/01
## # ℹ 60,670 more rows

We can calculate the total number of earthquake posts by days. First, we set up earthquakes, including 地震, 震, 花蓮, 警報, 抖, 921, 海嘯, 規模, 災情, 掉, 搖, 晃, and 倒. The following codes create two new column, one containing the result of checking if a title contains an earthquake keyword and the other containing the transferred data of Date. For the new column Y, 1 for earthquake and 0 for non-earthquake.

Dta1<-Dta %>% mutate(Y=ifelse(grepl("地震",Title) | 
                        grepl("震",Title) |
                        grepl("花蓮",Title) |
                        grepl("警報",Title) |
                        grepl("海嘯",Title) |
                        grepl("規模",Title) |
                        grepl("災情",Title) |
                        grepl("921",Title) |
                        grepl("掉",Title) |
                        grepl("搖",Title) |
                        grepl("晃",Title) |
                        grepl("倒",Title),1,0),
                     date=as.Date(Date,"%Y/%m/%d"))

We can plot the total number of earthquake posts by days. In below codes, we first group data by the column date and get the total number of earthquake posts. Also, we create another new variable dates which is in fact the same as date, except each date is represented by one row not multiple rows.

library(ggplot2)
Dta1 %>% group_by(date) %>% mutate(Num=sum(Y),dates=date) %>% 
  filter(dates<as.Date("2024/5/09","%Y/%m/%d")) %>%
  ggplot(aes(dates,Num))+geom_line(col="red")+geom_point(shape=1,color="black")

Earthquake data from weather forecast

We also download the earthquake data from Central Weather Administration (中央氣象局). For example, this web page provides the earthquake data in May 2024. I have already downloaded the earthquake data in April and May as two csv files. The below codes are used to import and synthesize these data as a tibble.

qdta4<-read.csv("quake4.csv",header=T)
names(qdta4)<-c("id","date","longitude","latitude","size","depth","location")
qdta5<-read.csv("quake5.csv",header=T)
names(qdta5)<-c("id","date","longitude","latitude","size","depth","location")
# Combine data of central weather administration in April and May
qdta<-rbind(qdta4,qdta5)
qdta<-as_tibble(qdta)

Let’s check the averaged size of quakes on each day. Also, we can plot the earthquake frequency by each date. After observing the patterns of earthquake levels and frequencies, it is implied that people’s discussion about earthquakes on PTT is according to frequency rather than size.

ndate<-sapply(qdta$date,function(x)unlist(strsplit(x," "))[1])
names(ndate)<-NULL
qdta$D<-ndate
qdta %>% group_by(D) %>% mutate(Level=mean(size), date=as.Date(D,"%Y-%m-%d")) %>%
  ggplot(aes(date,Level))+geom_line()

qdta %>% group_by(D) %>% mutate(Freq=length(D),dates=as.Date(D,"%Y-%m-%d")) %>%
  ggplot(aes(dates,Freq))+geom_line()+geom_point(shape=1,col="blue")

Introduction to R

Lee-Xieng Yang

Web scraper for the Gossips forum on PTT

Start scrapping

How to get post titles on each web page on the Gossips forum

Earthquake data from weather forecast