Introduction to R

Web Scraping in R with rvest

When you want to request data from a server which however does not provide API, you cannot but scrape the written contents from the web pages. To this end, you have to at least know about the structure of a web page in terms of HTML codes. This web page is a good tutorial for R users to learn how to create their own web scraper with R.

In order to scrape the written contents on a web page, we firstly install and upload the R package {rvest}. If we would like to scrape this news on BBC, we can use the below codes. We need to search for the CSS nodes which contain the contents we want. If you use MAC Safari, you can right click your mouse and choose the function Inspect Element on the bottom of the dialog. Same for Firefox, you can use the function Inspect to check the HTML codes of a web page. If you are using other browsers such as Chrome or IE, you can use the same function to see the HTML codes. When you turn on the Inspection mode, the web page looks like the below figure.

The remarked area is corresponded by the HTML code underneath, pointed by the mouse cursor. We can search for the node h1 to get the title of this news by the function html_nodes( ) and turn it to text by html_text( ).

library(rvest)
url<-"https://www.bbc.com/news/world-us-canada-59536519"
webpage<-read_html(url)
title<-webpage %>% 
         html_nodes("h1") %>%
         html_text()
title

## [1] "Chris Cuomo: CNN fires presenter over help he gave politician brother"

Now if we want to get the contents of this news, we need to find out the HTML nodes for them. We find the HTML code <b> leading the short abstract of this news.

abstract<-webpage %>%
            html_nodes("b") %>%
            html_text()

The body of this news can be found by the node <p>. There are in total 37 strings in the body variable.

body<-webpage %>%
        html_nodes("p") %>%
        html_text()
length(body)

## [1] 37

body[1]

## [1] "US anchor Chris Cuomo has been fired by CNN for help he gave his brother, ex-New York governor Andrew Cuomo, while he was battling harassment allegations."

Since most of the contents on a web page are textual, we need to learn how to do text mining with R which in fact is a book, in order to take good advantages of these words. According to this book, we can firstly turn the scraped lines to a special data structure called tibble. A tibble is a modern class of data frame within R, that has a convenient print method, will not convert strings to factors, and does not use row names.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

text_df<-tibble(line=1:length(body),text=body)
text_df

## # A tibble: 37 x 2
##     line text                                                                   
##    <int> <chr>                                                                  
##  1     1 "US anchor Chris Cuomo has been fired by CNN for help he gave his brot…
##  2     2 "The decision came after CNN said additional information had emerged o…
##  3     3 "Andrew Cuomo resigned in August after prosecutors said he had harasse…
##  4     4 "Chris Cuomo, 51, said in a statement that he was disappointed and it …
##  5     5 "He had worked for the network since 2013 and became one of its most r…
##  6     6 "A CNN statement said that a \"respected law firm\" had been hired to …
##  7     7 "Chris Cuomo had already been suspended by CNN on Tuesday after the ex…
##  8     8 "At that time, the network said that while it \"appreciated the unique…
##  9     9 "Documents released by New York Attorney General Letitia James on Mond…
## 10    10 "\"You need to trust me,\" he texted Melissa DeRosa, his brother's sec…
## # … with 27 more rows

Once we have a tibble, we need to convert it to a special format: one-token-per-document-per-row with the function unnest_tokens( ) in the package {tidytext}.

library(tidytext)
news<-text_df %>%
        unnest_tokens(word, text)
news

## # A tibble: 668 x 2
##     line word  
##    <int> <chr> 
##  1     1 us    
##  2     1 anchor
##  3     1 chris 
##  4     1 cuomo 
##  5     1 has   
##  6     1 been  
##  7     1 fired 
##  8     1 by    
##  9     1 cnn   
## 10     1 for   
## # … with 658 more rows

Often in text analysis, we will want to remove stop words. The so-called stop words are those that are not useful for an analysis, typically those extremely common words, such as “the”, “of”, “to”, and so forth in English. The stop words are kept in the tidytext data set stop_words, which can be removed with an anti_join( ). Now the tibble news has fewer rows.

data(stop_words) # Load the data set containing the stop words, which is named stop_words
news<-news %>%
       anti_join(stop_words)

## Joining, by = "word"

news

## # A tibble: 330 x 2
##     line word    
##    <int> <chr>   
##  1     1 anchor  
##  2     1 chris   
##  3     1 cuomo   
##  4     1 fired   
##  5     1 cnn     
##  6     1 brother 
##  7     1 york    
##  8     1 governor
##  9     1 andrew  
## 10     1 cuomo   
## # … with 320 more rows

Now we can count the freuqency of each word in this news with the function count( ) in {dplyr} for the intention to extract those high-frequency words in this news. The two most-frequent words are cuomo and chris, which in fact are the name of the protagonist in this news: Chris Cuomo. It is very straightforward that the more frequent the words are used in a text, the more likely those words reflect the main theme of that text.

news<-news %>%
       count(word, sort=T)
news

## # A tibble: 241 x 2
##    word            n
##    <chr>       <int>
##  1 cuomo          10
##  2 chris           6
##  3 cnn             6
##  4 york            6
##  5 andrew          4
##  6 brother         4
##  7 governor        4
##  8 allegations     3
##  9 brother's       3
## 10 cnn's           3
## # … with 231 more rows

We can plot the frequencies of top 10 frequent words. In order to print those words in the order of frequency, we temporarily create a variable word with mutate( ), which contains the words sorted out by their frequencies n. The updated news is piped to ggplot( ) to plot the frequencies of the top 10 common words in this news.

library(ggplot2)
news[1:10,] %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(n,word))+
   geom_col()

Of course, we can alos make a word cloud for these common words.

library(wordcloud2)
wordcloud2(news[1:30,])

TF-IDF

Which are the key words in this news? We can use TF-IDF algorithm to extract out the keywords from this news. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites. In order to demonstrate how TF-IDF works, we add one more news on this web page.

url2<-"https://www.theguardian.com/commentisfree/2021/dec/04/it-is-impossible-to-work-seriously-with-boris-johnsons-government"
webpage2<-read_html(url2)
body2<-webpage2 %>%
        html_nodes("p.dcr-t0ikv9") %>%
        html_text()
text_df2<-tibble(line=1:length(body2),text=body2)

Now convert the tibble to one-token-per-document-per-row format. See the below codes.

news2<-text_df2 %>%
        unnest_tokens(word, text) %>%
        anti_join(stop_words) %>%
        count(word, sort=T)

## Joining, by = "word"

news2

## # A tibble: 304 x 2
##    word           n
##    <chr>      <int>
##  1 france         9
##  2 french         9
##  3 british        6
##  4 government     5
##  5 britain        4
##  6 countries      4
##  7 european       4
##  8 uk             4
##  9 eu             3
## 10 heart          3
## # … with 294 more rows

Compute the tf-idf measure for each word in each source.

documents<-bind_rows(mutate(news,source="BBC"),
             mutate(news2,source="Guardian")) %>% 
             bind_tf_idf(word, source, n)

We can visualize the top 10 keywords of each news in terms of the tf-idf measure.

library(forcats)
documents %>%
  group_by(source) %>%
  slice_max(tf_idf,n=10) %>%
  ungroup() %>%
  ggplot(aes(tf_idf,fct_reorder(word, tf_idf),fill=source))+
    geom_col(show.legend=F)+
    facet_wrap(~source, ncol=2, scales="free")+
    labs(x='tf_idf',y=NULL)

Web scraping on PTT forums without age censorship

PTT is a well-known social network site, which consists of a variety of forums. As a lot of people are used to posting their reflections, opinions, and feelings toward public and/or personal affairs, even journalists are often reporting the stories posted on PTT. However, PTT does not provide API for anyone who might need. Thus, we cannot but compose our own web scraper to scrape the contents on PTT. In the below example, we will learn how to scrape the contents on the Boy-Girl forum on PTT. The index page of the Boy-Girl forum shows the latest posts. The pages before this index page are encoded in a backward sequence. Each page normally contains 20 posts. If we would like to get the common words in the post titles on the latest 5 pages, the below codes can be helpful.

main<-"https://www.ptt.cc/bbs/Boy-Girl/index"
Seq<-seq(5564,5561)
suffix<-".html"
urls<-sapply(c(0,Seq),function(x){
        if(x==0){
          return("https://www.ptt.cc/bbs/Boy-Girl/index.html")
        }else{
          return(paste0(main,x,suffix))
        }
})

If we want to collect the post titles, we need to know which code correspond to these titles. Inspection of the HTML codes suggests that titles are contained in <div.title>.

titles<-sapply(urls,function(url){
          webpage<-read_html(url) %>%
                    html_nodes("div.title") %>%
                    html_text()
})
titles<-unlist(titles)
names(titles)<-NULL

As some posts have nothing to do with the gist of this forum, we can get rid of them by removing the posts in the categories such as 公告 and 情報. Also, we do not want the posts which have been deleted either. The below codes

invalid<-which(grepl("公告",titles) | grepl("情報",titles) | grepl("刪除",titles))
v.titles<-sapply(titles[-invalid],function(tt){
    tt<-gsub("\n","",tt)
    tt<-gsub("\t","",tt)
    tt<-gsub("Re: ","",tt)
    return(unlist(strsplit(tt,"]"))[2])
})
names(v.titles)<-NULL

Now we can do word segmentation for these titles. We parse each title to a number of words separated by spaces and then group them together as a string. This string looks like an English sentence, as every two words are separated by a space.

library(jiebaR)

## Loading required package: jiebaRD

cat(c("Gloria Sol","bug","FB","IG","壞壞"),file="user_defined.txt")
cutter<-worker(user="user_defined.txt")
docs<-sapply(v.titles,function(v){
     return(paste(cutter[v],collapse=" "))
})

Subsequently, we convert these title strings to a tibble, in which there are two columns: word and n. We can plot the common words in the descending order on the frequency.

texts<-tibble(line=1:length(docs),text=docs) %>%
         unnest_tokens(word,text) %>%
         suppressWarnings()%>%
         count(word,sort=T) %>%
         filter(nchar(word)>1)
texts[1:10,] %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(n,word))+
   geom_col(fill="deepskyblue2",color="white")+
   theme(text=element_text(family='DFKai-SB'))+
   labs(x="頻次",y="詞彙")

Introduction to R

Lee-Xieng Yang

Web Scraping in R with rvest

TF-IDF

Web scraping on PTT forums without age censorship