Introduction to Web Scraping

If what we want is the contents published on a web page and we have no API to make a request for them to the server, we have no other good way to get the contents, except that we read the html codes of that web page and look for the particular contents of our interest. We call this procedure web scraping. In R, we can use the functions in the package {rvest} to do this.

First, you have to install the package {rvest} and upload it. Second, we need the URL of a web page. Suppose we want to understand the backgroud of the recent Israel-Hamas war, we find out the wikipedia web page. You can open this web page on your browser. You should find a title called Background, which is supposedly what we want. In stead of doing copying and pasting the content in the Background section, we want to use the functions in {rvest} to directly access these contents. How can we get to this content? We need to convert this website into an XML object with the function read_html( ). The below codes convert the Wikipedia web page to a list. When we inspect the html codes of this web page, we found that the title of this web page is contained under the node whose class is “mw-page-title-main”. The HTML structure of a web page is just like a tree structure. If we want to access a particular text, we need to provide the path to it. In this case, we have already known the direct path to the title is “mw-page-title-main”. Thus, we use “.mw-page-title-main” to represent the full path of it and use the function html_node( ) to get to it. We use the function html_text( ) to convert the content under this class to text.

library(rvest)
url<-"https://en.wikipedia.org/wiki/2023_Israel–Hamas_war"
html<-read_html(url)
length(html)
## [1] 2
mode(html)
## [1] "list"
html %>% html_node(".mw-page-title-main") %>%
  html_text()
## [1] "2023 Israel–Hamas war"
We can access the content by looking for the class mw-parser-output. As all the textual contents are embedeed in paragraphs starting and ending by “

” and “

“, we can add p in the path of nodes. The returned result are saved as a variable called doc. The variable doc is a character vector. However, we do not know which entry of this vector contains the backgroud information. Go back to check the HTML elements of the web page. It is found that the paragraph starting with”The attack took place…” is what we need.

doc<-html %>% html_nodes(".mw-parser-output p") %>% 
  html_text()
v<-sapply(doc,function(s){
    if(grepl("The attack took place",s)){
      return(1)
    }else
      return(0)
})
names(v)<-NULL
Background<-doc[v==1]
Background
## [1] "The attack took place during the Jewish holiday of Simchat Torah on Shabbat,[71] and a day after the 50th anniversary of the start of the Yom Kippur War, which also began with a surprise attack.[72]"

Example 2: Understanding Human Feelings through Poems

Poems describe human feelings in a precise and concise way. Thus, poems might be a direct road for us to walk into other people’s inner world. We can also learn from poems to properly describe our inner feelings. Therefore, the words or topics in poems presumably can reflect our affections. I happened to find this website, which collects lots of poems in many categories. This website can be a good platform for us to practice web scraping. For instance, we would like to know what linguistic features are relevant to depression. We can collect the depression poems on this website. You can click on older posts and you check the URL. The URL is “https://www.onlyshortpoems.com/poems/depression-poems/page/2/”. The number in this URL apparently shows the page number. Thus, we know that the page before this page must have a number 3. On each web page, there are 10 poems. We can only see a part of these poems, unless we click on read more. That means the complete poem must be shown on another url. After inspecting the HTML elements of this web page, we found that the hyper reference of this poem is under the node entry-header. There are only 5 pages for depression poems. Thus, we are going to scrap 50 depression poems. In fact, we can scrap only 35 poems.

main<-"https://www.onlyshortpoems.com/poems/"
categ<-"depression-poems/page/"
PoemUrls<-sapply(1:5,function(p){
     url<-paste0(main,categ,p,"/")
     tryCatch({
       page<-read_html(url)
       url.poem<-page %>% html_nodes(".entry-header a") %>% html_attr("href")
       return(url.poem)
     },
     error=function(cond){
       message("Error on page ",p)
       return(NA)
     },
     warning=function(cond){
       message("Warning on page ",p)
       return(NULL)
     },
     finally={
       message("Page ",p," is scraped.")
     })
})
## Page 1 is scraped.
## Page 2 is scraped.
## Page 3 is scraped.
## Page 4 is scraped.
## Error on page 5
## Page 5 is scraped.
PoemUrls<-unlist(PoemUrls)

Now we have collected the ruls of 35 depression poems. The next step is to scrap the contents of them. Before we do this, let’s check for any regularity for getting the poems among these web pages. Yes, there is. The title of a poem is contained in the node with class = entry-title and the content of a poem is contained in the node with class = entry-content.

poems<-lapply(PoemUrls[-length(PoemUrls)],function(x){
    p<-read_html(x)
    title<-p %>% html_node(".entry-title") %>% html_text()
    content<-p %>% html_node(".entry-content") %>% html_text()
    #message("Poem ", which(x==PoemUrls), " is done.")
    return(list(title=title,content=content))
})

Let’s check out the titles of these poems.

title<-sapply(poems,"[[","title")
title
##  [1] "Waiting"                       "Behind the Smile"             
##  [3] "Lost in the Dark"              "My life"                      
##  [5] "Time Out To Cry ©"             "Welcome To My World"          
##  [7] "Pain Became My Friend Today ©" "Disappear"                    
##  [9] "When You Arrive"               "Hurting Inside"               
## [11] "Whiskey Thirst"                "Peels"                        
## [13] "Oh Heart"                      "Goodbye"                      
## [15] "Mystery Life"                  "Painfull Past"                
## [17] "Butterflies"                   "My Heary Is Crying"           
## [19] "Vividly"                       "Depression Hurts"             
## [21] "Why?"                          "Lost"                         
## [23] "Pretend"                       "Coping"                       
## [25] "My cocoon"                     "River Of Misery"              
## [27] "Forgot How To Smile"           "Waking Sleep"                 
## [29] "Here I Hold my Ground"         "Ordinary Guy"                 
## [31] "The Scary Place"               "Curst"                        
## [33] "The New Ego"                   "Addictions"                   
## [35] "A Perfect Target"

Now let’s check out the contents. We need to clean up these contents.

content<-sapply(poems,"[[","content")
content<-sapply(content,function(x){
  temp<-gsub("\n\t\t\t\t\t\t\t","",x)
  temp<-gsub("\r\n\t\t\t\t","",temp)
  temp<-gsub("\n"," ",temp)
  temp<-gsub("\t"," ",temp)
  temp<-gsub("…"," ",temp)
  return(temp)
})
names(content)<-NULL

In the previous note, we have introduced how to extract the keywords by TF-IDF algorithm. In addition to individual words, compounds of words might be informative to represent an article. These compounds of words can be n-grams with n = 2, 3, … k. For example, in the sentence “Birds are singing outside”, the bigrams are birds_are, are_singing, and singing_outside. Similarly, the trigrams are birds_are_singing and are_singing_outside.

In the previous note, we have introduced how to extract keywords from texts. Here we are going to learn how to extract n-grams from texts. Before doing n-grams, we have to sort out our text in a tibble format. Thereafter, we use the function unnest_tokens( ) in the package {tidytext} to tokenize these poems in the format of one token per row. Here a token is a n-gram.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
poems.text<-tibble(text=content,title=title)
poems.text<-poems.text %>%
               unnest_tokens(bigram,text,token="ngrams",n=2)
poems.text
## # A tibble: 3,356 × 2
##    title   bigram   
##    <chr>   <chr>    
##  1 Waiting i am     
##  2 Waiting am all   
##  3 Waiting all alone
##  4 Waiting alone and
##  5 Waiting and it’s 
##  6 Waiting it’s cold
##  7 Waiting cold in  
##  8 Waiting in here  
##  9 Waiting here i   
## 10 Waiting i can’t  
## # ℹ 3,346 more rows

As we can see, a lot of n-grams are actually the compounds of common words (uninteresting words) or what we call stop words. We need to remove those stop words. However, we have two words in each token. Thus, we firstly tidy::separate a token to two columns; one for each word. Thereafter, we check for each word whether it is in stop words. If it is, then it should not be included in the new tibble.

stop_words
## # A tibble: 1,149 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ℹ 1,139 more rows
library(tidyr)
poems.separate<-poems.text %>% separate(bigram,into=c("word1","word2"),sep=" ")
poems.separate
## # A tibble: 3,356 × 3
##    title   word1 word2
##    <chr>   <chr> <chr>
##  1 Waiting i     am   
##  2 Waiting am    all  
##  3 Waiting all   alone
##  4 Waiting alone and  
##  5 Waiting and   it’s 
##  6 Waiting it’s  cold 
##  7 Waiting cold  in   
##  8 Waiting in    here 
##  9 Waiting here  i    
## 10 Waiting i     can’t
## # ℹ 3,346 more rows
poems.bis<-poems.separate %>% filter(!word1 %in% stop_words$word,
                                     !word2 %in% stop_words$word) %>%
           unite(bigram,c(word1,word2),sep=" ")
poems.bis
## # A tibble: 371 × 2
##    title            bigram            
##    <chr>            <chr>             
##  1 Waiting          it’s cold         
##  2 Waiting          pains inside      
##  3 Behind the Smile smile i’m         
##  4 Behind the Smile i’m smiling       
##  5 Behind the Smile inside i’m        
##  6 Behind the Smile i’m dying         
##  7 Behind the Smile crying i’m        
##  8 Behind the Smile i’m happy         
##  9 Behind the Smile panic attacks     
## 10 Behind the Smile depression attacks
## # ℹ 361 more rows

Now what we can do to analyze those n-grams? First, we can check the frequency of a particular word in the bigrams across all poems. We can separate a bigram into two words and check the match number of that particular word to the words. For example, if we want to get the frequency of >pain in these bigrams, we can use the below script.

library(stringr)
poems.bis %>% separate(bigram, into=c("word1","word2"), sep=" ") %>%
  filter(word1=="pain" | word2=="pain") %>% 
  count(pain=str_c(word1,word2,sep=" "),sort=T)
## # A tibble: 8 × 2
##   pain              n
##   <chr>         <int>
## 1 bye pain          1
## 2 cold pain         1
## 3 consoled pain     1
## 4 cried pain        1
## 5 misery pain       1
## 6 pain bounds       1
## 7 pain inside       1
## 8 tomorrow pain     1

Second, we can treat a bigram as a keyword and compute the frequency of each bigram.

library(tidylo)
library(ggplot2)
poems.bis %>% count(title,bigram,sort=T) %>%
  bind_log_odds(set=title,feature=bigram,n=n) %>% arrange(desc(log_odds_weighted)) %>% 
  top_n(15) %>% mutate(seq=letters[16:1]) %>%
  ggplot(aes(y=seq,x=log_odds_weighted))+
  geom_col(fill="deepskyblue2",color="white")+scale_y_discrete(labels=poems.bis$bigram)
## Selecting by log_odds_weighted

In addition, we can also check how often words are preceded by a word like “not”.

poems.separate %>% filter(word1=="not") %>% filter(!word2 %in% stop_words$word) %>%
  count(word1, word2, sort=T)
## # A tibble: 3 × 3
##   word1 word2      n
##   <chr> <chr>  <int>
## 1 not   caring     1
## 2 not   nice       1
## 3 not   ready      1

What about other types of poems? Suppose we would like to learn the difference between the concepts of depression and sad. We can do the same analysis again for those sad poems. In total, there are 8 pages of sad poems. We can get the urls of them.

main<-"https://www.onlyshortpoems.com/poems/"
categ<-"sad-poems/page/"
sad.PoemUrls<-sapply(1:8,function(p){
     url<-paste0(main,categ,p,"/")
     tryCatch({
       page<-read_html(url)
       url.poem<-page %>% html_nodes(".entry-header a") %>% html_attr("href")
       return(url.poem)
     },
     error=function(cond){
       message("Error on page ",p)
       return(NA)
     },
     warning=function(cond){
       message("Warning on page ",p)
       return(NULL)
     },
     finally={
       message("Page ",p," is scraped.")
     })
})
## Page 1 is scraped.
## Page 2 is scraped.
## Page 3 is scraped.
## Page 4 is scraped.
## Page 5 is scraped.
## Page 6 is scraped.
## Page 7 is scraped.
## Page 8 is scraped.
sad.PoemUrls<-unlist(PoemUrls)

Now we can get the information of each poem. The below script might take a while.

sad.poems<-lapply(sad.PoemUrls,function(x){
    result<-tryCatch({
       p<-read_html(x)
       title<-p %>% html_node(".entry-title") %>% html_text()
       content<-p %>% html_node(".entry-content") %>% html_text()
       return(list(title=title,content=content))
    },
    error=function(cond){
      message("Error on ",x)
      return(NA)
    },
    warning=function(cond){
      message("Warning on ",x)
      return(NULL)
    },
    finally={
      message("done.")
    })
    return(result)
})
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## done.
## Error on NA
## done.

Same as the procedure that we used for depression poems, we can get the titles and contents of these sad poems.

sad.title<-sapply(sad.poems,function(t){
  return(t$title)
})
sad.title[1:30]
sad.content<-sapply(sad.poems,function(t){
  return(t$content)
})
sad.content<-sapply(sad.content,function(x){
  temp<-gsub("\n\t\t\t\t\t\t\t","",x)
  temp<-gsub("\r\n\t\t\t\t","",temp)
  temp<-gsub("\n"," ",temp)
  temp<-gsub("\t"," ",temp)
  temp<-gsub("…"," ",temp)
  return(temp)
})
names(sad.content)<-NULL
sad.content[2]

Again, we do bigrams for sad poems and remove stop words.

sad.poems.text<-tibble(text=sad.content,title=sad.title)
sad.poems.text<-sad.poems.text %>%
               unnest_tokens(bigram,text,token="ngrams",n=2)
sad.poems.text

sad.poems.separate<-sad.poems.text %>% separate(bigram,into=c("word1","word2"),sep=" ")
sad.poems.separate
sad.poems.bis<-sad.poems.separate %>% filter(!word1 %in% stop_words$word,
                                     !word2 %in% stop_words$word) %>%
           unite(bigram,c(word1,word2),sep=" ")
sad.poems.bis

Let’s check for the top 15 bigrams for sad poems.

sad.poems.bis %>% count(title,bigram,sort=T) %>%
  bind_log_odds(set=title,feature=bigram,n=n) %>% arrange(desc(log_odds_weighted)) %>% 
  top_n(15) %>% mutate(seq=letters[15:1]) %>%
  ggplot(aes(y=seq,x=log_odds_weighted))+
  geom_col(fill="tomato",color="white")+scale_y_discrete(labels=sad.poems.bis$bigram)