Introduction to R

Understanding Human Feelings through Poems

Poems describe human feelings in a precise and concise way. Thus, poems might be a direct road for us to walk into other people’s inner world. We can also learn from poems to properly describe our inner feelings. Therefore, the words or topics in poems presumably can reflect our affections. I happened to find this website, which collects lots of poems in many categories. This website can be a good platform for us to practice web scraping. For instance, we would like to know what linguistic features are relevant to depression. We can collect the depression poems on this website. You can click on older posts and you check the URL. The URL is “https://www.onlyshortpoems.com/poems/depression-poems/page/2/”. The number in this URL apparently shows the page number. Thus, we know that the page before this page must have a number 3. On each web page, there are 10 poems. We can only see a part of these poems, unless we click on read more. That means the complete poem must be shown on another url. After inspecting the HTML elements of this web page, we found that the hyper reference of this poem is under the node entry-header. There are only 5 pages for depression poems. In order to do text analysis for these poems, we firstly collect the url of each poem.

library(rvest)
mainURL<-"https://www.onlyshortpoems.com/poems/depression-poems/"
urls<-sapply(1:4,function(i){
     if(i==1){
         url<-mainURL
     }else{
         url<-paste0(mainURL,"page/",i,"/")
     }
     return(url)       
})
# Start web scraping
poem_dta<-lapply(urls,function(j){
     # Read a url
     temp1<-read_html(j)
     # Get the titles of all poems
     poem_titles<-temp1 %>% html_nodes("h2.entry-title") %>% html_text(trim=T)
     # Get the hyper references of all poems
     poem_urls<-temp1 %>% html_nodes("h2.entry-title a") %>% html_attr("href")
     # Get the content of each poem
     poem_contents<-sapply(poem_urls,function(i){
          # Read the web page of a poem
          temp2<-read_html(i)
          # Get the content of that poem
          content<-temp2 %>% html_node("div.entry-content") %>% html_text(trim=T)
          # Remove \n
          content<-gsub("\n","",content)
          return(content)
     })
     names(poem_contents)<-NULL
     return(list(titles=poem_titles,contents=poem_contents))
})

After we scrap the web pages for depression poems, we have the titles and contents of 35 poems in total. Subsequently, these data are saved as a file in our hard drive for a future use.

all_titles<-unlist(sapply(poem_dta,"[[","titles"))
all_contents<-unlist(sapply(poem_dta,"[[","contents"))
write.table(data.frame(Title=all_titles,Content=all_contents),
            file="poems.txt",row.names=F)

Text analysis

Text analysis is a process that involves examining and interpreting written material to extract meaningful information, identify patterns, and derive insights. This can be achieved through various techniques and methodologies, often leveraging computational tools and algorithms. Text analysis is used in numerous fields such as linguistics, data science, marketing, and social sciences. Now we can import our poem data for practicing text analysis. For the convenience of doing text analysis, we turn the imported data to a tibble.

PoemDta<-read.table("poems.txt",header=T)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

PoemDta<-as_tibble(PoemDta)
PoemDta

## # A tibble: 35 × 2
##    Title                         Content                                        
##    <chr>                         <chr>                                          
##  1 Waiting                       I am all alone and It’s cold in here. I can’t …
##  2 Behind the Smile              Behind this smileI’m smiling, but inside I’m d…
##  3 Lost in the Dark              Lost in the darkNowhere to go, nowhere to hide…
##  4 My life                       Lying in my bed in a bad mood for the first ti…
##  5 Time Out To Cry ©             Time Out To Cry ©All alone at the end of the d…
##  6 Welcome To My World           Welcome To My World Have you ever tried to cry…
##  7 Pain Became My Friend Today © Pain Became My Friend Today ©Pain became my fr…
##  8 Disappear                     Sometimes I feel I want to disappear.This is h…
##  9 When You Arrive               When you arrive, i’ll lay sprawldon’t be frigh…
## 10 Hurting Inside                I sit over by the window looking up at the moo…
## # ℹ 25 more rows

What are the features, namely keywords, of these poems? Let’s check the TF-IDF index of each word in all poems. First, we need to turn the tibble to the one in a format of one-token-per-document-per-row using unnest_tokens( ) in the package tidytext.

library(tidytext)
PoemDta1<-PoemDta %>% unnest_tokens(word,Content)
PoemDta1

## # A tibble: 3,057 × 2
##    Title   word 
##    <chr>   <chr>
##  1 Waiting i    
##  2 Waiting am   
##  3 Waiting all  
##  4 Waiting alone
##  5 Waiting and  
##  6 Waiting it’s 
##  7 Waiting cold 
##  8 Waiting in   
##  9 Waiting here 
## 10 Waiting i    
## # ℹ 3,047 more rows

As can be seen in the tibble, a lot of function words are used. Since function words, such as the, i, etc., do not contain specific information, we need to remove them. In addition to function words, some other words might not be included either, as they reflect nothing too much about the poems. The words that should be removed are called stop words, which are contained in a dataset called stop_words. In order to remove stop words from our poems, we use anti_join( ), which will remove the words contained in the object stop_words.

data(stop_words)
PoemDta1<-PoemDta1 %>% anti_join(stop_words)

## Joining with `by = join_by(word)`

PoemDta1

## # A tibble: 1,227 × 2
##    Title   word    
##    <chr>   <chr>   
##  1 Waiting it’s    
##  2 Waiting cold    
##  3 Waiting can’t   
##  4 Waiting cry     
##  5 Waiting remember
##  6 Waiting fears   
##  7 Waiting wait    
##  8 Waiting i’m     
##  9 Waiting pains   
## 10 Waiting inside  
## # ℹ 1,217 more rows

Before we compute the TF-IDF index, we need to compute the total number of words in each poem. To this end, first we compute the total number of each wod in each poem. Second, we use bind_tf_idf( ) to calculate the TF-IDF index.

# Counting each word
poem_words<-PoemDta1 %>% count(word,Title,sort=T)
# Computing TF-IDF
poem_tf_idf<-poem_words %>% bind_tf_idf(word,Title,n)
poem_tf_idf<-poem_tf_idf %>% arrange(desc(tf_idf))
library(ggplot2)
library(forcats)
poem_tf_idf %>% slice_max(tf_idf,n=10) %>%
  ggplot(aes(tf_idf,fct_reorder(word,tf_idf)))+
      geom_col(show.legend=F,fill="deepskyblue3")+
      labs(x="TF-IDF",y=NULL)

In addition to using single words as features of a document, the features of a document can be of a larger size. If we use two words as a unit, then this unit is called bi-gram. For example, in the sentence that we went to a party, the bi-grams can be [we went], [went to], [to a], and [a party]. Similarly, if the feature is a compound of three words, it is called tri-gram. The below script shows how to extract bi-grams.

PoemBigrams<-PoemDta %>% unnest_tokens(bigram,Content,token="ngrams",n=2)
PoemBigrams

## # A tibble: 3,022 × 2
##    Title   bigram   
##    <chr>   <chr>    
##  1 Waiting i am     
##  2 Waiting am all   
##  3 Waiting all alone
##  4 Waiting alone and
##  5 Waiting and it’s 
##  6 Waiting it’s cold
##  7 Waiting cold in  
##  8 Waiting in here  
##  9 Waiting here i   
## 10 Waiting i can’t  
## # ℹ 3,012 more rows

Again, we need to remove those stop-words. However, a bi-gram has two words. We need to separate them to two single words for removing the stop words. The function separate( ) in the package {tidyr}.

library(tidyr)
PoemSeparated<-PoemBigrams %>% separate(bigram,into=c("word1","word2"),sep=" ")
PoemSeparated

## # A tibble: 3,022 × 3
##    Title   word1 word2
##    <chr>   <chr> <chr>
##  1 Waiting i     am   
##  2 Waiting am    all  
##  3 Waiting all   alone
##  4 Waiting alone and  
##  5 Waiting and   it’s 
##  6 Waiting it’s  cold 
##  7 Waiting cold  in   
##  8 Waiting in    here 
##  9 Waiting here  i    
## 10 Waiting i     can’t
## # ℹ 3,012 more rows

PoemBigrams_united<-PoemSeparated %>% filter(!word1 %in% stop_words$word) %>% 
                  filter(!word2 %in% stop_words$word) %>%
                  unite(bigram,c(word1,word2),sep=" ")

We can treat a bi-gram as a keyword and compute the frequency of each bi-gram.

library(tidylo)
PoemBigrams_united %>% count(Title,bigram,sort=T) %>%
  bind_log_odds(set=Title,feature=bigram,n=n) %>% arrange(desc(log_odds_weighted)) %>% 
  top_n(15) %>% mutate(seq=letters[15:1]) %>%
  ggplot(aes(y=seq,x=log_odds_weighted))+
  geom_col(fill="tomato",color="white")+scale_y_discrete(labels=PoemBigrams_united$bigram)

## Selecting by log_odds_weighted

Introduction to R

Lee-Xieng Yang

Understanding Human Feelings through Poems

Text analysis