Latent Semantic Analysis

Latent semantic analysis (LSA) is a method to find out the latent structure behind the words in texts, or the relationships between texts in terms of the used words. In order to demonstrate how LSA is implemented, we firstly scrap poems from this website. We focus on sad poems and funny poems. We first define one function specifically for collecting the urls of the poems of our intest.

get.URLs<-function(website,num){
   PoemUrls<-sapply(1:num,function(p){
     url<-paste0(website,p,"/")
     tryCatch({
       page<-read_html(url)
       url.poem<-page %>% html_nodes(".entry-header a") %>% html_attr("href")
       return(url.poem)
     },
     error=function(cond){
       message("Error on page ",p)
       return(NA)
     },
     warning=function(cond){
       message("Warning on page ",p)
       return(NULL)
     },
     finally={
       message("Page ",p," is scraped.")
     })
   })
   return(PoemUrls)
}

Now we can use this function to get the urls of sad poems.

library(rvest)
website<-"https://www.onlyshortpoems.com/poems/"
website<-paste0(website,"sad-poems/page/")
sad.urls<-get.URLs(website,8)
## Page 1 is scraped.
## Page 2 is scraped.
## Page 3 is scraped.
## Page 4 is scraped.
## Page 5 is scraped.
## Page 6 is scraped.
## Page 7 is scraped.
## Page 8 is scraped.
sad.urls<-unlist(sad.urls)

In order to get the title and content of each poem, we create another function.

get.Poems<-function(urls){
   poems<-lapply(urls,function(x){
     tryCatch({
        p<-read_html(x)
        title<-p %>% html_node(".entry-title") %>% html_text()
        content<-p %>% html_node(".entry-content") %>% html_text()
        return(list(title=title,content=content))
     },
     error=function(cond){
       message("Error on ",x)
       return(NA)
     },
     warning=function(cond){
       message("Warning on ",x)
       return(NULL)
     },
     finally={
       #message("done.")
     })
   })
   return(poems)
}
sad.poems<-get.Poems(sad.urls)

We can get the titles of these poems. Also, we can get the contents of them. Once we get the contents of these poems, we need to clean them up before further analysis, by removing nonsense symbols, converting words to lower cases, and converting plural words to singular words.

sad.titles<-sapply(sad.poems,"[[","title")
sad.contents<-sapply(sad.poems,"[[","content")
library(tidytext)
library(SemNetCleaner)
## Loading required package: SemNetDictionaries
## 
## SemNetDictionaries (version 0.2.0) 
## For help getting started, see <https://doi.org/10.1037/met0000463> 
## 
##  Submit your own dictionary and moniker glossaries to:
##  <https://github.com/AlexChristensen/SemNetDictionaries/issues/new/choose>
## 
## SemNetCleaner (version 1.3.4) 
## For help getting started, see <https://doi.org/10.31234/osf.io/eht87> 
##  For bugs and errors, submit an issue to <https://github.com/AlexChristensen/SemNetCleaner/issues>
## 
## WARNING: There have been major updates to the SemNetCleaner package.
##  Please see 'Package NEWS' for a detailed list of updates (see 'Changes in version 1.2.0')
# Clean up contents
sad.contents<-sapply(sad.contents,function(t){
     # Remove nonsense symbols
     temp<-gsub("\r\n\t\t\t\t","",t)
     temp<-gsub("\n\t\t\t\t\t\t\t","",temp)
     temp<-gsub("\n"," ",temp)
     temp<-gsub("\r"," ",temp)
     temp<-gsub("\t"," ",temp)
     # Convert to lower cases
     temp<-tolower(temp)
     # Convert to singular forms
     temp<-unlist(strsplit(temp," "))
     temp<-sapply(temp,singularize)
     names(temp)<-NULL
     temp<-paste(temp,collapse=" ")
     return(temp)
})
names(sad.contents)<-NULL

Before we remove stop words, we firstly create a tibble to contain and tokenize these poems.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sad.dta<-tibble(title=sad.titles,text=sad.contents)
sad.dta<-sad.dta %>% unnest_tokens(word,text)
# Remove stop words
sad.dta<-sad.dta %>% filter(!word %in% stop_words$word)

In addition to sad poems, we also scrap funny poems as a comparison. Also, we get the titles and contents of these funny poems.

funny.website<-"https://www.onlyshortpoems.com/poems/"
funny.website<-paste0(funny.website,"funny-poems/page/")
funny.urls<-get.URLs(funny.website,5)
## Page 1 is scraped.
## Page 2 is scraped.
## Page 3 is scraped.
## Page 4 is scraped.
## Page 5 is scraped.
funny.urls<-unlist(funny.urls)
funny.poems<-get.Poems(funny.urls)

Now we clean up these funny poems. Thereafter, we form a tibble to contain these funny poems.

funny.titles<-sapply(funny.poems,"[[","title")
funny.contents<-sapply(funny.poems,"[[","content")
funny.contents<-sapply(funny.contents,function(t){
     # Remove nonsense symbols
     temp<-gsub("\r\n\t\t\t\t","",t)
     temp<-gsub("\n\t\t\t\t\t\t\t","",temp)
     temp<-gsub("\n"," ",temp)
     temp<-gsub("\r"," ",temp)
     temp<-gsub("\t"," ",temp)
     # Convert to lower cases
     temp<-tolower(temp)
     # Convert to singular forms
     temp<-unlist(strsplit(temp," "))
     temp<-sapply(temp,singularize)
     names(temp)<-NULL
     temp<-paste(temp,collapse=" ")
     return(temp)
})
names(funny.contents)<-NULL
funny.dta<-tibble(title=funny.titles,text=funny.contents)
funny.dta<-funny.dta %>% unnest_tokens(word,text)
# Remove stop words
funny.dta<-funny.dta %>% filter(!word %in% stop_words$word)

TF-IDF

We can compute the TF-IDF for each word across all poems. To this end, we combined two tibbles and computed TF-IDF for each word. We choose the first 10% words as our target for LSA.

all.dta<-rbind(sad.dta,funny.dta)
all.dta<-all.dta %>% group_by(title) %>% count(word) %>% ungroup
all.dta<-all.dta %>% bind_tf_idf(word,title,n)
all.tfidf.dta<-all.dta %>% arrange(desc(tf_idf))
target.words<-all.tfidf.dta$word[1:182]

LSA

Thus far, we have scrapped two kinds of poems from the onlypoems website. There are many ways to describe the relationships between these two kinds of poems. LSA is one of them. We are going to demonstrate how to do LSA. We need to create a term-document matrix first. To this end, we aggreate all words in the poems. There are in total 1819 unrepeated words in all poems. Thus, we convert each peom to a vector of 1819 components, each of which is the counts of a particular word in that poem.

allwords<-unique(c(sad.dta$word,funny.dta$word))
length(allwords)
## [1] 1818
# For sad poems
sad.TD<-sapply(sad.titles,function(t){
     p.words<-sad.dta %>% filter(title==t) %>% with(word)
     temp<-rep(0,length(allwords))
     for(i in p.words){
       temp[which(i==allwords)]<-temp[which(i==allwords)]+1
     }
     return(temp)
})
funny.TD<-sapply(funny.titles,function(t){
     p.words<-funny.dta %>% filter(title==t) %>% with(word)
     temp<-rep(0,length(allwords))
     for(i in p.words){
       temp[which(i==allwords)]<-temp[which(i==allwords)]+1
     }
     return(temp) 
})

We create two term-document matrices. In order to find out the latent structure behind these poems, we combine these two matrices as one large term-document matrix with rows and columns representing words and poem titles respectively. With the focus on those words with high TF-IDF, we select the target words only.

TD<-cbind(sad.TD,funny.TD)
dim(TD)
## [1] 1818  125
target.nums<-sapply(target.words,function(x)which(x==allwords))
TD<-TD[target.nums,]
alltitles<-c(sad.titles,funny.titles)

This matrix can be decomposed to three parts using the linear algebra skill called SVD (Singular Value Decomposition). According to SVD, any matrix \(W=U \Lambda V^T\). In fact, \(\Lambda=\Delta^2\), where \(\Delta\) is the diagonal matrix of singular values of \(W\), namely the eigen values. SVD is actually PCA (Principal Component Analysis). Thus, TD can be decomposed to three parts by SVD. The first matrix \(U\) is a term-component matrix. Here this matrix has 182 rows and 125 components. Similarly, the third matrix is a component-term matrix with 125 components. When multiply H$TD and H$d, we create a new matrix wit rows representing words and columns representing title. The \(\Lambda\) value quickly drops in the first three ranks, suggesting that there might be three components.

H<-svd(TD)
length(H$d)
## [1] 125
dim(H$u)
## [1] 182 125
dim(H$v)
## [1] 125 125
library(ggplot2)
ggplot(data.frame(x=1:125,y=H$d),aes(x,y))+geom_line()

Since \(\Lambda\) is a diagonal matrix, \(U\Lambda\) is also a matrix. Here, \(U\) is a word-component matrix. We create a diagonal matrix with the diagonal representing d. Thus, \(U\Lambda\) is the coordinates of each word in the space made of the principal components. The relationships between these words can be geometrically shown.

WC<-H$u%*%(diag(H$d))
WC<-data.frame(WC)
rownames(WC)
##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
##  [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
##  [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
##  [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
##  [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
##  [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
##  [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
##  [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
##  [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
## [145] "145" "146" "147" "148" "149" "150" "151" "152" "153" "154" "155" "156"
## [157] "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168"
## [169] "169" "170" "171" "172" "173" "174" "175" "176" "177" "178" "179" "180"
## [181] "181" "182"
#WC.dta<-data.frame(WC[,c(1,3)])
#names(WC.dta)<-c("x","y")
#ggplot(WC.dta,aes(x,y,label=allwords[target.nums]))+geom_text(size=3)

Alternatively, we can conduct k-means cluster analysis for these words. We use the elbow method to estimate the optimal cluster number.

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(WC,
             FUNcluster=hcut,
             method="wss",
             k.max=12)+
  labs("Elbow Method for HC")

kmeans.cluster<-kmeans(WC,center=2)
fviz_cluster(kmeans.cluster,data=WC,geom=c("point","text"),
             frame.type="norm")
## Warning: argument frame is deprecated; please use ellipse instead.
## Warning: argument frame.type is deprecated; please use ellipse.type instead.
## Too few points to calculate an ellipse

Now we can check for the relationships between poems. Again, we use the elbow method to estimate the optimal cluster number. However, it looks like that the gradient is still steep until cluster number = 12. Thus, we choose the cluster number as 10. The cluster plot reveals that the sad poems are quite dissimilar to each other. The number shows the the rank of a poem in the vector of all poem titles. However, the funny poems gather closely. It is suggested that sad poems have individual stories but funny poems have a relatively common scenario.

CP<-diag(H$d)%*%t(H$v)
fviz_nbclust(data.frame(t(CP)),
             FUNcluster=hcut,
             method="wss",
             k.max=12)+
  labs("Elbow Method for HC")

kmeans.cluster<-kmeans(CP,center=10)
fviz_cluster(kmeans.cluster,data=CP,geom=c("point","text"),
             frame.type="norm")
## Warning: argument frame is deprecated; please use ellipse instead.
## Warning: argument frame.type is deprecated; please use ellipse.type instead.
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse