Introduction to R

Web crawler for scrapping contents on multiple web pages

Scraping only one article from a web page is rare. Sometimes, to answer a question of our interest may request texts scrapped from multiple web pages. For instance, if we are interested in knowing what topics the recent articles have on this webpage, we may need to collect texts in a sufficient number. On this web page, there are 20 articles. The next web page also contain 20 articles and so does the third web page. Thus, we know that on this web site, there are 20 articles on each web page. The questions then are twofold. The first is how to access articles over multiple web pages. The second is how to get the topic of each article.

Let’s deal with the first question. Look at the urls of this web page and the next. The only difference between their urls is the last number, which are 1 for the first web page and 2 for the second one. Presumably, the third web page should have a url ended with 3. You can try to click on 下一頁 and check the url. Therefore, we know how to access articles across web pages. How to get to know the topic of each article? There would be a number of ways. Here I will introduce a relatively easy way.

Visual inspection in the first web page suggests that the hashtags of each article may provide information about its topic. Thus, instead of collecting the content of each article, we can just collect its hashtags. Let’s see how we can get hashtags of an article. The first article on this web page on 2024.05.01 has the title of 台美職場文化差異？美國員工為何怨「台積電是地球上最糟糕的工作場所」. We find that the metadata of this article is contained under the node article with the class post__item.non”. We can use the function html_nodes( ) to access metadata of all articles.

library(rvest)
url1<-"https://dq.yam.com/list/1/"
html<-read_html(url1)
metadata<-html %>% html_nodes("article.post__item.non") %>% html_text()
length(metadata)

## [1] 20

The metadata of an article is a string containing the title of the article, the hashtags, and the published time. The regulation is that the first and the last part of the metadata of each article are the title and published time. Thus, we can use strsplit( ) to divide the metadata string by “”. For instance, we divide the first string and get a character vector of 21 elements. The first is the title and the 21st is the information about published time.

unlist(strsplit(metadata[1],"\n"))

##  [1] " 台美職場文化差異？美國員工為何怨「台積電是地球上最糟糕的工作場所」 "
##  [2] "    tsmc"                                                            
##  [3] "  "                                                                  
##  [4] "    工程師"                                                          
##  [5] "  "                                                                  
##  [6] "    文化衝突"                                                        
##  [7] "  "                                                                  
##  [8] "    主管"                                                            
##  [9] "  "                                                                  
## [10] "    台積電"                                                          
## [11] "  "                                                                  
## [12] "    台灣"                                                            
## [13] "  "                                                                  
## [14] "    企業"                                                            
## [15] "  "                                                                  
## [16] "    美國"                                                            
## [17] "  "                                                                  
## [18] "    張忠謀"                                                          
## [19] "  "                                                                  
## [20] "    管理"                                                            
## [21] "   more 陳愷昀  2024-04-302024-04-30bookmark_border"

Therefore, what we need is to extract those in between the first and the 21st elements. Since we have 20 articles’ metadata, it is more convenient to use spply( ) to extract the hashtags of 20 articles in a loop. Now we can check if we successfully get the hashtags of the first article. It looks okay and we check for the second article. The hashtags of the second article are retrieved successfully too.

hashtags<-sapply(metadata,function(v){
  temp<-unlist(strsplit(v,"\n"))
  tags<-temp[-c(1,length(temp))]
  tags<-gsub(" ","",tags)
  tags<-paste0(tags,collapse=" ")
  return(tags)
})
names(hashtags)<-NULL
hashtags[1]

## [1] "tsmc  工程師  文化衝突  主管  台積電  台灣  企業  美國  張忠謀  管理"

hashtags[2]

## [1] "app  google  instagram  tiktok  youtube  中國  北京  字節跳動  抖音  言論自由  青少年  拜登  美國  短影音  間諜  歐元  憲法  總統"

Since we can retrieve hashtags of articles on one web page, we must be able to retrieve those across many web pages. Suppose we want to get 200 articles’ metadata from this website. We need to scrap 10 pages. Let’s start the work by creating our function. See the below codes.

WS<-function(mainURL="https://dq.yam.com/list/",num=5){
  allHTs<-sapply(1:num,function(n){
            # Create the url of the web page to be scrapped
            url<-paste0(mainURL,n,"/",collapse="")
            html<-read_html(url)
            # Get metadata of articles
            metadata<-html %>% html_nodes("article.post__item.non") %>% html_text()
            hashtags<-sapply(metadata,function(v){
                         temp<-unlist(strsplit(v,"\n"))
                         tags<-temp[-c(1,length(temp))]
                         tags<-gsub(" ","",tags)
                         tags<-paste0(tags,collapse=" ")
                      return(tags)
            })
            names(hashtags)<-NULL
          return(hashtags)
  })
  return(allHTs)
}

Let’s scrap 10 web pages to get the hashtags. The output of our function is a 20 x 10 matrix, so in total there are 200 articles. Let’s do some descriptive statistics, such as making a frequency distribution of hashtag numbers. As shown in the figure below, most article have about 10 hashtags and very few articles have more than 20 hashtags.

results<-WS(mainURL="https://dq.yam.com/list/",num=10)
dim(results)

## [1] 20 10

all_metadta<-sapply(1:200,function(x){
     temp<-unlist(strsplit(results[x],"  "))
     temp1<-paste0(temp,collapse=" ")
     names(temp1)<-NULL
     return(list(tagnum=length(temp),tags=temp1))
})
library(ggplot2)
ggplot(data.frame(freq=unlist(all_metadta[1,])),aes(freq))+
  geom_histogram(aes(y=..density..),fill="gray",col="black",bins=10)+
  geom_density(color="red")

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We now create a term-document matrix for further data analysis. In a term-document matrix, each row represents a word (i.e., hashtag) and each column a document (i.e., article). To this end, we need to identify all words in the hashtags. We can use unique( ) to identify unrepeated words. With a Term-Document matrix, we can conduct LSA.

allwords<-NULL
for(i in 1:200){
  temp<-unlist(strsplit(unlist(all_metadta[2,i])," "))
  names(temp)<-NULL
  allwords<-c(allwords,temp)
}
length(allwords)

## [1] 2006

length(unique(allwords))

## [1] 1098

LSA

Latent semantic analysis (LSA) is often used to detect the relationships between texts or the relationships between words. In order to understand the relationships between the articles on DQ, we create a Term-Document matrix, in which rows represent hashtag words and columns represent articles. Each cell [i,j] recodes the frequency for a particular word i used in a particular article j. Thereafter, we use SVD (Singular Value Decomposition) technique to decompose this Term-Document matrix to three parts: a matrix of square root of eigen values (i.e., Component-Componet matrix), Term-Component matrix, and Component-Document matrix. We can multiply Component-Comonent matrix and Component-Document matrix to form a new matrix. This matrix records the coordinates of documents in a space componsed of components. This technique is actually PCA (Principal Component Analysis). The below codes show how we conduct PCA by using the svd technique. The results show that these articles can roughly be separated to four clusters. The numbers show the article numbers. The lower right cluster mainly is relevant to Japan, when we go back to check the articles according to the numbers. Similarly, the upper-right cluster is more relevant to China and South Korean. The upper-left cluster has something to do with the USA, technology and AI. The rest articles are positioned in the lower-left area. As these articles are quite distant to each other, they might not show a core concept here.

TD<-matrix(0,length(allwords),length(all_metadta[2,]))
for(x in 1:length(all_metadta[2,])){
   temp<-unlist(strsplit(unlist(all_metadta[2,x])," "))
   coding<-sapply(temp,function(j)which(j==allwords))
   coding<-unlist(coding)
   names(coding)<-NULL
   for(i in 1:length(coding))TD[coding[i],x]<-1
}
H<-svd(TD)
M<-(diag(length(H$d))*H$d)%*%t(H$v)
M<-t(M)
ggplot(data.frame(x=M[,1],y=M[,2],label=as.character(1:200)),
       aes(x,y,label=label))+
  geom_point(shape=16,col="red")+
  geom_text(size=2.5,color="deepskyblue4",nudge_x=0.1)

Introduction to R

Lee-Xieng Yang

Web crawler for scrapping contents on multiple web pages

LSA