Introduction to R

Web crawler

First, we scrape the texts from the first web page. We inspect the html codes and find that all quotes are contained in between the elements

and

. Thus, we extract the text content under the html element p. However, some text contents in between

and

are not what we want. The quotes that we want are always led by a number. Thus, we split each string in the quotes by “.” and check whether the content before it can be successfully transffer as a number. If yes, this quote is what we want and no otherwise.

library(rvest)
# Scrape texts on the web page for romantic breaking up
url1<-"https://poplady-mag.com/article/118175/失戀語錄-22句經典失戀語錄-總有一個能說進你的心裡"
p1<-read_html(url1)
quotes1<-p1 %>% html_nodes("p") %>% html_text()
valids<-sapply(quotes1,function(x){
          temp<-unlist(strsplit(x,"[.]"))
          tag<-suppressWarnings(as.numeric(temp[1]))
          return(!is.na(tag))
})
quotes1<-quotes1[valids]
quotes1

##  [1] "1. 每當別人問我喜歡類型的人，我就又要開始想起你。"                                                                                                            
##  [2] "2. 如果我能回到從前，我會選擇不認識你，並不是因為我後悔了，而是我無法面對現在的結局。"                                                                        
##  [3] "3. 你一定要過得好，別辜負我一生不打擾。"                                                                                                                      
##  [4] "4. 後來我們都很好，沒有問候，沒有打擾。"                                                                                                                      
##  [5] "5. 我一直都在關注你，以你知道或者不知道的方式。"                                                                                                              
##  [6] "6. 我見過你愛我的樣子，所以我確定你不愛我了。"                                                                                                                
##  [7] "7. 對一個男人來說，最無能為力的事兒就是“在最沒有能力的年紀，碰見了最想照顧一生的她。”"                                                                        
##  [8] "8. 我終究還是把你還回了人海。"                                                                                                                                
##  [9] "9. 曾經很生氣也不敢胡鬧，怕自己不夠重要。"                                                                                                                    
## [10] "10. 你走了也好，這樣我就不必再擔心你會離開了。"                                                                                                               
## [11] "11. 餘生不用你指教了，願你過的好，並讓我一無所知。"                                                                                                           
## [12] "12. 誰不曾用卑微的語句挽留誰。"                                                                                                                               
## [13] "13. 你是我所遇到的最好的人，可你卻不給我一個了解你的機會。為什麼不花一點時間來了解我，然後再和我分手呢？"                                                     
## [14] "14. 每個人心底都有那麼一個人，已不是戀人，也成不了朋友。"                                                                                                     
## [15] "15. 我不太會形容什麼愛情的語錄，愛你或許就像這首歌那樣平淡，你也許注意不到這三句開頭字。"                                                                     
## [16] "16. 我不過是想要一份不叛離，不傷害，只有溫暖的愛。"                                                                                                           
## [17] "17. 你可以刪掉一切，卻刪不掉我在你的記憶裡；我可以忘記一切，卻忘不掉你的出現。是不是真的等到我失憶，你才會在我記憶中消失。"                                   
## [18] "18. 小朋友，阿姨以前可喜歡你爸爸了。"                                                                                                                         
## [19] "19. 淋過雨的空氣，疲倦了的傷心，我記憶裡的童話已經慢慢的融化。"                                                                                               
## [20] "20. 時間會慢慢沉澱，有些人會在你心底慢慢模糊。學會放手，你的幸福需要自己的成全。"                                                                             
## [21] "21. 各種表達自己愛你的話，在我看來是那麼虛偽，一句愛你就夠了。"                                                                                               
## [22] "22. 無數個瞬間我都在想，要是你在就好了，結果還是我一個人熬過了所有的這個時刻。後來，不用了，謝謝 。未來我會變成更好的自己，或者也會在恰當的時間，再次遇到你。"

numq1<-length(quotes1)

Subsequently, we collect the happiness quotes. Again, the quotes are contained in between the elements

and

. We notice that only the quotes initiating with a number are our target. Thus, we identify the target quotes by checking for each string whether it is initiated by a number.

url2<-"https://www.iglobe.hk/blog/posts/50happiness"
p2<-read_html(url2)
quotes2<-p2 %>% html_nodes("p") %>% html_text()
valids2<-sapply(quotes2,function(x){
            temp<-unlist(strsplit(x,"、"))
            tag<-suppressWarnings(as.numeric(temp[1]))
            return(!is.na(tag))
})
quotes2<-quotes2[valids2]
quotes2

##  [1] "1、幸福就像香水，灑給別人也一定會感染自己。"                                                     
##  [2] "2、羨慕別人得到的，不如珍惜自己擁有的。"                                                         
##  [3] "3、明白事理的人讓自己適應世界，不明事理的人硬想令世界適應自己。"                                 
##  [4] "4、生活是一面鏡子。你對它笑，它就對你笑；你對它哭，它也對你哭。"                                 
##  [5] "5、活著一天，就是有福氣，就該珍惜。"                                                             
##  [6] "6、當我哭泣我沒有鞋子穿的時候，我發現有人卻沒有腳。"                                             
##  [7] "7、還能衝動，表示你還對生活有激情；總是衝動，表示你還不懂生活。"                                 
##  [8] "8、人之所以痛苦，在於追求錯誤的東西。"                                                           
##  [9] "9、如果你不給自己煩惱，別人也永遠不可能給你煩惱，煩惱都是自找的。"                               
## [10] "10、時間是治療心靈創傷的大師，但絕不是解決問題的高手。"                                          
## [11] "11、無論你覺得自己多麼了不起，永遠有人比你更強；無論你覺得自己多麼不幸，永遠有人比你更加不幸。"  
## [12] "12、人生是一條沒有回程的單行線，上帝不會給你一張返程的票，所以一定要活在當下，珍惜目前。"        
## [13] "13、對待生活中的每一天若都像生命中的最後一天，人生定會更精彩。"                                  
## [14] "14、活在昨天的人失去過去，活在明天的人失去未來，活在今天的人擁有過去和未來。"                    
## [15] "15、人生最大的悲哀不是失去太多，而是計較太多，這也是導致一個人不快樂的重要原因。"                
## [16] "16、雖然我們無法改變人生，但可以改變人生觀。"                                                    
## [17] "17、雖然我們無法改變環境，但我們可以改變心境。"                                                  
## [18] "18、當你快樂時，你要想，這快樂不是永恆的。當你痛苦時，你更要想，這痛苦也不是永恆的。"            
## [19] "19、命運就像自己的掌紋，雖然彎彎曲曲，卻永遠掌握在自己手中。"                                    
## [20] "20、長得漂亮是優勢，活得漂亮是本事。"                                                            
## [21] "21、人生是一場旅行，在乎的不是目的地，是沿途的風景以及看風景的心情。"                            
## [22] "22、時間是小偷，他來時悄無聲息，走後損失慘重，機會也是如此。"                                    
## [23] "23、真正的快樂來源於寬容和幫助。"                                                                
## [24] "24、幸福並不需要奢侈和豪華，有時要的越多反而越難幸福。"                                          
## [25] "25、心簡單，世界就簡單，幸福才會生長；心自由，生活就自由，到哪都有快樂。"                        
## [26] "26、人生如煙花，不可能永遠懸掛天際；只要曾經絢爛過，便不枉此生。"                                
## [27] "27、人生四然：來是偶然，去是必然，盡其當然，順其自然。"                                          
## [28] "28、真正的堅韌，應該是哭的時候要徹底，笑的時候要開懷，說的時候要淋漓盡致，做的時候不要猶豫。"    
## [29] "29、時間告訴你什麼叫衰老，回憶告訴你什麼叫幼稚。"                                                
## [30] "30、生命很殘酷，用悲傷讓你了解什麼叫幸福，用噪音教會你如何欣賞寂靜，用彎路提醒你前方還有坦途。"  
## [31] "31、時間並不會真的幫我們解決什麼問題，它只是把原來怎麼也想不通的問題，變得不再重要了。"          
## [32] "32、人生就像一場馬拉松，你的起點高也好，你的提速快也好，但結果比較的是誰能堅持到最遠。"          
## [33] "33、人生就像一本書，出生是封面，歸去是封底，內容要靠自己填。"                                    
## [34] "34、人生就像衛生紙，沒事的時候儘量少扯！"                                                        
## [35] "35、人生就像一杯沒有加糖的咖啡，喝起來是苦澀的，回味起來卻有久久不會退去的余香！"                
## [36] "36、時間給勤勉的人留下智慧的力量，給懶惰的人留下空虛和悔恨。"                                    
## [37] "37、炫耀是需要觀眾的，而炫耀恰恰讓我們失去觀眾。"                                                
## [38] "38、快樂不是因為擁有的多，而是因為計較的少。"                                                    
## [39] "39、我們總在關注我們得到的東西是否值錢，而往往忽略放棄的東西是否可惜。"                          
## [40] "40、能使我們感覺快樂的，不是環境，而是態度。"                                                    
## [41] "41、人生的冷暖取決於心靈的溫度。"                                                                
## [42] "42、勤奮和智慧是雙胞胎，懶惰和愚蠢是親兄弟。"                                                    
## [43] "43、幸福是簡單的，只要你有一顆善良的心，你永遠都會感受到它的存在。"                              
## [44] "44、幸福其實就是一杯白開水，平平淡淡，卻孕育著無限身機。"                                        
## [45] "45、幸福便是那勞動著的美麗，縱然汗流浹背，千辛萬苦，卻又苦中透甜。"                              
## [46] "46、幸福從來不在於你擁有什麼，幸福在於用自己的能力去努力創造，去用心感受。"                      
## [47] "47、幸福沒有標準的，它好像一道門檻，高低與否取決於自己的看法與定位。"                            
## [48] "48、真正的幸福是一點一點爭取的，一天一天積累的。不要計較太多的得與失，要學會用一顆寬容的心包容。"
## [49] "49、幸與不幸其實也是一種福"                                                                      
## [50] "50、生活可以是甜的，也可以是苦的，但不能是沒味的。你可以勝利，也可以失敗，但你不能屈服。"

numq2<-length(quotes2)

According to past studies, depressive people tend to use more the first-person singular pronoun (i.e., I or 我) in their texts, comparing to normal ones. This finding was consistently reported suggesting that the focus of depressive people is on themselves. Thus, it is reasonable to expect that the probability of the first-person singular pronoun should be significantly higher in the breaking-up quotes than in the happiness quotes. To verify this hypothesis, we firstly combine the quotes from the same web page as a larger string. In total, we will have two large strings: one for breaking-up and the other for happiness. Thereafter, we do Chinese word segmentation with {jieBaR} to identify the words in each group. Finally, we compare the probabilities of the first-person singular pronoun in these two groups with z test.

# Combine quotes from the same web page as a large string

q1<-paste0(quotes1,collapse=" ")
q2<-paste0(quotes2,collapse=" ")

library(jiebaR)

## Loading required package: jiebaRD

wk<-worker()
# Chinese word segmentation
words1<-wk[q1]
words2<-wk[q2]
p1<-sum(words1=="我")/(length(words1)-numq1)
p2<-sum(words2=="我")/(length(words2)-numq2)
p<-(sum(words1=="我")+sum(words2=="我"))/(length(words1)-numq1+length(words2)-numq2)

The z score of the difference between two proportions can be computed by

\[\begin{align} Z=\frac{\hat{p_1}-\hat{p_2}}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}, \end{align}\]

where \(\hat{p}=\frac{c_1+c_2}{t_1+t_2}\). Here \(c_1\) means the counts of “我” in the breaking-up quotes and \(c_2\) that in the happiness quotes. The total word count \(t_1\) is the count of all words in the breaking-up quotes minus 22 numbers. Same as \(t_1\), \(t_2\) is the total word count with numbers removed in the happiness quotes. The value is significant, hence verifying our hypothesis that the first-person singular pronoun is used more frequently in the breaking-up quotes than in the happiness quotes.

denominator<-p*(1-p)*(1/(length(words1)-numq1)+1/(length(words2)-numq2))
Z<-(p1-p2)/sqrt(denominator)
c(p1,p2)

## [1] 0.058823529 0.003875969

1-pnorm(Z,0,1)

## [1] 1.128529e-09

In addition to the first-person singular pronoun, we can also check for other LIWC words, which may include the words of negative emotion, positive emotion, sad, etc. To this end, we need two LIWC files. The file cliwcv_v1.1.txt contains Chinese words and their corresponding codes. In each line of this file, words are separated by a tab. Thus, we need to remove it. There are in total 7419 words in this CLIWC dictionary.

# Import all Chinese LIWC words and their codes
cliwc_codes<-scan("cliwc_v1.1.txt",what=character(),sep="\n")
# Remove tab
cliwc_codes<-sapply(cliwc_codes,function(x){
  return(gsub("\t"," ",x))
})
# Get words only
cliwc_ws<-sapply(cliwc_codes,function(x)unlist(strsplit(x," "))[1])
names(cliwc_codes)<-NULL
length(cliwc_ws)

## [1] 7419

Thereafter, we can transfer all words in each group to LIWC codes. To this end, we create a function. We then transfer all words in each group to CLIWC codes.

CLIWC_codes<-function(v){
  One_codes<-sapply(v,function(x){
    codes<-lapply(x,function(y){
        flag<-which(y==cliwc_ws)
        if(length(flag)==0){
           return(0)
        }else{
           tt<-unlist(strsplit(cliwc_codes[flag]," "))[-c(1,2)]
           return(tt)  
        }
    })
    return(paste(unlist(codes),collapse=" "))
  })
  names(One_codes)<-NULL
  return(One_codes)
}
codes1<-CLIWC_codes(words1)
codes2<-CLIWC_codes(words2)

Suppose we focus on the proportions of positive emotion, negative emotion, and sad words in CLIWC dictionary. We check the file cliwc_index.txt and learn that the codes for affective, positive emotion, negative emotion, sad, and cognitive-mechanism words are 125, 126, 127, 130 and 131, respectively. Thus, we will compute the proportions of them in each group.

# For checking proportion of positive emotion words in breaking-up quotes
Ps1<-lapply(codes1,function(t){
     return(list(affect=grepl("125",t),
                 post=grepl("126",t),
                 neg=grepl("127",t),
                 sad=grepl("130",t),
                 cog=grepl("131",t)))
})
# For checking proportion of positive emotion words in happiness quotes
Ps2<-lapply(codes2,function(t){
     return(list(affect=grepl("125",t),
                 post=grepl("126",t),
                 neg=grepl("127",t),
                 sad=grepl("130",t),
                 cog=grepl("131",t)))
})
totalWords1<-length(words1)-numq1
totalWords2<-length(words2)-numq2
affect1<-sum(sapply(Ps1,"[[","affect"))/(totalWords1)
affect2<-sum(sapply(Ps2,"[[","affect"))/(totalWords2)
post1<-sum(sapply(Ps1,"[[","post"))/(totalWords1)
post2<-sum(sapply(Ps2,"[[","post"))/(totalWords2)
neg1<-sum(sapply(Ps1,"[[","neg"))/(totalWords1)
neg2<-sum(sapply(Ps2,"[[","neg"))/(totalWords2)
sad1<-sum(sapply(Ps1,"[[","sad"))/(totalWords1)
sad2<-sum(sapply(Ps2,"[[","sad"))/(totalWords2)
cog1<-sum(sapply(Ps1,"[[","cog"))/(totalWords1)
cog2<-sum(sapply(Ps2,"[[","cog"))/(totalWords2)

Let’s plot the proportions of these LIWC words across quote types.

library(ggplot2)
fig_dta<-data.frame(Prob=c(affect1,affect2,post1,post2,neg1,neg2,sad1,sad2,cog1,cog2),
                    Type=rep(c("B","H"),5),
                    LIWC=rep(c("affect","post","neg","sad","cog"),each=2))
ggplot(fig_dta,aes(LIWC,Prob))+
  geom_bar(aes(fill=Type),stat="identity",position=position_dodge())

Let’s check if the LIWC words differ between these two types of quotes. The results show that the affection words and cognitive-mechanism words are used more frequently in the breaking-up quotes than in the happiness quotes. However, there is no significant difference on proportion for positive-emotion, negative-emotioin, and sad words between these two types of quotes.

# Compute Z scores
den<-1/totalWords1+1/totalWords2
affectP<-sum(sapply(Ps1,"[[","affect"))+sum(sapply(Ps2,"[[","affect"))
affectP<-affectP/(totalWords1+totalWords2)
Zaffect<-(affect1-affect2)/sqrt(affectP*(1-affectP)*den)
postP<-sum(sapply(Ps1,"[[","post"))+sum(sapply(Ps2,"[[","post"))
postP<-postP/(totalWords1+totalWords2)
Zpost<-(post1-post2)/sqrt(postP*(1-postP)*den)
negP<-sum(sapply(Ps1,"[[","neg"))+sum(sapply(Ps2,"[[","neg"))
negP<-negP/(totalWords1+totalWords2)
Zneg<-(neg1-neg2)/sqrt(negP*(1-negP)*den)
sadP<-sum(sapply(Ps1,"[[","sad"))+sum(sapply(Ps2,"[[","sad"))
sadP<-sadP/(totalWords1+totalWords2)
Zsad<-(sad1-sad2)/sqrt(sadP*(1-sadP)*den)
cogP<-sum(sapply(Ps1,"[[","cog"))+sum(sapply(Ps2,"[[","cog"))
cogP<-cogP/(totalWords1+totalWords2)
Zcog<-(cog1-cog2)/sqrt(cogP*(1-cogP)*den)
# Compute p values
c(Zaffect,Zpost,Zneg,Zsad,Zcog)

## [1]  2.200349 -1.240700 -1.661182 -2.315481  2.519632

1-pnorm(c(Zaffect,Zpost,Zneg,Zsad,Zcog),0,1)

## [1] 0.013891080 0.892641612 0.951661521 0.989706686 0.005873881

Introduction to R

Lee-Xieng Yang

Detecting psychological attributes reflected in texts

Web crawler