Web Scraper for Chinese websites

A web page contains many pieces of information, among which text is one of the most informative. We can learn a lot from the words in a web page, such as news, gossips of celebrities, recruiting advertisements, weather forecasts, and so on. How to get textual information from a web page and what analysis we can do with it are the important topics in text analysis for social media.

In this note, I am going to show how to scrap the words in a web page. To do this, we first need to have some basic concepts about web page. A web page is not a text file only. In fact, it is a script written by HTML codes. These codes define the structure of your article to be shown in a web page, such as defining the levels of headings, the numbers of paragraphs, or where to insert pictures, etc. Thus, a web page is far complicated than we see. Here is a HTML tutorial for reference.

Let’s see this web page. If you are using MAC Safari to read this web page, you can right click on it and click Inspect Element on the bottom of the menu to show the HTML codes of this web page. If you are using Google Chrome, you can also right click on this web page and choose Inspect on the bottom of the menu to see the HTML codes. See the below figure.

The title of this article is highlighted in the blue area. The corresponding HTML code is marked in the below window. The HTML code h1 means level 1 heading. The class attribute is used to point to a class name in a style sheet with a specific class name. Here the class name is “post__title”. We can use functions in the package {rvest} to retrieve the title according to the HTML codes h1 and post__title.

library(rvest)
url<-"https://dq.yam.com/post/16046"
html<-read_html(url)
mode(html)

## [1] "list"

In this example, we save the address of this web page, namely the url of this web page, as a character variable. Subsequently, we use the function read_html to transfer the information in that web page to a list which is called html here.

title<-html %>% html_node("h1.post__title") %>% html_text()
title

## [1] "年輕人瘋「躺床耍廢」，用休息時間抗拒生產力文化"

We can also retrieve the author name and publishing time of this post. Again, we inspect the HTML codes and find that

author<-html %>% html_node("span.post__author") %>% html_text()

Similarly, we can get the publishing time of this post using the below code.

time<-html %>% html_node("time.post__publishTime") %>% html_text()

We continue checking the HTML codes to find out where the article is. The article consists of two parts. One is the summary paragraph ahead of the main article and the other is the main body of the article. The summary paragraph can be found under the HTML node p. The main article is under the HTML node div and class name is “post__contentWrap”. Thus, we can retrieve the summary paragraph and the main body of this article.

summary<-html %>% html_node("p") %>% html_text()
text<-html %>% html_node("div.post__contentWrap") %>% html_text()

The retrieved words from html are saved as a string of length = 1. We call summary to check whether we correctly retrieve the summary paragraph. Also, we do the same for the main body of this article by calling text. It looks okay for the summary paragraph. However, the last sentence of the main article does not belong to this article.

mode(summary)

## [1] "character"

length(summary)

## [1] 1

mode(text)

## [1] "character"

length(text)

## [1] 1

summary

## [1] "2023下半年TikTok上出現一波所謂「爛在床上」（bed rotting）風潮，其實跟「慵懶過一天」的概念很類似，指的就是躺在床上什麼都不做，乍聽之下似乎是個相當適合放鬆充電的休息模式，相關影片累積瀏覽次數多到以億計，號稱出發點是基於自我保健和減輕壓力。"

text

## [1] "        「床上」馬鈴薯美國哥倫比亞廣播公司（CBS）指出，「爛在床上」是Z世代在TikTok上流行的術語，意思是選擇整天躺床上作為一種自我保健。做法包括賴在床上或舒服被窩裡花幾個小時滑手機、追劇、看書和吃零食等，好此道者聲稱這有助應對壓力和焦慮。睡眠專家兼行為科學家希爾（Vanessa Hill）說：「這有點像藉由什麼都不做、花時間休息來抗拒生產力文化。」她也在TikTok發短影音加入這股風潮，瀏覽次數超過260萬。希爾表示人們過去多將休息與懶畫上等號並引以為恥，如今這股發布並分享自己賴在床上的影片風潮，有助於洗刷「休息」的刻板印象。她說：「我們很多人其實都累了，相信觀看這類影音的人能感同身受，即以往當想休息時會有壓力。然而像爛在床上這股風潮並非真的只是在床上浪費時間，是讓自己少做一點且告訴自己這麼做不可恥。」只會越躺越累不過這波「爛在床上」遭健康團體與醫界抨擊，尤其是診治焦慮與憂鬱的精神科醫師。精神科醫師瓜塔姆（Rishi Guatam）就說，這股風潮會因加劇肥胖與心血管疾病而讓整體健康惡化。另有不少醫師認為「爛在床上」會打亂睡眠作息。倦怠感是真實存在，也會影響身心健康，當偶爾覺得精疲力盡時，放鬆個一天無傷大雅，一定限度內的休息和充電固然重要，但若搞成悖離正常生活或逃避承擔日常責任時，就可能是個得看醫生的信號。清醒時一直躺床上可能會導致失眠。醫師表示重要的是應將床與睡眠聯繫，而不是工作、吃飯、看電視或只是保持清醒的地方，否則無形中會訓練大腦要在床上時保持清醒。        爛在床上，會發芽的！缺乏活動會成為導致焦慮和憂鬱的一個重要因素。身體不活動的時間越長或不活動的次數越多，焦慮與憂鬱的產生或惡化風險就越大，然後又會降低動力並引發疲勞的惡性循環；而像憂鬱症的有效治療通常包括體力活動、社交互動和設法解決問題。瓜塔姆說：「如果發現自己越來越陷進去這個（爛在床上）無法自拔時，就充分表明這個人出了問題。」愛因斯坦醫學院蒙特菲爾醫學中心（Montefiore-Einstein Center for Cancer Care）首席心理學家芮哥（Simon Rego）指出，「平衡」對於健康很重要，躺在床上的時間太長會打亂人類的情緒，甚至還會增加壓力。芮哥呼籲「不論當時感覺有多舒服，都要避免過度」，還說躺在床上時間超過24小時就已算是太超過，有可能導致各種不同的心理健康問題，若頻繁出現「想整天躺在床上的衝動」就要有所警覺。避免被社會孤立鹽湖城猶他大學（The University of Utah）家庭和預防醫學系教授巴隆（Kelly Glazer Baron）表示，床只能用在睡覺和親密行為，而非追劇、工作甚至吃飯等，如果上床半小時內都無法入睡或夜間醒來超過20分鐘，就應離開床。她說，從睡眠科學的角度分析，「爛在床上」跟專家希望民眾做的事剛好相反，不僅會影響心理健康，還可能讓睡眠品質惡化。福斯新聞（Fox News）引述克里夫蘭臨床兒童醫院（Cleveland Clinic Children’s）兒童心理學家馬德（Emily Mudd）博士說：「若出於逃避某些事而爛在床上、或身心覺得無法離開床，都會是大問題。例如若是因對某件事感到焦慮而選擇待床上，或是為了避免社交。」馬德不諱言，無窮無盡的工作壓力會讓人們不知所措，尤其是對孩子，因此休息一天讓身心喘息是件好事；但若將「爛在床上」作為應對生活遭遇到問題的頭號方法，就恐淪為社會孤立，而社會孤立正是憂鬱和焦慮的危險因子。馬德說：「如果你是家長，當發現孩子已在床上待很長一段時間，那就得留神了。兒童有社交、發展和情感需求，例如與同儕相處和課堂學習，這些需求無法透過一個人躺在床上滿足。休息固然重要，但社交、情感和認知發展也至關重要。」她鼓勵發現孩子出現憂鬱或焦慮等精神健康障礙症狀的父母，快向醫療專業人員求助。        單篇文章贊助 定期／年度贊助 我們為您在DQ飛行船預留了VIP位子，期待您登船贊助DQ"

After we scrap the words in a web page, the first thing we need to do is data cleaning. That is, we have to remove all symbols, unnecessary words, functional codes (or macros), etc for making easier the future analysis. Here, the last sentence should be omitted obviously. Visual inspection of text suggests that the article can be separated by a space string ” ” and the last sentence will be the final part. Thus, we can use <font color=“blue>strsplit( ) to separate the string of text by” “. As the output of strsplit( ) is a list, we use unlist( ) to turn the output list to a character vector. Since we have three” “, the article is divided as four parts and the last one is the sentence which we do not want.

text1<-unlist(strsplit(text,"        "))
length(text1)

## [1] 4

text1<-text1[-length(text1)]

Now we put the summary graph and the cleaned article back to the full article. Together with the author’s name and the published time, we can create a data frame to maintian all these data.

post<-paste(c(summary,text1),collapse=" ")
post

## [1] "2023下半年TikTok上出現一波所謂「爛在床上」（bed rotting）風潮，其實跟「慵懶過一天」的概念很類似，指的就是躺在床上什麼都不做，乍聽之下似乎是個相當適合放鬆充電的休息模式，相關影片累積瀏覽次數多到以億計，號稱出發點是基於自我保健和減輕壓力。  「床上」馬鈴薯美國哥倫比亞廣播公司（CBS）指出，「爛在床上」是Z世代在TikTok上流行的術語，意思是選擇整天躺床上作為一種自我保健。做法包括賴在床上或舒服被窩裡花幾個小時滑手機、追劇、看書和吃零食等，好此道者聲稱這有助應對壓力和焦慮。睡眠專家兼行為科學家希爾（Vanessa Hill）說：「這有點像藉由什麼都不做、花時間休息來抗拒生產力文化。」她也在TikTok發短影音加入這股風潮，瀏覽次數超過260萬。希爾表示人們過去多將休息與懶畫上等號並引以為恥，如今這股發布並分享自己賴在床上的影片風潮，有助於洗刷「休息」的刻板印象。她說：「我們很多人其實都累了，相信觀看這類影音的人能感同身受，即以往當想休息時會有壓力。然而像爛在床上這股風潮並非真的只是在床上浪費時間，是讓自己少做一點且告訴自己這麼做不可恥。」只會越躺越累不過這波「爛在床上」遭健康團體與醫界抨擊，尤其是診治焦慮與憂鬱的精神科醫師。精神科醫師瓜塔姆（Rishi Guatam）就說，這股風潮會因加劇肥胖與心血管疾病而讓整體健康惡化。另有不少醫師認為「爛在床上」會打亂睡眠作息。倦怠感是真實存在，也會影響身心健康，當偶爾覺得精疲力盡時，放鬆個一天無傷大雅，一定限度內的休息和充電固然重要，但若搞成悖離正常生活或逃避承擔日常責任時，就可能是個得看醫生的信號。清醒時一直躺床上可能會導致失眠。醫師表示重要的是應將床與睡眠聯繫，而不是工作、吃飯、看電視或只是保持清醒的地方，否則無形中會訓練大腦要在床上時保持清醒。 爛在床上，會發芽的！缺乏活動會成為導致焦慮和憂鬱的一個重要因素。身體不活動的時間越長或不活動的次數越多，焦慮與憂鬱的產生或惡化風險就越大，然後又會降低動力並引發疲勞的惡性循環；而像憂鬱症的有效治療通常包括體力活動、社交互動和設法解決問題。瓜塔姆說：「如果發現自己越來越陷進去這個（爛在床上）無法自拔時，就充分表明這個人出了問題。」愛因斯坦醫學院蒙特菲爾醫學中心（Montefiore-Einstein Center for Cancer Care）首席心理學家芮哥（Simon Rego）指出，「平衡」對於健康很重要，躺在床上的時間太長會打亂人類的情緒，甚至還會增加壓力。芮哥呼籲「不論當時感覺有多舒服，都要避免過度」，還說躺在床上時間超過24小時就已算是太超過，有可能導致各種不同的心理健康問題，若頻繁出現「想整天躺在床上的衝動」就要有所警覺。避免被社會孤立鹽湖城猶他大學（The University of Utah）家庭和預防醫學系教授巴隆（Kelly Glazer Baron）表示，床只能用在睡覺和親密行為，而非追劇、工作甚至吃飯等，如果上床半小時內都無法入睡或夜間醒來超過20分鐘，就應離開床。她說，從睡眠科學的角度分析，「爛在床上」跟專家希望民眾做的事剛好相反，不僅會影響心理健康，還可能讓睡眠品質惡化。福斯新聞（Fox News）引述克里夫蘭臨床兒童醫院（Cleveland Clinic Children’s）兒童心理學家馬德（Emily Mudd）博士說：「若出於逃避某些事而爛在床上、或身心覺得無法離開床，都會是大問題。例如若是因對某件事感到焦慮而選擇待床上，或是為了避免社交。」馬德不諱言，無窮無盡的工作壓力會讓人們不知所措，尤其是對孩子，因此休息一天讓身心喘息是件好事；但若將「爛在床上」作為應對生活遭遇到問題的頭號方法，就恐淪為社會孤立，而社會孤立正是憂鬱和焦慮的危險因子。馬德說：「如果你是家長，當發現孩子已在床上待很長一段時間，那就得留神了。兒童有社交、發展和情感需求，例如與同儕相處和課堂學習，這些需求無法透過一個人躺在床上滿足。休息固然重要，但社交、情感和認知發展也至關重要。」她鼓勵發現孩子出現憂鬱或焦慮等精神健康障礙症狀的父母，快向醫療專業人員求助。"

data<-data.frame(Author=author,Time=time,Title=title,Post=post)
data

##    Author       Time                                          Title
## 1 許家銘  2024-04-22 年輕人瘋「躺床耍廢」，用休息時間抗拒生產力文化
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Post
## 1 2023下半年TikTok上出現一波所謂「爛在床上」（bed rotting）風潮，其實跟「慵懶過一天」的概念很類似，指的就是躺在床上什麼都不做，乍聽之下似乎是個相當適合放鬆充電的休息模式，相關影片累積瀏覽次數多到以億計，號稱出發點是基於自我保健和減輕壓力。  「床上」馬鈴薯美國哥倫比亞廣播公司（CBS）指出，「爛在床上」是Z世代在TikTok上流行的術語，意思是選擇整天躺床上作為一種自我保健。做法包括賴在床上或舒服被窩裡花幾個小時滑手機、追劇、看書和吃零食等，好此道者聲稱這有助應對壓力和焦慮。睡眠專家兼行為科學家希爾（Vanessa Hill）說：「這有點像藉由什麼都不做、花時間休息來抗拒生產力文化。」她也在TikTok發短影音加入這股風潮，瀏覽次數超過260萬。希爾表示人們過去多將休息與懶畫上等號並引以為恥，如今這股發布並分享自己賴在床上的影片風潮，有助於洗刷「休息」的刻板印象。她說：「我們很多人其實都累了，相信觀看這類影音的人能感同身受，即以往當想休息時會有壓力。然而像爛在床上這股風潮並非真的只是在床上浪費時間，是讓自己少做一點且告訴自己這麼做不可恥。」只會越躺越累不過這波「爛在床上」遭健康團體與醫界抨擊，尤其是診治焦慮與憂鬱的精神科醫師。精神科醫師瓜塔姆（Rishi Guatam）就說，這股風潮會因加劇肥胖與心血管疾病而讓整體健康惡化。另有不少醫師認為「爛在床上」會打亂睡眠作息。倦怠感是真實存在，也會影響身心健康，當偶爾覺得精疲力盡時，放鬆個一天無傷大雅，一定限度內的休息和充電固然重要，但若搞成悖離正常生活或逃避承擔日常責任時，就可能是個得看醫生的信號。清醒時一直躺床上可能會導致失眠。醫師表示重要的是應將床與睡眠聯繫，而不是工作、吃飯、看電視或只是保持清醒的地方，否則無形中會訓練大腦要在床上時保持清醒。 爛在床上，會發芽的！缺乏活動會成為導致焦慮和憂鬱的一個重要因素。身體不活動的時間越長或不活動的次數越多，焦慮與憂鬱的產生或惡化風險就越大，然後又會降低動力並引發疲勞的惡性循環；而像憂鬱症的有效治療通常包括體力活動、社交互動和設法解決問題。瓜塔姆說：「如果發現自己越來越陷進去這個（爛在床上）無法自拔時，就充分表明這個人出了問題。」愛因斯坦醫學院蒙特菲爾醫學中心（Montefiore-Einstein Center for Cancer Care）首席心理學家芮哥（Simon Rego）指出，「平衡」對於健康很重要，躺在床上的時間太長會打亂人類的情緒，甚至還會增加壓力。芮哥呼籲「不論當時感覺有多舒服，都要避免過度」，還說躺在床上時間超過24小時就已算是太超過，有可能導致各種不同的心理健康問題，若頻繁出現「想整天躺在床上的衝動」就要有所警覺。避免被社會孤立鹽湖城猶他大學（The University of Utah）家庭和預防醫學系教授巴隆（Kelly Glazer Baron）表示，床只能用在睡覺和親密行為，而非追劇、工作甚至吃飯等，如果上床半小時內都無法入睡或夜間醒來超過20分鐘，就應離開床。她說，從睡眠科學的角度分析，「爛在床上」跟專家希望民眾做的事剛好相反，不僅會影響心理健康，還可能讓睡眠品質惡化。福斯新聞（Fox News）引述克里夫蘭臨床兒童醫院（Cleveland Clinic Children’s）兒童心理學家馬德（Emily Mudd）博士說：「若出於逃避某些事而爛在床上、或身心覺得無法離開床，都會是大問題。例如若是因對某件事感到焦慮而選擇待床上，或是為了避免社交。」馬德不諱言，無窮無盡的工作壓力會讓人們不知所措，尤其是對孩子，因此休息一天讓身心喘息是件好事；但若將「爛在床上」作為應對生活遭遇到問題的頭號方法，就恐淪為社會孤立，而社會孤立正是憂鬱和焦慮的危險因子。馬德說：「如果你是家長，當發現孩子已在床上待很長一段時間，那就得留神了。兒童有社交、發展和情感需求，例如與同儕相處和課堂學習，這些需求無法透過一個人躺在床上滿足。休息固然重要，但社交、情感和認知發展也至關重要。」她鼓勵發現孩子出現憂鬱或焦慮等精神健康障礙症狀的父母，快向醫療專業人員求助。

Chinese Word Segmentation

A sentence is composed of words. For English sentences, such as “This is a book.”, every two words are separated by a space. Thus, a machine can easily identify each word by spaces. However, for Chinese sentences, this job becomes not that easy, as there is no space between words in a Chinese sentence. Luckily, some organizations (e.g., Academia Sinica in Taiwan or 騰迅) have developed applications for doing Chinese word segmentation. The application of Academia Sinica is called CKIP, which however can only be called by Python. The application developed by 騰迅 is called 結巴 or jieba, which can be called by Python and R. In R, the package {jiebaR} provides functions for users to use jieba for Chinese word segmentation. The detailed introduction to jiebaR can be found here. We use the function worker( ) to create an environment in R for doing word segmentation. See the below example.

library(jiebaR)

## Loading required package: jiebaRD

cutter<-worker()
cutter["今天的天氣好溫暖"]

## [1] "今天" "的"   "天氣" "好"   "溫暖"

Apparently, this Chinese sentence is separated to words. Thus, we can apply cutter to do word segmentation for the article. There are 805 words in total. Which words can represent this article most? One guess is the words that appear most frequently. We can check the frequencies of these words by using table( ). The length of freq is 460, far less than 805. This is because cutter only identifies valid words, no matter whether those word repeat or not. Based on the word frequencies, the most frequent word in this article is 的.

words<-cutter[data$Post]
length(words)

## [1] 805

freq<-table(words)
length(freq)

## [1] 460

w.mode.position<-which.max(freq)
names(freq)[w.mode.position]

## [1] "的"

This is quire reasonable, as 的 is actually the most frequent word in Chinese. However, this word tells us nothing specifically for this article. Thus, 的 cannot represent this article. How can we really get the true key words of an article?

TF-IDF

TF-IDF (Term Frequency and Inverse Document Frequency) is the product of two statistics: term frequency and inverse document frequency. The inverse document frequency is \(log(\frac{n_{number of documents}}{n_{number of documents containing term}})\). The below codes show how to get the top 10 words in terms of TF-IDF in jiebaR.

cutter<-worker("keywords",topn=10)
kwds<-cutter[data$Post]
kwds

## 161.058 117.392 93.9136 93.9136 82.1744 82.1744 82.1744  58.696  58.696  58.696 
##  "床上"    "會"  "爛在"    "躺"    "說"    "與"  "焦慮"    "做"    "時"  "壓力"

The numbers show the indices of TF-IDF of these words. Now these keywords make more sense than before. Thus, we extract keywords from this article.

Introduction to R

Lee-Xieng Yang

Web Scraper for Chinese websites

Chinese Word Segmentation

TF-IDF