Introduction to R

Getting data from web pages

Since the emergence of Web 2.0, people are used to sharing their lives on their own blogs and/or social media. Accordingly, social media have become a platform ripe with people’s digital footprints which can reflect their attitudes, thoughts, opinions, emotions, and so on. As a psychologist being interested in human beings, the data on social media undoubtedly are appealing treasures. However, how to get them has not been introduced in regular research-method classes for academic and technical reasons. Normally, psychological studies (either experiments or surveys) are used to be conducted in laboratories. Since 2000’s, psychologists have tried to collect data on web pages, such as Amazon Mechanical Turk. Psychologists put their questionnaires on M-Turk and recruited “workers” to fill in those questions as a means for collecting data. Also, other platforms are created for the same purpose, such as SurveyCake or Google documents. Soon after administering questions on Internet, psychologists have tried to conduct experiments on Internet. However, these attempts are nothing more than changing the way to collect human data, that do not increase the type of measures for human beings. Other than treating web pages as a data-collecting platform, the information on web pages itself can be a source of data. Specifically, the digital footprints on social media can be informative for understanding human beings. That is, instead of doing a psychological study on Internet, we scrap down the contents on web pages and analyze them in order to answer psychological questions. In methodology, this is relatively similar to the so-called naturalist observation, only being conducted on line not off line. Obviously, we need to know how to scrape the contents on a web page for further data analysis. However, it has not been included as a conventional means in the research-method classes in Psychology. R is a very useful tool to scrape the contents on web pages. Therefore, some examples will be shown for introducing R as well as introducing the means to exploit the data from social media (or any web pages) to benefit our research.

Open Data

The easiest way to get data from Internet is to download the data which was already prepared and uploaded by someone else. Normally, this kind of data is called open data. The advantage of using open data is that we do not need by ourselves to scrape pieces of information from web pages and sort them out in a spreadsheet structure. Using open data can save our energy on data management to a great extent. For example, this web page provides the data of COVID-19 cases sorted by area, country, total positive cases, and so on. We can directly download the data in the one of the formats it provides and start our analysis. For the convenience of demonstration, I chose to download the file in CSV format for the date as 2021-11-20. The following codes are used to compute the total infected cases in each area and show them in a bar plot. The first problem with which I was encountered was that the numbers in the downloaded file contain commas. We need to get rid of those commas before we can do any calculation. Also, I changed the setting of the text font to “DFKai-SB”. You can set up the text font as the one you like in your computer.

covid.dta<-read.csv("owl_world.csv",header=T)
covid.dta$total_case<-sapply(covid.dta$總確診數,function(s)gsub(",","",s))
cases<-with(covid.dta[-c(1:9),],tapply(as.numeric(total_case),洲名,sum))
dta<-data.frame(cases=cases,group=names(cases))
library(ggplot2)
ggplot(dta[-1,],aes(x="",y=cases,fill=group))+
  theme_bw()+theme(plot.background=element_blank(),
                       panel.grid.major=element_blank(),
                       panel.grid.minor=element_blank())+
  geom_bar(stat="identity",width=1)+
  coord_polar("y",start=0)+
  theme(text=element_text(family="DFKai-SB"))

Requesting for Data by API

API (Application Programming Interface) is a software intermediary that allows two applications to talk to each other. Each time you use an APP like Facebook, send an instant message, or check the weather on your phone, you’re using an API. When you Google something, you are using Google API also. The below example shows how to get data from Dcard via Dcard API. For your reference, this web page shows the codes of Dcard API. You can easily find a number of tutorials for Dcard API on the Internet. The basic idea of collecting data via API is that you send a request to the Dcard server, which in turn will return you the data according to your request. Normally, an API looks like an URL and it functions like a web URL. See the below example which requests for the most popular posts on Dcard irregardless of the forums, as the keyword is set as posts?popular=true. By default, one request can get 30 posts back. You can access this URL via your Internet browser to see what you will get.

url<-"https://www.dcard.tw/service/api/v2/posts?popular=true"

The below figure is what I got when I requested for the hottest 30 posts on Dcard in the evening on Nov. 21. Unlike what you see when you access the front page of Dcard, it looks quite messy and hard to understand. No need to be panic. In fact, this is a particular data structure called JSON.

The following codes introduce how to use R to import the JSON data. The logic is quite simple. First, we import all the contents on this web page by readLines( ). Second, we transfer the imported textual contents to R list by fromJSON( ) which is included in the package {rjson}. The function suppressWarnings( ) is used to surpress the warning signals.

p.posts<-suppressWarnings(readLines(url))
mode(p.posts)

## [1] "character"

length(p.posts)

## [1] 1

You can check the mode of imported contents, which must be “character”. Also, the returned contents are represented as a string. Thus, the length of it is 1. We transfer this string to a list by fromJSON( ), which is still called p.posts. You can check the number of entities of this list, which should be 30. This is because by default Dcard will return 30 posts.

library(rjson)
p.posts<-fromJSON(p.posts)
mode(p.posts)

## [1] "list"

length(p.posts)

## [1] 30

There are a lot of variables in each list entity. All these variables record some information of this post in different attributes. For example, id is the post id in the Dcard system. Also, title is the title of the post. The variable excerpt is the first scores of words of the posts, which normally functions like a preview of the post. In addition, createdAt and updatedAt are the exact times when the post was created and updated. The commonCount means how many comments to this post. The likeCount is the number of likes this post got. The hush tags given by the author can be found in the variable topics. Although Dcard authors are anonymous, the gender is set as public, which can be found in the variable gender. We can check for the forums where these posts were published.

forums<-sapply(p.posts,"[[","forumName")
table(forums)

## forums
## 個人看板     女孩     寵物 居家生活     心情     感情     有趣     穿搭 
##        2        2        1        1        5        4        3        1 
##     美妝     美食     追星     閒聊 
##        2        2        2        5

forums[order(forums,decreasing=T)]

##  [1] "閒聊"     "閒聊"     "閒聊"     "閒聊"     "閒聊"     "追星"    
##  [7] "追星"     "美食"     "美食"     "美妝"     "美妝"     "穿搭"    
## [13] "有趣"     "有趣"     "有趣"     "感情"     "感情"     "感情"    
## [19] "感情"     "心情"     "心情"     "心情"     "心情"     "心情"    
## [25] "居家生活" "寵物"     "女孩"     "女孩"     "個人看板" "個人看板"

genders<-sapply(p.posts,"[[","gender")
table(genders)

## genders
##  F  M 
## 20 10

post.times<-sapply(p.posts,"[[","createdAt")
post.times

##  [1] "2021-11-20T05:59:33.856Z" "2021-11-20T01:48:28.385Z"
##  [3] "2021-11-21T02:40:54.425Z" "2021-11-20T03:28:14.426Z"
##  [5] "2021-11-20T13:32:15.164Z" "2021-11-20T12:03:04.626Z"
##  [7] "2021-11-21T04:00:06.557Z" "2021-11-20T03:08:32.272Z"
##  [9] "2021-11-20T12:03:37.473Z" "2021-11-20T01:45:42.423Z"
## [11] "2021-11-20T16:09:08.302Z" "2021-11-20T07:33:12.339Z"
## [13] "2021-11-20T08:13:58.787Z" "2021-11-20T16:44:57.422Z"
## [15] "2021-11-20T18:46:07.197Z" "2021-11-20T06:44:21.975Z"
## [17] "2021-11-20T10:05:50.429Z" "2021-11-20T04:44:58.164Z"
## [19] "2021-11-20T17:16:43.862Z" "2021-11-20T08:35:42.104Z"
## [21] "2021-11-20T14:04:54.591Z" "2021-11-20T11:17:14.841Z"
## [23] "2021-11-20T05:27:24.661Z" "2021-11-20T11:36:00.990Z"
## [25] "2021-11-20T09:45:38.939Z" "2021-11-20T15:46:29.653Z"
## [27] "2021-11-20T01:39:56.543Z" "2021-11-20T17:33:17.266Z"
## [29] "2021-11-20T09:30:40.302Z" "2021-11-21T03:03:30.078Z"

We can also check the distribution of the publishing time. First, we only extract the time from the time string of each post.

times<-sapply(post.times,function(x)unlist(strsplit(x,"T"))[2])
times<-sapply(times,function(x)unlist(strsplit(x,"[.]"))[1])
times<-sapply(times,function(x){
  temp<-unlist(strsplit(x,":"))
  hours<-as.numeric(temp[1])+as.numeric(temp[2])/60+
    as.numeric(temp[3])/3600
})
ggplot(data.frame(times=times),aes(times))+
  geom_histogram(aes(y=..density..),binwidth=2,fill="white",color="black")+
  geom_density()

We can also check the number of likes for each post. The maximum of like counts is more than 3,000 and the minimum is at least 450. The histogram of like counts is shown as follow.

likes<-sapply(p.posts,"[[","likeCount")
summary(likes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   581.0   673.0   845.5  1151.5  1597.0  2321.0

ggplot(data.frame(likes=likes),aes(likes))+
  geom_histogram(fill="pink",color="black",binwidth=100)

In order to get the full content of a post, we need to access that post. Suppose we want to access the full content of the second popular post.

url.text<-"https://www.dcard.tw/service/api/v2/posts/"
ids<-sapply(p.posts,"[[","id")
url.text<-paste0(url.text,ids[2])
text<-suppressWarnings(readLines(url.text))
text<-fromJSON(text)
text$title

## [1] "台中19歲女大學生大戰把人拽下來的50歲留學高學歷婦人(๑•̀ㅂ•́)و"

We can access the full content of this post by text$content. I only show the first scores of words here for the sake of demonstration. Thereafter, we can create the word cloud for this post in order to help us understand the main theme of it. It is necessary to parse the Chinese sentences into individual words first. To this end, I use the package {jiebaR}}. Before we do word segmentation, we can remove the symbols, characters, and English alphabets. We can use the function worker( ) to create an environment for conducting Chinese word-parsing. The outcome of word parsing is a vector of words. The top 10 frequent words are shown.

text$excerpt

## [1] "#慎入 台中人：「一定愛用東泉欸辣椒醬️」今天出門的時候順便拍一下事發當晚的人行道，今天天氣真好 陽光明媚️，B446，因為我一直依稀記得，我每天上學的路上要騎上來這個人行道的時候都會看到一個藍色的牌子"

# Preprocessing
words<-gsub("[a-z]","",text$content)
words<-gsub("[A-Z]","",words)
words<-gsub("[0-9]","",words)
words<-gsub("/","",words)
words<-gsub("[.]","",words)
# Word segmentation
library(jiebaR)

## Loading required package: jiebaRD

cutter<-worker()
p.words<-cutter[words]
# Count the frequency of each word and sort out all words from the highest frequency to the lowest
t.words<-table(p.words)
names(t.words)[order(t.words,decreasing=T)][1:10]

##  [1] "我"   "的"   "婦人" "你"   "就"   "是"   "了"   "他"   "在"   "要"

A word cloud is useful for data visualization. In order to make a meaningful word cloud, we can focus on the words longer than one character due to the characteristic of Chinese language.

library(wordcloud2)
words.dta<-data.frame(t.words)
flag<-sapply(names(t.words),function(x)ifelse(nchar(x)>1,T,F))
words.dta<-words.dta[flag,]
words.dta<-words.dta[order(words.dta$Freq,decreasing=T),]
wordcloud2(words.dta[1:30,])

Of course, we can get the keywords of this post by the TF-IDF algorithm. In {jiebaR}, when you set up the word-parsing environment as type=“keywords”, it will apply the TF-IDF algorithm to determine which words are better able to represent this post. The default number of the keywords is 5 and we can change it to 10 by setting up topn=10. The keywords seem to be able to capture the main meaning of this post.

cutter<-worker(type="keywords",topn=10)
cutter[words]

##  281.741   140.87  129.131  93.9136  93.9136  93.9136  90.5649  90.4105 
##   "婦人"   "沒有"     "說"     "想"   "報警"   "因為" "人行道"   "警察" 
##  82.1744  70.4352 
##   "他們"   "有點"

Introduction to R

Lee-Xieng Yang

Getting data from web pages

Open Data

Requesting for Data by API