Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. See Reddit. Although Reddit provides useful and thorough API keys, starting from 2023, the CEO of Reddit announed that the use of their API keys will be charged. Thus, do as quickly as you can. The below script is an example of how we can GET posts from a forum on Reddit. The API is simply a URL, starting by “https://www.reddit.com/r/{subreddit}/{listing}.json?limit={count}&t={timeframe}”.
The parameters are described as follows.
subreddit = the forum where you wanna scrape posts.
listing = controversial, best, hot, new, random, rising, top.
count = number of posts (e.g., 100)
timeframe = hour, day, week, month, year, all.
Suppose we want to scrape 100 controversial posts on the subreddit vaccine. We can use the API to do this. We use fromJSON to transform a JSON format to a list.
library(rjson)
url<-"https://www.reddit.com/r/vaccine/controversial.json?limit=100&t=year"
link<-suppressWarnings(readLines(url))
dta.v<-fromJSON(link)
After checking the structure of dat.v, we found the data of our interest are stored under the node called children. We can check how many data sets we scrapped. There are 100 posts, that is exactly what we requested. After checking the structure of the list dta.v, we can get the titles of these 100 posts.
length(dta.v$data$children)
## [1] 100
titles<-sapply(1:100,function(x)dta.v$data$children[[x]]$data$title)
What exactly these titles tell us can be explored by the keywords. There are many different indices to find out the keywords. Let’s start from TF-IDF (Item Frequency-Inverse Document Frequency), which consists of two parts: TF and IDF. TF is how frequently a word appears in a document. IDF then is computed as
\[\begin{equation} idf(term)=ln(\frac{n_{total}}{n_{specific}}), \end{equation}\] where \(n_{total}\) is the totoal number of documents and \(n_{specific}\) is the number of doucments containing that term. Therefore, TF-IDF is the product of TF and IDF. Let’s see how we can do that.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
vctitles<-tibble(id=as.factor(1:100),title=titles)
vcwords<-vctitles %>% unnest_tokens(word,title) %>% count(id,word,sort=T)
total_words<-vcwords %>% group_by(id) %>% summarize(total=sum(n))
vcwords<-left_join(vcwords,total_words)
## Joining with `by = join_by(id)`
# TF-IDF
vcwords_tfidf<-vcwords %>% bind_tf_idf(word,id,n)
vcwords_tfidf %>% select(-total) %>% arrange(desc(tf_idf))
## # A tibble: 1,144 × 6
## id word n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 64 varicella 1 0.5 4.61 2.30
## 2 80 menb 1 0.5 4.61 2.30
## 3 68 meningitis 1 0.5 3.51 1.75
## 4 24 cytomegalovirus 1 0.333 4.61 1.54
## 5 24 trial 1 0.333 4.61 1.54
## 6 47 question 1 0.333 4.61 1.54
## 7 30 happen 1 0.25 4.61 1.15
## 8 30 often 1 0.25 4.61 1.15
## 9 67 during 1 0.25 4.61 1.15
## 10 67 pregnancy 1 0.25 4.61 1.15
## # ℹ 1,134 more rows
We can plot a word cloud for these words.
library(wordcloud2)
vcwords_tfidf %>% select(word,tf_idf) %>% wordcloud2()
Alternatively, we can make a bar plot for the top 15 words.
library(forcats)
library(ggplot2)
vcwords_tfidf %>% slice_max(tf_idf,n=15) %>%
ggplot(aes(tf_idf,fct_reorder(word,tf_idf)))+geom_col(fill="deepskyblue2")+
ylab("Terms")
YouTube also provides API for users to communicate with their applications. We can follow this web page to learn how to use YouTube API. According to this tutorial, the first step is to create a project on Google Developers. In this project, we apply for the credential for using API. Google Developers will give every project a unique API key. With this key, you can use API. Suppose we have created a project, get the API key, and enabled V3 API. There are a few types of YouTube API, according to their functions. For getting information of one video, the URL has the format of https://www.googleapis.com/youtube/v3/videos?id=\{single video id}.
For example, if we want to get some information of the newest video on the channel 蒼藍鴿, we need to get this video id, which can be seen in the URL of this video. That is, https://www.youtube.com/watch?v=aualTazKEeI, and the id of this video is aualTazKEeI. The format of API is https://www.googleapis.com/youtube/v3/videos?id=\{video id}&key={your API key}&part=snippet. When you create a project, Google Developers will give you {your API key}. That is what you need to enter here. We use readLines( ) to scrape the information about this video and save to a new variable content.
Checking content, we can get some useful information. For example, the channel id and this video title can be seen on content[11] and content[12]. Also, we can find out the description of this video in content[104].
url3<-"https://www.googleapis.com/youtube/v3/videos?id=aualTazKEeI"
key<-"AIzaSyDxoB-hMmtGa4N3cZvsVBj3GFMCSDE6f8A"
request<-"&part=snippet"
url3<-paste0(url3,"&key=",key,request)
content<-suppressWarnings(readLines(url3))
length(content)
## [1] 114
content[11:12]
## [1] " \"channelId\": \"UCUn77_F5A65HViL9OEvIpLw\","
## [2] " \"title\": \"高燒加虛脫! 鴿的身體發生什麼事?\","
What if we want to get the playlist of 蒼藍鴿 channel? We need get the playlist id first. To this end, we will use the API as https://www.googleapis.com/youtube/v3/channels?part=contentDetails&id=\{channel id}&key={your API key}. Let’s see the below script, where channel id can be found in content[11]. In the content returned by this API, the playlist id is listed behind “uploads”. Now we have it.
url4<-"https://www.googleapis.com/youtube/v3/channels"
request<-"?part=contentDetails"
channel.id<-"UCUn77_F5A65HViL9OEvIpLw"
url4<-paste0(url4,request,"&id=",channel.id,"&key=",key)
playlist<-suppressWarnings(readLines(url4))
playlist[16]
## [1] " \"uploads\": \"UUUn77_F5A65HViL9OEvIpLw\""
Now we can get the playlist of this channel, using the below API. The maximum number of videos can be scrapped is 50.
url5<-"https://www.googleapis.com/youtube/v3/playlistItems"
request1<-"?part=snippet,contentDetails,status"
request2<-"maxResults=50"
playlistid<-"UUUn77_F5A65HViL9OEvIpLw"
url5<-paste0(url5,request1,"&playlistId=",playlistid,"&key=",key,"&",request2)
videos<-suppressWarnings(readLines(url5))
Since the returned data is not stored in a JSON format, we cannot but use our scripts to collect the data we want. Suppose we want to the id, the title, and the publishing time of each video. We can run the below scripts.
Ids<-NULL
for(i in 1:length(videos)){
x<-videos[i]
if(grepl("videoId",x)){
temp<-unlist(strsplit(x,": "))[2]
temp<-gsub('"',"",temp)
Ids<-c(Ids,temp)
}
}
Titles<-NULL
for(i in 1:length(videos)){
x<-videos[i]
if(grepl("title",x)){
temp<-unlist(strsplit(x,": "))[2]
temp<-gsub('"',"",temp)
Titles<-c(Titles,temp)
}
}
Dates<-NULL
for(i in 1:length(videos)){
x<-videos[i]
if(grepl("publishedAt",x)){
temp<-unlist(strsplit(x,": "))[2]
temp<-gsub('"',"",temp)
temp<-unlist(strsplit(temp,"T"))[1]
Dates<-c(Dates,temp)
}
}
v.dta<-tibble(Ids=Ids[seq(1,99,2)],Titles=Titles,Dates=Dates)
v.dta
## # A tibble: 50 × 3
## Ids Titles Dates
## <chr> <chr> <chr>
## 1 aualTazKEeI 高燒加虛脫! 鴿的身體發生什麼事?, 2023-10-02
## 2 5Hq56D7qaR8 緩解胃痛的方法, 2023-09-30
## 3 qoj1hW36CSY 吃柚子禁忌! 這幾種藥物併用有危險!, 2023-09-29
## 4 gLqvTYwicI0 親吻眼鏡蛇 不幸身亡!, 2023-09-27
## 5 Ydxy4wbEjhg 腰斬後 現代醫學救得活嗎?, 2023-09-25
## 6 weLqxtzgats 爆肝的跡象 熬夜的人小心了?, 2023-09-23
## 7 qivTlzIeuVM 吃點心也能輕鬆瘦? 四種健康點心推薦!, 2023-09-22
## 8 9T0439Tpseo 校園偷抽電子煙 結局令人意外..., 2023-09-20
## 9 I7c4laahW3s 我不當急診醫生的原因...整天跟病人吵架!?, 2023-09-18
## 10 x9SoAAPrKyQ 肝癌發生的原因, 2023-09-16
## # ℹ 40 more rows
We can calculate how frequently this channel updates the videos. It looks like that this channel update videos every 1.714 days on average.
v.dta$Dates<-as.Date(v.dta$Dates,"%Y-%m-%d")
days<-v.dta$Dates[1:(nrow(v.dta)-1)]-v.dta$Dates[2:nrow(v.dta)]
mean(days)
## Time difference of 1.714286 days
Of course, we can scrape more than 50 videos. We need to find out the nextPageToken, which here is EAAaIVBUOkNESWlFRGhETlVaQlJUWkNNVFkwT0RFelF6Z29BUQ. Thereafter, we fix the same API for scrapping the first 50 videos by adding “&pageToken={nextPageToken}” at the end of API. Now we should be able to scrape another 50 videos.
At last, how to scrape the comments for a video? We can use the below API to scrape the comments for the video published at 09-29-2023.
url6<-"https://www.googleapis.com/youtube/v3/commentThreads"
request<-"?part=snippet,replies"
url6<-paste0(url6,request,"&videoId=",v.dta$Ids[3],"&key=",key)
reply<-suppressWarnings(readLines(url6))
# Get replies
replies<-NULL
for(i in 1:length(reply)){
if(grepl("textDisplay",reply[i])){
temp<-unlist(strsplit(reply[i],": "))[2]
temp<-gsub('\"',"",temp)
replies<-c(replies,temp)
}
}
reply.authors<-NULL
for(i in 1:length(reply)){
if(grepl("authorDisplayName",reply[i])){
temp<-unlist(strsplit(reply[i],": "))[2]
temp<-gsub('\"',"",temp)
temp<-gsub(',','',temp)
reply.authors<-c(reply.authors,temp)
}
}
reply.dta<-tibble(author=reply.authors,reply=replies)
reply.dta[1:20,]
## # A tibble: 20 × 2
## author reply
## <chr> <chr>
## 1 蒼藍鴿的醫學天地 "*蒼藍鴿使用的保健品牌「藥師健生活」,折扣碼「bluepig」享全…
## 2 王琮傑 "很簡單,不吃柚子就沒事了,反正麻煩也沒很愛,可是加工過的我…
## 3 鄧喬韓 "醫師、醫師,到底要多沒人格?才可以在台大醫院當精神科主治醫…
## 4 鄧喬韓 "醫師、醫師,你有人格嘛?是不是沒人格,就可以考上台大醫學系…
## 5 jokerc0728 "我得了CML好幾年了, 天天吃第1代的標耙藥, 盒子外面就寫明了不…
## 6 Ryou "下次請提早一個星期發布 好讓我轉發給所有親朋好友😂\\u003cbr…
## 7 山嵐Motor "柚子藥物交互作用,"
## 8 汛豪 陳 "印象中羅漢果就是使用柚子製作的樣子,"
## 9 April Yang "問過藥劑師,基本上三高患者柚子、葡萄柚都在禁忌名單中。坊間…
## 10 Wang Johnny "舉手發問,請問Minoxdil也有降血壓的功效,可以跟柚子一起吃嗎…
## 11 阿輝遊戲直播 "感謝蒼藍鴿,我只知道不能和葡萄柚一起吃,而且剛好有在吃三高…
## 12 huifen Guo "葡萄柚跟柚子真的是不能跟某些藥物一起吃\\u003cbr\\u003e從吃…
## 13 起司 "爆破魔法,"
## 14 陳嘉誠 "請問吃抗組織胺是不是也會影響,"
## 15 劉致甫 "鴿可以唱歌了,"
## 16 唐白 "開頭笑到不行😂,"
## 17 Yang Wang "咦 我的a酸藥單上也有說不能吃葡萄柚之類的,"
## 18 Jennifer Hsu "我剛邊看影片邊吃完一整顆\\u003cbr\\u003e才看到熱量控制來不…
## 19 阮冠博 "蒼藍鴿你好,可以請問莫德那的XBB.1.5你建議打嗎?,"
## 20 暄暄 "這顆柚子有刺青,"
Usually, we are interested in some statistics of a video. How can we do that? For example, we would like to know how many views and how many likes for the latest video on the channel 蒼藍鴿, we can use the following API.
url7<-"https://www.googleapis.com/youtube/v3/videos?id=aualTazKEeI"
key<-"AIzaSyDxoB-hMmtGa4N3cZvsVBj3GFMCSDE6f8A"
request<-"&part=statistics"
url7<-paste0(url7,"&key=",key,request)
content<-suppressWarnings(readLines(url7))
viewCount<-NULL
likeCount<-NULL
commentCount<-NULL
for(i in 1:length(content)){
if(grepl("viewCount",content[i])){
v<-unlist(strsplit(content[i],": "))[2]
v<-gsub(',',"",v)
v<-gsub('\"',"",v)
viewCount<-c(viewCount,v)
}
if(grepl("likeCount",content[i])){
l<-unlist(strsplit(content[i],": "))[2]
l<-gsub(",","",l)
l<-gsub('\"',"",l)
likeCount<-c(likeCount,l)
}
if(grepl("commentCount",content[i])){
c<-unlist(strsplit(content[i],": "))[2]
c<-gsub(",","",c)
c<-gsub('\"',"",c)
commentCount<-c(commentCount,c)
}
}
c(viewCount,likeCount,commentCount)
## [1] "40844" "1466" "105"
We can check out the statistics of the 50 videos that we just scrapped. The below script is a function which is used to get the view counts, like counts, and comment counts of each video.
GStats<-function(vId){
viewCount<-NULL
likeCount<-NULL
commentCount<-NULL
for(i in 1:length(vId)){
# Create API
url<-"https://www.googleapis.com/youtube/v3/videos?id="
key<-"AIzaSyDxoB-hMmtGa4N3cZvsVBj3GFMCSDE6f8A"
request<-"&part=statistics"
url<-paste0(url,vId[i],"&key=",key,request)
web.content<-readLines(url)
for(j in 1:length(web.content)){
if(grepl("viewCount",web.content[j])){
v<-unlist(strsplit(web.content[j],": "))[2]
v<-gsub(',',"",v)
v<-gsub('\"',"",v)
viewCount<-c(viewCount,v)
}
if(grepl("likeCount",web.content[j])){
l<-unlist(strsplit(web.content[j],": "))[2]
l<-gsub(",","",l)
l<-gsub('\"',"",l)
likeCount<-c(likeCount,l)
}
if(grepl("commentCount",web.content[j])){
c<-unlist(strsplit(web.content[j],": "))[2]
c<-gsub(",","",c)
c<-gsub('\"',"",c)
commentCount<-c(commentCount,c)
}
}
}
return(cbind(viewCount,likeCount,commentCount))
}
Stats<-GStats(v.dta$Ids)
We can organize our data by incorporating these count data. We can check the view counts of the latest 50 videos.
v.dta<-v.dta %>% mutate(viewCount=as.numeric(Stats[,1]),
likeCount=as.numeric(Stats[,2]),
commentCount=as.numeric(Stats[,3]))
v.dta %>% ggplot(aes(as.Date(Dates,"%Y-%m-%d"),viewCount))+geom_line()+
geom_point(color="tomato")+xlab("Date")