Import Data from Web Pages

In addition to importing data on a web page, it is more common to download data files from a web page. For example, if we want to get the daily data about COVID-19, such as confirmed cases, we may go to Our World in Data. On this web page, we can download the daily-updated data about COVID-19. You can check the lower-left area and choose which type of data you want to download.

People are used to use EXCEL to manage their data, so we can download the .xslx file. The full file name should be owid-covid-data.xlsx. As read.table( ) can only deal with pure text file, we need to use other functions to import the data in .xslx file. The package {readxl} provides a function for R user to read the .xslx file. We can use read_xlsx( ) to import the data. We use head( ) to have a glance at this data set.

library(readxl)
covid.dta<-read_xlsx("owid-covid-data.xlsx")
head(covid.dta)
## # A tibble: 6 × 67
##   iso_code continent location    date   total_cases new_cases new_cases_smoothed
##   <chr>    <chr>     <chr>       <chr>        <dbl>     <dbl>              <dbl>
## 1 AFG      Asia      Afghanistan 2020-…          NA         0                 NA
## 2 AFG      Asia      Afghanistan 2020-…          NA         0                 NA
## 3 AFG      Asia      Afghanistan 2020-…          NA         0                 NA
## 4 AFG      Asia      Afghanistan 2020-…          NA         0                 NA
## 5 AFG      Asia      Afghanistan 2020-…          NA         0                 NA
## 6 AFG      Asia      Afghanistan 2020-…          NA         0                  0
## # ℹ 60 more variables: total_deaths <dbl>, new_deaths <dbl>,
## #   new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## #   new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## #   total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## #   new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## #   icu_patients <lgl>, icu_patients_per_million <lgl>, hosp_patients <lgl>,
## #   hosp_patients_per_million <lgl>, weekly_icu_admissions <lgl>, …

As shown in the upper cell, this is a tibble structure and there are 67 columns (or variables). Suppose we want to draw the curve of confirmed cases from the first date in this data set until now. The current mode of date is character. We can transfer it to the mode of Date in R using the function as.Date( ) and save as a new variable. Thus, we have a continuous date data.

mode(covid.dta$date[1])
## [1] "character"
covid.dta$date[1]
## [1] "2020-01-03"
covid.dta$date1<-as.Date(covid.dta$date,"%Y-%m-%d")
covid.dta$date1[1]
## [1] "2020-01-03"
mode(covid.dta$date1)
## [1] "numeric"

We focus on the new confirmed cases in the column new_cases_per_million. Let’s plot this variable by continent with the below script. Clearly, the year 2022 was the most serious time of COVID-19. Also, comparing with Europe, Oceaina, and South America, COVID-19 in Asia is relatively mild. To my surprise, North America seemed to have a properer control over COVID-19 than Europe.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
covid.dta %>% filter(continent!="NA") %>% 
  ggplot(aes(date1,new_cases_per_million,color=continent))+geom_line()+
  xlab("Date")
## Warning: Removed 50 rows containing missing values (`geom_line()`).

One might want to report the data in a sliding date window of 7 days to get the curve smoother. To this end, the number of cases on each day is the average cases across the past 7 days (including today).

africa.dta<-covid.dta %>% filter(continent=='Africa') %>% select(date1,new_cases_per_million)
asia.dta<-covid.dta %>% filter(continent=='Asia') %>% select(date1,new_cases_per_million)
europe.dta<-covid.dta %>% filter(continent=='Europe') %>% select(date1,new_cases_per_million)
na.dta<-covid.dta %>% filter(continent=='North America') %>% select(date1,new_cases_per_million)
ocean.dta<-covid.dta %>% filter(continent=='Oceania') %>% select(date1,new_cases_per_million)
sa.dta<-covid.dta %>% filter(continent=='South America') %>% select(date1,new_cases_per_million)
  
covid_smooth<-function(dta){
   temp<-dta$new_cases_per_million[1:7]
   for(i in 8:nrow(dta)){
       v<-mean(dta$new_cases_per_million[(i-7):i],na.rm=T)
       temp<-c(temp,v)
   }
   return(temp)
}
smooth.dta<-tibble(date=c(africa.dta$date1,asia.dta$date1,europe.dta$date1,
                          na.dta$date1,ocean.dta$date1,sa.dta$date1),
                   mean_case=c(covid_smooth(africa.dta),
                               covid_smooth(asia.dta),
                               covid_smooth(europe.dta),
                               covid_smooth(na.dta),
                               covid_smooth(ocean.dta),
                               covid_smooth(sa.dta)),
                   continent=c(rep('Africa',nrow(africa.dta)),
                               rep('Asia',nrow(asia.dta)),
                               rep('Europe',nrow(europe.dta)),
                               rep('North America',nrow(na.dta)),
                               rep('Oceania',nrow(ocean.dta)),
                               rep('South America',nrow(sa.dta))))

We first separated the data set by continent. For each continent, we can create a tibble to contain the dates and the average cases. Thereafter, we can assemble them as one tibble. We then can make plots based on this new tibble. On this plot, we can clearly tell the peaks in every continent. For example, there were two peaks in South America in 2022: one was near January and the other was around May. Also, starting from 2023, the confirmed cases of COVID-19 declined quickly in most continents, except Oceania.

smooth.dta %>% ggplot(aes(date,mean_case,color=continent))+geom_line()+
  ylab('Average Case per Week')
## Warning: Removed 2 rows containing missing values (`geom_line()`).

Considering the contagion of a pandemic, the confirmed case number in one continent might be correlated with that in the other. To check this idea, we can make a scatter plot on the new cases between different continents. Supposes we would like to compare the confirmed cases in Europe and North America. The current format is not suitable. We need to transform the current long format to the wide format with Europe and North America cases as two different columns. The first part of the following script demonstrates how to turn a long format to a wide format. The key idea is to group by dates and continents first. Thus, for each date, there will be two means: one in Europe and the other in North America. The reason why we need to group by variables is that on some dates, there are only one data from one continent or there are no data at all. Thus, we need to compute cases date by date. Thereafter, the function spread( ) is used to transform a long format to a wide format. Note gather( ) is used to convey a wide format to a long format. The second par of the following script is meant to make a scatter plot for the confirmed cases in Europe and North America. The diagnoal line has the slope = 1 and the intercept = 0.

library(tidyr)
temp<-smooth.dta %>% filter(continent=='Europe' | continent=='North America') %>% 
  group_by(date,continent) %>% summarise(case=mean(mean_case,na.rm=T)) %>% 
  spread(continent,case)
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
temp %>% ggplot(aes(Europe,`North America`))+geom_point(shape=1,size=1)+
  geom_abline(slope=1,intercept=0)
## Warning: Removed 2 rows containing missing values (`geom_point()`).

with(temp,cor.test(Europe,`North America`))
## 
##  Pearson's product-moment correlation
## 
## data:  Europe and North America
## t = 47.44, df = 1360, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7686247 0.8087107
## sample estimates:
##       cor 
## 0.7895082

The third part of the above script is meant to compute a correlation coefficient between Europe and North America on their confirmed cases. Apparaently, the correlation is significant \(r=.79\) and \(p<.01\). Although the high correlation is found between these two continents, the scatter plot shows that the contagion of COVID-19 is more serious in Europe than in North America. Similarly, we can compare the confirmed cases in Europe and Asia. The result is similar to that when comparing Europe and North America, except that Europe has far too many cases than Asia.

temp<-smooth.dta %>% filter(continent=='Europe' | continent=='Asia') %>% 
  group_by(date,continent) %>% summarise(case=mean(mean_case,na.rm=T)) %>% 
  spread(continent,case)
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
temp %>% ggplot(aes(Europe,Asia))+geom_point(shape=1,size=1)+
  geom_abline(slope=1,intercept=0)

with(temp,cor.test(Europe,Asia))
## 
##  Pearson's product-moment correlation
## 
## data:  Europe and Asia
## t = 69.057, df = 1360, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8697337 0.8933586
## sample estimates:
##       cor 
## 0.8820996

Perhaps, the high correlation is because Europe, North America, and Aisa are close to each other in communication. What if Oecania and Europe? The below script provides a picture different from the previous two scatters. Although the correlaton is still high, it is lower than the previous two.

temp<-smooth.dta %>% filter(continent=='Europe' | continent=='Oceania') %>% 
  group_by(date,continent) %>% summarise(case=mean(mean_case,na.rm=T)) %>% 
  spread(continent,case)
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
temp %>% ggplot(aes(Europe,Oceania))+geom_point(shape=1,size=1)+
  geom_abline(slope=1,intercept=0)
## Warning: Removed 4 rows containing missing values (`geom_point()`).

with(temp,cor.test(Europe,Oceania))
## 
##  Pearson's product-moment correlation
## 
## data:  Europe and Oceania
## t = 22.126, df = 1356, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4748445 0.5530722
## sample estimates:
##     cor 
## 0.51503

In order to understand why the correlation drops, we hypothesize that the outbreaks of COVID-19 might not be synchronized. Thus, we make a line plot for the confirmed cases in Europe and Oceania from the 2020-01-01 until now. Apparently, the outbreaks of COVID-19 occurred later in Oceania than Europe. This is the reason why the correlation drops.

smooth.dta %>% filter(continent=="Europe" | continent=="Oceania") %>%
  ggplot(aes(date,mean_case,color=continent))+geom_line()

Another Example

In addition to importing open data from overseas web pages, we can also find open data on the official web pages of local governments. For example, on this link, we can find the data about fires in Taipei. On this web page, you can find a keyword API. When you click on it, you will see the instruction of how to use this API to request data from the server of Taipei government. The API is just like a web address. You can enter this API in your web browser. You will see a lot of alphabets and numbers. They are actually arranged in the json format. Thus, we need a tool to transform the json data to something we can manage in R. The package {rjson} can help transform a json data set to a list. Now we can import data from this web page, with no need to download the .csv file. The idea is that (1) we use the function readLines( ) to read the contents of the web page referenced by the API, and (2) we transform the content format (i.e., json) to a list in R. According to the instruction of the API, the upper limit is 1000, that means we can at most request for 1000 records. We can request for 1000 records by appending “&limit=1000” to the API.

library(rjson)
url<-"https://data.taipei/api/v1/dataset/7f02febe-86b6-4d3d-b06a-dcdfa3eacacf?scope=resourceAquire&limit=1000"
post<-suppressWarnings(readLines(url))
fire<-fromJSON(post)
months<-sapply(fire$result$results,"[[","年月別")
dates<-sapply(months,function(x){
     temp<-unlist(strsplit(x,"年"))
     year<-as.numeric(temp[1])+1911
     month<-as.numeric(gsub("月","",temp[2]))
     time<-paste0(year,"-",month,"-",1)
     time<-as.Date(time,tryFormats="%Y-%m-%d")
     return(time)
})
dates[1:2]
## 87年1月 87年2月 
##   10227   10258

When we get the contents in json format, we use fromJSON( ) to transform the json format to a list, which we call fire. You can check the structure of the list fire. We found in this data set, one column is month. Thus, we use sapply( ) to extract all months. We can turn these months to western style. The dates contain a serial of strange numbers. Don’t worry. We just need to add the origin date to these numbers to recover the true dates. Now we can extract the times of fire from this data set. We can establish our data frame to contain the fire events in every month. Clearly, there is a jump in the fire numbers in around the mid of 2017. This is intriguing. Do you have any interpretation for this?

events<-as.numeric(sapply(fire$result$results,"[[","火災發生次數[次]"))
fire.dta<-tibble(Date=as.Date(dates,origin="1970-01-01"),
                 Events=events)
fire.dta %>% ggplot(aes(Date,Events))+geom_point(shape=1,size=1)+geom_line()+
  ylab("Fire Accidents")