Integration of multiple data bases

In order to answer a question, we might need to get data from different sources. Suppose we would like to know the relationships between mental health and some macro economic indices. To this end, we need to get the data which can represent mental health. Statista is a website containing data in many different respects. However, you need to pay for using the data they provide on that website. Do not worry. The NCCU library has bought their services and you only need to sign in the library with your own account and password. Thereafter, you can search for this data base by entering statista in search for data base.

Now suppose you have already accessed the data base. You can have a quick glance at what it provides. Alternatively, you can enter some keywords to search for the data which you might need. For example, I type in depression and I get a full web page showing depressive disorder related information. I further restrict the search scope to Taiwan only and get some charts about the revenue for depressive disorder and so on. The revenue for a mental disorder means the cost that we spent on the drug for that mental disorder. Thus, it is reasonable that the revenue for a mental disorder can represent the prevalence of it. We can download the data in the format we like.

Since I have already downloaded the data file, I just import the data by read.csv( ) to a data frame. This data frame is organized in a wide format.

mh.dta<-read.csv("mental_health_taiwan.csv",header=T)

Suppose we want to know the trend of the average revenue per patient for depressive disorders from 2016 to 2022. We firstly get the average revenue per patient in USD for depressive disorders from 2016 to 2022. Similarly, we get the same revenue for anxiety. We can also get the data for psychotic disorders. At last, ll these revenues are put together as a tibble.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
dd<-mh.dta %>% filter(Market=="Depressive Disorders" & Chart=="Average Revenue per Patient" &
                  Name=="Total",Unit=="USD (US$)") %>% select(X2016,X2017,X2018,X2019,X2020,X2021,
                                                             X2022)
ad<-mh.dta %>% filter(Market=="Anxiety and Related Sleep Disorders" & Chart=="Average Revenue per Patient" & Name=="Total",Unit=="USD (US$)") %>% select(X2016,X2017,X2018,X2019,X2020,X2021,
                                                             X2022)
pd<-mh.dta %>% filter(Market=="Psychotic Disorders" & Chart=="Average Revenue per Patient" & Name=="Total",Unit=="USD (US$)") %>% select(X2016,X2017,X2018,X2019,X2020,X2021,
                                                             X2022)
days<-sapply(seq(2016,2022),function(x){
  temp<-paste0(x,"-08-01")
  return(as.Date(temp,"%Y-%m-%d"))
})
all.dta<-tibble(date=rep(days,3),type=rep(c("dd","ad","pd"),each=7),
                revenue=c(unlist(dd),unlist(ad),unlist(pd)))
all.dta %>% ggplot(aes(as.Date(date,origin="1970-01-01"),
                       revenue,color=type))+geom_line()+geom_point()+
            xlab("Year")

As we would like to find out what macro economic indices would be correlated with our mental health, we collected the CPI (Consumer Price index) data and Salary data from here. Since there is not too much data, I just copied and pasted them here to create another data set. Apparently, CPI increases all the way through the past years. Correlation analysis shows that CPI is positive correlated with the revenue for depressive disorders as well as that for psychotic disorders, but not the revenue for anxiety disorders.

days<-sapply(seq(2004,2023),function(x){
  temp<-paste0(x,"-08-01")
  return(as.Date(temp,"%Y-%m-%d"))
})
cpi.dta<-data.frame(cpi=c(82.84,84.75,85.26,86.79,89.86,89.07,89.93,91.21,92.97,
                 93.71,94.83,94.54,95.86,96.45,97.76,98.30,98.07,100.00,102.95,105.11),
                 year=as.Date(days,origin="1970-01-01"))
cpi.dta %>% ggplot(aes(year,cpi))+geom_line()+geom_point(color="tomato")

cor.test(cpi.dta$cpi[13:19],all.dta$revenue[all.dta$type=="dd"])

## 
##  Pearson's product-moment correlation
## 
## data:  cpi.dta$cpi[13:19] and all.dta$revenue[all.dta$type == "dd"]
## t = 5.2329, df = 5, p-value = 0.003374
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5414591 0.9882641
## sample estimates:
##       cor 
## 0.9195646

cor.test(cpi.dta$cpi[13:19],all.dta$revenue[all.dta$type=="ad"])

## 
##  Pearson's product-moment correlation
## 
## data:  cpi.dta$cpi[13:19] and all.dta$revenue[all.dta$type == "ad"]
## t = 0.76362, df = 5, p-value = 0.4796
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5681540  0.8655781
## sample estimates:
##       cor 
## 0.3231759

cor.test(cpi.dta$cpi[13:19],all.dta$revenue[all.dta$type=="pd"])

## 
##  Pearson's product-moment correlation
## 
## data:  cpi.dta$cpi[13:19] and all.dta$revenue[all.dta$type == "pd"]
## t = 2.8114, df = 5, p-value = 0.03749
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07202579 0.96622764
## sample estimates:
##       cor 
## 0.7826341

We also collected the data of average salary, which has been downloaded as this file. As this data starts from 1980 until 2022, we need to trim it, so that we can do correlation tests with other data, such as cpi. The average salary is highly correlated with cpi.

salary<-read.csv("salary.csv",header=T)
mean_salary<-salary %>% group_by(Period) %>% mutate(ms=mean(Val,na.rm=T)) %>%
                          with(unique(ms))
cor.test(cpi.dta$cpi[1:19],mean_salary[25:43])

## 
##  Pearson's product-moment correlation
## 
## data:  cpi.dta$cpi[1:19] and mean_salary[25:43]
## t = 10.391, df = 17, p-value = 8.804e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8225586 0.9729426
## sample estimates:
##       cor 
## 0.9294956

temp.dta<-tibble(cpi=cpi.dta$cpi[1:19],salary=mean_salary[25:43])
summary(lm(scale(cpi)~scale(salary),temp.dta))

## 
## Call:
## lm(formula = scale(cpi) ~ scale(salary), data = temp.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70305 -0.29000  0.08389  0.31851  0.42391 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.144e-15  8.707e-02    0.00        1    
## scale(salary)  9.295e-01  8.946e-02   10.39  8.8e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3795 on 17 degrees of freedom
## Multiple R-squared:  0.864,  Adjusted R-squared:  0.856 
## F-statistic:   108 on 1 and 17 DF,  p-value: 8.804e-09

temp.dta %>% ggplot(aes(salary,cpi))+geom_point(col="red")+geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

Additionally, we collected the data of unemployment rate from the same website. Here we have downloaded them as a csv file. The unemployment rate is negatively correlated with the average salary. However, it is not significantly correlated with the CPI. The pain index is the sum of cpi and unemployment rate. The unemployment rate is not correlated with any of the revenue for depressive disorders, anxiety disorders, and psychotic disorders. With no doubt, the revenue for depressive disorders is highly correlated with the pain index.

unemp.dta<-read.csv("unemp.csv",header=T)
cor.test(temp.dta$salary,unemp.dta$失業率...)

## 
##  Pearson's product-moment correlation
## 
## data:  temp.dta$salary and unemp.dta$失業率...
## t = -2.9957, df = 17, p-value = 0.008129
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8224305 -0.1822414
## sample estimates:
##        cor 
## -0.5877955

cor.test(temp.dta$cpi,unemp.dta$失業率...)

## 
##  Pearson's product-moment correlation
## 
## data:  temp.dta$cpi and unemp.dta$失業率...
## t = -2.0796, df = 17, p-value = 0.05301
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.750944035  0.004865585
## sample estimates:
##        cor 
## -0.4503389

cor.test(all.dta$revenue[all.dta$type=="dd"][1:7],unemp.dta$失業率...[13:19])

## 
##  Pearson's product-moment correlation
## 
## data:  all.dta$revenue[all.dta$type == "dd"][1:7] and unemp.dta$失業率...[13:19]
## t = -0.25914, df = 5, p-value = 0.8059
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7989192  0.6984905
## sample estimates:
##        cor 
## -0.1151225

cor.test(all.dta$revenue[all.dta$type=="ad"][1:7],unemp.dta$失業率...[13:19])

## 
##  Pearson's product-moment correlation
## 
## data:  all.dta$revenue[all.dta$type == "ad"][1:7] and unemp.dta$失業率...[13:19]
## t = -0.8377, df = 5, p-value = 0.4404
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8731915  0.5466588
## sample estimates:
##        cor 
## -0.3508198

cor.test(all.dta$revenue[all.dta$type=="pd"][1:7],unemp.dta$失業率...[13:19])

## 
##  Pearson's product-moment correlation
## 
## data:  all.dta$revenue[all.dta$type == "pd"][1:7] and unemp.dta$失業率...[13:19]
## t = -0.12035, df = 5, p-value = 0.9089
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7754199  0.7288098
## sample estimates:
##        cor 
## -0.0537461

temp.dta$pain<-temp.dta$cpi+unemp.dta$失業率...
temp.dta$dates<-as.Date(days,origin="1970-01-01")[-20]
temp.dta %>% ggplot(aes(dates,pain))+geom_line()+geom_point(col="red")

cor.test(all.dta$revenue[all.dta$type=="dd"][1:7],temp.dta$pain[13:19])

## 
##  Pearson's product-moment correlation
## 
## data:  all.dta$revenue[all.dta$type == "dd"][1:7] and temp.dta$pain[13:19]
## t = 5.5646, df = 5, p-value = 0.002579
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5803562 0.9895174
## sample estimates:
##       cor 
## 0.9278882

summary(lm(all.dta$revenue[all.dta$type=="dd"][1:7]~temp.dta$cpi[13:19]+
             unemp.dta$失業率...[13:19]))

## 
## Call:
## lm(formula = all.dta$revenue[all.dta$type == "dd"][1:7] ~ temp.dta$cpi[13:19] + 
##     unemp.dta$失業率...[13:19])
## 
## Residuals:
##    X2016    X2017    X2018    X2019    X2020    X2021    X2022 
## -3.69793  1.44974  0.70656 -0.08753  1.81392  1.58746 -1.77223 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                -210.8501    68.2939  -3.087  0.03667 * 
## temp.dta$cpi[13:19]           2.5896     0.4582   5.651  0.00483 **
## unemp.dta$失業率...[13:19]   12.7847    10.0353   1.274  0.27166   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 4 degrees of freedom
## Multiple R-squared:  0.8902, Adjusted R-squared:  0.8352 
## F-statistic: 16.21 on 2 and 4 DF,  p-value: 0.01206

At last, we collected the stock price data from yahoo finance. A useful R package {quantmod} provides an easy way to collect the stock price data in the time range defined by us. Suppose we want to get the stock price of Taiwan, whose code is ^TWII from 2010-01-01 to 2022-12-31. We can use the codes below to achieve this goal. The row names of this data frame are the dates. However, we cannot directly retrieve them, unless we transform this data frame to a matrix first. Thereafter, we reoranize a tibble for containing dates and stock prices.

library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## Loading required package: TTR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Sdata<-get(getSymbols(paste0("^TWII"),src="yahoo",from="2010-01-01",to="2022-12-31"))

## Warning: ^TWII contains missing values. Some functions will not work if objects
## contain missing values in the middle of the series. Consider using na.omit(),
## na.approx(), na.fill(), etc to remove or replace them.

stock.dta<-as.matrix(Sdata)
dates<-rownames(stock.dta)
stock.dta<-cbind(stock.dta,dates)
stock.dta<-as_tibble(stock.dta)

Now we compute the average stock price per year.

# Compute mean price of ^TWII per year
stock.dta$year<-sapply(stock.dta$dates,function(x){
  return(unlist(strsplit(x,"-"))[1])
})
stock.dta %>% group_by(year) %>% mutate(mprice=mean(as.numeric(TWII.Close),na.rm=T)) %>% 
  ggplot(aes(as.Date(year,"%Y"),mprice))+geom_line()+geom_point(color="tomato")+
  ylab("TWII price")+xlab("Year")

TWII<-stock.dta %>% group_by(year) %>% mutate(mprice=mean(as.numeric(TWII.Close),na.rm=T)) %>% 
  with(unique(mprice))

We know that the revenue for depressive disorders, the CPI, and the stock price are all positively correlated with each other. In order to clarify their interrelationships, we conduct mediation analysis with the stock price as the mediator, the revenue as the dependent variable, and the CPI as the independent variable. The idea is that the revenue goes up, because the consumer price index goes up with the stock price as the mediator. The below analysis supports this idea. First, we regress the revenue on the CPI. Second, we regress the CPI on the stock price. Third, we regress the revenue on the CPI and the stock price. If the stock price is the mediator in between the relationship between the CPI and the revenue, then when the CPI alone as the independent variable, its regression coefficient is significant (e.g., m1). However, when it together with the stock price as the predictors, its regression coefficient should become small or even not significant. In this case, it is not significant any more. Thus, we can draw a conclusion that when the consumer price levels up, the stock price goes up too, that predicts an increase on the revenue for the depressive disorders.

m1<-lm(scale(all.dta$revenue[1:7])~scale(temp.dta$cpi[13:19]))
m2<-lm(scale(TWII[7:13])~scale(temp.dta$cpi[13:19]))
m3<-lm(scale(all.dta$revenue[1:7])~scale(temp.dta$cpi[13:19])+scale(TWII[7:13]))
summary(m1)

## 
## Call:
## lm(formula = scale(all.dta$revenue[1:7]) ~ scale(temp.dta$cpi[13:19]))
## 
## Residuals:
##    X2016    X2017    X2018    X2019    X2020    X2021    X2022 
## -0.43115  0.08915 -0.09211 -0.16175  0.38606  0.61816 -0.40836 
## attr(,"scaled:center")
## [1] 92.75
## attr(,"scaled:scale")
## [1] 6.188
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                6.637e-15  1.627e-01   0.000  1.00000   
## scale(temp.dta$cpi[13:19]) 9.196e-01  1.757e-01   5.233  0.00337 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4304 on 5 degrees of freedom
## Multiple R-squared:  0.8456, Adjusted R-squared:  0.8147 
## F-statistic: 27.38 on 1 and 5 DF,  p-value: 0.003374

summary(m2)

## 
## Call:
## lm(formula = scale(TWII[7:13]) ~ scale(temp.dta$cpi[13:19]))
## 
## Residuals:
##        1        2        3        4        5        6        7 
## -0.17470  0.09182 -0.24418 -0.38377  0.12588  1.04383 -0.45888 
## attr(,"scaled:center")
## [1] 12146
## attr(,"scaled:scale")
## [1] 3011
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                5.559e-15  2.112e-01   0.000    1.000  
## scale(temp.dta$cpi[13:19]) 8.601e-01  2.282e-01   3.769    0.013 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5589 on 5 degrees of freedom
## Multiple R-squared:  0.7397, Adjusted R-squared:  0.6876 
## F-statistic: 14.21 on 1 and 5 DF,  p-value: 0.01303

summary(m3)

## 
## Call:
## lm(formula = scale(all.dta$revenue[1:7]) ~ scale(temp.dta$cpi[13:19]) + 
##     scale(TWII[7:13]))
## 
## Residuals:
##    X2016    X2017    X2018    X2019    X2020    X2021    X2022 
## -0.31378  0.02746  0.07194  0.09609  0.30149 -0.08312 -0.10007 
## attr(,"scaled:center")
## [1] 92.75
## attr(,"scaled:scale")
## [1] 6.188
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                2.902e-15  8.893e-02   0.000   1.0000  
## scale(temp.dta$cpi[13:19]) 3.417e-01  1.883e-01   1.815   0.1437  
## scale(TWII[7:13])          6.718e-01  1.883e-01   3.569   0.0234 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2353 on 4 degrees of freedom
## Multiple R-squared:  0.9631, Adjusted R-squared:  0.9446 
## F-statistic: 52.19 on 2 and 4 DF,  p-value: 0.001362

Application of Data from Web Pages in Psychology

Lee-Xieng Yang

Class 5

Integration of multiple data bases