Introduction to R

Export Data

It is almost impossible not to save your data in a file for future use. In order to show how to import data to a text file, let us create some data. Suppose you measured the respiratory exchange ratio (RER) of 18 subjects after doing exercise. Among these subjects, before doing exercise, half of them took a capsule containing pure caffeine and the other half just placebo. The difference on RER between these two groups can tell us whether caffeine has an effect on muscle metabolism.

dta<-data.frame(group=rep(c("C","P"),each=9),
                rer=c(96,99,94,89,96,93,88,105,88,
                      105,119,100,97,96,101,94,95,98))

We can quickly have a glance at the distribution of the RER’s of these two groups. Apparently, the Placebo group have a lower mean RER than the Caffeine group.

library(ggplot2)
ggplot(dta,aes(group,rer,color=group))+
  geom_boxplot()+
  geom_jitter(shape=16,position=position_jitter(0.1))

An independent two-group t-test shows that the difference on RER between these groups is marginally significant, \(p=.06\) at \(\alpha=.05\), supporting the visual inspection.

with(dta,t.test(rer~group,var.equal=T))

## 
##  Two Sample t-test
## 
## data:  rer by group
## t = -1.9948, df = 16, p-value = 0.06339
## alternative hypothesis: true difference in means between group C and group P is not equal to 0
## 95 percent confidence interval:
##  -13.0639064   0.3972398
## sample estimates:
## mean in group C mean in group P 
##        94.22222       100.55556

Now if we want to save these data for future use, we need to export them to a file. Normally, we will use write.table( ) to export the data frame to a file. After you save these data to a file, you can check it with any text processor, such as Windows Notepad.

write.table(dta,"rer.txt")

Import Data from a File

The first way to import data is to import the data in a text file using read.table( ). This function is suitable for importing the data in the spreadsheet format.

dta<-read.table("rer.txt",header=T,sep="")
dim(dta)

## [1] 18  2

dta

##    group rer
## 1      C  96
## 2      C  99
## 3      C  94
## 4      C  89
## 5      C  96
## 6      C  93
## 7      C  88
## 8      C 105
## 9      C  88
## 10     P 105
## 11     P 119
## 12     P 100
## 13     P  97
## 14     P  96
## 15     P 101
## 16     P  94
## 17     P  95
## 18     P  98

Improt Data from a URL

In R, it is very easy to import the data from a URL. For example, the below function can be used to import the data on this link. The argument skip indicates how many lines should be skipped before the data. As you can see on the web page, the first 4 lines are the description of these data, not these data themselves. Thus, we need to skip the first 4 lines, in order to get the data.

dta1<-read.table("http://lib.stat.cmu.edu/jcgs/tu",skip=4,header=T)
dim(dta1)

## [1] 136   7

names(dta1)

## [1] "XL"   "XR"   "ZL"   "ZR"   "AGE"  "MULT" "TRT"

Now we can start analyzing the imported data. Suppose we would like to examine whether XR differs between the two age groups. First, we can make a box plot for the XR data.

ggplot(dta1,aes(as.factor(AGE),XR,color=as.factor(AGE)))+
  geom_boxplot()+
  geom_dotplot(binaxis="y",stackdir="center",binwidth=0.8,dotsize=0.5)

It looks like that XR might differ between the two age groups. A t-test confirms this inspection.

with(dta1,t.test(XR~AGE,var.equal=T))

## 
##  Two Sample t-test
## 
## data:  XR by AGE
## t = -1.9839, df = 134, p-value = 0.04931
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -2.93260189 -0.00452457
## sample estimates:
## mean in group 1 mean in group 2 
##        13.19101        14.65957

Import Unstructured Data from File

Sometimes, we need to import the unstructured data from a file. However, read.table( ) only accept structured data (i.e., spreadsheet data). What can we do? In R, you can use the function scan( ) to import the data. For example, this link shows a post copied from Dcard. Apparently, these are textual data and unstructured. Thus, we use scan( ) instead of read.table( ) to import them.

url<-"http://140.119.175.225:880/PsyCoding/dcard_post1.txt"
post<-scan(url,what=character())
mode(post)

## [1] "character"

length(post)

## [1] 27

We assign the textual data scraped from the web page to a variable named post. This variable is a character vector, in which there are 27 elements. We can check what the elements are. Clearly, these are the lines of that post one be one.

post

##  [1] "我曾經是一個愛玩交友軟體會去夜店跑趴的女生"
##  [2] "會跟網友出去"                              
##  [3] "至於做什麼我也不用多講"                    
##  [4] "現任男友也是交友軟體認識的"                
##  [5] "見面彼此感覺都很好決定交往"                
##  [6] "我完全不介意他的過去"                      
##  [7] "但他卻非常介意我以前做過的事"              
##  [8] "我真的很後悔交往前什麼都跟他講"            
##  [9] "他常常說一直想起我跟別人的畫面"            
## [10] "讓他覺得很煩很難過為此已爭吵多次"          
## [11] "我幾乎都有安撫道歉"                        
## [12] "但有一次他說我很髒"                        
## [13] "我真的心都碎了"                            
## [14] "他後來有道歉"                              
## [15] "他說他知道我現在不會再做這些事情"          
## [16] "但過去就是無法改變他就是很介意"            
## [17] "除了分手還有什麼辦法嗎我真的很愛他"        
## [18] "為了他可以不跟任何男生出去都沒問題"        
## [19] "夜店也是不可能再去的"                      
## [20] "今天他又提起過去的事情"                    
## [21] "我真的很難過但還是跟他講"                  
## [22] "只要他願意我不會放棄"                      
## [23] "在此想強調"                                
## [24] "愛玩的當下不要以為你能對自己負責就好"      
## [25] "我曾經也是這麼狂妄"                        
## [26] "但真正遇到一個你愛的人他也愛你"            
## [27] "看見他為了這些事難過你也絕對好過不到哪"

Each of these elements of this character vector is a string. By definition, a string is whatever enclosed by " and " or ’ and ’. The length of a string is always 1. The first line is longer than the second line. However, they both have a length of 1.

post[1]

## [1] "我曾經是一個愛玩交友軟體會去夜店跑趴的女生"

length(post[1])

## [1] 1

post[2]

## [1] "會跟網友出去"

length(post[2])

## [1] 1

How many characters a string has can be checked by nchar. The first line has more number of characters than the second line.

nchar(post[1])

## [1] 21

nchar(post[2])

## [1] 6

Strings

Some useful functions for strings are worth introducing. First, we can separate a string by a character or a compound of characters by strsplit( ). The returned outcome of strsplint( ) is a list. A list is a particular data structure in R. You have already seen a special case of a list. That is the data frame. A data frame can possess variables of different modes, so does a list. I declare a lise named A to store the returned outcome after executing strsplit( ).

strsplit(post[1],"交友軟體")

## [[1]]
## [1] "我曾經是一個愛玩"   "會去夜店跑趴的女生"

A<-strsplit(post[1],"交友軟體")
mode(A)

## [1] "list"

length(A)

## [1] 1

## [[1]]
## [1] "我曾經是一個愛玩"   "會去夜店跑趴的女生"

This list has only one variable, which is a character vector of two elements. When accessing an element in a vector , we have to call the position of that element, say T[2] the second position in the vector T. If T is a list, you have to use double brackets, namely [[]], to access the values. Take A as an example. Although we know that A should contain two segments of the first line of the Dcard post, they are included in a list as a vector variable. Thus, A has only one vector with two elements. A[1] is not “我曾經是一個愛玩”. For a comparison, you can check what A[[1]] returns.

A[1]

## [[1]]
## [1] "我曾經是一個愛玩"   "會去夜店跑趴的女生"

A[[1]]

## [1] "我曾經是一個愛玩"   "會去夜店跑趴的女生"

In this case, if you wan to get the first segment after separating the first line, how should we do? You can check the returned outcomes of the below codes. The first one returns all elements, whereas the second returns what we really want, namely the first segment.

A[1][1]

## [[1]]
## [1] "我曾經是一個愛玩"   "會去夜店跑趴的女生"

A[[1]][1]

## [1] "我曾經是一個愛玩"

We can also declare a list to store a vector. The below code is used to declare a list, in which there is only one vector. Although the vector is numeric, you cannot directly do calculation for it. If you want, you need to firstly access that vector by using double brackets. Then you can do calculation for the vector.

B<-list(c(1,2,3,5,8))
B

## [[1]]
## [1] 1 2 3 5 8

# You can try B+3 and you will get an error message
B[[1]]+10

## [1] 11 12 13 15 18

Now back to the case of Dcard post. We can turn the list A to a normal vector by unlist( ). You can see there is no “交友軟體” in either element of this vector. This is because the character vector used for splitting has been deleted.

unlist(A)

## [1] "我曾經是一個愛玩"   "會去夜店跑趴的女生"

In addition to split a string, we can check if a string possesses a particular character compound. The function grep( ) is normally used to do this job. If the character(s) you want to check is really in this string, then it will return 1 and 0 otherwise. Of course, you can also use grepl( ), if you want the returned outcome has a logical value of TRUE or FALSE.

grep('交友軟體',post[1])

## [1] 1

grepl('交友軟體',post[1])

## [1] TRUE

Often we substitute a part of a string by gsub( ) or gsub( ). If the target character(s) appears in the string more than once and you want to replace them all by the new character(s), you should use the latter function. See the string tt. There are two b’s. If we want to substitute all b’s by H, gsub( ) should be used. For a comparison between these two functions, you can check the outcome returned by sub( ) for the same substitution.

sub("交友軟體","Tinder",post[1])

## [1] "我曾經是一個愛玩Tinder會去夜店跑趴的女生"

tt<-"a b c b d e"
gsub("b","H",tt)

## [1] "a H c H d e"

sub("b","H",tt)

## [1] "a H c b d e"

Import Data from a File Created by Other Software

From EXCEL

If you want to import a *.csv file created by EXCEL, you can simply treat it as a text file and use read.table( ) to import the data in it. Note the argument sep now is assigned with “,”. This is because in a csv file, the columns are separated by “,”. Alternatively, you can use read.csv( ) to import the data of a csv file. See the difference between read.table( ) and read.csv( ) in use to import data from the same csv file.

dta3<-read.table("col.csv",header=T,sep=",")

## Warning in read.table("col.csv", header = T, sep = ","): incomplete final line
## found by readTableHeader on 'col.csv'

dta3

##   Col1 Col2 Col3
## 1  100   a1  b1 
## 2  200   a2  b2 
## 3  300   a3   b3

read.csv("col.csv",header=T)

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'col.csv'

##   Col1 Col2 Col3
## 1  100   a1  b1 
## 2  200   a2  b2 
## 3  300   a3   b3

Second, the data in the xls file can be imported by using read.xls( ) in the package of {gdata}. The argument sheet in read.xls( ) indicates from which sheet the data will be retrieved. However, it seems like that this function cannot properly show the Chinese characters.

library(gdata)

## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

##

## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

## 
## Attaching package: 'gdata'

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

dta4<-read.xls("col.xls",sheet=1)
dta4

##   Col1 Col2 Col3.
## 1  100   a1   b1 
## 2  200   a2   b2 
## 3  300   a3    b3

read.xls("col.xls",sheet=2)

##   Name Eng Math
## 1   __  60   84
## 2   __  65   76
## 3   __  78   56
## 4   __  98   45

For importing an xlsx file, you can use read_excel( ) in the package {readxl}. The returned object by readxl( ) is a tibble, which is similar to data frame and you can just treat is as a data frame.

library(readxl)
dta6<-read_excel("col.xlsx",sheet=1)
dta6

## # A tibble: 3 x 3
##    Col1 Col2  Col3 
##   <dbl> <chr> <chr>
## 1   100 a1    b1   
## 2   200 a2    b2   
## 3   300 a3    b3

Although not recommended, many people love to use SPSS to do statistical analysis. In SPSS, the file maintaining the data is always ended with sav after the dot as the file extension. If you want to import the data from a .sav file, you need to use the funciton read.spss( ) in the package {foreign}. Note the argument to.data.frame is set up as TRUE, so that the returned object is a data frame. We can check the first 6 rows of this data frame.

library(foreign)
nlr.dta<-read.spss("p025a.sav",to.data.frame=T)
head(nlr.dta)

##    Y  X
## 1  1 -7
## 2 14 -6
## 3 25 -5
## 4 34 -4
## 5 41 -3
## 6 46 -2