R is a statistical language and RStudio is the IDE (Integrated Development Environment) for R. These two applications can be easily approached by Google search with their names as the keywords. Supposedly, students all have R and RStudio installed on their computers before this class. Now let’s get started.
Like other computer language, variables and functions are the building blocks of R. For variables, there are at least two modes: numeric and character. We use the operator <- to declare a variable. The below command means that declare a variable x whose value is 2.
x<-2
x
## [1] 2
Of course, other types of numeric data are acceptable, such as float. Here we declare a float variable y.
y<-3.141592
y
## [1] 3.141592
As x and y are both numeric, we do apply mathematical functions to them.
x+y
## [1] 5.141592
x-y
## [1] -1.141592
x*y
## [1] 6.283184
x/y
## [1] 0.6366199
In addition to numeric, character is another common type of variable in R. Any character variable when being defined requires a pair of ” or ’ to enclose the content of the variable. Note that we cannot execute any mathematical function for the character variables. The R system will give you an error message.
a<-"a"
b<-"b"
a
## [1] "a"
b
## [1] "b"
Variables can vary in their structures as a single entity, vector, matrix, or array. We can use the function c( ) to create a vector. In a vector, all elements should be of the same mode: all numeric or all character. If numeric elements and character elements are mixed in a vector, then the vector will turn every element to character.
x<-c(1,3,5,7,9)
x
## [1] 1 3 5 7 9
y<-c(x,'a')
y
## [1] "1" "3" "5" "7" "9" "a"
We can retrieve an element of a vector by calling the position of it. The positions of elements in a vector start from 1 to the length of it. In the below commands, length( ) is used to count the elements of a vector (or matrix and array).
x[3]
## [1] 5
length(x)
## [1] 5
x[length(x)]
## [1] 9
We can also do arithmetic operations for vectors, just like scalars, as long as that (1) they are all numeric and (2) the vectors have the same length. However, these operations are all element-wise.
a<-1:5
b<-(1:5)*2
a
## [1] 1 2 3 4 5
b
## [1] 2 4 6 8 10
a+b
## [1] 3 6 9 12 15
b-a
## [1] 1 2 3 4 5
a*b
## [1] 2 8 18 32 50
b/a
## [1] 2 2 2 2 2
A matrix is a two-dimensional data structure. For example, we can create a matrix by matrix( ). Note, by default, the elements of a matrix are positioned along columns first and then rows. We can also change the sequence to arranging the elements along rows first and then columns.
v1<-matrix(1:9,3,3)
v1
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
v2<-matrix(1:9,3,3,byrow=T)
v2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Of course, we can apply any mathematical operations to numeric matrices. A particular operation is multiplication, which is executed by the operator %*%. We can use the function t( ) to transpose a matrix.
m<-matrix(c(1,0,0,1),2,2)
m
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
n<-matrix(1,2,4)
n
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 1 1 1 1
m%*%n
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 1 1 1 1
t(n)%*%m
## [,1] [,2]
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
## [4,] 1 1
What has more dimensions than matrix is an array. However, in this class, array is not frequently used, so we will not introduce it. One characteristic of vector and matrix is that the elements of them must be in the same mode. What if we want to contain contents of different types together? A particular type of variable is called list which should be suitable. The below example shows that we declare a list called h, in which a numeric scalar, a numeric vector, and a character vector are contained. If we want to retrieve the first element in this list, h[1] does not work. This is because that a list uses [ ] to enclose any element in it. We need to use another pair of [ ] to retrieve the content. That is, h[[1]] and we can apply mathematical operations to h[[1]] now.
h<-list(1,c(2,3,4),c('john','marry','steve'))
h
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2 3 4
##
## [[3]]
## [1] "john" "marry" "steve"
h[1]
## [[1]]
## [1] 1
h[[1]]+10
## [1] 11
Similarly, if we want to retrieve the second element in the character vector in h, we can use h[[3]][2]. With the pair of [ ], the contents in a list can be protected from users’ inattentive operation.
h[[3]][2]
## [1] "marry"
A data frame is a special case of list, which actually is a spread sheet for maintaining data. There are many ways to declare a data frame. A common way is to use the function data.frame( ).
dta<-data.frame(name=c('John','May','Steve'),math=c(90,80,70),eng=c(60,70,80))
dta
## name math eng
## 1 John 90 60
## 2 May 80 70
## 3 Steve 70 80
A data frame is a spreed-sheet structure just like a single tag of your EXCEL file. We can sue the function summary( ) to quickly get a glance at each column of a data frame. As shown in the below cell, the statistics of the numeric columns can be computed. However, there is no way to calculate any statistics for a character column, except counting the times of the entities.
summary(dta)
## name math eng
## Length:3 Min. :70 Min. :60
## Class :character 1st Qu.:75 1st Qu.:65
## Mode :character Median :80 Median :70
## Mean :80 Mean :70
## 3rd Qu.:85 3rd Qu.:75
## Max. :90 Max. :80
Tibble is a modern reimagining of the data.frame, which is actually a data frame but with a better compatibility with the functions used by data scientists in the packages such as {dplyr}, {tidyr}, and {tidyverse}. In the below example, we load the package {dplyr} first by using library( ). Then we transfer the type of dta to tibble.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dta2<-as_tibble(dta)
dta2
## # A tibble: 3 × 3
## name math eng
## <chr> <dbl> <dbl>
## 1 John 90 60
## 2 May 80 70
## 3 Steve 70 80
For no matter data.frame or tibble, we can always to access the data in a column by the operator $. The data in a numeric column can be mathematically processed directly, of course.
dta2$math
## [1] 90 80 70
dta2$name
## [1] "John" "May" "Steve"
dta2$math+10
## [1] 100 90 80
In R, the plotting functions are always a stunning point. The package {ggplot2} is my only suggestion for making professional figures with R. For example, we can make a scatter plot for the Mathematics and English scores of these three students. We can also add a regression line to show the linear relation between these two scores.
library(ggplot2)
ggplot(dta,aes(eng,math))+geom_point(shape=1)+geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
Since we have loaded the package {dplry}, we can use the special operator called pipe %>% to transfer data. Suppose we want to make box plots for these two scores in one single figure. First we need to change the wide format of dta to the long format. We can use the function gather( ) in the package {tidyr}.
library(tidyr)
dta %>% gather(topic,score,math,eng)
## name topic score
## 1 John math 90
## 2 May math 80
## 3 Steve math 70
## 4 John eng 60
## 5 May eng 70
## 6 Steve eng 80
dta %>% gather(topic,score,math,eng) %>%
ggplot(aes(topic,score))+geom_boxplot()
We can save our data.frame variable to a file for future use. For example, if we want to save the data in the data.frame dta, we can execute the write.table( ). We set row.names equal to F. Thus, there will not be row numbers in the output file. In R, the logical values of F and T are False and True respectively. Once we save our data in a text file. We can import the data from this file by the function read.table( ).
write.table(dta,file="mydata.txt",row.names=F)
There are many other ways to import data. One useful way is to directly import the data on a web page. For example, we have a data file shared on the server of my lab. We can use read.table( ) to import the data in it.
url<-"http://cml.nccu.edu.tw/lxyang/Statistics/BodyWeights2.dat"
dta2<-read.table(url,header=T,sep="")
dta2
## Age20 Age8 Age17
## 1 58 31 69
## 2 52 25 50
## 3 61 29 69
## 4 57 20 39
## 5 93 40 90
## 6 63 32 82
## 7 68 27 65
## 8 71 33 72
## 9 73 28 66
## 10 48 21 43
Visual inspection of this data set indicates that these are the body weights of 10 participants at year 8, 17, and 20. A quick research question might be whether or not the body weights between young and old ages are positively correlated. Thus, we focus on the two columns for Age 8 and Age 20. We can test the correlation coefficient between them. The result supports our anticipation.
cor.test(dta2$Age20,dta2$Age8)
##
## Pearson's product-moment correlation
##
## data: dta2$Age20 and dta2$Age8
## t = 3.9692, df = 8, p-value = 0.004123
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3792010 0.9545561
## sample estimates:
## cor
## 0.8143881
Of course, we can make a scatter plot for the body weights at these two ages.
dta2 %>% ggplot(aes(Age8,Age20))+geom_point(shape=16,color="tomato")+
geom_smooth(method="lm",color="blue")
## `geom_smooth()` using formula = 'y ~ x'
If you want, we can also run a regression analysis with the body weight at 20 years as the dependent variable and the body weight at 8 years as the independent variable. The slope is significantly larger than 0, suggesting a positive correlation on body weight between these two ages. \(R^2\) is about 0.66, suggesting more than 66% of the variance in the body weight at 20 years can be explained by the body weight at 8 years.
m1<-lm(Age20~Age8,dta2)
summary(m1)
##
## Call:
## lm(formula = Age20 ~ Age8, data = dta2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.641 -5.555 -2.072 7.455 9.660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.8588 12.9756 1.068 0.31666
## Age8 1.7672 0.4452 3.969 0.00412 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.894 on 8 degrees of freedom
## Multiple R-squared: 0.6632, Adjusted R-squared: 0.6211
## F-statistic: 15.75 on 1 and 8 DF, p-value: 0.004123