Variable and Functions

Like other computer language, variables and functions are the building blocks of R. For variables, there are at least two modes: numeric and character. We use the operator <- to declare a variable. The below command means that declare a variable x whose value is 2.

x<-2
x

## [1] 2

Of course, other types of numeric data are acceptable, such as float. Here we declare a float variable y.

y<-3.141592
y

## [1] 3.141592

As x and y are both numeric, we do apply mathematical functions to them.

x+y

## [1] 5.141592

x-y

## [1] -1.141592

x*y

## [1] 6.283184

x/y

## [1] 0.6366199

In addition to numeric, character is another common type of variable in R. Any character variable when being defined requires a pair of ” or ’ to enclose the content of the variable. Note that we cannot execute any mathematical function for the character variables. The R system will give you an error message.

a<-"a"
b<-"b"
a

## [1] "a"

## [1] "b"

Vectors

Variables can vary in their structures as a single entity, vector, matrix, or array. We can use the function c( ) to create a vector. In a vector, all elements should be of the same mode: all numeric or all character. If numeric elements and character elements are mixed in a vector, then the vector will turn every element to character.

x<-c(1,3,5,7,9)
x

## [1] 1 3 5 7 9

y<-c(x,'a')
y

## [1] "1" "3" "5" "7" "9" "a"

We can retrieve an element of a vector by calling the position of it. The positions of elements in a vector start from 1 to the length of it. In the below commands, length( ) is used to count the elements of a vector (or matrix and array).

x[3]

## [1] 5

length(x)

## [1] 5

x[length(x)]

## [1] 9

We can also do arithmetic operations for vectors, just like scalars, as long as that (1) they are all numeric and (2) the vectors have the same length. However, these operations are all element-wise.

a<-1:5
b<-(1:5)*2
a

## [1] 1 2 3 4 5

## [1]  2  4  6  8 10

a+b

## [1]  3  6  9 12 15

b-a

## [1] 1 2 3 4 5

a*b

## [1]  2  8 18 32 50

b/a

## [1] 2 2 2 2 2

A matrix is a two-dimensional data structure. For example, we can create a matrix by matrix( ). Note, by default, the elements of a matrix are positioned along columns first and then rows. We can also change the sequence to arranging the elements along rows first and then columns.

v1<-matrix(1:9,3,3)
v1

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

v2<-matrix(1:9,3,3,byrow=T)
v2

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Of course, we can apply any mathematical operations to numeric matrices. A particular operation is multiplication, which is executed by the operator %*%. We can use the function t( ) to transpose a matrix.

m<-matrix(c(1,0,0,1),2,2)
m

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

n<-matrix(1,2,4)
n

##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    1    1    1    1

m%*%n

##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    1    1    1    1

t(n)%*%m

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    1    1
## [4,]    1    1

List

What has more dimensions than matrix is an array. However, in this class, array is not frequently used, so we will not introduce it. One characteristic of vector and matrix is that the elements of them must be in the same mode. What if we want to contain contents of different types together? A particular type of variable is called list which should be suitable. The below example shows that we declare a list called h, in which a numeric scalar, a numeric vector, and a character vector are contained. If we want to retrieve the first element in this list, h[1] does not work. This is because that a list uses [ ] to enclose any element in it. We need to use another pair of [ ] to retrieve the content. That is, h[[1]] and we can apply mathematical operations to h[[1]] now.

h<-list(1,c(2,3,4),c('john','marry','steve'))
h

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2 3 4
## 
## [[3]]
## [1] "john"  "marry" "steve"

h[1]

## [[1]]
## [1] 1

h[[1]]+10

## [1] 11

Similarly, if we want to retrieve the second element in the character vector in h, we can use h[[3]][2]. With the pair of [ ], the contents in a list can be protected from users’ inattentive operation.

h[[3]][2]

## [1] "marry"

Data frame

A data frame is a special case of list, which actually is a spread sheet for maintaining data. There are many ways to declare a data frame. A common way is to use the function data.frame( ).

dta<-data.frame(name=c('John','May','Steve'),math=c(90,80,70),eng=c(60,70,80))
dta

##    name math eng
## 1  John   90  60
## 2   May   80  70
## 3 Steve   70  80

A data frame is a spreed-sheet structure just like a single tag of your EXCEL file. We can sue the function summary( ) to quickly get a glance at each column of a data frame. As shown in the below cell, the statistics of the numeric columns can be computed. However, there is no way to calculate any statistics for a character column, except counting the times of the entities.

summary(dta)

##      name                math         eng    
##  Length:3           Min.   :70   Min.   :60  
##  Class :character   1st Qu.:75   1st Qu.:65  
##  Mode  :character   Median :80   Median :70  
##                     Mean   :80   Mean   :70  
##                     3rd Qu.:85   3rd Qu.:75  
##                     Max.   :90   Max.   :80

Tibble

Tibble is a modern reimagining of the data.frame, which is actually a data frame but with a better compatibility with the functions used by data scientists in the packages such as {dplyr}, {tidyr}, and {tidyverse}. In the below example, we load the package {dplyr} first by using library( ). Then we transfer the type of dta to tibble.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dta2<-as_tibble(dta)
dta2

## # A tibble: 3 × 3
##   name   math   eng
##   <chr> <dbl> <dbl>
## 1 John     90    60
## 2 May      80    70
## 3 Steve    70    80

For no matter data.frame or tibble, we can always to access the data in a column by the operator $. The data in a numeric column can be mathematically processed directly, of course.

dta2$math

## [1] 90 80 70

dta2$name

## [1] "John"  "May"   "Steve"

dta2$math+10

## [1] 100  90  80

Plots

In R, the plotting functions are always a stunning point. The package {ggplot2} is my only suggestion for making professional figures with R. For example, we can make a scatter plot for the Mathematics and English scores of these three students. We can also add a regression line to show the linear relation between these two scores.

library(ggplot2)
ggplot(dta,aes(eng,math))+geom_point(shape=1)+geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

Since we have loaded the package {dplry}, we can use the special operator called pipe %>% to transfer data. Suppose we want to make box plots for these two scores in one single figure. First we need to change the wide format of dta to the long format. We can use the function gather( ) in the package {tidyr}.

library(tidyr)
dta %>% gather(topic,score,math,eng)

##    name topic score
## 1  John  math    90
## 2   May  math    80
## 3 Steve  math    70
## 4  John   eng    60
## 5   May   eng    70
## 6 Steve   eng    80

dta %>% gather(topic,score,math,eng) %>% 
        ggplot(aes(topic,score))+geom_boxplot()

Import and Export Data

We can save our data.frame variable to a file for future use. For example, if we want to save the data in the data.frame dta, we can execute the write.table( ). We set row.names equal to F. Thus, there will not be row numbers in the output file. In R, the logical values of F and T are False and True respectively. Once we save our data in a text file. We can import the data from this file by the function read.table( ).

write.table(dta,file="mydata.txt",row.names=F)

There are many other ways to import data. One useful way is to directly import the data on a web page. For example, we have a data file shared on the server of my lab. We can use read.table( ) to import the data in it.

url<-"http://cml.nccu.edu.tw/lxyang/Statistics/BodyWeights2.dat"
dta2<-read.table(url,header=T,sep="")
dta2

##    Age20 Age8 Age17
## 1     58   31    69
## 2     52   25    50
## 3     61   29    69
## 4     57   20    39
## 5     93   40    90
## 6     63   32    82
## 7     68   27    65
## 8     71   33    72
## 9     73   28    66
## 10    48   21    43

Visual inspection of this data set indicates that these are the body weights of 10 participants at year 8, 17, and 20. A quick research question might be whether or not the body weights between young and old ages are positively correlated. Thus, we focus on the two columns for Age 8 and Age 20. We can test the correlation coefficient between them. The result supports our anticipation.

cor.test(dta2$Age20,dta2$Age8)

## 
##  Pearson's product-moment correlation
## 
## data:  dta2$Age20 and dta2$Age8
## t = 3.9692, df = 8, p-value = 0.004123
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3792010 0.9545561
## sample estimates:
##       cor 
## 0.8143881

Of course, we can make a scatter plot for the body weights at these two ages.

dta2 %>% ggplot(aes(Age8,Age20))+geom_point(shape=16,color="tomato")+
                                 geom_smooth(method="lm",color="blue")

## `geom_smooth()` using formula = 'y ~ x'

If you want, we can also run a regression analysis with the body weight at 20 years as the dependent variable and the body weight at 8 years as the independent variable. The slope is significantly larger than 0, suggesting a positive correlation on body weight between these two ages. $R^2$ is about 0.66, suggesting more than 66% of the variance in the body weight at 20 years can be explained by the body weight at 8 years.

m1<-lm(Age20~Age8,dta2)
summary(m1)

## 
## Call:
## lm(formula = Age20 ~ Age8, data = dta2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.641  -5.555  -2.072   7.455   9.660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  13.8588    12.9756   1.068  0.31666   
## Age8          1.7672     0.4452   3.969  0.00412 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.894 on 8 degrees of freedom
## Multiple R-squared:  0.6632, Adjusted R-squared:  0.6211 
## F-statistic: 15.75 on 1 and 8 DF,  p-value: 0.004123

Application of Data from Web Pages in Psychology

Lee-Xieng Yang

Class 2