Basic Operations in R

The main purpose of this postgraduate-level course Experimental Design is to introduce students how to appropiately conduct statistical tests for the data collected in psychological studies. Thus, this course will focus on the methods of data analysis more than data management, such as how to design an experiment, how to counterbalance the effects of variables, or how to eliminate the errors (or make them equal across conditions).

To this end, the statistical concepts and tests will be introduced. Students are expected to learn the concepts of probability distributions, z test, t test, correlations between two variables, simple (and multiple) linear regression analysis, and analysis of variance (including prior comparisons and posterior comparisons).

Most importantly, students are required to learn how to use R to conduct all statistical tests introduced in this course. There are many benefits of using R for data analysis, not only because it provides heigh precision of data analysis, excellent graphics ready for publication in academic journals, and high flexibility to link to other applications (e.g., markdown), but also because it is an open source project that is FREE and updated very freqently! Students will be asked to do class assignments with R. Thus, before the formal classes, students are encouraged to learn R themselves. This R tutorial is then designed. For the students who are beginners in R, this tutorial is a good start to learn the basic variables, functions, and how to compose scripts in R. For those who have known quite well about R, this tutorial provides an opportunity to learn how to use R to learn statistics. The editing environment in R is not that handy to use, the RStudio IDE (Integrated Development Environment) is instead strongly suggested, which is also FREE!

RStudio

Before we start the R tutorial, we need to know RStudio which is a useful IDE for R. The below shows what RStudio looks like. There are four panes in RStudio. The lower-left pane is called console, where you can key in your command and run it by pressing enter. R will immediately return you the result of your command. The upper-left pane is where you compose your script. A script consists of lines of commands which are linked together as a system to complete a mission. As shown in the below figure, there are two lines of commands. Different from console, the codes you type here would not be executed by pressing enter. This is because we do not want any feedback from R, which can be an interruption to our programming, before the final code in a script. If you want to test a part of your script while you are programming, you can highlight the lines of codes you want and press ctrl-enter to see the output returned by R following your codes.

The upper-right pane is the global environment, where you can find the variables you declare so far. As shown in this figure, there have been two variables x and y declared. The variable x is a scalar 10, whereas y is a vector from 1 to 10. These two variables can be declared in the console pane or can be declared and executed by ctrl-enter in the script pane. The last pane is the lower-right window, where shows the directory tree in your computer, the plots you make, the packages you have installed, and the help documents for the functions installed. For example, you can type ?q() on the console pane to ask for the information about the function q(), which is shown in the lower-right pane. Note the way to call a function in R is to key in the function name followed by parentheses. Thus, q is the name of function which is used to quit R and if you type q() in the console pane, R will be terminated.

Variables

As other well-developed computer languages, R provides many kinds of variables for temporarily storing data. For example, we can declare a variable named x and assigned with a scalar 100. Here, the operator <- means to assign the value/variable on the right to the variable on the left. Some people use = instead of <-, which is also valid in R. Once you declare a variable, you can check it by directly entering its name on the console pane and R will return the content of it.

x<-100
x

## [1] 100

There are basically two major modes of variable in R. One is numeric, including any scalar, integer, float, and even imaginary number.

a<-3.4
b<--0.89
c<-23
a

## [1] 3.4

## [1] -0.89

## [1] 23

The other major mode of variable in R is character. For example, the word “hello” can be stored as a variable h. In R, anything embraced by double quotes “” is of the mode of character. Thus, when a<-“3.4”, a now is no longer a scalar but a character.

h<-"hello"
h

## [1] "hello"

a<-"3.4"
a

## [1] "3.4"

In addition to a single word, multiple words can be regarded as a string by “” and stored in a variable.

d<-"Hello Word!"
d

## [1] "Hello Word!"

Of course, it is valid to do arithmetic operations with a numeric variable. Note that variable 1 <- variable 2 means replace the value of variable 1 by that of variable 2. Similarly, variable 1 <- variable 2 + variable 3 means to assign the sum of variable 2 and variable 3 to variable 1 either by assigning variable 1 a new value or taking over its old value.

b*2

## [1] -1.78

c/10

## [1] 2.3

b+c

## [1] 22.11

a<-b+c
a

## [1] 22.11

Vector

The variables in R can have different classes according to the structures of them. In the previous section, the variable’s content is a single value in no matter which mode. In fact, we can combine many values into a vector with c(). The below code declares a variable A. We can check its mode by mode(), its structure by str(), and its length by length(). The returned message when calling str() shows the mode of the variable as num, the structure of it (one dimensional vector with 6 components from 1 to 6 denoted by 1:6), the length of it (denoted by the maximum number in 1:6), and parts of the contents of it.

A<-c(1,3,-7,21,35,45.23)
A

## [1]  1.00  3.00 -7.00 21.00 35.00 45.23

mode(A)

## [1] "numeric"

str(A)

##  num [1:6] 1 3 -7 21 35 ...

length(A)

## [1] 6

A character vector contains character components. For example, the below code declares a character vector B, in which 4 components/words are stored. The sentence “This is a book” can be stored in a variable C also. B is a vector, however, C is a variable, that can be found easily by checking their lengths.

B<-c("This","is","a","book")
B

## [1] "This" "is"   "a"    "book"

C<-"This is a book"

Again, arithmetic operations can be applied to the numeric vectors, but not the character vectors. For example, a vector of values from 1 to 20 can be created by coloning +1 by the operator : for the values starting from 1 until the value 20.

X<-1:20
X

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Also, you can check yourself what R will return when you call X+10, X-20, X*2, and X/2. Note that the same arithmetical operation will be applied to all components in a vector. One vector can be summed together with, multiplied with, subtracted from, and divided by another vector in an element wise manner, as long as they both are in the same length. In the below codes, X and Y are in the same length, so X+Y can be run by R. However, Z has a different length from X or Y. You can check whether Z can be summed with X or Y and what results you will get if you do so.

Y<-X*100
X+Y

##  [1]  101  202  303  404  505  606  707  808  909 1010 1111 1212 1313 1414
## [15] 1515 1616 1717 1818 1919 2020

Z<-c(4,5,7)

The specific component of a vector can be located by its position. For example, X[5] means to retrieve the vale at the fifth position in the vector X. The square brackets are the operator used to call values by their addresses (i.e., positions). The below codes show several ways to retrieve the value(s) in a vector.

Y[1]

## [1] 100

Y[c(2,3,5)]

## [1] 200 300 500

Y[4:8]

## [1] 400 500 600 700 800

Matrix

A matrix is a two dimensional strcture for storing values. There are a number of ways to declare a matrix variable. For example, we can combine vectors by rows as a matrix with rbind() or by columns with cbind(). In the below codes, M and N are respectively the 2x3 and 3x2 matrices. The dimensions of a matrix can be retrieved by dim(), with the first returned value denoting the number of rows and the second denoting the number of columns.

M<-rbind(1:3,-1:-3)
N<-cbind(1:3,-1:-3)
M

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]   -1   -2   -3

##      [,1] [,2]
## [1,]    1   -1
## [2,]    2   -2
## [3,]    3   -3

dim(M)

## [1] 2 3

dim(N)

## [1] 3 2

As for vectors, the value(s) in a matrix can be retrieved by calling its/their position(s). For example, M[2,3] gets the value at row 2 and column 3 in M and N[3,2] gets the value at row 3 and column 2 in N. The returned values are all -3. This is not a coincidence. M and N are composed of the same vectors in which there are 6 components in total and final one is -3. Thus, calling M[2,3] is equivalent to calling N[3,2], which is the last among the 6 values, namely -3. In fact, R treats a matrix as a one-dimensional vector and you can retrieve any value at its serial position. Thus, it is valid calling M[6] and it is in fact calling M[2,3].

M[2,3]

## [1] -3

N[3,2]

## [1] -3

M[6]

## [1] -3

N[6]

## [1] -3

When retrieving the value by calling its serial position in a matrix, the positions by default are ordered by column first. For example, a sequence from 1 to 9 to be arranged as a matrix can be done with matrix(). Of course, you can change the order as arranging values by row. Now, only the first and final values of K and Z are the same and others are different.

K<-matrix(1:9,nrow=3,ncol=3)
K

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Z<-matrix(1:9,3,3,byrow=T)
Z

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Note the nth value in a matrix means the value at nth of the positions permutated in a vertical order first. Thus, K[4] is the value at row 1 and column 2 and so is Z[4]. Since K and Z are formed by arranging values in different manners, K[4] and Z[4] are not the same value although they point to the same position in a matrix.

K[4]

## [1] 4

Z[4]

## [1] 2

Just like vector, a matrix can be of the mode of character. The below code puts the words in the sentence “This is a book” in a 2x2 matrix.

E<-matrix(c("This","is","a","book"),2,2)
E

##      [,1]   [,2]  
## [1,] "This" "a"   
## [2,] "is"   "book"

Of course, the component of this matrix can be retrieved by calling its position directly.

E[1,2]

## [1] "a"

E[4]

## [1] "book"

Note that for vector, matrix, or array (matrix more than 2 dimensions), the components in it are demanded to be the same. Thus, when the components of different modes are put together as a vector or matrix, the modes of components will be unified as the mode demanding a smaller space to store a value in RAM. For example, A is a sequence created by seq() from 1 to 9 with 2 as the interval. When added by a character “Ha”, the resulting vector becomes a character vector, no longer a numeric vector. This is because a character variable demands less bytes than a numeric variable (integer, float, or double) does. Therefore, it is strongly recommended to put the values of the same mode together in a variable at all times. What if you need to put data of different modes together in a variable? You need a data frame to contain them.

A<-seq(1,9,2)
A

## [1] 1 3 5 7 9

c(A,"Ha")

## [1] "1"  "3"  "5"  "7"  "9"  "Ha"

Data Frame

In a list, values of different modes can be contained. Suppose we have a vector of numbers and a vector of words. We can put these two variable together as in a list object. The below codes show a vector of the scores of five students in a math examination and a vector of their names.

Math<-c(57,68,24,96,78)
Names<-c("John","Mary","April","Cindy","Tom")
Math

## [1] 57 68 24 96 78

Names

## [1] "John"  "Mary"  "April" "Cindy" "Tom"

We can put these two variables together as a list without changing their modes. Data is a spreadsheet in which each column represents a variable and each row shows for a subject. Each column has its own independent mode. In this case, Math is numeric whereas Names is character.

Data<-data.frame(Math,Names)
Data

##   Math Names
## 1   57  John
## 2   68  Mary
## 3   24 April
## 4   96 Cindy
## 5   78   Tom

In this data frame, the colnames can be retrieved by names(). As shown in the below result by calling names(), Math and Names are the names of the columns in Data. In fact, the row names of Data have been displayed as the first column when you call Data, which can be independently retrieved by rownames(). The function colnames() is the same in effect as names().

names(Data)

## [1] "Math"  "Names"

rownames(Data)

## [1] "1" "2" "3" "4" "5"

colnames(Data)

## [1] "Math"  "Names"

However, names() can be applied to retrieving the names of components in a vector also. For example, we can declare a sequence from 1 to 10 and add letters from a to j to this vector as the names.

x<-1:10
names(x)<-letters[1:10]
x

##  a  b  c  d  e  f  g  h  i  j 
##  1  2  3  4  5  6  7  8  9 10

In a data frame, each column is a variable, which can be accessed by calling the column name. For example, if we want to retrieve the math scores in Data, we can call Data$Math. Note that a data frame is a list object which is a mutlilevel structure with the name of data frame as the highest level node and the variables as the secondary level nodes underneath it. Anything on the left of $ is the parent of anything on the right of $. Thus, if we want to retrieve the data in a variable in a data frame, we need to call the data frame first and access it by $.

Data$Math

## [1] 57 68 24 96 78

Data$Names

## [1] John  Mary  April Cindy Tom  
## Levels: April Cindy John Mary Tom

Of course, we can add variables into a data frame. For example, we can add the scores of these five students in an English examination. Now there are three variable names in Data corresponding to three columns.

Data$Eng<-c(79,88,94,35,56)
names(Data)

## [1] "Math"  "Names" "Eng"

Data

##   Math Names Eng
## 1   57  John  79
## 2   68  Mary  88
## 3   24 April  94
## 4   96 Cindy  35
## 5   78   Tom  56

The math scores in Data are numeric can be dealt with arithmetic operations. For example, we can compute the mean and standard deviation of the math scores, as shown in the following codes. Also, you can get the maximum and minimum of these scores. You can check out what you will get if you call median(Data$Math).

mean(Data$Math)

## [1] 64.6

sd(Data$Math)

## [1] 26.84772

max(Data$Math)

## [1] 96

min(Data$Math)

## [1] 24

A userful funciton summary() can return some statistics of the distribution of data. These statistics include the minimum, maximum, mean, median, and the first and third quantiles. However, there is no mode and we cannot call mode() to get the statistical mode of a set of data, because it will return the type of the data.

summary(Data$Math)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    24.0    57.0    68.0    64.6    78.0    96.0

mode(Data$Math)

## [1] "numeric"

Exercise:

Suppose we have a sequence of three 2s, two 1s, and one 3. Of course, the statistical mode of this sequence must be 2. However, there is no function in R which can directly get you this answer. Please figure out how to get it in R. Hint: table().