Introduction to R

Value Access in Vector and Matrix

If you want to access the value at a particular position in a vector, you need to call by its serial number in that vector. Suppose you call a vector of 26 English alphabets in lower case and assign them into a variable called abc. You will see all the alphabets stored in abc if you simply call by abc. However, if you want the 5th value in this vector, you need to call abs[5], with the number showing the serial position of the value in this vector. Here letters is a predefined variable in R. Similarly, you can try LETTERS to see what you will get.

abc<-letters
abc

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

abc[5]

## [1] "e"

Also, if you want the values at more than one position, then you need to specify their positions. For example, you can access the values at positions 1, 2, 4, and 10 by calling abc[c(1,2,4,10)]. Sometimes, you want the values at the positions in a range, say the position numbers from 8 to 17. You can access the values by calling abc[8:17].

abc[c(1,2,4,10)]

## [1] "a" "b" "d" "j"

abc[8:17]

##  [1] "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"

As a matrix is a two-dimensional data structure, you need to specify the row and column positions in order to get the value in a particular cell. Suppose you declare a $3\times3$ matrix filled by numbers from 1 to 9, which is called m. Then, m[2,3] means the value in the cell crossed by the second row and the third column. Similarly, you can access the value in the cell crossed by the third row and the first column by m[3,1]. If you want to retrieve a row from a matrix, you only need to specify the row number, leaving the column number as empty. For example, m[2,] will return the values in the second row. Also, you can retrieve the values in a specific column, say m[,2].

m<-matrix(1:9,3,3)
m

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

m[2,3]

## [1] 8

m[3,1]

## [1] 3

m[2,]

## [1] 2 5 8

m[,2]

## [1] 4 5 6

Array

Normally, the matrix is a two-dimensional data structure. The array is a multiple-dimensional data structure. It is not often used, yet you might have the chance to use it. The below code declares a 3-dimensional array with two values in each dimension. Thus, in total there are $2\times2\times2=8$ data points. The values in an array are retrieved in the same way for a matrix. For example, for an array A, if we want all values in the first and second dimensions at the first position in the third dimension, you can call A[,,1].

A<-array(1:8,c(2,2,2))
A

## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8

A[,,1]

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Data Frame

Note that the values in a vector, matrix, or array must be in the same mode. Take vector as an example. A numeric vector is a vector containing all numeric data. Similarly, a character vector is a vector containing all character data. If you mix up the numbers and characters in a vector, all values will be changed to the mode of character. Matrix and array also have this feature.

v<-c(1,2,4,"a","b","c")
v

## [1] "1" "2" "4" "a" "b" "c"

If you want to put the numeric data and character data together, a better way is to declare a data frame for them. There are a number of ways to declare it. You can directly use the function data.frame( ). See the below codes, which create a data frame for storing five students’ data, including their names, English scores, and Mathematics scores. The data frame is basically a spreadsheet with each column representing for a variable and each row representing for for a case. As shown in the below codes, I create a data frame with three columns which are respectively named as Name, Eng, and Math. If you want get the value of a column, you need to call by its name. However, you cannot directly call it by its name. Instead, you need to call the data frame in which the variable is stored. For example, if you want the names of these five students, you need to call by dd$Name. This is because a data frame is a hierarchical data structure. The variable under a data frame cannot be directly accessed. It has to be accessed through the parent of this variable. The operator $ is used to identify the variable on the right belongs to the variable on the left. Thus, dd$Name means that Name is under dd. A data frame is the parent of the variables underneath it. Thus, those variables are the children of the data frame.

dd<-data.frame(Name=c("Appril","Ben","Carter","David","Eric"),
               Eng=c(84,73,62,90,77),Math=c(43,67,85,89,75))
dd

##     Name Eng Math
## 1 Appril  84   43
## 2    Ben  73   67
## 3 Carter  62   85
## 4  David  90   89
## 5   Eric  77   75

dd$Name

## [1] "Appril" "Ben"    "Carter" "David"  "Eric"

Name<-c("Robert","Mary")
Name

## [1] "Robert" "Mary"

A data frame is like a module. The variables in a module cannot be directly accessed by their names. If you have another variable which is also called Name, when you call Name, R will return you the contents of Name in the current working environment, instead of the contents of Name in the data frame dd. In other words, it is totally okay, if you give the same to a variable in a data frame and a variable in the working environment. When you want to get the contents of the variable in the data frame, you just need to call it through the data frame. Also, the variables in a data frame do not need to be of the same mode. As shown by the below codes, the mode of the variable Name in the data frame is character, whereas the mode of the other two are numeric. The data frame can protect their original modes from being changed.

dd$Name

## [1] "Appril" "Ben"    "Carter" "David"  "Eric"

dd$Eng

## [1] 84 73 62 90 77

dd$Math

## [1] 43 67 85 89 75

Also, you can put some column vectors together as a matrix and transform this matrix to a data frame. For example, a new variable Gender is created for storing the genders of Robert and Mary with M for male and F for female. Now, we can combine these two vectors as a matrix by the function cbind( ). Now you have a new matrix called Ss, which has two rows and two columns.You can turn this matrix to a data frame by using data.frame( ). See the difference between the returned outcomes. One is a matrix and the other is a data frame.

Gender<-c("M","F")
Ss<-cbind(Name,Gender)
mode(Ss)

## [1] "character"

dim(Ss)

## [1] 2 2

Ss

##      Name     Gender
## [1,] "Robert" "M"   
## [2,] "Mary"   "F"

Ss<-data.frame(Ss)
Ss

##     Name Gender
## 1 Robert      M
## 2   Mary      F

You can add a variable (or a column) in a data frame. For example, you can add the ages of these two students as a variable in Ss. Now you can check what is changed in Ss. You can use the function name( ) to check the captions of the columns (or variables) in a data frame. Of course, if you want to give new captions to the columns, you can use this function also. See the demonstration below.

Ss$Age<-c(23,18)
Ss

##     Name Gender Age
## 1 Robert      M  23
## 2   Mary      F  18

names(Ss)

## [1] "Name"   "Gender" "Age"

names(dd)

## [1] "Name" "Eng"  "Math"

names(Ss)<-c("name","sex","year")
Ss

##     name sex year
## 1 Robert   M   23
## 2   Mary   F   18

Some Descriptive Statistics

The function summary( ) is used to compute some descriptive statistics. For example, we can check for the descriptive statistics with it the English scores of those five students. The returned outcome shows the median, mean, minimum, maximum, the first quantile, and the third quantile of the English scores. Also, you can simply apply summary( ) to a data frame. As you can see in the below results, summary( ) will return the statistics of the numeric variables. However, since you cannot do calculation for the character variable, summary( ) will return the counts of the content of the variable.

summary(dd$Eng)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    62.0    73.0    77.0    77.2    84.0    90.0

summary(dd)

##      Name                Eng            Math     
##  Length:5           Min.   :62.0   Min.   :43.0  
##  Class :character   1st Qu.:73.0   1st Qu.:67.0  
##  Mode  :character   Median :77.0   Median :75.0  
##                     Mean   :77.2   Mean   :71.8  
##                     3rd Qu.:84.0   3rd Qu.:85.0  
##                     Max.   :90.0   Max.   :89.0

You can also expand a data frame by merging it with another, as long as these two data frames have the same columns by using the function rbind( ). Now dd has the data of seven students. Normally, if you want to check how many cases (i.e., rows) and variables (i.e., columns) of a data frame, you can use dim( ), simply as done for a matrix. The first element in the returned vector shows the number of rows and the second the number of columns.

dd1<-data.frame(Name=Ss$name,Eng=c(85,90),Math=c(55,73))
dd1

##     Name Eng Math
## 1 Robert  85   55
## 2   Mary  90   73

rr<-rbind(dd,dd1)
rr

##     Name Eng Math
## 1 Appril  84   43
## 2    Ben  73   67
## 3 Carter  62   85
## 4  David  90   89
## 5   Eric  77   75
## 6 Robert  85   55
## 7   Mary  90   73

dim(rr)

## [1] 7 3

Of course, you can check for the row number by nrow( ) and the column number by ncol( ). It is often that you only need to have a glance at the content of a data frame without seeing too much of it. You can use head( ) to extract the data of the first six rows.

nrow(rr)

## [1] 7

nrow(rr)

## [1] 7

head(rr)

##     Name Eng Math
## 1 Appril  84   43
## 2    Ben  73   67
## 3 Carter  62   85
## 4  David  90   89
## 5   Eric  77   75
## 6 Robert  85   55

Data Transformation

If you wan to compare the seven students’ English scores and their Mathematics scores, you should not directly make the comparison, as the scales of these two scores might be quite different. A better way to go is to transfer the English scores as well as the Mathematics scores to Z scores. Z is standard normal distribution with 0 as the mean and 1 as the standard deviation. Then, how to make this transformation? One way is to compose your R code to compute $Z_i=\frac{X_i-\bar{X}}{SD(X)}$. See the first line in the below codes. Alternatively, you can use the function scale( ). See the below third and fourth lines. You will find that the data frame rr gets two more columns, storing the Z scores of English and Mathematics scores.

Z.Eng<-(rr$Eng-mean(rr$Eng))/sd(rr$Eng)
Z.Eng

## [1]  0.3785708 -0.7010571 -1.7806849  0.9674587 -0.3084651  0.4767188  0.9674587

rr$Z.Eng<-scale(rr$Eng)
rr$Z.Math<-scale(rr$Math)
rr