Introduction to R

Function

Normally, a variable corresponds to a set of data, while a function corresponds to a particular procedure used for dealing with the data in a specific way or controlling the input and output of signals. For instance, c( ) is designed to combine the arguments as a vector and q( ) is used to switch off R. Users can declare as many variables they want as the reference of the data. Similarly, users can declare their own functions for their particular goals. A function is declared by calling function( ). For example, if we want to declare a function which can return the mode of a sequence of numbers, a function can be declared as follow.

MODE<-function(v){
  v.t<-table(v)
  flag<-which.max(v.t)
  return(as.numeric(names(v.t)[flag]))
}
MODE(c(1,1,2,3,4,5,5,4,3,3,3))

## [1] 3

Although this function can get the mode of a numeric sequence, it is not suitable for the character vectors. The below example shows that when the argument is a character vector, the function cannot return the mode of the elements of the vector.

MODE(c("a","B","b","a","a"))

## Warning in MODE(c("a", "B", "b", "a", "a")): NAs introduced by coercion

## [1] NA

How to prevent this situation? We can add an if-then logical judgment for checking the mode of the vector in this function. When the input vector is not numeric, this function will return a warning message. With this checking mechanism, we can prevent the function making logical errors.

MODE<-function(v){
  if(mode(v)!="numeric"){
    message("Warning: The sequence must be numeric!")
    return("NA")
  }else{
    v.t<-table(v)
  flag<-which.max(v.t)
  return(as.numeric(names(v.t)[flag]))  
  }
}
MODE(c(1,1,2,3,4,5,5,4,3,3,3))

## [1] 3

MODE(c("a","B","b","a","a"))

## Warning: The sequence must be numeric!

## [1] "NA"

This example shows how we can define a function. The structure of a function includes the function name, the arguments, and the body of the function which are the procedures the function will execute. In the above exemplar, the function name is MODE, the argument should be a vector, and the body consists of many commands which together compute the mode of the input vector. When we declare a function, we can add some descriptions about the function, which can be used as a hint for ourselves. As shown in the below example, the very first line starting from # is exactly the description of the goal of this function.

MODE<-function(v){
  # Compute the statistical mode of a sequence of numbers
  if(mode(v)!="numeric"){
    message("Warning: The sequence must be numeric!")
    return("NA")
  }else{
    v.t<-table(v)
  flag<-which.max(v.t)
  return(as.numeric(names(v.t)[flag]))  
  }
}

Using a user-defined function to deal with data importing and exporting is common. Suppose we want to create a function to read the data from a file, to do some process for them, and to export the results to another file. First, we can create a data set by sampling random dots in a 2-dimensional space from a multivariate normal distribution. The mean of this 2-dimensional normal distribution is c(0,0) and the variance-covariance matrix is \(\sum=\begin{bmatrix}1 & 0.75\\0.75 & 1\end{bmatrix}\).

library(mvtnorm)
set.seed(3456)
data<-rmvnorm(100,mean=rep(0,2),sigma=matrix(c(1,0.75,0.75,1),2,2))
data<-data.frame(data)
names(data)<-c("x","y")
write.table(data,file="cor.txt")

Second, we can declare a function to analyze these data and do something else. The first argument is the data to be analyzed, which by default is NULL. The second argument is the method which you want to apply to the data. The default value is a string “summary”, indicating that the function will make a summary for the data in each column. If we simply call the function without inputting a data set, the function will return you the message “No data imported!” and then finish running.

analy<-function(data=NULL,method="summary"){
  # Analyze data
  if(length(data)==0){
    message("No data imported!")
  }else{
    # Scatter plot
    library(ggplot2)
    Names<-names(data)
    fig<-ggplot(data,aes(eval(parse(text=Names[1])),eval(parse(text=Names[2]))))+
      geom_point()+
      geom_smooth(method="lm")+
      ylab(Names[2])+xlab(Names[1])
    # Data analysis
    results<-switch(method,
                    "summary"=summary(data),
                    "cor"=cor.test(data[,1],data[,2]))
    return(list(fig=fig,results=results))
  }
}
analy()

## No data imported!

Now we can enter the data that we just created and see what will happen. This function will return two objects. The first is the scatter plot. The second is the summary result of data.

analy(data)

## $fig

## `geom_smooth()` using formula 'y ~ x'

## 
## $results
##        x                  y            
##  Min.   :-2.55920   Min.   :-2.404216  
##  1st Qu.:-0.66390   1st Qu.:-0.591246  
##  Median :-0.01138   Median :-0.008994  
##  Mean   :-0.05673   Mean   : 0.022552  
##  3rd Qu.: 0.49815   3rd Qu.: 0.657836  
##  Max.   : 3.02051   Max.   : 3.243133

If we want to run the correlation analysis, we can call anlay(data,“cor”).

analy(data,"cor")

## $fig

## `geom_smooth()` using formula 'y ~ x'

## 
## $results
## 
##  Pearson's product-moment correlation
## 
## data:  data[, 1] and data[, 2]
## t = 9.5535, df = 98, p-value = 1.127e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5766578 0.7839159
## sample estimates:
##       cor 
## 0.6944215

We can declare a function to save and read multiple files. The below codes show two functions. The first one is used to generate data for hypothetical students with one in each file. In this function, the argument seq is a vector recording the students’ numbers. The default main file name is “subject_”. That is, if no file name specified, the names of all generated files will all start by “subject_”, followed by the subject numbers which are maintained in the vector seq. For the sake of demonstration, I simply define this function as creating students’ scores on Math and English. All students math scores are sampled from a normal distribution with the mean = 75 and the standard deviation = 8. Also, the students’ English scores are sampled from a normal distribution with mean = 80 and SD = 8. In the for loop, I firstly assemble the file name for each student with paste0( ). This function is used to glue smaller strings together as a big one. Thereafter, I export the math and English scores of each student to the file named by the file name that I just created. Let’s try to generate three students’ scores and export to three text files.

OutputData<-function(mainfilename="subject_",seq=1){
  # Save data to multiple files
  
  # First create the data
  num.subj<-length(seq)
  set.seed(1115)
  math<-round(rnorm(num.subj,75,8))
  eng<-round(rnorm(num.subj,80,8))
  data<-cbind(math,eng)
  data<-data.frame(data)
  for(x in 1:length(seq)){
    s<-seq[x]
    filename<-paste0(mainfilename,s,".txt")
    write.table(data[x,],filename)
    message(paste0(filename," is created."))
  }
  return("Job is done.")
}
OutputData(seq=c(201,204,205))

## subject_201.txt is created.

## subject_204.txt is created.

## subject_205.txt is created.

## [1] "Job is done."

We can declare a function to import the scores of these three students. This function can retrieve the data of each student and put them together as a data frame.

GetData<-function(mainfilename="main",suffix=".txt",seq=1){
   # Retrieve data from multiple files
   dd<-sapply(1:length(seq),function(x){
      file<-paste0(mainfilename,"_",seq[x],".txt")
      return(read.table(file,header=T))
   })
   dd<-t(dd)
   return(dd)
}
GetData("subject",seq=c(201,204,205))

##      math eng
## [1,] 73   69 
## [2,] 80   78 
## [3,] 68   86

Of course, we can call a function to work in another function. Suppose we declare a function to compute the indices of the descrpitive statistics of a distribution. The below codes are used to create a random sample of 100 values from the normal distribution with mean = 100 and SD = 15. In addition to the indices provided by the R function summary( ), we also add the mode by using our own function MODE( ).

Summary<-function(data){
  temp<-summary(data)
  Mode<-MODE(data)
  names(Mode)<-"Mode"
  return(c(temp,Mode))
}
set.seed(1115)
data<-round(rnorm(100,100,15))
Summary(data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    Mode 
##   64.00   90.00  100.00  100.22  109.00  146.00  103.00

Introduction to R

Lee-Xieng Yang

Function