What is Data?

Data is a set of values of subjects with respect to qualitative or quantitative variables (see Wikipedia). The below picture shows the vote casts for the candidates in an election for class representative in a primary school. According to the Wikipedia definition of data, does this picture show some data? There is no number in this chart. When the vote casts are transferred to numbers, we will have data.

Look at Table 2.1 in Chapter 2. Does this table tell you some data? You might incline to answer “Yes”. There are many numbers in this table, which are the reaction times (i.e., the subject) in 10oths of a second.

What is Variable?

A variable represents for a property of an object or event that can take on different values. For instance, the mass, size, or density can be a variable for a physical object. Similarly, we can have variables for psychological attributes. For instance, IQ, the extent of anxiety, the well-being in work, and so on. A variable, thus, can be a psychological attribute, which normally are measured by appropriate instruments, such as questionnals, psychological tests, or experimental instruments (e.g., PC, eye tracker, MRI, etc.).

Types of Data Scale

Nominal scale. Actually, the values of a nomial variable are just the labels of the levels of that variable. For instance, the number of a baseball player simply represents for that player, just like his/her name.

Ordinal scale. The simplest true scale is ordinal scale, which orders people, objects, or events along some continum. The values represent the ordinal relationships. For instance, the examination grades (i.e., prize 1 > 2 > 3, etc.). However, the intervals between every two ranks are not required to be the same.

Interval scale. With an interval scale, we have a measurement scale in which we can legitmately speak of differences between scale points. The intervals between every two successive values on an interval scale are always the same. For instance, the Celsius/Fahrenheit scale of temperature is an interval scale. The difference in temperature between \(10^{\circ}\) C and \(20^{\circ}\) C = the difference between \(90^{\circ}\) C and \(100^{\circ}\) C. However, \(30^{\circ}\) C doesn’t feel twice as hot as \(15^{\circ}\) C. Also, \(\frac{40^{\circ} F}{80^{\circ}F}\neq\frac{4.4^{\circ}C}{26.7^{\circ}C}\). The 0 point on the interval scale does not mean nothing. Clearly, \(0^{\circ}\) C does not mean no temperature. In fact, the absolute zero of temperature is \(-273.15^\circ\) C.

Ratio scale. A ratio scale is one that has a true zero point. For instance, the common physical scales of length, volume, time, and so on. Ten seconds are twice long as 5 seconds (i.g., \(\frac{10}{5}=2\)). This ratio is the same as that of 20 seconds over 10 seconds.

Note that the type of scale you use often depends on your purpose in using it. For instance, the Celisus scale can be interval when measuring the molecular activity. However, it cannot be the case when measuring comfort. The underlying measurement scale is not crucial in our choice of statistical techniques.

Descriptive Statistics

The first step to understand your data is to describe them. Normally, we can count the frequency of observed values. For instance, we observed 10 students coming out of the gate of a school. There were 7 female and 3 male students. In this case, the student gender is the variable of our interest, which has two levels or values (e.g., 1 for male and 0 for female). What scale is the gender data? We can plot the frequency of each gender. Based on this bar plot, at least we can say that there are more female students. Presumably, the gender value here is nominal and no calculation is suitable for it. However, the frequency of each gender in the data satisifies the definition of ratio scale.

Data Visualization

In the classic study of Sternberg (1962) for short-term memory, it is examined whether the length of study list would influence the reaction time on recognizing the study items. In the learning phase, a serial of digits in three different lengths (i.e., 1, 3, or 5) were presented to participants who were told to simply memorize them. In the test phase, participants were given a new set of digits and asked to respond “yes” or “no” to the question whether a particular number, say “6”, was included as one of the set of digits that were just seen. The response time (or reaction time) in 10 ms was the dependent variable in data analysis. The collected data can be seen in Table 2.1.

We can count the frequency of each observed RT (reaction time) and list them in a table. See Table 2.2. You might notice that the relatively short (e.g., 360 ms) and long (e.g., 950 ms) RT’s are less frequent than those in the middle. This pattern can be easier seen in Figure 2.1. We can also plot the counts of RT data by intervals, namely histogram. See Figure 2.2 for the histogram of RT in Table 2.1. Although Figure 2.2 loses some details of Figure 2.1, the main pattern still remains. How to Make Bar Plot and Histogram with R? There are at least two ways to make graphics in R. I would particularly recommend using the functions in the package {ggplot2} to make your graphics. In order to do so, you need to install {ggplot2} first. Before that, let’s start learning how to use R and RStudio.

Simple Tutorial for R and RStudio

Launch your RStudio. There are four panes. The lower-left pane is called console where you can enter your commands and will get feedback immediately. For example, in the console pane, we can assign a variable a number 10 by using the symbol <-, which is the compound of < and -. We can get the value of x by simply calling its name. Here x is the variable and 10 is its value.

x<-10
x
## [1] 10

In addition to a singular value, a variable can be assigned with more than one value. For example, the variable a in the below codes is actually a vector containing four numbers. The function c( ) means to combine its arguments (i.e., those four numbers in the below example). Note in R, a function must have the two parentheses whereas a variable must not. Similarly, b in the below example is a vector containing five characteric components. With mode(a), we can tell the mode of the variable a, which is numeric here. In contrast, b’s mode is character.

a<-c(10,-2,0.4,-0.05)
a
## [1] 10.00 -2.00  0.40 -0.05
b<-c("John","Mary","Smith","Mike","Vicky")
b
## [1] "John"  "Mary"  "Smith" "Mike"  "Vicky"
mode(a)
## [1] "numeric"
mode(b)
## [1] "character"

A data frame is a special type of variable in R, which records values in different columns (i.e., variables) with one row representing for one case. For example, cars is a data frame pre-installed in your R, in which there are 2 columns (speed and dist) and 50 rows. You can ues the function dim( ) to check the amounts of columns and rows. For a quick glance at the contents of this data frame, you can use head( ) to show the data in the first 6 rows.

dim(cars)
## [1] 50  2
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

We can create our own data frame with the function data.frame( ). For example, the below codes are used to create a data frame dta1, containing five students’ names, English scores, and Mathematics scores. Note a data frame can contain variables of different modes. In this case, the variables English and Math are numeric, but names is a character variable. The mode of a data frame is list, which can be viewed as a variable containing variables of different modes.

names<-b
English<-c(60,55,78,89,93)
Math<-c(45,75,82,64,66)
dta1<-data.frame(names=names,English=English,Math=Math)
dta1
##   names English Math
## 1  John      60   45
## 2  Mary      55   75
## 3 Smith      78   82
## 4  Mike      89   64
## 5 Vicky      93   66
mode(dta1)
## [1] "list"

In most situations, the data to be analyzed are stored in a hard disk as a file. Therefore, we need to import the data from a file before we can do any statistical analysis. The following codes show how to import the data from a text file chap2_sternberg.txt. In fact, this text file contains the data listed in Table 2.1. The column names start with V for variable, followed by a number showing the length of study list and a character Y or N for the response. For example, V3Y means that the set size is 3 and the response is Yes.

dta2<-read.table("chap2_sternberg.txt",header=T,sep="")
dim(dta2)
## [1] 52  6
head(dta2)
##   V1Y V1N V3Y V3N V5Y V5N
## 1  40  52  73  73  39  66
## 2  41  45  83  47  65  53
## 3  47  74  55  63  53  61
## 4  38  56  59  63  46  74
## 5  40  53  51  56  78  76
## 6  37  59  65  66  60  69

Histogram

If we simply want to plot the frequency distribution (or histogram) of RT’s in this study, we can pull these data together as a single variable in a new data frame RTs. The function unlist( ) is used to turn a list to a vector. Thus, the variable RT in RTs actually contains all RT’s in the data frame dta2. Before we can plot a histogram, we need to install a package {ggplot2}. You can find the Install bottom beneath the tag Packages in the lower-right pane. Just click it and enter the name of the package that you want to install. Here, you need to enter ggplot2. Once you install {ggplot2} successfully, you need to launch it with the function library( ) with ggplot2 as its argument. Now you can make your histogram by calling ggplot( ) and geom_histogram( ). The function ggplot( ) is used to define where the data comes from (e.g., RTs in this case) by setting up data=RTs and the dimensions of the figure with aes(RT). Here RT is the only data dimension. The function geom_histogram( ) is used to make a histogram, in which the argument color means the color of the frame of each bin and fill means the body color of each bin. For a clearer understanding of the arguments of geom_histogram( ), you can enter ?geom_histogram in the console pane. The result will be shown in the page of the tag Help in the lower-right pane. In fact, ?+function name is a quick way to get the help file about any function for that you are searching.

RTs<-data.frame(RT=unlist(dta2))
head(RTs)
##      RT
## V1Y1 40
## V1Y2 41
## V1Y3 47
## V1Y4 38
## V1Y5 40
## V1Y6 37
library(ggplot2)
ggplot(data=RTs,aes(RT))+
  geom_histogram(color="black",fill="gray",binwidth=6)

With the codes above, we replicate Figure 2.2. Based on this histogram, we can describe this RT data set. For instance, we know that the extreme RT’s are less frequent than the normal ones. Further, we can fit this frequency distribution with a curve to have a neater description of the data. The below figure shows the histogram with a fitted curve. Acutally, this curve is a combination of many normal curves, which we call kernels, according to the principle shown in Figure 2.5. Note in Figure 2.4, only one normal curve is fitted to the data, whereas Figure 2.6 is made with many normal curves assembled as done in Figure 2.5. This is why the below figure looks more similar to Figure 2.6.

ggplot(data=RTs,aes(RT))+
  geom_histogram(aes(y=..density..),color="black",fill="tomato",alpha=0.3,binwidth=6)+
  geom_density()

Caution with Histogram

Using histogram to describe data is straightforward. However, some cautions should be noted. Specifically, the choice on breaks to group data might alter the distribution shape, that will lead you to a different conclusion. As shown in Figure 2.3, the right panel looks more like a normal distribution than the left, although they are actually plotted with the same data. How and why? Let’s create a data set to replicate this observation. The data frame dta3 is created to simulate the data set used for Figure 2.3. The function rep( ) is to repeate the argument for a certain times. Thus, rep(82,4) means 82, 82, 82, and 82. More details about rep( ) can be found by calling for help in R. The argument breaks in geom_histogram ( ) defines the start and end points of the intervals. The funciton seq(a,b,by=c) is used to create a series of numbers from a to b with the step size = c. In the following figures, the first one uses the breaks from 72.5 to 97.5 with the interval size = 5, whereas the second uses the breaks from 75 to 100 with the interval size as 5 again. However, the shapes of these two histgorams look quite differently. The first figure will lead us to make a conclusion that more data lengths are larger than the mean length. However, the second figure will support the conclustion that more data lengths are around the mean length.

dta3<-data.frame(length=c(75,rep(82,4),rep(85,6),rep(87.5,6),
                          rep(90,12),rep(92.5,11),rep(96,4)))
ggplot(data=dta3,aes(length))+
  geom_histogram(breaks=seq(72.5,97.5,by=5),color="black",fill="forestgreen",alpha=0.3)

ggplot(data=dta3,aes(length))+
  geom_histogram(breaks=seq(75,100,by=5),color="black",fill="tomato",alpha=0.3)

Describing Distributions

A distribution can be described with several respects. See Figure 2.10. Panel (a) shows a distribution peaked at the center and descending in a roughly equal trend to the two ends. This distribution is called normal distribution. It is a symmetric and unimodal distribution. In contrast, panel (b) shows a distribution with two peaks, although it is symmetric. That is called bimodal and is not a normal distribution.

Panel (c) shows a long tail to the minimum point, whereas panel (d) shows a long tail to the maximum point. The former is called negatively skewed and the latter positively skewed. Although these two distributions have one peak only, they are not symmetric. Thus, they are not normal distributions by definition. Panel (e) is called platykurtic, namely less kurtosis (or sharpness) than the normal distribution. On the contray, panel (f) is more kurtosis than the normal distribution, which is called leptokurtic.

Measures of Central Tendency

A distribution gives us a visual sense of the general magnitude of the numbers in the data set. There are several statistics that can be used to represent the “center” of the distribution.

Mean. The common measure of central tendency is the mean. \(\bar{X}=\frac{\sum X}{N}\), where N is the total number of scores.

Mode. The mode can be defined simply as the most common score. For instance, in a series of numbers [2,2,4,4,4,1,7,9], 4 is the most frequent one, namely the mode.

Median. The median is the score that corresponds to the point at or below which 50% of the scores fall when the data are arranged in numerical order. For example, consider the numbers [5,8,3,7,15]. When it is sorted in numeric order, it becomes [3,5,7,8,15]. Then, the middle score would be 7. If there are an even number of scores, the average of the middle two scores is the median. Thus, the middle locaition is \(\frac{N+1}{2}\).

How to use R to Get Measures of Center Tendency?

Suppose we have a series of numbers as [5,8,3,7,15]. We can compute the mean and median with summary( ). Again, we can run this function for another series of numbers in an even-numbered size.

x1<-c(5,8,3,7,15)
summary(x1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     7.0     7.6     8.0    15.0
x2<-c(5,11,3,7,15,14)
summary(x2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.500   9.000   9.167  13.250  15.000

One strange thing in R is that summary( ) does not output the mode of a variable. This is because that mode( ) is already designed to show the type of variable. See the below codes. Then, how can we get the statistical mode in R? The function table( ) can be used to count the frequencies of the values in a variable. Thus, tx3 now is the counts of the values instead of the values themselves. The function max( ) is used to get the maximum among the values of a variable. Therefore, max(tx3) returns the largest counts. The code tx3==max(tx3) means to check which value in tx3 is equal to the maximum of tx3 values (the operator == means equal to). Subsequently, mod.x3 represents the position where the largest count appears. Since the position (or name) of the vector tx3 is the value (or component) of x3, mod.x3 actually represents the value with the highest frequency. However, in R the position of a vector’s component is represented as a character. Thus, we need to transform the name of mod.x3 to numeric with as.numeric( ).

x3<-c(2,2,4,4,4,1,7,9)
mode(x3)
## [1] "numeric"
tx3<-table(x3)
tx3
## x3
## 1 2 4 7 9 
## 1 2 3 1 1
mod.x3<-which(tx3==max(tx3))
mod.x3
## 4 
## 3
mod.x3<-as.numeric(names(mod.x3))
mod.x3
## [1] 4

Note in a normal distribution, the mean, the median, and the mode are the same, as it is symmetric and unimodal. Then, in a negatively skewed distribution, what is the correct rank order of these three measure? Although the answer depends on the extent of skew, it can be expected that Mean \(\leq\) Median \(\leq\) Mode, as the mean is dramatically attracted towards the extreme value. In a negatively skewed distribution, the extreme value is much smaller than most of the values. Similarly, you can think about the rank order of these measures in the case of a positively skewed distribution.

Measures of Variability

In addition to the central tendency, we would like to know the degree to which individual observations are deviate from the average value, or how much diverse the observed values are.

Range. The range is of a series of values is simply the distance from the lowest to the highest score. The range of the data in x3 is 9-1=8.

Interquartile Range. The interquartile range represents an attempt to circumvent the problem of the range’s heavy dependence on extreme scores. An interquartile range is obtained by discarding the upper 25% and the lower 25% of the distribution and taking the range of what remains. The point that cuts off the lowest 25% of the distribution is called the first quartile or Q1. Similarly, the point that cuts off the upper 25% of the distribution is called the third quartile or Q3. The median is the second quartile or Q2. In fact, summary( ) provides these measures.

summary(x3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   4.000   4.125   4.750   9.000

Variance. The variance is a measure of how scores are dispersed around the mean. The variance is computed as \(\frac{\sum(X-\bar{X})^2}{N}\), where \(N\) is the size of the data set. In statistics, the population variance is normally denoted as \(\sigma^2\) and the sample variance \(s^2\). Take the variable x3 as an example. The below code is used to compute the variance of x3. The function length( ) is used to compute the size of a variable. In this example, since x3 has 8 components, the length of x3 is 8.

ssquare<-sum((x3-mean(x3))^2)/length(x3)
ssquare
## [1] 6.359375

Standard Deviation. The standard deviation is defined as the positive square root of the variance. Thus, \(s\) and \(\sigma\) are referred to the sample standard deviation and the population standard deviation. The sample standard deviation is defined as \(s_X=\sqrt{\frac{\sum((X-\bar{X})^2)}{N-1}}\). The following R codes compute the standard deviation of x3 and the squared value of it. Note the squared value of x3 is larger than the variance of x3. This is because by definition the standard deviation is not the sum of deviation divided by \(N\), but \(N-1\). Why? An explanation is provide on this web page.

sd(x3)
## [1] 2.695896
sd(x3)^2
## [1] 7.267857

Sample and Population

The distribution of the sample data is assumed to be representative of the population distribution. Therefore, the sample mean and standard deviation (or variance) should be the unbiased estimators of the population parameters (i.e., population mean and standard deviation). However, when the population mean is unknown, the squared sample standard deviation is a biased estimator for the population variance. Bessel thus suggested to multiply the sample standard deviation by \(\frac{n}{n-1}\) as a correction, where \(n\) is the sample size. Alternatively, \(n-1\) is also called the degrees of freedom. Suppose we want to estimate each of \(n\) data points. Without any prior theory, a straightforward way is to use \(n\) parameters to represent them. Now suppose we know already the mean of these \(n\) data points. We only need to estimate \(n-1\) parameters, since the \(n^{th}\) data point can be computed as \(n\bar{X}-(X_1+X_2+\cdots+X_{n-1})\). With one parameter corresponding to one degree of freedom, in this case, the total degrees of freedom is \(n-1\).