Applications of Standard Normal Distribution

One of the applications of the standard normal distribution (\(z\) distribution) is to set probable limits on an observation. For instance, as we know that the area within -1.96 and 1.96 on \(z\) distribution is 0.95, this is equivalent to saying that 95% of the \(z\) scores fall withing the limits of -1.96 and 1.96. However, we generally want to express our answers in terms of raw scores, rather than z scores, we must do some transformation. Recall that \(z=\frac{x-\mu}{\sigma}\), where \(x\), \(\mu\), and \(\sigma\) are respectively a raw score, the population mean, and the population standard deviation. Suppose we have an IQ test with the population mean and standard deviation as 50 and 10. What are the limits within which 95% of the IQ scores fall?

As we know that IQ follows a normal distribution, given the population mean and standard deviation, every IQ score can be transferred to a z score. We also know that 95% of z scores fall within the limits between -1.96 and 1.96. Then, we can compute the upper limit of IQ score by the equation \(z=\frac{x-\mu}{\sigma}\). When \(z=1.96\), \(x=\mu+1.96\sigma=50+1.96(10)=69.6\). Similarly, when \(z=-1.96\), \(x=\mu-1.96(10)=30.4\). Therefore, 95% of the IQ scores chosen at random would be between 30.4 and 69.6. Note if the raw scores do not follow a normal distribution, the transformed \(z\) scores would not either. In the below example, 100 raw scores are sampled randomly from an exponential distribution. These raw scores are then transformed to \(z\) scores. The histograms of the raw scores and \(z\) scores are the same in a positively skewed shape. Thus, \(z\) is a linear transformation.

library(ggplot2)
x<-rexp(100,1)
z<-(x-mean(x))/sd(x)
dta<-data.frame(x=c(x,z),c=rep(c("x","z"),each=100))
ggplot(data=dta,aes(x,fill=c))+
  geom_histogram(color="white",bins=30)

Suppose a student’s English score is 84 and Mathematics score is 76. Can we say that this student performed better on the English test than the Mathematics test? No, we cannot. This is because that we lack of the information about all students’ performance on the English test and the Mathematics test. Suppose again that the mean and standard deviation on the English test are 90 and 6, while the mean and standard deviation on the Mathematics test are 70 and 6. As normally the test performance follows a normal distributionm, we can transform these two test scores to \(z\) scores. Then, the \(z\) score for the English score is \(z_{Eng}=\frac{84-90}{6}=-1\). On the contrary, the \(z\) score for the Mathematics score is \(z_{Math}=\frac{76-70}{6}=1\). Now, \(z_{Eng}<z_{Math}\) suggesting that this student performed better on the Mathematics test than the Enlgish test. Why? The probability below -1 (.84) on the \(z\) distribution is smaller than that below 1 (.16). This is equivalent to saying that this student’s Mathematics score is higher than 84% of all students whereas his/her English score is only higher than 16% of all students.

pnorm(1,0,1)
## [1] 0.8413447
pnorm(-1,0,1)
## [1] 0.1586553

However, if the English mean and standard deviation are 80 and 4, whereas the Mathematics mean and standard deviation are 70 and 10, Then, \(z_{Eng}=\frac{84-80}{4}=1\) and \(z_{Math}=\frac{76-70}{10}=0.6\) This student performs better on the English test than the Mathematics test. The \(z\) distribution is usually used as a normalized scale for comparing the scores with different metrics. Suppose a 3 year-old child is 95 cm tall and 80 kg in weight. Normally it is nonsense to directly compare the body height and the body weight. However, we can normalize the height and weight to \(z\) scores, in order to understand the developing statuses of this kid on the body height and weight. As shown in the case of two test scores, we need to know in advance the means and the standard deviations of the population distributions for the height and the weight. This population distribution is also called norm. A norm for the body height of 3 year-old children is the distribution consisting of a large number of the heights of 3 year-old kids. How large is large enough? The more the better, but we seldomly see a norm with \(N<1000\). Since the body height is normally normally distributed, the norm for height is a normal distribution. This is true for other human features such as the length of arm, the length of foot, the body weight, the IQ, and so on.

Assessing Whether Data Are Normally Distributed

In order to assess whether the collected data follow the normal distribution, we can superimpose a normal curve on top of a histogram and have some idea of how much close the histogram is to a normal distribution. A far better approach is to use what are called Q-Q plots (quantile-quantile plots). The file “foot.txt” contains the foot lengths of 3,982 US soldiers (1,774 males and 2,208 females). We can check how close these data are to the normal distribution. First, let us check the histograms. The below figure shows two histograms for two genders. The visual inspection suggest that these two distributions should be normal.

foot.dta<-read.table("foot.txt",header=T,sep="")
names(foot.dta) # Check the column captions
## [1] "footlength" "sex"
dim(foot.dta)
## [1] 3982    2
ggplot(data=foot.dta,aes(x=footlength,fill=sex))+
  geom_histogram(color="white",alpha=0.5,bins=30)

Then, we can make a Q-Q plot to check how close these two distributions are to the standard normal distribution. See the below figure, in which the abscissa shows the \(z\) sores and the coordinate shows the observation scores. The observed foot lengths (i.e., dots) are quite close to the Q-Q line. Thus, these two sets of data are pretty much normally distributed.

ggplot(data=foot.dta,aes(sample=footlength,color=sex))+
  stat_qq()+stat_qq_line()