A Little Game of Probability

Suppose you are in front of 3 doors. In one door, there is an automobile, but you do not know in which door the car is. The host of the game invites you to choose one door and saied if you choose the correct door, then you can have the car in it. Now suppose you have chosen a door. Before the host opens the door you choose, he decides to give you an extra favor. He opens one of the 2 doors which you did not choose and shows nothing in it. Now he asks you if you want to change your choice. Will you change your choice? The below R codes show how this game can be solved by computer simulation.

The user-defined function game( ) can compute the probabilities to win the prize if you change your choice (i.e., index=1) and if you don’t change your choice. The variable data stores the answer matrix with the columns showing for the rounds and the rows showing for the doors. Each column shows for a trial with 0 denoting no prize and 1 denoting a prize. The results show that the winning probability is larger when you decide to change your choice.

game<-function(data,index){
   set.seed(9)
   results<-sapply(1:ncol(data),function(j){
             temp1<-data[,j]
             names(temp1)<-1:3
             # Assume that the first choice is always the first door
             resp1<-1
             names(resp1)<-1
             # Get the correct answer
             ans<-as.numeric(names(temp1[temp1==1]))
             if(ans!=1){
                # The first choice is not the correct answer
                temp2<-temp1[c(resp1,ans)]
                if(index==1){
                    # Make a change
                    resp2<-temp2[-1]
                    return(sum(as.numeric(names(resp2))==ans))
                }else{
                    # No change
                    resp2<-resp1
                    return(sum(as.numeric(names(resp2))==ans))
                }
             }else{
                # The first choice is the correct answer
                flag<-rbinom(1,1,0.5)+2
                temp2<-temp1[-flag]
                if(index==1){
                    # Make a change
                    resp2<-temp2[-1]
                    return(sum(as.numeric(names(resp2))==ans))
                }else{
                    # No change
                    resp2<-resp1
                    return(sum(as.numeric(names(resp2))==ans))
                }
             }
   })
   return(results)
}
set.seed(123)
data<-rmultinom(10,size=1,prob=c(1/3,1/3,1/3))
mean(game(data,1))
## [1] 0.6
mean(game(data,0))
## [1] 0.4
set.seed(123)
data<-rmultinom(100,size=1,prob=c(1/3,1/3,1/3))
mean(game(data,1))
## [1] 0.67
mean(game(data,0))
## [1] 0.33

Basic Concepts of Probability

What is probability? Probability is simply how likely something is to happen. The analysis of events governed by probability is called statistics. For example, tossing a coin is an event with two outcomes: head or tail. Then the probability of an event can be defined as the ratio of the number of ways it happens over the total number of the outcomes. In the case of tossing a coin, \(p(H)=\frac{1}{2}\) and so is \(p(T)\). Similarly, each outcome of throwing a die is \(\frac{1}{6}\).

Probability Axioms

  • The probability of an event A is \(0\leq P(A) \leq 1\)
  • The opposite of an event A is the event [not A], denoted as \(\neg A\) or \(\sim A\)
  • \(P(A)+P(\neg A)=1\)

Probability Interpretations

There are two types of probabilities. One is objectivist in that the probability of an outcome to happen is the relative frequency of occurrence of the outcome. For example, you toss a coin 100 times and observe 30 Heads. Then, the probability of Heads = \(\frac{30}{100}=0.3\). Another type is subjectivist in that the probability of an outcome is the degree of your belief about how likely this outcome will happen. For example, the probability of precipitation you estimate for tomorrow.

Joint Probability and Conditional Probability

The below table shows the number of cases in each of the four combinations across the occurrence of coronary heart disease and the type of diet.

Coronary Heart Disease Cholesterol Diet (H) Cholesterol Diet (L) Marginal Sum
Y 11 2 13
N 4 6 10
Marginal Sum 15 8 Total: 13+10 or 15+8

The probability of taking a high cholesterol diet is \(\frac{15}{23}=0.65\) and the probability of having a coronary heart disease is \(\frac{13}{23}=.57\). If two events A and B occur on a single performance of an experiment, this is called the joint probability of A and B, denoted as \(p(A,B)\). In the above table, the probability for a patient to have a high cholesterol diet (HCD) and a coronary heart disease (CHD) is \(p(CHD,HCD)=\frac{11}{23}=0.48\). The conditional probability is the probability of an event to occur given the occurrence of the other event. For example, the probability of getting a coronary heart disease for a patient who has had a high cholesterol diet is \(p(CHD|HCD)=\frac{11}{11+4}=0.73\). The conditional probability of the event A given the event B is denoted as \(p(A|B)\).

The joint probability = the conditional probability \(\times\) the marginal probability. For example, \(p(CHD,HCD)=p(CHD|HCD)\times p(HCD)\). That is, \(\frac{11}{23}=\frac{11}{15}\times\frac{15}{23}\). Alternatively, the same joint probability can be viewed as the product of the conditional probability of having a high cholesterol diet given that the patient has a coronary heart disease and the probability of having a coronary heart disease, namely \(p(CHD,HCD)=p(HCD|CHD)\times p(CHD)\). That is, \(\frac{11}{23}=\frac{11}{13}\times\frac{13}{23}\).

Bayes’ Theorem

Suppose we have two events A and B. According to the definition of the joint probability,

\[\begin{align*} p(A,B)&=p(A|B)p(B)\\ &=p(B|A)p(A). \end{align*}\]

Therefore, \[\begin{align*} p(A|B)p(B)&=p(B|A)p(A)\\ p(A|B)&=\frac{p(B|A)p(A)}{p(B)}. \end{align*}\]

Also,

\[\begin{align*} p(B)&=p(B,A)+p(B,\neg A)\\ &=p(B|A)p(A)+p(B|\neg A)p(\neg A)\\ &=\sum_i p(B|A_i)p(A_i). \end{align*}\]

Now we change the terms by substituting \(A\) with \(\theta\) and \(B\) with \(D\) and we get

\[\begin{align*} p(\theta|D)=\frac{p(D|\theta)p(\theta)}{p(D)}=\frac{p(D|\theta)p(\theta)}{\sum_i p(D|\theta_i)p(\theta_i)}, \end{align*}\]

where \(p(\theta)\) is called the prior probability of the event, \(p(D|\theta)\) is called the likelihood, \(p(\theta|D)\) is called the posterior probability of the event, and \(p(D)\) is called the evidence.

Continuous Probability Distribution

A continuous probability distribution is a probability distribution that has a cumulative distribution function that is continuous. That is, \(p(a \leq X \leq b)=\int_{a}^{b}f(x)dx\), where \(f(x)\) is called probability density function. Also, \(p(-\infty \leq X \leq \infty)=\int_{-\infty}^{\infty} f(x) dx=1\). That means the sum of the probabilities of all possible outcomes equals 1. See the below illustration.

We can extend Bayes’ theorem to the case where the occurrence of an event follows the continuous probability distributions (e.g., body weight). Then the Bayes’ theorem is written as

\[\begin{align*} p(\theta|D)=\frac{P(D|\theta)p(\theta)}{P(D)}=\frac{P(D|\theta)p(\theta)}{\int P(D|\theta) p(\theta) d\theta}. \end{align*}\]

Here, \(\theta\) is the parameter of a model. The likelihood \(P(D|\theta)\) is actually computed by the likelihood function given the parameter \(\theta\). The likelihood function is actually the model which is used to predict what data we will observe given the parameters. Suppose your model is a normal probability density function. Given the two parameters mean = 0 and SD = 1, the likelihood to observe 0 is 0.399. That is, \(P(D=0|\theta,M)=0.399\) with \(M\) = normal probability density function and \(\theta=[0,1]\). However, the likelihood \(P(D|\theta)=0.242\) with \(\theta=[1,1]\). Thus, it is more likely to have the mean equal to 0 than 1, given 0 as the observed data. The evidence \(D\) is also called as the marginal likelihood, as it is the sum of all likelihoods for the same \(D\) given all possible \(\theta\)s.

Numerical Example I

Suppose a coin is tossed for 10 times and the outcomes are 7 Heads and 3 Tails. What is your estimate for the probability of Head for this coin, \(\theta\)? Presumably, there can be infinity of values between 0 and 1 for \(\theta\) and we do not know what the real value is. Since we have no particular preference for which is the real value, we can say that all values between 0 and 1 are equally likely to be the answer. In the terminology of probability, it is equal to saying that \(\theta\) follows a uniform distribution. Thus, the prior distribution of \(\theta\) is uniform, namely

\[\begin{align*} \theta \sim U(0,1). \end{align*}\]

Then, suppose that we have sampled 11 values from \(U(0,1)\) for \(\theta\) as \(\theta=[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]\). As the prior distribution is a uniform distribution, all these values have the same likelihood to be sampled, namely \(p(\theta_i)=1, i=1 \cdots11\). For the likelihood function, a binomial probability density function is a suitable model, as there are in total 2 outcomes for each tossing and each observation is independent of the other. As the binomial distribution has only one parameter, in this example, it is the probability of Head. Therefore, the likelihood is

\[\begin{align*} P(D|\theta)=Bin(k,n,\theta), \end{align*}\]

where \(n=10\) is the total times of tossing, \(k=7\) is the times to observe Head, and \(\theta\) is the probability of Head. Now we can compute \(P(D|\theta)\). We can use R to do this computation. It looks like when \(\theta=0.7\), the likelihood to get 7 Heads in 10 times of tossing this coin is the largest.

# Set up all 11 values for theta
theta<-seq(0,1,0.1)
# Compute the likelihoods of a binomial distribution given the theta values
L<-dbinom(7,10,theta)
L
##  [1] 0.000000000 0.000008748 0.000786432 0.009001692 0.042467328 0.117187500
##  [7] 0.214990848 0.266827932 0.201326592 0.057395628 0.000000000

Then, the marginal likelihood can be computed as \(\sum_i P(D|\theta_i)p(\theta_i)\). Since all \(p(\theta)=1\), the marginal likelihood is simply the sum of these likelihoods.

ML<-sum(L)
ML
## [1] 0.9099927

Now we can compute the posterior probability of \(\theta\), namely the probability of Head. Since we have 11 candidate values for \(\theta\) and each can lead to a posterior probability, the posterior probabilities all together is a distribution. Thus, \(p(\theta|D)\) is the posterior distribution. The maximum likelihood occurs when \(\theta=0.7\). Then we can report that the most likely probability of Head is 0.7 (maximum posteriori estimation, or MAP). Also, we can report the mean of this distribution as the estimate for the probability of Head. That is 0.67.

postTheta<-L/ML
postTheta
##  [1] 0.000000e+00 9.613264e-06 8.642179e-04 9.892049e-03 4.666777e-02
##  [6] 1.287785e-01 2.362556e-01 2.932199e-01 2.212398e-01 6.307262e-02
## [11] 0.000000e+00
max<-which(postTheta==max(postTheta))
theta[max]
## [1] 0.7
sum(postTheta*theta)
## [1] 0.6669622

We can plot these posterior probabilities with the package {ggplot2} in R.

library(ggplot2)
dd<-data.frame(x=theta,y=postTheta)
ggplot(dd,aes(x,y))+
  geom_line()

In fact, there are infinity of values for \(\theta\). The above demonstration is too much simplified. Normally, we will sample \(m\) values from the prior distribution for \(r\) rounds and compute the posterior distribution in each round. Then we aggregate all these posterior distribution as the posterior distribution generated by our model. Then, the mean or mode of this distribution is the most likely value for the parameter. However, it is almost impossible for us to do such a tedious thing. Thus, there are approaches to solving this problem. First, seeking for an analytic solution. For instance, the above example actually has an analytic solution. That is, the posterior distribution is actually a beta distribution with the two parameters as the number of heads + 1 and the number of tails + 1, which is denoted as \(Beta(N_H+1,N_T+1)\). The MAP estimate for \(\theta\) is the same as the maximum-likelihood estimate, \(\frac{N_H}{N_H+N_T}\), which is 0.7. The posterior mean is slightly different, which is \(\frac{N_H+1}{N_H+N_T+2}\). Then, for the above example, the mean of this posterior distribution is \(\frac{8}{12}=0.67\).

Second, we can use Markov Chain Monte Carlo (MCMC) method to get the approximate solution for the parameter \(\theta\). MCMC is a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its <font color=“blue>equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution.

Markov Chain Monte Carlo

The below figures are the examples of MCMC. For estimating the parameter a, there are three chains used in MCMC and each chain goes through 2000 iterations. Same as a, there are also three chains used in MCMC for estimating the parameter b.

How to use R to do Bayesian?