Modeling with maximum likelihood as the index of goodnees-of-fit

The index of how well a model can fit a set of data is called goodness of fit. Goodness of fit can have two forms. One is a measure for the distance between the observed data points and the corresponding predictions made by a model. For example, if we have a data set of 10 numbers, \([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\) and a vector of corresponding model predictions \([0.8, 1.5, 3.2, 3.8, 5.4, 6.7, 6.8, 7.5, 9, 9.8]\), then the total distance between each of these value pairs can tell us how well the model can predict the observed data. In this case, if we compute the absolute distance between each data point and the model prediction, then the total distance can be calculated by summing all absolute distances, which is 3.1. However, a sum might not be a good index. Normally, we prefer using a mean. In this case, \(\frac{3.1}{10}=0.31\).

obs<-1:10
preds<-c(0.8, 1.5, 3.2, 3.8, 5.4, 6.7, 6.8, 7.5, 9, 9.8)
dist<-abs(obs-preds)
sum(dist)
## [1] 3.1

A more often used index for the discrepancy between model prediction and observed data is RMSD, which is computed as \(\sqrt{\frac{\sum (y-\hat{y})^2}{N}}\).

sqrt(mean((obs-preds)^2))
## [1] 0.3674235

However, the discrepancy index is not a standardized measure as the correlation coefficient \(-1 \le r \le 1\). It is hard to tell whether a discrepancy value is small enough for us to claim a good fit. Also, there is no suitable statistical testing for judging how good a discrepancy index is. Nonetheless, the discrepancy index can be a suitable goodness of fit, if multiple models are compared on the fitness to the data.

The other often-used goodness of fit is maximum likelihood, which is a measure of how likely a model can predict the observed data. Suppose we have one observed value which is 1. If a model is used to generate this value repeatedly, the predicted values by this model presumably should be around 1 forming a normal distribution \(\hat{y}\sim N(1,\sigma_\epsilon)\). Now, suppose a predicted value is 0.1. The likelihood for this model to predict the observed data can be computed as the likelihood of the normal function with mean = 1 and SD = \(\sigma_\epsilon\). Suppose \(\sigma_\epsilon=0.5\). The likelihood is 0.16. Suppose we change the parameter values of this model and the predicted value turns to 0.9 accordingly. The likelihood now increases to 0.78, nearly as 5 times large as the previous predicted value. That is, the parameter values at the second iteration are better able to make this model predict the observed data. Therefore, we should keep the parameter values used at the second iteration as the best paramter values.

l1<-dnorm(0.1,1,0.5)
l1
## [1] 0.1579003
l2<-dnorm(0.9,1,0.5)
l2
## [1] 0.7820854
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data.frame(x=seq(-1,3,0.05),y=dnorm(seq(-1,3,0.05),1,0.5)) %>% 
  ggplot(aes(x,y))+geom_line()+
  geom_segment(aes(x=x1,y=y1,xend=xend,yend=yend),
               data=data.frame(x1=0.1,xend=0.1,y1=0,yend=l1),color="red",lty=2)+
  geom_segment(aes(x=x1,y=y1,xend=xend,yend=yend),
               data=data.frame(x1=0.9,xend=0.9,y1=0,yend=l2),color="red",lty=2)+
  xlab("Predicted Values")+ylab("Likelihood")

The likelihood for a model to predict a data point \(i\) is \(l_i\). The total likelihood of this model on fit to all \(n\) data points is then \(l_1 \times l_2 \times l_3 \cdots \times l_n=\prod_{1}^{n}l_i\). However, multiplying a serial of numbers costs computers a lot of computing capacity. Thus, we normally use the sum of the logarithms of all likelihoods as the goodness of fit, \(Logl=\sum_i^{n} log(l_i)\). As \(log(l)\) is negative, we turn this likelihood sum to \(-Logl\), in order to get a positive sum. The famous goodness of fit AIC is actually \(-2Logl+2N\) where \(N\) is the number of freely estimated parameters. The smaller AIC, the better the model performance is.

Gibbs sampling

Bayesian framework is very suitable for implementing parameter estimation. Recall \(p(\theta|D)=\frac{P(D|\theta)p(\theta)}{\int P(D|\theta)p(\theta)d\theta}\), where \(p(\theta|D)\) is the posterior distribution of \(\theta\), given the observed data \(D\). Naturally, the larger the likelihood of a value is, the more likely the value is the best estimate of \(\theta\). The question now is how to get this posterior distribution?

The Gibbs sampling algorithm is often used to get the posterior distribution of a parameter, which can be shown in the pseudo code below.

As \(p(\theta|D)\propto P(D|\theta)p(\theta)\), that is, the best estimate of \(\theta\) should be the value which can make the product of the likelihoods on the right hand side the largest. Thus, in the above code, the function LogL( ) is used to compute \(P(D|\theta)p(\theta)\). We firstly compute the current product likelihood for the current set of parameters. Subsequently, we propose a new value for a parameter. Thus, the proposed set of parameters now has one value different from the current set of parameters. Again, we compute the product likelihood for the proposed set of parameters. Thereafter, we compare the two product likelihoods as a ratio. If this ratio \(>1\), then the proposed set of parameters replaces the current set. If this ratio \(<1\), then if it is still larger than a random number sampled from a uniform distribution between 0 and 1, the current set is replaced by the proposed set and the current set maintains otherwise.

The Gibbs sampling algorithm follows these steps through all iterations. The values recorded for one parameter through all iterations form the posterior distribution of that parameter. That is, the distribution over the values of one column of the matrix out in the above code. The number of columns corresponds to the number of parameters and for each parameter, there is a posterior distribution. Thus, we can easily get the best-fit estimate of a parameter simply via searching for the value, which produces the largest likelihood of the posterior distribution of that parameter.