10

Given a parametric pdf $f(x;\lambda)$ and a set of data $\{ x_k \}_{k=1}^n$, here are two ways of formulating a problem of selecting an optimal parameter vector $\lambda^*$ to fit to the data. The first is maximum likelihood estimation (MLE):

$$\lambda^* = \arg \max_\lambda \prod_{k=1}^n f(x_k;\lambda)$$

where this product is called the likelihood function.

The second is least squares CDF fitting:

$$\lambda^*=\arg \min_\lambda \| E(x)-F(x;\lambda) \|_{L^2(dx)}$$

where $F(x;\lambda)$ is the CDF corresponding to $f(x;\lambda)$ and $E(x)$ is the empirical CDF: $E(x)=\frac{1}{n} \sum_{k=1}^n 1_{x_k \leq x}$. (One could also consider more general $L^p$ CDF fitting, but let's not go there for now.)

In the experiments I have done, these two methods give similar but still significantly different results. For example, in a bimodal normal mixture fit, one gave one of the standard deviations as about $12.6$ while the other gave it as about $11.6$. This isn't a huge difference but it is large enough to easily see it in a graph.

What is the intuition for the difference in these two "goodness of fit" metrics? An example answer would be something along the lines of "MLE cares more about data points in the tail of the distribution than least squares CDF fit" (I make no claims on the validity of this statement). An answer discussing other metrics of fitting parametric distributions to data would also be of some use.

Royi
  • 10,050
Ian
  • 104,572
  • I had this thought for long time as well. What I can tell you is that the ML maximizes the Fisher Information which guarantees some properties which I don;t think the other method can. – Royi Aug 28 '17 at 13:59

1 Answers1

0

In my eyes, the intuitive explanation is that the ML estimates the conditional mode (the maximum of the distribution), the least squares the conditional mean. In the case, where the errors are perfectly Gaussian distributed are this estimates equal. Distribution of errors