Given a parametric pdf $f(x;\lambda)$ and a set of data $\{ x_k \}_{k=1}^n$, here are two ways of formulating a problem of selecting an optimal parameter vector $\lambda^*$ to fit to the data. The first is maximum likelihood estimation (MLE):
$$\lambda^* = \arg \max_\lambda \prod_{k=1}^n f(x_k;\lambda)$$
where this product is called the likelihood function.
The second is least squares CDF fitting:
$$\lambda^*=\arg \min_\lambda \| E(x)-F(x;\lambda) \|_{L^2(dx)}$$
where $F(x;\lambda)$ is the CDF corresponding to $f(x;\lambda)$ and $E(x)$ is the empirical CDF: $E(x)=\frac{1}{n} \sum_{k=1}^n 1_{x_k \leq x}$. (One could also consider more general $L^p$ CDF fitting, but let's not go there for now.)
In the experiments I have done, these two methods give similar but still significantly different results. For example, in a bimodal normal mixture fit, one gave one of the standard deviations as about $12.6$ while the other gave it as about $11.6$. This isn't a huge difference but it is large enough to easily see it in a graph.
What is the intuition for the difference in these two "goodness of fit" metrics? An example answer would be something along the lines of "MLE cares more about data points in the tail of the distribution than least squares CDF fit" (I make no claims on the validity of this statement). An answer discussing other metrics of fitting parametric distributions to data would also be of some use.
