3

I came across this article: “MSE is Cross Entropy at Heart: Maximum Likelihood Estimation Explained” which states:

"When training a neural network, we are trying to find the parameters of a probability distribution which is as close as possible to the distribution of the training set."

This makes sense when the model is learning the un-conditional distribution of the data, assuming that the true data-generating process is IID. In that case, we can write the average log likelihood as the expectation of the model probability with respect to the empirical probability of the data:

$$ \frac{1}{N} \sum_{i=1}^{N} \log p_{\theta}(x_i) \quad \text{or equivalently} \quad \mathbb{E}_{\hat{p}_{\text{data}}}[\log p_{\theta}(x)] $$

For conditional models, we typically write a similar expression using conditional probabilities:

$$ \frac{1}{N} \sum_{i=1}^{N} \log p_{\theta}(y_i \mid x_i) \quad \text{or equivalently} \quad \mathbb{E}_{\hat{p}_{\text{data}}}[\log p_{\theta}(y \mid x)] $$

However, I have a couple of questions regarding this formulation:

  1. Conditional Independence and Cross-Entropy Equivalence:
    For conditional models, we often only assume conditional independence (see this discussion). Does this imply that the log likelihood in the conditional case would not always be equivalent to the cross-entropy with the empirical data distribution unless the data-generating process is IID? Is my understanding correct?

  2. Log Likelihood and Conditional Empirical Distributions:
    In general, why is the log likelihood not calculated with respect to a conditional empirical data distribution for conditional models? In other words, why do we directly use the expectation:

    $$ \mathbb{E}_{\hat{p}_{\text{data}}(x,y)}[\log p_{\theta}(y \mid x)] $$

    rather than formulating it in terms of a conditional empirical distribution $\hat{p}_{\text{data}}(y \mid x)$?

Any insights or references that could help clarify these points would be much appreciated!

spie227
  • 101
  • 4

1 Answers1

2

I'll try to explain based on my understanding:

1. Conditional Independence and Cross-Entropy :If the model assumes conditional independence (i.e., outputs are independent given inputs) but the true data is not strictly IID or has dependencies, cross-entropy may not perfectly match the true log likelihood. However, in practice, we often treat each $(x_i, y_i)$ as an independent sample and optimize cross-entropy under those assumptions.

2. Using $\mathbb{E}_{\hat{p}_{\text{data}}(x,y)}[\log p_{\theta}(y \mid x)]$ :We compute $\log p_{\theta}(y_i \mid x_i)$ for each observed pair rather than explicitly defining $\hat{p}_{\text{data}}(y \mid x)$. This avoids the need to estimate a conditional distribution from limited data and directly uses the available pairs for maximum likelihood estimation.

Marzi Heidari
  • 299
  • 3
  • 12