I came to know the definition of conjugate distribution: For particular distribution $f(x|\theta)$, its conjugate distribution $g(\theta)$ is defined as(or satisfies) that the prior $g(\theta)$ and the posterior $g(\theta|x)$ belong to the same function family (or loosely speaking have the same form)
Then I wonder how the statisticians deduced the form of the conjugate distributions?
My first thought was they, indeed, solved a function equation
$$g(\theta|x)\propto f(x|\theta)g(\theta)$$
given $f(x|\theta)$, but I find it not so obvious.
A second thought is that they compute the parameters of a distribution from its samples, and they want to characterize the uncertainty of such estimation by another distribution, e.g. they study the sample covariance of a (multi-)normal distribution and find out the inverse Wishart distribution for characterization. They conclude that such characterization of the sample estimation of a parameter has an interesting relationship with the original distribution. Hence, they define conjugacy. If so, why does such a relationship exist for so many distributions? Is it a coincidence(for me, it is now the case), or is there some underlying logic?
I want to know the logic of how the theory of conjugacy develops rather than memorize a big table.
P.S. For convenience, we may restrict our discussion to exponential families.
@Mittens: It's too long to reply in the comment so I add it here.
Your answer seems a very formal and rigorous one (and both articles you referred seem also). Before I dive into it, I want to ask is the theory developped all of a sudden? Actually I'm more interesting about, if it exists, such a stage that people just discover that the distribution that governs the estimated parameter and the original distribution have a nice property under the Bayesian law. Only then we define the conjugacy (instead of defining the distribution of parameter prior as a conjugate form).
Still takes the example of multi-normal distribution. Without knowing the theory of conjugate distribution, we could still find out the distribution of the sample covariance matrix right? (At least what I understand here.)
The Wishart distribution arises as the distribution of the sample covariance matrix for a sample from a multivariate normal distribution
So I wonder if it is a coincidence that the distribution happens to be a conjugate prior, that is to say when apply the Bayes law, the posterior and prior are the same.
What I want to ask exactly is how you explain such coincidence. If it's not a coincidence, how you could prove(or just show intuitively without bothering too much on rigorousness) that such property holds over all the exponential family. I have no doubt on the rigorousness of the theory, I just wonder how it develops.