How does the statistician deduce the conjugate distribution in general?

Question

I came to know the definition of conjugate distribution: For particular distribution $f(x|\theta)$, its conjugate distribution $g(\theta)$ is defined as(or satisfies) that the prior $g(\theta)$ and the posterior $g(\theta|x)$ belong to the same function family (or loosely speaking have the same form)

Then I wonder how the statisticians deduced the form of the conjugate distributions?

My first thought was they, indeed, solved a function equation

$$g(\theta|x)\propto f(x|\theta)g(\theta)$$

given $f(x|\theta)$, but I find it not so obvious.

A second thought is that they compute the parameters of a distribution from its samples, and they want to characterize the uncertainty of such estimation by another distribution, e.g. they study the sample covariance of a (multi-)normal distribution and find out the inverse Wishart distribution for characterization. They conclude that such characterization of the sample estimation of a parameter has an interesting relationship with the original distribution. Hence, they define conjugacy. If so, why does such a relationship exist for so many distributions? Is it a coincidence(for me, it is now the case), or is there some underlying logic?

I want to know the logic of how the theory of conjugacy develops rather than memorize a big table.

P.S. For convenience, we may restrict our discussion to exponential families.

@Mittens: It's too long to reply in the comment so I add it here.

Your answer seems a very formal and rigorous one (and both articles you referred seem also). Before I dive into it, I want to ask is the theory developped all of a sudden? Actually I'm more interesting about, if it exists, such a stage that people just discover that the distribution that governs the estimated parameter and the original distribution have a nice property under the Bayesian law. Only then we define the conjugacy (instead of defining the distribution of parameter prior as a conjugate form).

Still takes the example of multi-normal distribution. Without knowing the theory of conjugate distribution, we could still find out the distribution of the sample covariance matrix right? (At least what I understand here.)

The Wishart distribution arises as the distribution of the sample covariance matrix for a sample from a multivariate normal distribution

So I wonder if it is a coincidence that the distribution happens to be a conjugate prior, that is to say when apply the Bayes law, the posterior and prior are the same.

What I want to ask exactly is how you explain such coincidence. If it's not a coincidence, how you could prove(or just show intuitively without bothering too much on rigorousness) that such property holds over all the exponential family. I have no doubt on the rigorousness of the theory, I just wonder how it develops.

Conjugate priors are defined mostly for exponential family of distributions. This is due to the fact that for such family, the parameters that define such families of distributions form a convex set, and that the log maximal function takes the for of the Fenchel-Legendre transform. The theory relies heavily on convex analysis. The monograph "Brown, L.D., Fundamentals of Statistical Exponential Families, IMS Lecture Notes, Monograph series 9, 1986" has a detailed account of this. Working knowledge of the basics of convex analysis is needed to really appreciate the technicalities. — Mittens, Jan 03 '24 at 19:20
Other good references are the paper "Diaconis, P. and Ylvisaker, D., Conjugate priors for exponential families, Annals of Statistics, 1979, vol 7,pp. 269-281," and "Guitierrez-Peña, E., and Smith, A.F.M, Exponential and Bayesian conjugates families: Review and extensions (with discussion), 1997, Test, Vol 6, pp 1-90. — Mittens, Jan 03 '24 at 19:38

Mittens · Accepted Answer · 2024-01-03T22:05:22.853

Conjugate priors are defined mostly for exponential family of distributions. This is due to the fact that for such family, the parameters that define such families of distributions form a convex set, and that the log maximal function takes the form of the Fenchel-Legendre transform. The theory relies heavily on convex analysis. The monograph "Brown, L.D., Fundamentals of Statistical Exponential Families, IMS Lecture Notes, Monograph series 9, 1986" has a detailed account of this. Working knowledge of the basics of convex analysis is needed to really appreciate the technicalities.

For exponential families finding prior conjugates works as follows.

Suppose $\mu$ is a Borel measure on $(\mathbb{R}^d,\mathscr{B}(\mathbb{R}^d))$. Consider a family of distributions $\{P_\theta:\theta\in\Theta\}$ of the form \begin{align} P_\theta(dx)=e^{\theta\cdot x-\Lambda(\theta)}\,\mu(dx)\tag{1}\label{one} \end{align} where $e^{-\Lambda(\theta)}$ is a normalizing factor, and $$\Theta=\{\theta\in\mathbb{R}^d:\int_{\mathbb{R}^d}e^{\theta\cdot x}\mu(dx)<\infty\}$$ Using Hölder's inequality is is each to check that $\Theta$ is a convex subset of $\mathbb{R}^d$.
The conjugate family of prior distributions is defined as the family of probability measures $\{\Pi_{a,\tau}:(a,\tau)\in E\}$ on $(\Theta,\mathscr{B}(\Theta))$ (here $\mathcal{B}(\Theta)$ is the Borel $\sigma$-algebra of subsets of $\Theta$), defined as \begin{align} \Pi_{a, \tau}(d\theta)=D(a, \tau)e^{\tau\cdot \theta- a\Lambda(\theta)}\lambda(d\theta)\tag{2}\label{two} \end{align} where $\lambda$ is the Lebesgue measure restricted to $\mathscr{B}(\Theta)$, $D(a,\tau)$ is a normalizing factor, and $$E=\{(a,\tau)\in\mathbb{R}\times\mathbb{R}^d: \int_{\Theta}e^{\tau\cdot \theta- a\Lambda(\theta)}\,\lambda(d\theta)<\infty\}$$ Again, using Hölder's inequality, one checks that $E$ is a convex subset of $\mathbb{R}\times\mathbb{R}^d$.
Under the Bayesian paradigm, if the law of $X$ given $\boldsymbol{\theta}$ is given by \eqref{one}, then under the prior of the form \eqref{two} the posterior distribution of $\boldsymbol{\theta}$ given $X$ has destiny (w.r.t. Lebesgue measure $\lambda$) given by \begin{align} \frac{d\mathbb{P}[\boldsymbol{\theta}\in d\theta|X]}{d\lambda}\propto e^{\theta\cdot(x+\tau) -(1+a)\Lambda(\theta)} \end{align} That is, the posteriori distribution $\boldsymbol{\theta}$ given $X=x$ is $\Pi_{1+a,x+\tau}$ which is of the form given \eqref{two}. This justifies use of the term conjugate prior: the posterior distribution belongs to the same family of distributions as the prior distributions.

There are some technical considerations that I mention now as a theorem

Theorem: Consider the exponential family $\{P_\theta:\theta\in\Theta\}$ as in \eqref{one}. If

$\Theta$ has non empty interior,

for all $(r,v)\in\mathbb{R}\times\mathbb{R}^n\setminus\{\mathbf{0}\}$, $\mu(x\in\mathbb{R}^d: v\cdot x\neq r\}>0$ (the r.v. $X(x)=x$ is the constant $1$ are linearly independent a.s.),

and $C=\overline{\operatorname{co}}\big(\operatorname{supp}(\mu)\big)$ (the closure of the convex hull of the support of $\mu$) has non empty interior,
then for $a>0$, $$E_a:=\{\tau\in\mathbb{R}^d: (a,\tau)\in E\}=\{\tau\in\mathbb{R}^d: a^{-1}\tau\in\operatorname{int}(C)\}$$

The Wishart distribution can be obtained along the lines discussed above; however, since they involved distributions on the space of positive semidefinite matrices, harmonic analysis provides a derivation of this distribution. A detail analysis of this appears in Eaton, M.L., Multivariate Statistics: A Vector Space Approach., John Wiley and Sons, 1983, chapter 8.

Definition: A probability measure $\nu$ on the space of $p\times p$ real symmetric matrices is a Wishart distribution with parameters $\Sigma$ (a $p\times p$ positive semidefinite matrix), $p$, $n$ if there exists $n$ i.i.d vectors $X_1,\ldots X_n$ in $\mathbb{R}^p$ such that $X_j\sim N(\mathbf{0},\Sigma)$ and $$\nu=\operatorname{law}\big(\sum^n_{j=1}X_jX^\intercal_j\big)$$

Denoting by $W(\Sigma,p,n)$ the Wishart distribution of $\mathcal{S}_p$ (the linear space of $p\times p$ real symmetric matrices), we have that

Theorem: If $S\sim W(\Sigma,p,n)$, then $W(\Sigma,p,n)\ll\lambda_{p(p+1)/2}$ iff $n\geq p$ and $\Sigma$ is positive definite. Furthermore, the density of $W(\Sigma,p,n)$ (w.r.t) $\lambda_{p(p+1)/2}$ is given by $$p(S|\Sigma)=c(n,p)(\operatorname{det}(\Sigma^{-1}S))^{-n/2} \exp\big(-\frac12\operatorname{tr}(\Sigma^{-1}S)\big) \mathbb{1}_{\mathcal{S}^+_p}(S)\frac{\lambda_{p(p+1)/2}(dS)}{\big(\operatorname{det}(S)\big)^{(p+1)/2}}$$ where $\mathcal{S}^+_p$ is the set of positive definite symmetric $p\times p$ matrices, and $$c(n,p)=\int_{\mathcal{S}^+_p}(\operatorname{det}(S))^{(n-p-1)/2}\exp\big(-\frac12\operatorname{tr}(S)\big)\,d\lambda_{p(p+1)/2}(dS)$$

Notice that $p(S|\Sigma)$ defines an exponential family.

I've updated my reply in the OP cause it's too long to fit here. Thank you for your attention. — sicheng mao, Jan 04 '24 at 18:18
@sichengmao: The development of conjugate was not spontaneous. As you can see the, the estimation of $\pi(\theta|x)$ and $m(x)=\int_\Theta p(x|\theta)\pi(d\theta)$ is not simple (often requires numerical methods (quadrature of MCMC methods). That was specially harder before the advent of modern computers. A large part of Bayesian literature was (less now but still) to finding prior distributions for which $\pi(\theta|x)$ can be easily obtained. The conjugate priors were introduced by Raffia, H and Schelaifer, R. Applied Statistical Decision Theory, Grad. School of Business, Harvard 1961. — Mittens, Jan 04 '24 at 18:41
@sichengmao: There are some arguments on whether their simplicity (not only for exponential families but for some families of mixture distributions) comes at the price of giving reasonable approximations to "most reasonable" priors. For initial analyses, of course they are very useful. Now, the mechanics of how one obtained them is simple, just as in my posting. Finding the domain for parameters is more delicate and, at least for the exponential family, some convex analysis is required. — Mittens, Jan 04 '24 at 18:46
I think I have one false understanding before: ~~Wishart distribution arise directly as the distribution of sample covariance matrix without the conjugate theory.~~ And my question can finally reduce to whether this understanding is true or false. So nothing is coincident here, everything is established elaborately by conjugate theory. The Applied Statistical Decision Theory is quite insightful. Thank you for such a comprehensive answer. — sicheng mao, Jan 08 '24 at 15:57

How does the statistician deduce the conjugate distribution in general?

1 Answers1