Complexity of / best algorithm for finding the dichotomy that maximizes information gain?

Question

Suppose that $X$ is a finite set with a probability measure $P$. I want to find the subset $A \subset X$ so that the information gain of conditioning on ${A, A^c}$ is maximal. That is, I want to find $A$ that maximizes

$$H(X) -H(X|\{A,A^c\}) = H(X) - (P(A)H(A) + P(A^c) H(A^c)),$$

where $H(A)$ refers to the entropy of the conditional probability distribution so $\mu(B) = P(B \cap A) / P(A)$. (If $P(A) = 0$, then set $P(A) H(A) = 0$. In any case, it won't be a maximizer.)

Since this splits $X$ into two sets, I am calling this a dichotomy, and the question is of finding the dichotomy that produces the largest information gain.

Question: What is the complexity of this problem? (As D.W. points out below, a reasonable corresponding decision problem is - for $t \geq 0$, is there an $A$ so that information gain is $\geq t$? This decision problem is in NP, and we can ask if it is NP-hard, etc.) Is there a good heuristic algorithm for making this choice? What if I ask for an approximately, probably correct algorithm?

I am asking this question since I'm studying decision trees in machine learning and also coding theory, and this seems like a basic question in both settings.

score 4 · Accepted Answer · edited Apr 13 '17 at 12:48

The information gain in that case depends only on the mass of $A$, and is maximized when $P(A)=\frac{1}{2}$. This probably shows why this definition of information gain is not very interesting.

Suppose $X=\left\{x_1,...,x_n\right\}$ and $P=\left(p_1,...,p_n\right)$.

The information gain is defined as

$IG(A)=H(X)-\left(P(A)H(A)+\left(1-P(A)\right)H\left(X\setminus A\right)\right)$

Where $H(A)$ is the entropy of the random variable which takes values in $A$ with probabilities $q_x = \mathbb{1}_{x\in A}\frac{p_x}{p(A)}$.

Now lets write explicitly what is $IG(A)$:

$$\begin{align*} IG(A)&= H(X)+P(A)\sum\limits_{x\in A} \frac{p_x}{P(A)}\log \frac{p_x}{P(A)}+ \left(1-P(A)\right)\sum\limits_{x\in X\setminus A} \frac{p_x}{1-P(A)}\log\frac{p_x}{1-P(A)} \\ &= H(x)+\sum\limits_{x\in A} p_x\log \frac{p_x}{P(A)}+ \sum\limits_{x\in X\setminus A} p_x\log\frac{p_x}{1-P(A)} \\ &= H(A)+\sum\limits_{x\in A} p_x\log p_x -\sum\limits_{x\in A} p_x \log P(A) + \sum\limits_{x\in X\setminus A}p_x\log p_x - \sum\limits_{x\in X\setminus A}p_x\log \left(1-P(A)\right) \\ &= -P(A)\log P(A) - \left(1- P(A)\right)\log(1-P(A)). \end{align*}$$

So setting $y=P(A)$, you seek to maximize $f(y)=-y\log y-(1-y)\log(1-y)$ in $y\in[0,1]$. Maximum is achieved at $y=\frac{1}{2}$.

The optimization problem (or corresponding decision problem) is NP-hard, meaning that unless $P=NP$, you can't find a subset $A$ which minimizes $\left| P(A)-\frac{1}{2}\right|$ in polynomial time. This problem is called the partition problem (special case of subset sum), only that now you have the constraint that the elements sum to 1. See this question for some information about the hardness of partition problem.

Complexity of / best algorithm for finding the dichotomy that maximizes information gain?

1 Answers1