The entropy of entropy (or how to fix an unfair die)

Question

$\newcommand{\on}[1]{\operatorname{#1}}$ I have recently noticed this behavior:

Let $\on{P}$ be a discrete probability distribution $$ \on{P} = \left\{p_{1},\ldots, p_{n} \right\}\ \mbox{where}\ p_{1} + \cdots + p_{n} = 1,\quad p_{i} > 0\ \forall\ i $$ The entropy of $\on{P}$ is $H = -\sum_{i}p_{i}\log \left(p_{i}\right)$.
You can build a new distribution $$ \on{P}'= \left\{-\left[p_{1}\log\left(p_{1}\right)\right] /H,\ldots, -\left[p_{n}\log\left(p_n\right)\right]/H\right\} $$ and calculate its entropy again.
And you can do it again and again by iterating the same process, which is basically to replace each probability $p$ by $-p \log\left(p\right)/H$, where $H$ is the the total entropy of the previous step.
I have reason to believe that the limit of this process is the uniform distribution where $p = 1/n$ for all the $p$'s.
However, I am having a hard time proving it. ( If any $p_{i} = 0$, then it would be replaced with $0\log\left(0\right) = 0$, which is why I required $p_{i} > 0$ ).

Have any of you heard about this or a similar result? Any help will be appreciated.

One can use the convention $0\log 0=0$ to be bale to define the map $T$ on the simplex $S_n={\mathbf{p}\in\mathbb{R}^n_+: |\mathbf{p}|1=1}$ defined by $T\mathbf{p}=-\frac{1}{\sum_np_j\log p_j}(p_1\log p_1,\ldots, p_n\log p_n)$. The median $m^=\frac{1}{n}(1,\ldots,1)$ is a fixed point (there are others). It seems that the OP suggests that for $m^$ is an attractor for the dynamical system $\mathbf{p}{n+1}=T\mathbf{p}_n$ where $\mathbf{p}_0$ is in the interior (relative) of $S$. — Mittens, Jul 25 '24 at 20:09
Hello @Mittens, thanks. Yes, m* is a fixed point of T and also the the maximum of the Entropy. What I'm trying to prove is that no matter where you start in the simplex, the sequence always drives you to m*, or the attractor of T as you put it. — Pedro Paiva, Jul 25 '24 at 20:21
If you want to take a dynamical systems approach, linearizing about the uniform fixed point or using $V(p) = \sum_{i=1}^n p_i \ln p_i + \ln n$ as a Lyapunov function may prove fruitful. It still remains to show that applying this procedure actually increases entropy, which seems messy — whpowell96, Jul 26 '24 at 00:57
This is a super interesting question but I'm left wondering about the context it originated in. Do you have an interpretation/application that you're looking at this procedure for? — user3716267, Jul 26 '24 at 12:38
Simulations indicate that convergence to the Uniform distribution occurs for many other functions $f$ besides $f(p)=−p\log p$, i.e. with $p'i=f(p_i)/F$ where $F=\sum{i=1}^nf(p_i)$, and corresponding limits occur for the iterated $F$ values. E.g. there is convergence to Uniform when $f(p)=p^\alpha$ with $0<\alpha<1$, the limiting $F$-value being $n^{1−α}$. The Uniform distribution also seems to result even with $f(p)=−\log(p)$. — r.e.s., Jul 26 '24 at 15:09
I find myself upvoting in spite of the completely inappropriate (click-bait?) title. The question apparently has nothing to do with "the entropy of entropy", nor with "how to fix unfair dice"! — r.e.s., Jul 29 '24 at 18:20
A proof that can be also used for other functions is now available (@r.e.s.). — Amir, Jul 30 '24 at 09:02
I'm thinking this is at least a book title if not a screenplay :-) — uhoh, Jul 30 '24 at 10:20
@r.e.s., I apologize if the title struck you as "click-bait". I don't do this. I just tried to make the question a bit colorful. It has to do with the entropy of entropy, because it's a recursive application of the entropy definition onto itself. It's about taking any probability distribution (the unfair dice) and taking it to the uniform distribution (the fair dice). I'm sorry that it was misunderstood. — Pedro Paiva, Jul 30 '24 at 11:14
@ZoeAllen, as far as I can see your proof is correct and creative. Many thanks. At the same time, I am still struggling a bit to follow your proof of the inequalities. I hope I can be done understanding them soon. Many thanks! — Pedro Paiva, Jul 30 '24 at 11:16
@user3716267, thanks for asking. I came to this question as part of my quest for mathematical processes that result in integers or rational numbers. This in turn is part of my attempt to "build' the natural numbers from the real numbers. Crazy as it may sound, that would be a much longer story. — Pedro Paiva, Jul 30 '24 at 11:24
@PedroPaiva I see; that context makes sense. It's definitely an interesting observation: a given outcome's "entropy share" is always less-extreme (more typical?) than its probability. I wonder if there's any simple proof from convexity? — user3716267, Jul 30 '24 at 15:14
@Mittens please undo your edit (I don't have the rep to do it directly). It is one die, many dice (see the second definition, for die as a noun, not a verb). If you want to keep dice, you need to also change the "an unfair" to "unfair" or otherwise remove the explicit singular. You can't have "a dice", only "a die" or "many dice". — terdon, Aug 23 '24 at 21:33
@terdon: https://www.oed.com/dictionary/die_n1?tab=meaning_and_use#6735383 . If it really kills you and you are going to die about this, I will make a change later (or someone else) I am closing mu shop for today... — Mittens, Aug 23 '24 at 21:57
@Mittens "Dice" is plural. "Die" is singular. Fire subject-verb agreement, it should be "due" in the title. — Xander Henderson, Aug 23 '24 at 21:57
@XanderHenderson: Again, you can see this https://www.oed.com/dictionary/die_n1?tab=meaning_and_use#6735383. It does not matter to me anyway. — Mittens, Aug 23 '24 at 21:58
@Mittens So far as I can tell, that link says exactly what I just said. — Xander Henderson, Aug 23 '24 at 21:59
@XanderHenderson: My point being is that dice used as a singular is fine. Eventually it eillbe more common than the die. It is hard ti avoid engaging in this useless byzantine discussions about the gender of the angels... — Mittens, Aug 23 '24 at 22:30
@Mittens Editing a post to change one word (a word that is definitely correct in the context in which it was used), and to replace it with a different word which is of debatable correctness, is really not appropriate. "Die" is singular---no one is going to argue that. The sentence in which it was used called for a singular noun. There is absolutely no call to edit the post just to change that one word. — Xander Henderson, Aug 24 '24 at 00:04
Dear colleagues, I'd like to thank you for your contributions. I continue to work on this question and building on your suggestions and ideas. — Pedro Paiva, Aug 25 '24 at 14:55
Menswhile, I have posted a video on Youtube to share what I have seen in my numerical simulation of the process. In case you are interested, this is the link. — Pedro Paiva, Aug 25 '24 at 15:03
@r.e.s., so sorry. My mistake. I'm new to Youtube. I have fixed it. Can you please try again? — Pedro Paiva, Aug 25 '24 at 15:11

stochasticboy321 · Answer 1 · 2024-08-05T14:16:42.573

9

We're interested in the dynamics $$ p^{t+1}_i = -p^t_i \log p^t_i/\sum_j -p^t_j \log p^t_j. $$

The following argument shows that if $p^0_{\max} := \max_i p^0_i \le 1/e,$ then $p^t \to \mathbf{1}/n$. To this end, first observe that since $z \mapsto -z \log z$ is monotone increasing over $[0,1/e],$ it follows that if $p^t_i > p^t_j,$ then $p^{t+1}_i > p^{t+1}_j$. Further, since $-z \log z \le 1/e,$ and $\sum_i p_i (-\log p_i) \ge \sum p_i(-\log p_\max) = - \log p_\max,$ if $p_\max^t \le 1/e,$ then $p_\max^{t+1} \le 1/e$ (thanks to @ImbalanceDream for pointing out that this was missing). As a result, the index of the maximum and minimum entries of $p^t$ remain constant throughout the dynamics, which in turn means that for the map $$\theta(p) := \frac{\max_i p_i}{ \min_i p_i} -1,$$ we have the dynamics $$ \theta(p^{t+1}) = \frac{p^t_{\max} \log p^t_{\max}}{p^t_{\min} \log p^t_{\min}} -1.$$

For succinctness, let me write $\theta = \theta(p^t)$ and $\theta_+ = \theta(p^{t+1})$ and $p = p^t$. Then we can further develop the relation $$ \theta_+ = (\theta + 1) \left( \frac{\log p_{\max} }{ \log p_{\max} - \log (\theta+1)} \right) -1 = \theta - (\theta + 1) \frac{\log (\theta +1)}{\log(\theta+1) - \log p_{\max}}.$$ Further observe that $\theta \ge 0,$ and $p_{\max} \ge 1/n$. Now, for any $c > 0,$ if $$\frac{\log (\theta + 1)}{ \log(\theta + 1) + \log n} \ge c \iff \theta \ge n^{c/(1-c)} - 1,$$ then we can conclude that $\theta_+ \le (1-c)\theta$. Due to this contraction, and the fact that $\theta_+ \le \theta,$ we can conclude that for any $c > 0,$ $$ \limsup_{t \to \infty} \theta(p^t) \le n^{c/1-c} -1,$$ and by taking a limit as $c \to 0,$ we conclude that $\limsup \theta(p^t) = 0 \implies \theta(p^t) \to 0.$ But for any $p$, $$ 0 \le p_{\max} - p_{\min} = p_{\max} \frac{\theta(p)}{\theta(p)+1} \le \theta(p),$$ and so we must conclude that $\lim_t p_{\max}^t - p_{\min}^t = 0,$ or equivalently, that $p^t \to \mathbf{1}/n$. I suppose that by studying this in a more refined way, one can also derive some rates, but I don't really want to go down that rabbithole.

Of course, this analysis remains incomplete, because one needs to show that if $p_\max^0 > 1/e,$ then one eventually gets smaller than $1/e$. This might need a different attack, certainly the expression for $\theta_+$ would no longer be so simple.

edited Aug 05 '24 at 14:16

answered Jul 26 '24 at 02:16

stochasticboy321

10,643

Have you checked the validity of the claim for $p_\max^0 > 1/e$ numerically? – Amir Jul 26 '24 at 07:53
1

@Amir I tried a couple of examples, and it works for those, but I don't really know what sort of shape of $p$ would be "worst" for this dynamics, so its hard to draw intuition from this. I think one might have to spend some time figuring this out before simulations start being helpful. The examples I did try were of the form $p^0_1 = 0.9999$, and $p^0_{2:n} = (1-p(1)) * q$, where $q$ was an exponentially decaying law. This might also be a bad example because if one is too aggressive about the decay, or $n$ is large, then one runs out of precision and gets a bunch of NaNs everywhere. – stochasticboy321 Jul 26 '24 at 13:09
Can you explain a little bit more on $c$? I have trouble following the steps after the introduction of $c$. The inequality $\theta\geq n^{c/(1-c)}-1$ can't hold for constant $c$ since finally you proved $\theta\to0$ – ImbalanceDream Aug 05 '24 at 01:59
1

@ImbalanceDream $c$ is just a parameter. The flow of the argument is as follows. 1) For all $t$, $\theta_{t+1} \le \theta_t$. 2) Take any $c > 0.$ If $\theta_t > n^{c/(1-c)} - 1$, then $\theta_{t+1} \le (1-c) \theta_t$. 3) Therefore, for every $c > 0,$ it holds that $\limsup_{t \to \infty} \theta_t \le n^{c/(1-c)} - 1.$ 4) Hence, $\limsup_{t \to \infty} \theta_t \le \inf_{c > 0} n^{c/(1-c)} - 1.$ 5) Finally observe that $\lim_{c \to 0} n^{c/(1-c)} - 1 = 0,$ so the infimum in part 4 is $0$. – stochasticboy321 Aug 05 '24 at 02:59
1

Oh, I suppose it should be $c \in (0,1)$ rather than $c > 0$, but this doesn't change the argument at all. In case you wanted a clarification on point 3), the idea is as follows: if for a nondecreasing sequence $x_t \ge 0,$ if there are $A > 0$ and $\rho \in (0,1)$ such that it holds that $x_t \ge A \implies x_{t+1} \le (1-\rho) x_t,$ then notice that $x_t \le \max( (1-\rho)^t x_0, A).$ You can show this inductively if you like. But as $t \to \infty, (1-\rho)^t x_0 \to 0,$ so this gives $\limsup_{t} x_t \le A$. – stochasticboy321 Aug 05 '24 at 03:05
Thank you for the great explanation. – ImbalanceDream Aug 05 '24 at 03:32
1

I have one more question. In order to keep the order it is required that $p_\max^t\leq1/\mathrm e$ for all $t$, which I think is not necessarily true. How is it guaranteed during the iteration? – ImbalanceDream Aug 05 '24 at 03:59
That's an excellent point, which I didn't think about. However, I think it should work out here: basically, $z \mapsto -z \ln(z)$ is bounded by $1/e$. So it suffices that the entropy (in nats) is at least $1$ if $p_\max \le 1/e$. Numerically, for $n = 3,$ the minimum value is about $1.08$ nats. My intuition says that the minimum entropy law should be supported on only 3 letters, but I'd have to show that. – stochasticboy321 Aug 05 '24 at 04:29
@ImbalanceDream Actually, no it's much simpler: $p_i \le p_{\max} \iff - \log p_i \ge - \log p_{\max},$ and so $H = \sum p_i (-\log p_i) \ge \sum p_i (-\log(1/e)) = 1.$ – stochasticboy321 Aug 05 '24 at 04:32

Zoe Allen · Accepted Answer · 2024-07-29T17:16:08.087

8

Firstly, it never overshoots by more than the original distance. To overshoot by more than the original distance we need: $$|\ln a_i - \ln a_j| \ge 2|a_i - a_j|$$ $$\frac{|\ln a_i - \ln a_j|}{|a_i - a_j|} = \frac{\ln a_i - \ln a_j}{a_i - a_j} \ge 2$$ Now there are values of $a_i$ and $a_j$ that would satisfy this inequality, but we have an additional constraint. $p_i + p_j \le 1$ tells us $$e^{-a_i} + e^{-a_j} \le 1$$ and there are no values that satisfy both inequalities, as you can see by plotting them both in Desmos.

Edit: I have since proven the inequality here and made a more detailed graph to go along with my proof.

You can see that even if we lower $2$ to $1.99$ this still holds, and we can lower it to approximately $1.443$. This means not only can an overshoot not increase the distance, but it has to decrease the distance by at least a constant proportion. (If we prove the inequality for $1.99$ that shows the distance has to decrease by at least 1%).

Now the undershoot case:

We can pick some small distance $\varepsilon > 0$ and as long as $|a_i - a_j| < \varepsilon$ the distance will decrease by at least (assuming $a_i < a_j$) $\ln(a_i + \varepsilon) - \ln a_i$ Suppose we pick $a_i$ to be the smallest value (so $p_i$ is the largest value) then $p_i \ge \frac1n \implies a_i \le \ln n$ meaning the decrease in distance is at least $\ln(\ln n + \varepsilon) - \ln \ln n > 0$.

Putting this together, each step no distance can increase, and if the larges distance is $> \varepsilon$ it must either decrease by a constant or decrease by a constant ratio. If we repeatedly do this to any one distance it will drop below $\varepsilon$, so as there are only finitely many distances, in a finite number of iterations they will all drop below $\varepsilon$. As we can choose $\varepsilon$ as small as we like, this proves the distances all limit to $0$, from which your claim immediately follows.

To be more precise, we can sum up all $\frac12 n(n-1)$ distances, and that sum, $S$, must decrease by a either a constant or a constant ratio each step until $S < \frac12 n(n-1) \varepsilon$.

Now that the inequality is proven, this is a complete proof.

*let $p_i = e^{-a_i}$ and $p_i' = e^{-a_i'}$. Then $p_i' = -H^{-1} p_i \ln p_i$ gives us $$e^{-a_i'} = H^{-1}a_i e^{-a_i}$$ by logging both sides then negating: $$a_i' = a_i - \ln a_i + \ln H$$ then we take $\lambda = \ln H$

edited Jul 29 '24 at 17:16

answered Jul 26 '24 at 22:13

Zoe Allen

7,939

Dear colleagues, I'd like to thank you all for your contributions and interest. I've been exploring your suggestions and will continue to do so. I hope to comment on each of them soon. So far, I have not managed to make progress on my own. Many thanks. If you happen to have new insights, please keep sharing them and feel free to share this question with others too. – Pedro Paiva Jul 27 '24 at 15:26
Hello @ZoeAllen, many thanks. I`m reading your proof now. Really appreciated. – Pedro Paiva Jul 29 '24 at 12:28
@Mittens $a_j$ isn't just the log, it's the negation of the log. Does that answer your question? – Zoe Allen Jul 29 '24 at 16:41
@Mittens $p \mapsto H^{-1} p \ln p$ and because we choose $\lambda$ we get a degree of freedom to multiply through by what we like, so that we make the $a_j$ inside the log positive (and it has to be positive for the log to be well defined). – Zoe Allen Jul 29 '24 at 16:53
@Mittens I've added a note explaining it, and also reversed the sign of $\lambda$, which I think is clearer. – Zoe Allen Jul 29 '24 at 17:09
1

@Mittens I shouldn't have described $\lambda$ as a constant in an earlier version. I only meant that it is the same for each coordinate, not each time step. It doesn't need to be the same for each time step, as the fact that it is the same for each coordinate causes it to cancel out of the distance expression. – Zoe Allen Jul 29 '24 at 17:36
(+1) @ZoeAllen In the more-general case of $p_i^{\prime}=f(p_i)/\sum_{j=1}^nf(p_j)$, can you see any convenient choice(s) of $a_i=g(p_i)$ that would allow a similar approach to work, for example when $f(p)=-\log p,$ or $f(p)=p^\alpha(0<\alpha<1)$? – r.e.s. Jul 29 '24 at 18:05
@ZoeAllen: I see what you are doing: you are comparing the components (in log-scale) of the vector $T(\mathbf{p})$ and your goal is to show that the iteratins will give differences that are small. – Mittens Jul 29 '24 at 18:46
@r.e.s. No. I'm not entirely satisfied with this proof in large part for that reason. – Zoe Allen Jul 29 '24 at 19:15
The high level idea works directly for $f(p) = p^\alpha.$ Set $a_i = -\log p_i.$ Then under the interation, $p_i^+ = -\alpha \log p_i + \log\sum_i p_i^\alpha,$ and so $|a_i^+ - a_j^+| = \alpha | a_i - a_j|,$ and $\alpha < 1$ means we're immediately done. For $f(p) = -\log p,$ crudely one can see that $|a_i^+ - a_j^+| \le |a_i - a_j|/\max(|a_i|, |a_j|), $ which means that as long as at most one entry of $p^0$ is $> 1/e,$ and the rest are strictly $< 1/e,$ we again get contraction. Maybe one can say more by actually considering the constraint $e^{a_i} + e^{a_j} \le 1$. – stochasticboy321 Aug 05 '24 at 03:27
1

Note that the same applies if $g(a) := \log f(\exp(a))$ is $(1-c)$-Lipschitz for some $c > 0$.The strategy reminds me of the design philosophy behind multiplicative weights, which in a certain sense essentially says that the nice way to do gradient updates over the probability simplex is to do them on the logarithm of the probability vector, a la the approach here, and one can perhaps view this answer in that vein. The potential function typically used in these analyses is the normalisation itself, maybe that can yield a wider characterisation of what $f$ are easily captured? Not sure. – stochasticboy321 Aug 05 '24 at 03:32

Mittens · Answer 3 · 2024-07-29T16:49:05.777

This is not a complete answer but a possible path for the OP:

The entropy function $H(\mathbf{p})=-\sum^n_{j=1}p_j\log p_j$ satisfies $0\leq H(\mathbf{p})\leq \log n$

Define the map $T$ on the cube $(0,1)^n$ as $$T\mathbf{p}=-\frac1{H(\mathbf{p})}(p_1\log p_1,\ldots, p_n\log p_n)$$ $T$ is in fact defined on $D_n=[0,1]^n\setminus V_n$ where $V_n$ is the set of vertices of $[0,1]^n$. Further, $T$ maps $D_n$ onto the simplex $S^*:=S\setminus\{\mathbf{e}_1,\ldots, \mathbf{e}_n\}$ ($\mathbf{e}_j(i)=\delta_{ij}$ for $1\leq i,j\leq n$).

The OP suggests that the median $\mathbf{p}^*=\frac{1}{n}(1,\ldots,1)$, a fixed point of $T$ on $(0,1)^n$, is an attractor for the dynamical system $\mathbf{p}_{m+1}=T\mathbf{p}_m$ where $\mathbf{p}_0$ is in $(0,1)^n$. Numerical evidence sustains this. (See R code below that simulates the system $\mathbf{p}_{n+1}=T\mathbf{p}_n$).

Denote by $T_i(\mathbf{p})=-\frac{p_i\log p_i}{H(\mathbf{p})}$. Then, $T$ is differentiable as a function on $(0,1)^n$ and the Jacobian matrix $T'(\mathbf{p})$ has entries given by \begin{align} \partial_jT_i &=-\frac{(1+\log p_j) p_i \log p_i}{H^2(\mathbf{p})}\\ \partial_iT_i&=-\frac{\big(1+\log p_i\big)\big(H(\mathbf{p})+p_i\log p_i\big)}{H^2(\mathbf{p})} \end{align} for $1\leq i,j\leq n$, $i\neq j$. It follows that for each $i$, $$ \partial_jT_i(\mathbf{p})\left\{\begin{array}{lcr} <0 & \text{if} & j\neq i, \,0<p_j<e^{-1}\\ >0 &\text{if} &j\neq i, \,e^{-1}<p_j<1\\ <0 &\text{if} & j=i, \,e^{-1}<p_i<1\\ >0 &\text{if} & j=i, \,0<p_i<e^{-1} \end{array} \right. \tag{*}\label{sign} $$

At $\mathbf{p}^*$ \begin{align} T'(\mathbf{p}^*)=\frac{1}{\log n}\begin{pmatrix} (\log n -1)\frac{n-1}{n} & \frac{1-\log n}{n} & \frac{1-\log n}{n}& \ldots & \frac{1-\log n}{n}\\ \frac{1-\log n}{n} & (\log n -1)\frac{n-1}{n} & \frac{1-\log n}{n}&\ldots & \frac{1-\log n}{n}\\ \vdots &\vdots &\ddots & \ldots &\vdots\\ \frac{1-\log n}{n} & \frac{1-\log n}{n}&\frac{1-\log n}{n} &\ldots& (\log n -1)\frac{n-1}{n} \end{pmatrix} \end{align} It is clear that $\lambda_1=0$ is an eigenvalue with one dimensional Eigen space generated by $\mathbf{u}:=[1,\ldots,1]^\intercal$. There is another eigenvalue, $\lambda_2=\frac{\log n-1}{\log n}$ whose corresponding eigenspace $(n-1)$-dimensional space is generated by the vectors $\mathbf{e}_1-\mathbf{e_j}$, $j=2,\ldots, n$. As $|\lambda_k|<1$, $k=1,2$, we conclude that indeed, $\mathbf{p}^*$ is a (local) atractor.

It remains to show that it isa global atractor for $T$ on $(0,1)^n$. This, perhaps, can be deduced from \eref{star} which at first sight imply that the orbit of any point $p_0$ in the (relative) interior of $S^*$ moves inwards, i.e. away from the boundary $\delta S^*$.

Along lower dimensional faces of the simplex $S^*$, there are other (hyperbolic) atractors given by the medians of the lower dimensional faces.

This code simulates the dynamical system $\mathbf{p}_{n+1}=T(\mathbf{p_n})$.

slog <- function(x){
  ifelse(x==0,0,x*log(x))
}
slog <- Vectorize(slog,vectorize.args = 'x')
entropy <- function(p){
  -sum(slog(p))
}
myTfun <- function(p){
  H <- entropy(p)
  return( -slog(p)/H)
}
set.seed(143)
library(mcmc)
p0 <- c(0.7153878, 0.01568887, 0.1950531, 0.07387031)
p0 <- rdirichlet(1,c(1,1,1,1))
Tp <- matrix(NA,nrow=30,ncol = length(p0))
Tp[1,] <- p0
for(n in 2:nrow(Tp)){
  Tp[n,] <- myTfun(Tp[n-1,])
}
residual <- apply(Tp- rep(1/ncol(p0),ncol(p0)),1, function(x){max(abs(x))})
plot(1:nrow(Tp),residual, type = 'o', col = 'blue', 
     xlab = 'n')
Tp

Amir · Answer 4 · 2024-08-22T17:08:28.050

A way to prove the claim is to use the following inequality:

$$\left \|\frac{-\boldsymbol{p} \circ \log \boldsymbol{p}}{H(\boldsymbol{p})} -\frac{1}{n} \boldsymbol{1} \right \|_\alpha<\left \|\boldsymbol{p} -\frac{1}{n} \boldsymbol{1} \right \|_\alpha \tag {1} $$ with $\alpha\ge 1$ for any $p_1,\dots,p_n > 0, \sum_{i=1}^n p_i=1$ excepting $p_i=\frac{1}{n}, i=1,\dots,n$.

A partial proof of (1) based on majorization, fully covering a general result in (2), is given in my answer to this MSE post that I asked after guessing the inequality. A proof of this inequality follows from the validity of this recent conjecture $\frac{-\boldsymbol{p}\log \boldsymbol{p}}{H(\boldsymbol{p})} \prec \boldsymbol{p}$, completely proved in this MO post.

Let $\boldsymbol{p}^k, k\in \mathbb N$ denote the probability vectors generated sequentially based on the dynamical system described in the OP. Then, from (1), $\left \|\boldsymbol{p}^k -\frac{1}{n} \boldsymbol{1} \right \|_\alpha, k\in \mathbb N$ is a strictly decreasing sequence of positive numbers, and thus it converges to some constant $c\ge 0$ as $k\to \infty$. Therefore, the sequence $\boldsymbol{p}^k, k\in \mathbb N$ converges to an attracting fixed point, which is the simplex center $\frac{1}{n}\boldsymbol{1}$ as we show that no other fixed point (those on the boundary of the simplex) is attracting.

Any attracting point $y$ on the boundary of the simplex, such as $\small \left(\frac{1}{n-1},\dots,\frac{1}{n-1} ,0 \right)$, is the center of its underlying face, which is another simplex with a smaller dimension. Hence, the face that includes $y$ is tangent to the intersection of the ball $B_{\alpha}(r_{y})$ and the simplex at the point $y$ (the intersection is denoted by $SB_{\alpha}(r_{y})$), where the ball $B_{\alpha}(r_{y})$ is defined as the $l_\alpha$ ball whose center is $\frac{1}{n}\boldsymbol{1}$ with radius $r_{y}=\left \|y -\frac{1}{n} \boldsymbol{1} \right \|_\alpha$. Thus, any sequence of points in the relative interior of the simplex that converges to $y$ necessarily crosses the relative interior of $SB_{\alpha}(r_{y})$ for some $\alpha\ge 1$ (otherwise, all points of the sequence cannot be in the relative interior of the simplex and some of them must be on the face including $y$); see the orange point, which is a fixed point on the boundary, by rotating the figure in illustration. This observation results in a contradiction based on (1) as follows. Assume that $x_1,x_2,\dots$ is a sequence of relatively interior points that does not converge to $\frac{1}{n} \boldsymbol{1}$ and converges to other fixed point $y$, which should be on the boundary of the simplex. Then, based on the observation made just earlier, there are $\alpha\ge 1$ and a point $x_m$ in the sequence that is sufficiently close to $y$ such that the $l_\alpha$ distance of $x_m$ from the center $\frac{1}{n}\boldsymbol{1}$ is strictly less than the $l_\alpha$ distance of $y$ from $\frac{1}{n}\boldsymbol{1}$, i.e., $r_y=\left \|y -\frac{1}{n} \boldsymbol{1} \right \|_\alpha$. Then, from (1), the distances of the points $x_m, x_{m+1},\dots $ from the center successively become smaller and they converge to some point whose distance from the center is strictly smaller than $r_y$, which is a contradiction because $x_m, x_{m+1},\dots $ can never converge to $y$ whose distance from the center is $r_y$.

Extension

The inequality (1) can be extended for any increasing functions $f:\mathbb [p_{(n)},p_{(1)}] \to \mathbb R_+$ whit $\frac{f(x)}{x}$ being strictly decreasing as follows:

$$\left \|\frac{f (\boldsymbol{p})}{\sum_{i=1}^n f (p_i)} -\frac{1}{n} \boldsymbol{1} \right \|_\alpha<\left \|\boldsymbol{p} -\frac{1}{n} \boldsymbol{1} \right \|_\alpha \tag {2} $$ with $\alpha\ge 1$ for any $p_1,\dots,p_n > 0, \sum_{i=1}^n p_i=1$ excepting $p_i=\frac{1}{n}, i=1,\dots,n$.

Hence, the same convergence result can be obtained for other continuous functions $f$ such as $f(x)=x^a$ with $a\in (0,1)$ with similar attracting points, as suggested in a comment by @r.e.s. based on numerical experiments.

It may be obvious but all you have from one from (1) is that $a_k:=|T\mathbf{p}_k-\frac1n\mathbf{1}|_2$ converges, being a decreasing sequence of positive numbers. How you establish that he limit is $0$? — Mittens, Jul 30 '24 at 14:42
Mittens and BorisPerezPrado: You may notice that the inequality implies that the points move successively towards the center of the simplex, and that each point is farther from the boundary than the previous point. Hence, the center is the only global attracting fixed point. — Amir, Jul 30 '24 at 21:05
@Mittens Let $x_1,x_2,\dots$ be a sequence of relatively interior points in the simplex that converges to some fixed point $y$ on the boundary of the simplex. Then, there is a $x_m$ in the sequence near the $y$ and $p\ge1$ such that the $l_p$ distance of $x_m$ from the center is less than the $l_p$ distance of $y$ from the center. Hence, from (1), which holds for any $l_p$ norm, $x_m, x_{m+1},\dots $ converges to the center, and not $y$, which is a contradiction. — Amir, Jul 31 '24 at 02:17
In this 3D figure, you can see that as $p$ increases, any point in the neighbor of $(1/2,1/2,0)$ can be covered by an $l_p$ ball around the center with radius $\small \left(2|1/2-1/3|^p + |0-1/3|^p \right)^{1/p}.$ — Amir, Jul 31 '24 at 02:18
Your proof in your comments is not correct. I know things should be try, after all Zoe’s proof points in that direction, but your arguments is not quite correct. One should take a look at the derivatives to see that things move away from boundary. — Mittens, Jul 31 '24 at 02:54
@Amir: I am still having problems following your argument. I appreciate your effort and your edit. I have removed my (-1). Thanks! — Boris PerezPrado, Aug 01 '24 at 18:40

score 2 · Answer 5 · answered Aug 03 '24 at 02:26

2

Just some late ideas to provide insights. NOT a proof.

With some techniques similar to data processing inequality, we may obtain $$ D_{\mathrm {KL}}(P_{i+1}\|U)\leqslant D_{\mathrm {KL}}(P_{i}\|U), $$where $U$ is the uniform distribution. This also indicates that the entropy of $P_i$ is increasing.

If $P_i$ converges to some $P$, we may expect that $$ p_i=-p_i\log p_i\mathbin/ H(P),\quad i=1,\dots,n, $$which means $-\log p_1=\dots=-\log p_n=H(P)$, so $P$ is the uniform distribution.

answered Aug 03 '24 at 02:26

ImbalanceDream

1,081
2
11

many thanks for the insight. Have you managed to establish the KL inequality? – Pedro Paiva Aug 03 '24 at 12:29
@PedroPaiva Haven't tried too much. Even with the inequality, it will be difficult to prove "$P_i$ converges to some $P$" because the convergence of KL divergence can't simply imply the convergence of $P_i$. Anyway, I hope the consideration of KL divergence would be helpful. – ImbalanceDream Aug 04 '24 at 09:55

The entropy of entropy (or how to fix an unfair die)

5 Answers5

p0 <- c(0.7153878, 0.01568887, 0.1950531, 0.07387031)

Linked