Projection onto Birkhoff Polytope

Question

Suppose we would like to compute the Euclidean projection of an arbitrary matrix $A$ onto the Birkhoff polytope, the set of doubly-stochastic matrices.

Under some conditions on $A$, Sinkhorn's algorithm returns two diagonal matrices $D_1,D_2$ such that $D_1 A D_2$ is a doubly-stochastic matrix (which is unique). How does this matrix compare to the Euclidean projection, and if not, does this solution correspond to the projection under any particular metric?
Are there known methods for solving this efficiently besides just formulating the problem as a quadratic program and applying a generic QP algorithm?

I think this will be related as well https://math.stackexchange.com/questions/4936790. — Royi, Jun 25 '24 at 19:53

score 3 · Accepted Answer · edited Sep 03 '24 at 17:11

3

It is known that symmetric Sinkhorn algorithm in fact minimizes KL divergence [2,3]. In 1, authors present a method to minimize the Euclidean distance. This is called BBS (Bregmanian Bi-Stochastication). It is reported that BBS algorithm using the Euclidian distance has noticeably better potential of producing good clustering results than the SK algorithm, while Sinkhorn performs better in terms of doubly stochasticity.

(1) Learning a Bi-Stochastic Data Similarity Matrix, Fei Lang, Ping Li, Arnd Christian Konig

(2) J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5):1470–1480, 1972.

(3) G. W. Soules. The rate of convergence of Sinkhorn balancing. Linear Algebra and its Applications, 150:3 – 40, 1991.

Regarding the projection of an arbitrary (arguably non-positive matrix): First, it is quite hard to project a matrix with negative entries. I would first project it onto the orthogonal matrices and then from there round it to the Birkhoff polytope. It is not a great procedure though, but a reasonable approach is given in :

(4) Approximating Orthogonal Matrices by Permutation Matrices, Alexander Barvinok, 2005

(5) Reparameterizing the Birkhoff Polytope for Variational Permutation Inference - See Doubly Stochastic Matrices in Stan.

edited Sep 03 '24 at 17:11

Royi

10,050

answered May 09 '15 at 08:35

Tolga Birdal

1,082
9
18

The original post seeks to find a bi-stochastication of an arbitrary matrix, but this paper seems to rely on the input being a similarity matrix. Are you aware of a more general algorithm? – RMurphy May 07 '20 at 17:21
Updated my reply, but unfortunately I am not aware of "the way" to do it. – Tolga Birdal May 07 '20 at 20:28
for what it's worth, I've done some toy experiments with method [1] inputting matrices with negative elements, etc. So far, the method does seem to converge quite reliably, but as of right now I don't know whether it scales well. I will take a look at the second paper you mentioned. Thanks! – RMurphy May 07 '20 at 20:52
I'm surprised. Please keep us posted on your findings. – Tolga Birdal May 08 '20 at 18:53
@TolgaBirdal, In (1), they say they use Dykstra algorithm yet actually are using alternating projections. Since their sets in the constraints are not sub spaces I don't think convergence to the projection point can be guaranteed (See http://pfister.ee.duke.edu/courses/ece586/altproj.pdf). Are you aware to expansion of the method to the case in the paper? – Royi Jun 26 '24 at 20:19
@RMurphy, The paper states the optimization manner in a general way which assumes nothing about the matrix. Hence it should converge for any given square matrix. They use similarity matrix since it is motivated by the application of clustering. I think the algorithm could be derived pretty easily as projected gradient descent. I also think the made assumption on their step which is not valid (At least they did not justify it). – Royi Jun 26 '24 at 20:26

score 0 · Answer 2 · answered Jun 27 '24 at 20:34

Defining the problem as:

$$ \begin{align*} \arg \min_{\boldsymbol{X}} \quad & \frac{1}{2} {\left\| \boldsymbol{X} - \boldsymbol{Y} \right\|}_{2}^{2} \\ \text{subject to} \quad & \begin{aligned} \boldsymbol{X} \boldsymbol{1} & = \boldsymbol{1} \\ {X}_{i, j} & \geq 0 \\ \boldsymbol{X} & = \boldsymbol{X}^{T} \end{aligned} \end{align*} $$

One could use Orthogonal Projection onto the Intersection of Convex Sets in order to calculate the projection.

Each projection on its own is pretty trivial:

$ \mathcal{C} = \left\{ \boldsymbol{X} \in \mathbb{R}^{n \times n} \mid \boldsymbol{X} \boldsymbol{1} \leq \boldsymbol{1} \right\} \Rightarrow {P}_{\mathcal{C}} \left( \boldsymbol{Y} \right) = \boldsymbol{Y} - \frac{1}{n} \left( \boldsymbol{Y} \boldsymbol{1} - \boldsymbol{1} \right) \boldsymbol{1} $.
$ \mathcal{C} = \left\{ \boldsymbol{X} \in \mathbb{R}^{n \times n} \mid \boldsymbol{1}^{T} \boldsymbol{X} = \boldsymbol{1} \right\} \Rightarrow {P}_{\mathcal{C}} \left( \boldsymbol{Y} \right) = \boldsymbol{Y} - \frac{1}{n} \boldsymbol{1} \left( \boldsymbol{1}^{T} \boldsymbol{Y} - \boldsymbol{1}^{T} \right) $.
$ \mathcal{C} = \left\{ \boldsymbol{X} \in \mathbb{R}^{m \times n} \mid \boldsymbol{X}_{i, j} \geq 0 \right\} \Rightarrow {P}_{\mathcal{C}} \left( \boldsymbol{Y} \right) = \max \left\{ \boldsymbol{Y}, \boldsymbol{0} \right\} $ where $\boldsymbol{0}$ is a matrix of zeros.
$ \mathcal{C} = \left\{ \boldsymbol{X} \in \mathbb{R}^{n \times n} \mid \boldsymbol{X} = \boldsymbol{X}^{T} \right\} \Rightarrow {P}_{\mathcal{C}} \left( \boldsymbol{Y} \right) = 0.5 \left( \boldsymbol{Y} + \boldsymbol{Y}^{T} \right) $.

The full code is available on my StackExchange Mathematics GitHub Repository (Look at the Mathematics\Q838813 folder).

Remarks:

The definition of Doubly Stochastic Matrix does not require symmetry.
I used symmetry to follow the Learning a Bi-Stochastic Data Similarity Matrix paper. In the paper context they require similarity as the matrix origins is similarity matrix.
The paper applies alternating projection (Although claims to use Dykstra's Projection Algorithm) though the sets are not sub spaces. It skipped proving the convergence to optimality. My code use Dykstra's Projection Algorithm.
I apply projection both onto rows and columns as it made convergence faster.

score 0 · Answer 3 · answered Sep 09 '24 at 09:58

Besides the Sinkhorn-Knopp algorithm, I believe it is possible to apply the alternating projection algorithm to the matrix $A$. Specifically, I use the projection onto the probability simplex as described by Wang et al. (2013), which is easy to implement. The process involves alternately projecting each row and then each column onto the probability simplex until convergence. Empirically, it converges very quickly and yields results that are close to those of Dykstra's algorithm, which guarantees the exact projection onto the intersection of convex sets.

Projection onto Birkhoff Polytope

3 Answers3