Balanced linear partitioning of a set of points in $R^d$

Question

Suppose we have a set of points in $R^d$ and for a given constant $\epsilon>0$ we want to find a hyperplane such that it divides the dataset into two balanced partitions, and that the number of points that are $\epsilon$-close the hyperplane is minimized.

I came up with a formulation of this problem that seems to capture all these constraints and makes the problem a continuous optimization problem.

Suppose we are given $n$ points $x_1,...,x_n\in\mathbb{R}^d$ are our dataset. We parametrize the hyperplane by $w^\top x - b = 0$ in which $w\in\mathbb{R}^d$ is the unit orthogonal vector to the hyperplane and $b$ is the offset. Moreover, we label the points by $y_i\in\{0,1,-1\}$. Here is an illustration when $d=2$. Blue, red, and black dots correspond to labels $y_i=1, y_i=-1$, and $y_i=0$ respectively. The goal is to find a line such that the number of blue and red points are equal and the number of black points is minimized.

We can formulate this optimization problem as follows: $$\max_{w,b,y} \sum_i y_i^2 \qquad (1)$$ such that $$\forall i\in[n]: y_i^2(w^\top x_i - b - \epsilon)(w^\top x_i - b + \epsilon) \ge 0 \quad (2)$$ $$\forall i\in[n]: y_i(w^\top x_i - b)\ge 0 \qquad (3)$$ $$y_i\in\{0,1,-1\}, \sum_i^n y_i = 0 \qquad (4)$$ In words, we have two hyperplanes parallel with the offsets $b-\epsilon$ and $b+\epsilon$, and we want to make sure that pionts in between the two are assigned label $y_i=0$ and points on the left and right side of the hyperplanes are assigned $y_i=-1$ and $y_i = 1$ respectively. The maximization in (1) means we want to minimize the number of points that are assigned $y_i=0$ which is what we want. The constraitns $\sum_i^m y_i=0$ makes sure the number points on the left and right sides is balanced. constraint (3) is to make sure points to the left and right don't have the wrong sign (it's similar to SVM constraints). And finally, constraint (2) means that points between the two hyperplanes shifted by $\epsilon$ are set to label $y_i=0$. The reason is that for these points the product $(w^\top x_i - b - \epsilon)(w^\top x_i - b + \epsilon)$ is negative, and $y_i^2$ is also non-negative, so the only way to make the result non-negative is to have $y_i=0$.

One nice property of this formulation is that we can remove $y_i\in\{0,1,-1\}$ and replace it by $-1\le y_i \le 1$, without the risk of getting non-integral solutions. The scetch of the proof of integrality is that, if we suppose we have a non-integer optimal solution $w^*, b^*, y^*$, such that for some $i$ we have $0 < y^*_i < 1$ (or $-1<y^*_i < 0$), because of $\sum_i y_i =0$ there must be some other index $j$ such that $0<y_j<1$ or $-1<y_j < 0$, and there is always a small constant $c$ such that setting $y_i^*\leftarrow y^*_i+c$ and $y^*_j\leftarrow y_j^* - c$, won't violate any constraints while improving the objective (details not discussed for sake of brevity).

However, it seems that the constraint (2) is very non-convex, and it's also degree $4$ w.r.t the variables. Hence the problem can't be solved using quadratic programming off the shelf. Are there any ideas on how this problem can be reformulated to get an optimal solution in polynomial time?

UPDATE I've added a bounty because this problem has very important applications if it can be sollved efficiently. To just briefly mention one, if we have a set of datapoints that need to be distributed on a HPC cluster system, being able to partition them in a balanced way such that they have as little as possible proximity would minimize the communication cost between the nodes. If the exact solution cannot be found, any approximation algorithm that achieves a reasonable bound is also acceptable.

It seems that the formalized optimization problem for a point $x_i$ placed at a distance exactly $\epsilon$ to the hyperplane leaves a freedom for $y_i$ to be zero or non-zero. Is it OK for the initial geometric problem? — Alex Ravsky, Jun 19 '19 at 17:21
@AlexRavsky wrt the constraints that is correct, but actually in the optimal solution because of the maximization $\max \Sigma_i y_i^2$ all such points will get a non-zero label. I can provide the details if you are curious. — kvphxga, Jun 20 '19 at 15:27
OK. Another problem with the formalized optimization problem. I don’t see why in its optimal solution all $y_i$’s which are not $\epsilon$-close to the separating hyperplane are non-zero. — Alex Ravsky, Jun 22 '19 at 18:01
Setting all $y_i=0$ is indeed a feasible solution. But again the maximization of $\sum_i y_i^2$ means that the optimal feasible solution must have plenty of non-zero labels. It $\epsilon$ is not too big, such a point must exist. — kvphxga, Jun 22 '19 at 20:00
My question is different. Let $(w_g, b_g, y_g)$ and $(w_f, b_f, y_f)$ be optimal solutions of the geometric and formalized optimization problem, respectively. Why $\sum y_{g,i}^2\ge \sum y_{f,i}^2$? As I wrote, in $(w_f, b_f, y_f)$, if $x_i$ is not $\epsilon$-close to the separating hyperplane then $y_{f,i}$ is not necessarily non-zero. We can argue that in this case we can increase $|y_{f,i}|$ and, therefore, increase $\sum y_{f,i}^2$ (contradicting optimality of $(w_f, b_f, y_f)$). But this can be impossible for all needed $y_{f,i}$ due to balancing condition (4). — Alex Ravsky, Jun 23 '19 at 02:41
If I'm understanding your question correctly, you're questioning the claim that the optimal solution of the formal problem is integral, right? Because if it is integral, it is quite easy to construct a geometrical solution with the same optimal value, and hence $\sum_i y_{g,i}^2 \ge \sum_i y_{f,i}^2$, This only breaks if you get an optimal solution which is not integral. is that right? — kvphxga, Jun 24 '19 at 12:06
I think the problem can occur even if the optimal solution of the formal problem is integral. I guess the idea of constructing $(y_{g,i})$ from $(y_{f,i})$ is to increase $|y_{f,j}|$ for some of them. But I’m afraid that balancing condition (4) may not allow to do this for all $y_{f,j}$ which are zeros, but their $x_i$ are not $\epsilon$-close to the separating hyperplane. For instance, this can happen when all red dots $x_i$ already have $y_i=1$, but there remains a blue dot $x_j$ with $y_j>-1$ (say, with $y_j=0$). — Alex Ravsky, Jun 24 '19 at 14:23
If you don’t get an answer here, you might consider asking instead on the new [or.se] Stack Exchange site. — LarrySnyder610, Jul 13 '19 at 03:04

Balanced linear partitioning of a set of points in $R^d$

0 Answers0