4

I am an applied math undergraduate student. On my project, I come across an integer linear programming question as follow:

Given $x_0,x_1,...,x_n$: $\forall$ i $\in$ [0,n], $x_i$ = 0 or 1

min Z = $\sum_{i=0}^n x_i $

with m numbers of constraints in the format $\sum x_j \ge 1$: j $\in$ [0,n].

I know this is generally a problem with NP hardness but I was wondering if there is an effective algorithm to solve this problem.

Kuifje
  • 9,802
  • This appears to be a form of Set Cover problem. The $n+1$ variables $x_i$ play the role of the sets, and the $m$ constraints are the points to be covered. Choosing the minimum size subset of variables (so that at least one appears in each constraint) amounts to solving your problem. – hardmath Oct 28 '16 at 23:41
  • Yes but is there a P algorithm for this special case? I have around 100k of variables and 400k of constraints and if I use NP algorithm it would nearly be impossible for the code to terminate – Dat Tran Oct 29 '16 at 00:29
  • I posted my answer, that for constraints of the general type you describe, the minimization of a binary linear program is NP-hard. It remains to consider if the constraints in your application have a special structure that makes the problem tractable. – hardmath Oct 29 '16 at 19:29
  • If it is impractical to describe any special structure that your constraints might have, perhaps you can instead discuss the importance of an exact solution as opposed to an approximate one, for the application you have in mind. – hardmath Oct 30 '16 at 19:47
  • That is a great suggestion. I would keep the question open so that I can continue to receive suggestions from others. I am current studying about minimum flow since I haven't figure out how network simplex could give me an integer solution. – Dat Tran Oct 31 '16 at 02:29
  • I certainly don't want the Question to be removed or closed; it will likely be useful to future Readers. But my two previous Comments convey that without more specifics of one kind or the other, little is possible in the way of helping you further. Possibly you should post a new Question that links to this one and adds those further details about the intended application. – hardmath Nov 01 '16 at 16:08
  • I am currently studying the structure of the coefficient matrix of the constraints in my problem. Thank you for the help. However I would like to use the proof of NP hardness for this problem, do you want me to refer the proof to you as the author? @hardmath – Dat Tran Nov 03 '16 at 02:05
  • Keep in mind my Answer details that your binary integer program (in the full generality you stated) is equivalent to Minimum Set Cover, the NP-hardness of which is known (see the Wikipedia article for some references, or I can add some to my Answer). Use of my Answer is licensed by Creative Commons 3.0 with attribution required. Attribution can be satisfied by providing the URL either to my Answer or to your Question. Referring to me as "hardmath" is fine. – hardmath Nov 03 '16 at 05:51

2 Answers2

2

Actually, this is not a difficult problem, it can be solved efficiently by interpreting it as a network flow problem. Suppose you have the following problem: $$ \min \; x_1+x_2+x_3 $$ subject to $$ x_1+x_2 \ge 1\\ x_1+x_3 \ge 1 \\ x_1,x_2,x_3 \in \{0,1\} $$

Now, create the following graph:enter image description here

Add a unitary cost on arcs $(s,1), (s,2), (s,3)$, and impose that the flow $f$ on all arcs ending in node $t$ is at least equal to $1$. Use any minimum cost flow algorithm to solve the problem and you are done.


EDIT $1$: There is a contradiction between this answer and @hardmath's answer: one of the two is not possible. The problem cannot be NP-hard and be solved in polynomial time. I suspect there is something wrong with my flow model, but I cannot figure out why. Ahuja and Magnanti (Networks Flows) have proposed shortest path algorithms for this problem if there are consecutive 1's in the columns of the matrix, but not for the general case. This is a hint that I believe something must be wrong with my flow model for the general case. Below is their network flow model if there are consecutive 1's in the columns of the matrix. See their book for more details.

enter image description here


EDIT $2$:

As @Erwin Kalvelagen pointed out, the above network flow model is wrong. The reason for that is that the inflow in node $s$ does not equal the objective function (the number of active variables). It counts the total number of times each variable is used, which is not what we want.

In summary:

  1. As proved below by @hardmath, the problem is NP-hard, so in general, no polynomial time algorithm can solve it (unless $P=NP$).
  2. You have two options left:
    • you can use approximation algorithms as mentioned by @hardmath
    • if your set covering problem has a particular type of structure, specifically, if all $1's$ are consecutive in the matrix columns, then the problem is no longer NP-hard and you can use the approach described above in Ahuja and Magnanti's book.
Kuifje
  • 9,802
  • The problem is that I have about 100k variables and 400k constraints so I want to find a more effective algorithm than the tree flow which would require exponential time running. – Dat Tran Oct 29 '16 at 00:19
  • Nope, you can do this in polynomial time. And even the network simplex algorithm should be efficient here although its computational complexity is exponential. Such algorithms are available in the Networkx (Python) library, see here: https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.flow.html. – Kuifje Oct 29 '16 at 00:25
  • Consider the slightly different problem $\min Z = x_1 + x_2 + x_3$ where $$x_1 + x_2 \ge 1$$ $$x_1 + x_3 \ge 1$$ $$x_2 + x_3 \ge 1$$ The optimal solution among nonnegative $x_1,x_2,x_3 \ge 0$ is $Z = 3/2$, but the optimal solution among binary integers is $Z = 2$. The continuous relaxation does not necessarily produce integer solutions. – hardmath Oct 30 '16 at 00:55
  • With my approach, I do not solve the continuous relaxation of the linear problem. I solve the linear relaxation of the network flow model associated to the initial linear problem. – Kuifje Oct 30 '16 at 16:27
  • may I know how did you create your graph? (i.e what software/app, Tikz?) – Anonymous Oct 30 '16 at 17:13
  • 1
    @Anonymous I used Tikz :) – Kuifje Oct 30 '16 at 17:19
  • The integrality I am familiar with for network flow models requires total unimodularity of the constraint matrix. One cannot adjoin arbitrary further constraints without risk of breaking that property. – hardmath Oct 30 '16 at 17:39
  • I think integrality actually isn't the point here, because there are tons of polynomial algorithms that solve the( integer version of the) network flow problem. I think the model itself is what should be focused on, and not which polynomial time running algorithm one should pick to solve it. The debate on the integrality property assumes one has chosen to solve it with by considering the continuous relaxation of the network flow model. – Kuifje Oct 30 '16 at 17:48
  • TiKz requires a bit of effort, (+1) for it. – Anonymous Oct 30 '16 at 18:29
  • 2
    I am not sure this network correctly implements the set covering problem. I guess the outflow at t is the number of sets (equations, i.e. 2). What is the inflow at s? A variable inflow indicating the number of active variables? That would be 1. – Erwin Kalvelagen Oct 31 '16 at 03:39
  • @ErwinKalvelagen: Yes exactly, the inflow at $s$ would be the number of active variables, i.e. $1$ here (variable $x_1$). – Kuifje Oct 31 '16 at 13:06
  • 2
    So where is this extra flow coming from? One in at s and two out at t. – Erwin Kalvelagen Oct 31 '16 at 13:08
  • 1
    You are absolutely right. This is where the problem comes from. This network flow model does not minimize the number of active variables. Thank you Erwin for finding the error here. Your expertise is definitely valuable. – Kuifje Oct 31 '16 at 13:22
2

The reduction of your problem to an instance of SetCover is perhaps clear to you. We will show that the reverse is also possible, to formulate any SetCover instance as a binary integer program of the kind you are asking about, and this will establish that the resulting minimization computation is NP-hard.

A good survey of approximate methods for such problems (including rounding from linear programming) is given by Fatema Akhter (2015), "A Heuristic Approach for Minimum Set Cover Problem", where a modified hill climbing algorithm is proposed and its performance compared on a library of SetCover problems.

First let's clear up the notation used for the constraints, because what was written in the Question: $$m \text{ numbers of constraints in the form } \sum x_j \ge 1: j \in [0,n]$$

does not adequately distinguish among different constraints. For each constraint there should be a subset $C_k \subseteq \{0,\ldots,n\}$ of indices over which the sum is taken:

$$ \sum_{j \in C_k} x_j \ge 1 $$

For convenience we will refer to this latter constraint by its index subset $C_k$, so that the collection of $m$ constraints is $\mathscr{U} = \{C_1,\ldots,C_m\}$.

This $\mathscr{U}$ will be the universe of our SetCover instance, and the family of sets from which a minimum cover will be found is $\mathscr{S} = \{S_0,S_1,\ldots,S_n\}$ where:

$$ S_i = \{ C_k : x_i \in C_k \} \text{ for } i=0,1,\ldots,n $$

Note that the sets $S_i$ are in one-to-one correspondence with your variables $x_i$. Let a minimum cover $\mathscr{S}' \subseteq \mathscr{S}$ of the universe be found, and set $x_i = 1$ if and only if $S_i \in \mathscr{S}'$, $x_i = 0$ otherwise. A little thought should convince us this assignment minimizes the objective:

$$ Z = \sum_{i=0}^n x_i $$

subject to the constraints $C_k$, which amount requiring each subset $C_k$ contains at least one variable $x_i$ which is assigned a value $1$.

In other words we have described how to choose which variables $x_i=1$ so that their sum $Z$ is minimized.

With a little further thought and explanation, this reduction can be reversed so than any SetCover instance can be expressed as a similar binary integer programming problem. Going in this direction we are given a finite universe $\mathscr{U}$ which we suggestively write as a set of $m$ points:

$$ \mathscr{U} = \{c_1,\ldots,c_m\} $$

and a finite nonempty family of subsets $\mathscr{S} = \{S_0,S_1,\ldots,S_n\}$ whose union is all of $\mathscr{U}$.

Define binary "indicator" variables $x_i \in \{0,1\}$ corresponding to the subsets $S_i$, so that for any choice of subfamily $\mathscr{S}' \subseteq \mathscr{S}$:

$$ x_i = \begin{cases} 1 & \text{if } S_i \in \mathscr{S}' \\ 0 & \text{otherwise} \end{cases} $$

Now as before we ask to minimize $Z = \sum_{i=0}^n x_i$ subject to the $m$ constraints:

$$ \sum_{i \in C_k} x_i \ge 1 $$

for $k = 1,\ldots,m$ where $C_k = \{ i : c_k \in S_i \}$.

A feasible solution $x_0,x_1,\ldots,x_n$ provides us a cover $\mathscr{S}'$ of the universe through the settings of indicator variables $x_i$. That is, since every constraint is satisfied, each $c_k \in \mathscr{U}$ will belong to at least one $S_i \in \mathscr{S}'$. A solution that minimizes $Z$ will at the same time minimize the size of $\mathscr{S}'$, since $Z = |\mathscr{S}'|$.


Having shown that your problem, at least in the generality described above, is NP-hard, it may be of interest to point out algorithms for approximate solutions to SetCover which produce in polynomial-time, covers that are within some factor of the true minimum cover size.

In the context of a universe of size $m$, the greedy algorithm (choose at each step a subset $S_i$ which contains the most points not yet covered) produces a solution whose size is at most $\log m$ times the true minimum size.

The Wikipedia article on SetCover which we linked above gives reference to a fairly recent paper by Dinur and Steurer (2013) which proves that polynomial-time algorithms cannot give much better approximation factors unless P = NP. This is the culmination of various investigations, showing "that the greedy algorithm is essentially the best-possible polynomial time approximation algorithm for set cover to lower order terms."

hardmath
  • 37,715
  • There is a contradiction between your answer and mine. You proved that the problem is NP-hard, and I proposed a polynomial time running algorithm for the problem. One of us is wrong, unless P=NP. – Kuifje Oct 29 '16 at 20:52
  • @Kuifje: One omission in your Answer is the distinction between continuous and discrete solutions. I will dig up an earlier post that shows how the continuous optimum can be quite far from a discrete (restricted to lattice) optimum. Anyway, something to think about. – hardmath Oct 29 '16 at 21:00
  • Good point, but as a flow problem, the continuous relaxation of the problem should give an integer solution. – Kuifje Oct 29 '16 at 21:01
  • I am trying to read through both of these solutions since I don't have a lot of background on this. Can you explain how the flow problem can give you integer solution? @Kuifje – Dat Tran Oct 30 '16 at 00:53
  • This is called the integrality property and is well known for such network flow models. See for example https://www.amazon.com/Network-Flows-Theory-Algorithms-Applications/dp/013617549X – Kuifje Oct 30 '16 at 16:29
  • This being said there is still a contradiction between both answers: one of the two is not possible. The problem cannot be NP-hard and be solved in polynomial time (or else we are millionaires). I suspect there is something wrong with my flow model, but I cannot figure out why. Ahuja and Magnanti (Networks Flows) have proposed shortest path algorithms for this problem if there are consecutive 1's in the columns of the matrix, but not for the general case. This is a hint that I believe you are right. – Kuifje Oct 30 '16 at 17:27
  • 1
    The consecutive one's case results in a totally unimodular matrix, and the LP optimal BFS is thus integral. When the matrix isn't totally unimodular, you can end up with fractional optimal solutions. I belief that's where Kuifje's argument breaks down. – Brian Borchers Oct 30 '16 at 17:51
  • @BrianBorchers Again, this only shows that the continuous relaxation of the model will not give an integral solution. It does not show that any other minimum cost flow algorithm does not correctly solve the model. – Kuifje Oct 30 '16 at 18:20