3

I have a sparse $60000\times10000$ matrix where each element is either a $1$ or $0$ as follows.

$$M=\begin{bmatrix}1 & 0 & 1 & \cdots & 1 \\1 & 1 & 0 & \cdots & 1 \\0 & 1 & 0 & \cdots & 1 \\\vdots & \vdots & \vdots & \ddots & \vdots \\0 & 0 & 0 & \cdots & 0 \end{bmatrix}$$

Each row represents a different point in time such that the first row is $t=1$ and the last row is $t=60000$. Each column in the matrix is a different combination of signals (ie. $1$s and $0$s). I want to choose five column vectors from $M$, $c_1, c_2, c_3, c_4, c_5$ and take the Hadamard (ie. element-wise) product:

$$s = c_1 \circ c_2 \circ c_3 \circ c_4 \circ c_5$$

The result is a new vector $s$ where $1$s only remain in a row if they were present in every vector $c_1, c_2, c_3, c_4, c_5$ for that row. I need an algorithm that finds the optimal strategy vector $s$ that is most similar to the optimal strategy vector $s^{*}$. The optimal strategy vector $s^{*}$ is static and is of the same dimensions as $s$.

The easiest solution is to test every combination possible; however, choosing $5$ columns from $10000$ results in $10^{20}$ different possible combinations. Note that the five vectors chosen from $M$ must not be distinct. As such, I want to find an algorithm that returns the five vectors $c_{1}$ through $c_{5}$ that yield the optimal strategy $s^{*}$.

I am not too sure on how to approach this problem but I believe it is a discrete optimisation problem where I want to maximise the similarity, $\operatorname{sim}(s,s^{*})$. I am defining the similarity using the Jaccard Index below as it rewards having $1$s in the correct place but does not penalize $0$s. This is good because I'd rather have a false negative than a false positive. The metric is bounded between $0$ and $1$ where $1$ denotes the maximum similarity. It is commonly used for 'binary vectors' and so seems ideal here as each element is either a $1$ or $0$.

$$\operatorname{sim}(s,s^{*})=\frac{s \cdot s^{*}}{\sum_{i=1}^{60000}\left (s_{i} + s_{i}^{*} \right ) - s \cdot s^{*}}$$

Note that the operation ($\cdot$) refers to the dot product and not the Hadamard product.

I posted a similar question before but it was removed for lack of context. I have added more context now but let me know if anything is unclear. I am asking this question to aid me in a personal project I am working on.

FYI, the algorithm will be run on a computer. Any help is much appreciated -- thanks!

  • Shouldn't $s\circ s=s$? Will $s^*$ also only have $0,1$ as entries? – Sudix Aug 02 '21 at 15:59
  • You are right for the first question. I meant to write the sum of all elements in $s$. I'll fix that. $s^{*}$ is derived from elementwise operation of the other vectors (which exclusively have $0$s and $1$s) so it too only has $0$, $1$ entries. – Alex Pharaon Aug 02 '21 at 16:12
  • The core part of your question lies within the chosen similarity measure (which, formally spoken, isn't a metric). If we had a classic metric (e.g. euclidian or taxicab metric) and were to minimize the distance of $s$ and $s^$ there'd most likely exist results already, and with them algorithms. Therefore it's critical to make sure the measure you chose actually does what you want. Two points here are: 1. Is the non-linearity essential? 2. What do you do if $s^=\vec{0}$? – Sudix Aug 02 '21 at 16:55
  • 1
    For my purposes, $s^*\neq\vec{0}$. In regards to the metric @Sudix, I first saw it on the article link and upon further research now see that it is known as the Jaccard Index/Similarity (although in the paper they refer to it as the Tanimoto coefficient). This metric is regarded as better for 'binary vectors' (ie. only 1s and 0s). – Alex Pharaon Aug 02 '21 at 18:00
  • Your dimensions seem to be a bit off. You start with a $100000 \times 60000$ matrix, then say it has $60000$ rows and that you are choosing five columns from $10000$ (not $100000$. – prubin Aug 03 '21 at 19:02
  • Good catch @prubin, it should say a $60000 \times 10000$ matrix. – Alex Pharaon Aug 03 '21 at 21:39

2 Answers2

1

Let $m:=100000, n:=60000$.

We'll show that just deciding whether or not there exists an $s_\text{OPT}$ with $s_\text{OPT}=s^*$ is NP-hard.

Let $s_\text{OPT} :=c_1 \circ c_2 \circ c_3 \circ c_4 \circ c_5$ denote the optimal solution, and $c_1,...,c_5$ a combination that produces this optimal solution.

If for any $i$ we have $(c_i)_x=0$, then $(s_\text{OPT})_x=0$. Therefore, if $s_\text{OPT}=s_*$, whenever we have $s^*_x=1$, then for all $i$ it must hold that $(c_i)_x=1$.

Let $M_{\downarrow i}$ denote the $i$-th column of $M$.

So if we look for each column $M_{\downarrow i}$ at the index set $S_i\subseteq \{1,..,n\}$, which tells us which cells of $M_{\downarrow i}$ are $0$, then the hadamard product of columns becomes the union of their index sets.

Formally:
For each column $M_{\downarrow i}$ define $S_i$ via the equivalence $\forall x \in \{1,...,n\}:\quad x\in S_i\Leftrightarrow (M_{\downarrow i})_x=0$.

Then the following holds: $$ (M_i\circ M_j )_x=0 \Leftrightarrow x\in (S_i \cup S_j) $$

Using this mapping of our row vectors to sets, we can now reduce this problem to a known one (technically I'd have to show the reduction the other way around, but the reduction in the other direction works similar):

Let $S_*$ be the set corresponding to $s^*$.

If we assume that $s_\text{OPT}=s_*$, then we only have to look at columns that have at least a one everywhere $s^*$ has a one, so we discard all sets $S_i$ where $S_i\supsetneq S_*$.

We now have an instance of Maximum k-Set Cover, which is a special case of the Set Cover Problem in which we want to find the biggest union of at most $k$ sets.

This problem however is in $W[2]$ when parameterized after $k$, so there can't be any efficient algorithm, see this link.

If you want an exact algorithm, your best chance is to look for an FPT that is only exponential in the size $\|s_\text{OPT}-s^*\|_1$.

Otherwise you'll probably have to live with a heuristic. While not exactly your problem, I think the ideas of heuristics for the Maximum Coverage Problem should be somewhat transferable.

Sudix
  • 3,829
  • Thanks for your response and suggestions @Sudix, I'll take a deeper look into the links you provided a bit later. I'm a bit confused about some of the things you wrote. First, why do we only need to look at rows that have a one everywhere $s^*$ has a one to find the exact solution? Wouldn't one need to be sure that other rows in $s_\text{OPT}$ don't also have ones? Also, you mentioned that the Set Cover Problem seeks to find the biggest union. Am I not trying to maximise the intersection here? – Alex Pharaon Aug 03 '21 at 13:22
  • For your first point, you're right. I was a bit too inprecise there. For your second point: If we were looking at the ones, then we'd look at the maximum intersection. However, if we're looking at zeroes, then if there is just one component with a zero at index $x$, then there's a zero in the final Hadamard product as well, so we're looking at the union instead. – Sudix Aug 03 '21 at 14:23
  • Having taken a better look at your answer now @Sudix, there are still some things I don't fully understand. First, why are you creating sets alongst each row. Should it not each column be a set as I want to choose 5 different columns? Also, here you take the Hadamard product of two different rows $(M_i\circ M_j )_x=0 \Leftrightarrow x\in (S_i \cup S_j)$. Should it not be of different columns? – Alex Pharaon Aug 05 '21 at 09:04
  • Yes, you're right. Seems I misread "column" as "row" in your question – Sudix Aug 05 '21 at 12:43
  • I'm having difficulties seeing how to apply the concepts from the maximum k-set cover problem as it will never be the case that ${OPT}=^{∗}$ for me. I merely want to find the ${OPT}$ most similar to $^{∗}$ as defined by the equation above (or by an easier one if needed, like $∥−^∗∥$). – Alex Pharaon Aug 08 '21 at 23:05
  • Yes, that was to be expected. The point of the $k$-set-cover problem was that we already know that it's a hard problem, and we used it to show that your problem is hard as well. This tells us then that no matter how much time we look for algorithms that solve the problem, we won't find an algorithm that always runs in polynomial time (for the parameter "number of columns"). Since your number of columns is high, this then implies that you'll probably have to accept some kind of compromise: The algorithm either has a long running time, solves only a small subset of instances, or is a heuristic. – Sudix Aug 09 '21 at 03:55
  • Thanks for your response @Sudix, a heuristic seems like the best way to go for my problem. Heuristics from the maximum coverage problem would seek to maximise the number of zeros produced in the union of the sets without any regard to whether a 0 should be in a row or not, as defined by $s^*$. Incorporating a cost of choosing a set as a function of the number of misplaced 0s seems reasonable but I am not too sure how I would do such an adaptation to the heuristic algorithm... Whether you know how to do such a thing or not, I am grateful for all the help you have already given me :) – Alex Pharaon Aug 09 '21 at 12:32
  • The methods given in the paper I linked are more or less standard heuristic methods: Greedy Search, k-swap methods, and tabu search methods. At least the first two, but more probably all of them, can be readily adapted to your problem. I'd start with the greedy approach and test how fast it is. If it's already borderline slow, most other heuristics will be too slow. Once you've picked a general method for a heuristic, it's up to you to optimize its average speed and error, e.g. by sorting your columns after some criteria and/or discarding irrelevant columns. – Sudix Aug 09 '21 at 16:40
0

You can formulate this as a binary linear program and exploit sparsity if you are willing to settle for a somewhat different objective function (your chosen similarity measure being nonlinear).

Let $O_i = \lbrace j : M_{ij} = 1\rbrace$ be the set of column indices where row $i$ has a one. Let $x_j\in \lbrace 0, 1 \rbrace$ be 1 if column $j$ is selected and 0 if not, and let $s_i \in [0,1]$ for each row index $i$. (You can make the $s_i$ binary or not. It's hard to say which way will work better with respect to solver time.) Now add the constraints\begin{align*} \sum_{j}x_{j} & =5\\ \sum_{j\in O_{i}}x_{j} & \le4+s_{i}\quad\forall i\\ \sum_{j\in O_{i}}x_{j} & \ge5s_{i}\quad\forall i. \end{align*} The first says that you have to select exactly five columns. The second says that if $s_i$ is to be 0, you cannot select more than four columns with value 1 in row $i$. The third says that if $s_i$ is to be 1, you must choose five columns that are all 1 in row $i$. Note that the number of constraints is approximately twice the number of rows, and the only dense constraint is the first one.

That brings us to the objective function. If you want to maximize the number of ones in $s^*$ that get matched, you can just maximize $$\sum_{j:s^*_j=1}s_j.$$ If you want to minimize the number of times $s^*$ is 0 but $s$ is 1 (false positive?), you can minimize $$\sum_{j:s^*_j=0} s_j.$$ To minimize the $\parallel s-s^*\parallel_1$, just minimize $$\sum_{j:s^*_j=0}s_j + \sum_{j:s^*_j=1}(1-s_j).$$

prubin
  • 5,418
  • Would the final function to minimise not be $\sum_{j:s^j=0}s_j - \sum{j:s^j=1}(s_j)$ instead since $\sum{j:s^_j=1}s_j$ is not bounded by 0 and 1 @prubin? Also, to clarify, is $s_i$ just whether row $s^{}=0$ at row $i$? If so, then does it work to have your second and third constraints for ALL rows $i$ when there is no solution $s=s^{*}$ in my problem? – Alex Pharaon Aug 05 '21 at 09:23