Automated optimization of 0-1 matrix vector multiplication

Question

Question:

Is there established procedure or theory for generating code that efficiently applies a matrix-vector multiplication, when the matrix is dense and filled with only zeros and ones? Ideally, the optimized code would make systematic use of previously computed information to reduce duplicated work.

In other words, I have a matrix $M$ and I want to do some precomputation based upon $M$, that will make computing $Mv$ as efficient as possible when I later receive the vector $v$.

$M$ is a rectangular dense binary matrix that is known at "compile time", whereas $v$ is an unknown real vector that is only known at "run time".

Example 1: (sliding window)

Let me use an easy small example to illustrate my point. Consider the matrix, $$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1\\ & 1 & 1 & 1 & 1 & 1 \\ & & 1 & 1 & 1 & 1 & 1\\ & & & 1 & 1 & 1 & 1 & 1\end{bmatrix}.$$ Supposing we apply this matrix to a vector $v$ to get $w = Mv$. Then the entries of the result are, \begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5\\ w_2 &= v_2 + v_3 + v_4 + v_5 + v_6\\ w_3 &= v_3 + v_4 + v_5 + v_6 + v_7\\ w_4 &= v_4 + v_5 + v_6 + v_7 + v_8 \end{align}

Doing a standard matrix-vector multiplication will compute exactly this way. However, a lot of this work is redundant. We could do the same matrix computation at less cost by keeping track of a "running total", and adding/subtracting to get the next number: \begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5\\ w_2 &= w_1 + v_6 - v_1\\ w_3 &= w_2 + v_7 - v_2\\ w_4 &= w_3 + v_8 - v_3 \end{align}

Example 2: (hierarchical structure)

In the previous example, we could just keep track of a running total. However, usually one would need to create and store a tree of intermediate results. For example, consider $$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1 \\ & & & & 1 & 1 & 1 & 1\\ 1 & 1 \\ & & & & 1 & 1 \\ & & 1 & 1 \\ & & & & & & 1 & 1\end{bmatrix}$$ One could compute $w = Mv$ efficiently using a tree of intermediate results:

Compute $w_5$ and $w_7$, and add them to get $w_3$.
Compute $w_4$ and $w_6$, and add them to get $w_2$.
Add $w_2$ and $w_3$ to get $w_1$

The structure in the examples above is easy to see, but for the actual matrices I'm interested in, the structure is not so simple.

Example 3: (low rank)

To clear up some confusion, the matrices are generally not sparse. Specifically, a method solving this problem needs to be able to find efficient methods to apply matrices where large blocks are filled with ones. For example, consider

$$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & & \\ 1 & 1 & 1 & 1 & & \\ 1 & 1 & 1 & 1 & & \end{bmatrix}.$$

This matrix can be decomposed as a difference of two rank-1 matrices, $$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1\end{bmatrix} - \begin{bmatrix}& & & & & \\ & & & & & \\ & & & & 1 & 1 \\ & & & & 1 & 1 \\ & & & & 1 & 1\end{bmatrix}$$

so its action on a vector $w := Mv$ can be computed efficiently by, \begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 + v_6 \\ w_2 &= w_1 \\ w_3 &= w_2 - v_5 - v_6 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}

Motivation:

I'm working on a numerical method for some image processing, and there are several large dense $0-1$ matrices with different structures that are fixed for all time. Later these matrices will need to be applied to many unknown vectors $v_i$ that will depend on the user's input. Right now I'm using pencil-and-paper to come up with efficient code on a per-matrix basis, but I'm wondering if the process can be automated.

Edit: (postscript)

All of the answers here so far (as of 9/5/15) are interesting, but none answer the question as satisfactorily as I had hoped. Probably it turns out that this is a hard research question, and no one knows a good answer.

Since time has run out I am awarding the bounty to EvilJS's answer since it addresses the right question. However, I wish the answer contained more clear and detailed explanations.

tranisstor's answer makes a connection between this question and the Online Boolean Matrix-Vector Multiplication (OMv) problem, but the connection is not exactly what this question is asking. In particular, the following assumption doesn't really fit (bold emphasis mine),

Now assume that for all $n \leq n_0$ and all $n \times n$ matrices $M$ we know an algorithm $A_{n,M}$, that for all vectors $v$ computes $Mv$ in truly subquadratic time, i.e. in time $O(n^{2 - \varepsilon})$ for some $\varepsilon > 0$.

Whether or not there exist subquadratic algorithms for all matrices is orthogonal to the question of finding an algorithm for a specific matrix that is as fast as possible. Most 0-1 matrices look like random noise and (if I were to guess) probably don't have subquadratic algorithms. However, the fact that there are really bad matrices out there doesn't prevent me from finding a fast algorithm on a good matrix, for example, a "sliding window" matrix.

vzn's answers, first answer, second answer are interesting (and in my opinion don't deserve so many downvotes), but they don't apply to the question for reasons discussed in the comments there.

tranisstor · Answer 1 · 2015-09-02T23:15:54.323

This is related to an open research question, which is known as the "Online Boolean Matrix-Vector Multiplication (OMv) problem". This problem reads as follows (see [1]): Given a binary $n \times n$ matrix $M$ and $n$ binary column vectors $v_1, \dots, v_n$, we need to compute $M v_i$ before $v_{i+1}$ arrives.

Notice that the problem from the question is somewhat more general: It allows for $m \times n$ matrices and real-valued vectors. Observe that the problem with $n \times n$ matrices and Boolean vectors is "easier", as it presents a special case.

Clearly, the naïve algorithm for the Online Boolean Matrix-Vector Multiplication problem (which just uses standard matrix-vector-multipliction) takes time $O(n^3)$. There is a conjecture (see e.g. [1]) that this cannot be done truly faster than $O(n^3)$. (In more detail, this conjecture goes as follows: There exists no truly subcubic algorithm, which solves the Online Boolean Matrix-Vector Multiplication Problem, i.e. there is no algorithm with running time $O(n^{3 - \varepsilon})$ for $\varepsilon > 0$).

It is known that Williams's algorithm solves this problem in time $O(n^3 / \log^2 n)$. See [2] for more details.

It would be a breakthrough in the area of conditional lower bounds, if one could prove or disprove the above conjecture.

[1] Unifying and Strengthening Hardness for Dynamic Problems via an Online Matrix-Vector Multiplication Conjecture. by Henzinger, Krinninger, Nanongkai and Saranurak
[ http://eprints.cs.univie.ac.at/4351/1/OMv_conjecture.pdf ]

[2] Matrix-vector multiplication in sub-quadratic time: (some preprocessing required). by Williams
[ http://dl.acm.org/citation.cfm?id=1283383.1283490 ]

Update

One of the questions in the comments was as follows: We know $M$ at compile time. Can't we adjust our algorithm to suit $M$, so the OMv problem (conjecture) does not apply? We will see that this is not the case, unless the OMv conjecture fails.

The proof idea is simple: Assume we could give fast algorithms for all matrices up to some certain size (e.g. distinguishing all possible cases). After this certain size we use divide and conquer.

Here are the details:
Fix some $n_0 \in \mathbb{N}$, which (without loss of generality) is a power of 2 and bigger than 2. Now assume that for all $n \leq n_0$ and all $n \times n$ matrices $M$ we know an algorithm $A_{n,M}$, that for all vectors $v$ computes $Mv$ in truly subquadratic time, i.e. in time $O(n^{2 - \varepsilon})$ for some $\varepsilon > 0$. (Notice that this allows an individual algorithm for each matrix up to size $n_0 \times n_0$.)

Now we will solve OMv in truly subcubic time:
Given a binary matrix $M$ of size $n \times n$, where $n = 2^k$ for some $k$ and $n > n_0$, we use a divide and conquer strategy. We divide $M$ into four submatrices $M_1, M_2, M_3, M_4$ of sizes $2^{k-1} \times 2^{k-1}$. If $2^{k-1} \leq n_0$, then we use algorithm $A_{2^{k-1},M_i}$, otherwise, we recurse. (As $n_0$ is some fixed number, we can pick the correct algorithm in constant time.)

Notice that we will need at most $O(\log n)$ recursion steps. Also, for $n$ vectors $v_1, \dots, v_n$, we will $n$ computations. Thus, to process all matrix-vector multiplications we will need a total computation time of $O(n^{3 - \varepsilon} \log n)$.

It is well known that the logarithm grows slower than any polynomial (in particular slower than any root). Fixing some $\tilde \varepsilon > 0$ with $\tilde \varepsilon < \varepsilon$, we see that our total computation is running in truly subcubic time (in particular, in time $O(n^{3 - \tilde \varepsilon})$). Thus, the OMv conjecture would be wrong.

(If $M$ has size $m \times n$ and $m$ and $n$ are not powers of 2, then the bounds on the running times still apply, as we could just increase $n$ and $m$ to the next powers of 2.)

Conclusion: If you could make use of case distinctions on the input matrices to derive fast algorithms, then you could improve the OMv conjecture.

Evil · Accepted Answer · 2015-09-05T18:41:14.663

If it is possible try to exploit banded tridiagonal nature of matrix.
Otherwise if the matrix contains only a constant number of distinct values (which surely is being binary), you should try Mailman algorithm (by Edo Liberty, Steven W. Zucker In Yale university technical report #1402): optimized over finite dictionary
Common Subexpression Elimination is know for some time like Multiple Constant Multiplication, but going down to gate-level is an option - patterns used here could be used separately as solution or merged with other methods, the paper for this "Improving Common Sub-expression Elimination Algorithm with A New Gate-Level Delay Computing Method" by Ning Wu, Xiaoqiang Zhang, Yunfei Ye, and Lidong Lan published in "Proceedings of the World Congress on Engineering and Computer Science 2013 Vol II WCECS 2013, 23-25 October, 2013, San Francisco, USA" Gate level CSE

There is also crude but working method, to generate symbolic matrix with constants, vector with variables and plug it to Static Single Assingment (SSA) from compilers, which automates process of writing matrices by hand.

new algorithm prototype
What you did with running sum: \begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 \\ w_2 &= w_1 + v_6 - v_1 \\ w_3 &= w_2 + v_7 - v_2 \\ w_4 &= w_3 + v_8 - v_3 \end{align}
Gives 10 operations, and with my initial idea to use Thomas it is equivalent.
For now I am still writing and testing new algorithm, also runtimes are nasty, but first test result gave me surprising answer:

\begin{align} tmp_1 &= v_2 + v_3 + v_4 + v_5 \\ w_1 &= v_1 + tmp_1 \\ w_2 &= tmp_1 + v_6 \\ w_3 &= w_2 + v_7 - v_2 \\ w_4 &= w_3 + v_8 - v_3 \end{align}
Which gives 9 operations, defining them as + or - is 1, and = is 0.

\begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 + v_6 \\ w_2 &= w_1 \\ w_3 &= w_2 - v_5 - v_6 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}

This gives 7 operations, my algorithm result gave:
\begin{align} tmp_1 &= v_1 + v_2 + v_3 + v_4 \\ tmp_2 &= v_5 + v_6 \\ w_1 &= tmp_1 + tmp_2 \\ w_2 &= w_1 \\ w_3 &= w_2 - tmp_2 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}
Which gives 6 operations For now I can tell that I am using Hamming distance, & and | bitwise operations, counting usages and making something like Cocke–Younger–Kasami (CYK) - "a parsing algorithm for context-free grammars, named after its inventors, John Cocke, Daniel Younger and Tadao Kasami. It employs bottom-up parsing and dynamic programming." - from Wikipedia This is the same technique I use to build blocks of variables.

score -2 · Answer 3 · edited Apr 13 '17 at 12:32

this is essentially research-level CS, the problem is studied in at least two guises, one of multiplication of sparse matrices (example paper just cited), and also the special case of "binary sparse matrices" is also studied. the 2^nd case is known to be related to optimizing straight-line programs. minimal programs may also be like DAGs with two types of "gates", addition and multiplication, so some circuit minimization literature may connect with this, and possibly "off the shelf" software could be adapted for the purpose. here is a specific ref on the 2^nd case and also the same question on cstheory with some basic initial empirical study.

score -3 · Answer 4 · answered Aug 27 '15 at 15:48

not sure if this problem has been studied exactly but this research is related & seems a reasonable lead/ start. it looks at hypergraph decomposition for sparse matrix multiplication. binary matrices are a special case of this approach. this approach will find more optimal strategies than the "straight" multiplication method. further optimizations (within this framework) might be possible based on the binary matrix property.

Hypergraph-Partitioning Based Decomposition for Parallel Sparse-Matrix Vector Multiplication / Umit Catalyurek and Cevdet Aykanat

Automated optimization of 0-1 matrix vector multiplication

4 Answers4