I am looking for a parallel algorithm to compute ranks of huge, sparse binary matrices over $\Bbb F_2$, say, $10^5 \times 10^5$ with $10^7$ ones in total.
Currently, I am doing this by packing $64$ bits in a long and applying Gaussian elimination with XOR, swapping unneeded parts of the matrix to hard disk. While this goes rather fast (even such huge matrices take only a few days) I feel it's a pity that:
I'm not using the fact that it's sparse, I feel this could give a huge improvement;
I'm not using the fact that I have multiple processors available ($1000$+ in a GPU)
I wondered if anyone is aware of any better methods? I saw this paper: Link but only computing the characteristic polynomial seems to be much more expensive than what I'm doing now. Or am I missing something? How should I compute this determinant polynomial efficiently?
Another approach would be working via https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.1632 (Wayne Eberly, Erich Kaltofen: On Randomized Lanczos Algorithms) and its derivates such as http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.1619 (Jean-Guillaume Dumas, Gilles Villard: Computing the Rank of Large Sparse Matrices over Finite Fields), but this doesn't work for ${\rm GF}(2)$ (which is not even mentioned in the paper).