3

Given a 2D lattice with coordinates $1 \leq x \leq c$ and $1 \leq y \leq d$, we define $f(x, y) = xy$.

We wish to find a boolean function $I(x,y)$ that determines in $O(1)$ time whether or not $(x,y)$ belongs to the set of points $S$ of size $k$ such that the sum $Z(S) = \sum_{(x,y) \in S} f(x,y)$ is smaller or equal to any other set of size $k$. One may use $O(c+d+k)$ time and space to construct $I$.

Is this possible? Is this a known problem (my search turned up nothing)? Can https://en.wikipedia.org/wiki/Divisor_summatory_function and its approximation help us?

Motivation: I work in NLP and am trying to find an optimal way of storing part of a word-word cooccurrence matrix in memory and part on disk. This matrix is very sparse. I'm making the simplifying assumption that the probability of two works co-occurring is proportional to their unigram frequencies. By ranking words in terms of frequency, we get the $c$ and $d$ terms. So the smaller the ranks $c$ and $d$, the more likely they are to co-occur, so this value should be stored in memory. Since there will be billions of lookups, $I$ needs to be fast.

Thanks!

Alexandre
  • 349
  • 2
  • 10

1 Answers1

3

The set $S$ must have the following form: there must exist a threshold $t$ such that:

  • If $xy < t$, then $(x,y)$ is in $S$.
  • Some or all of the points $(x,y)$ with $xy=t$ are in $S$.
  • If $xy > t$, then $(x,y)$ is not in $S$.

Moreover, in the second bullet, it doesn't matter which points with $xy=t$ you put into $S$; all that matters is how many of them you put into $S$. Finally, notice that the locus of points $(x,y)$ such that $xy=t$ forms a hyperbola.

It follows that one way to parametrize the set is as

$$S = \{(x,y) : xy < t \text{ or } (xy=t \text{ and } x<u)\},$$

for some parameters $t,u$. It's clear that, given $t,u$, membership can tested in $O(1)$ time, i.e., your oracle will run in $O(1)$ time, as required.

So, we simply need to select suitable values of $t,u$ during the precomputation to construct $I$. How can we do that?

One easy way is to use binary search. Notice that it's easy to count the number of lattice points $(x,y)$ such that $xy < t$ in $O(c)$ time: just iterate over $x=1,2,\dots,c$ and count the number of values $y$ such that $xy < t$, i.e., such that $y < t/x$; there are $\lfloor (t-1)/x \rfloor$ of them, so we sum that over $x=1,2,\dots,c$. Now use binary search on $t$ until you find the largest $t$ such that the number of such points is at most $k$. This gives you the value of $t$. It also gives you the number of points $(x,y)$ such that $xy<t$; call that number $k_0$.

Once you know the value of $t$, the next step is to choose a suitable value for $u$. We'll need to choose $u$ so that there are exactly $k-k_0$ lattice points $(x,y)$ such that $xy=t$ and $x<u$. Finding $u$ can be done by iterating over $x=1,2,\dots,c$; for each $x$, check whether $y=t/x$ is an integer, keep track of how many lattice points you've found, and stop once you've found $k-k_0$ of them.

In the end, the precomputation requires something like $O(\min(c,d) \log k)$ time and the oracle can be computed in $O(1)$ time. The precomputation can probably be done even faster through implementation tricks and/or some number theory, but given your problem statement, I'm guessing this might already be good enough.

D.W.
  • 167,959
  • 22
  • 232
  • 500