Finding a value in a sorted array in log R time, R is the number of distinct elements

Question

The standard binary search algorithm gives log N time, where N is the total number of elements in the array. When the array has duplicates, I don't see how you could detect those duplicates ahead of time. (Iterating through the array takes N time, which is too much.) Consequently how do you improve the performance from log N to log R?

score 6 · Answer 1 · answered Aug 22 '16 at 21:32

There is no such algorithm. Here's an information-theoretic proof that it can't be done, inspired by gnasher79's answer.

Let's focus on the special case where $R=3$. Suppose there is a constant $c$ and an algorithm that examines at most $c$ elements of the array, regardless of $n$. Then this algorithm can't possibly solve your problem.

In particular, consider all arrays of the form $[0,\dots,0,1,2,\dots,2]$, i.e., containing some number of 0's, followed by a single 1, followed by 2's. Let's imagine what the algorithm does when you ask it to find a $1$. There are $n-1$ different possible positions for the $1$. However, after doing $c$ probes, the algorithm learns only $c \lg 3$ bits of information about the position of the $1$. Thus, if $n-1 > 2^{c \lg 3}$, the algorithm can't possibly work correctly for all such arrays: there exists some pair of arrays where the algorithm outputs the same thing in both cases, but the correct answers differs, so the algorithm must be wrong for at least one of those two cases. In other words, if we take $n$ to be sufficiently large (namely, $n> 3^c + 1$), the algorithm will be incorrect.

This implies there is no algorithm for the case $R=3$ whose running time is upper-bounded by a constant.

It follows that there is no algorithm whose running time is $O(\lg R)$ (and whose running time doesn't depend on $n$): if there was, plugging in $R=3$ would give us a $O(1)$-time algorithm for the special case where $R=3$... but I just proved that no such algorithm can exist.

However, the special case $R=2$ can be solved in $O(1)$ time: simply look at the first and last element of the array, and pick whichever one matches the value you're looking for.

gnasher729 · Answer 2 · 2016-08-23T19:30:01.443

It can't be done.

Assume you are looking for an element x, and it is guaranteed that x is an array element. Assume there are n array elements. You read array elements until you know one that equals x. I change array elements to create the worst case for you. Assume R ≥ 3 and x occurs exactly once.

If n ≥ 2 then you don't know which one equals x. If n ≥ 4 then whatever element you examine, there are at least two elements either to the left or to the right of it, so you may be left with a range of ≥ 2 elements containing x, so you can't solve the problem examining only 1 element. If n ≥ 8 then whatever element you examine, there are at least 4 elements either left or right of it, so you can't solve it in two steps, and so on.

If $n ≥ 2^k$ then you need to examine at least k array elements in the worst case, for any R ≥ 3. Since k is unlimited, finding an array element in guaranteed O (log n) is impossible.

So there's a little bit missing: There is no upper limit on the number of steps needed if R = 3. However, O (log R) means: Less than c * log R for some c for all large R, the case R = 3 doesn't mean anything.

But it's simple: Take an arbitrary R ≥ 3. Take an array of R-3 numbers from 1 to R-3, an arbitrary number of items equal to R-2, one item R-1, an arbitrary number of items equal to R. And you look for the item R-1. Even in this special case, even if I tell you the structure of the array, you still have to solve a problem equivalent to the case R = 3, which cannot be done with any limit to the number of steps, and which therefore cannot be done in c log R steps.

KWillets · Answer 3 · 2016-08-24T01:22:28.260

If each item repeats $N/R$ times, this result is automatic, since binary search will take $\log( N/R)$ fewer steps to find a value. Specifically, if a searched value repeats C times, the lower and upper brackets cannot get closer than C without locating the match, so $log C$ iterations are avoided, even if the searcher doesn't know C (no a priori knowledge is needed).

$\log N - \log(N/R) = \log R$

In the more general case where each item $i$ occurs $c_i$ times, if the items are searched in proportion to their frequency in the array, the average cost is

$\sum \limits_{i=1}^{R}{{c_i \over N}(\log N -\log c_i )} = \sum{-{c_i \over N}\log {c_i \over N}} = \sum -p_i \log p_i$

which is the entropy of the multiset. (There's an extra $1$ in there which I dropped for complexity-class purposes, eg entropy 0 still takes one probe.)

score 3 · Answer 4 · edited Aug 24 '16 at 18:28

how do you improve the performance from log N to log R?

As you can see from answers above this is impossible to do. But its possible with little bit of cheating.

Assumptions:

The array doesn't change.
Queries for each element are equally likely.

Whenever you get query for some element $x$. Compute the location of $x$ and the range of indices where it occurs. It can be done in $O(\log n)$ time using binary search (e.g., lower_bound and upper_bound in C++). Keep $x$ and the range of locations where it appears in the main array. Maintain a balanced binary tree with all of the elements seen so far and their range of locations (one leaf node per element).

After $O(R\log R)$ queries (expected value using assumption 2) you will have seen every element at least once. Now you know the range of locations of all elements of the array. Now you can find any new element in $O(\log R)$ time by lookup in the binary tree.

Of course you don't have to wait until you've seen all the elements to start using the binary tree. Any time you want to find a value $x$, you can first look it up in the binary tree; if it's not there, you can look it up in the main array and then add it to the binary tree. But once you've made about $O(R \log R)$ queries, you should expect each future query to take $O(\log R)$ time.

If you plan to make at least $\Omega(R \log N)$ queries, the amortized running time will be $O(\log R)$ time per query.

This could be useful in the real world if $R \ll n$.

Finding a value in a sorted array in log R time, R is the number of distinct elements

4 Answers4