7

Short version: What is the most space-efficient lossy dictionary (with given false positive and false negative rates)?

Long version: A lossy dictionary is a data structure $D$ that encodes a set $S$ of $n$ key-value pairs $(k_i, v_i)$, assorted with a query operation. Let $K = \{k_1,\dots{},k_n\}$. The query operation is defined as:

$$query_D(k_i) = v_i,$$ $$query_D(k) = \bot \text{ for } k \not\in K.$$

However, false positives (returning a $v_i$ for a $k \not\in K$) or false negatives (returning $\bot$ for a $k_i$), respectively occurring with probability $p_p$ and $p_n$, are acceptable to some degree.

Many efficient techniques exist, but I am having trouble figuring out the most efficient one. I am chiefly interested in space efficiency (as a function of $p_p$ and $p_n$). $query$ efficiency is not a main concern (but should remain sublinear in $n$). I assume that keys and values are of fixed size of respectively $s_k$ and $s_v$. I also assume that keys (both in $S$ and those queried) are uniformly distributed in $\{0,1\}^{s_k}$.

I know of some examples:

  • Cuckoo hashing
  • Bloomier filters
  • Minimal perfect hash function + signature (to alleviate false positives)

They are however not trivial to compare, and I am also not sure there isn't better out there. Any insight would be appreciated!

doc
  • 391
  • 1
  • 8

1 Answers1

1

Let's see what we can do. Values are random and hence are incompressible, but keys can be sorted and then diffed. So, just sort the data, employ difference compression schema for keys and use binary search to find.

It will put your N keys+values into minimum amount of bits required to keep this amount of information.

May be, you can further improve compression ratio by dropping less compressible items.

EDIT: As mentioned here, you can compress keys to smaller values. So, f.e. you can use Bloom filter to check that probed key is included in the result set, and then use compressed 32-bit hash of the key to find it in table with 64K entries with pretty high fidelity. NOTE: at more thorough analysis I realized that Bloom filter will occupy at least the same space you will economy on reducing key values stored in the sorted table.

PS: You can use any rehashing scheme, including Cuckoo hashing, to cut off rehash chains at some point, but anyway sorted sequence should be better compressible than any hashtable containing unused cells.

Perfect hashing just helps you to make hashtable without holes. I don't know what signature you mean here.

Bloom filter by itself only checks value presence, then you still need to store entire key+value pair.

Bulat
  • 2,113
  • 1
  • 11
  • 17