Cryptographic data structure: sparse array without membership test

Question

I would like a data structure supporting two operations: $\mathsf{set}(k,v)$ and $\mathsf{get}(k)$, with the following properties:

$k$ can be any number between $0$ and something impressively large, like $2^{256}-1$.
If there is some $v$ such that $\mathsf{set}(k,v)$ has been called then $\mathsf{get}(k)$ returns the most recent such $v$.
There is no way to learn anything about which $k$ and $v$ are such that $\mathsf{set}(k,v)$ has been called (but it might be possible to determine information that does not have to do with specific $k$ and $v$, e.g., it might be possible to learn how many times set has been called or on how many different $k$ it has been called).
Space and time complexity are bounded by reasonable functions of the total number of calls to set and get (or better yet by a reasonable function of the number of keys set has been called on).

An attacker would learn everything I know, so the solution cannot be based on me retaining any secrets. But an attacker would not see the sequence of $\mathsf{set}$ and $\mathsf{get}$ operations since that would allow knowing what $\mathsf{set}$ has been called on. An example of a data structure that would meet all my demands except for the computational complexity ones is to create at time zero an array of size $2^{256}$ populated by random data and then update and query it as a normal array.

I assume a fixed field length for $v$ and also that the data being stored in $v$ is effectively random. The meaning of the last statement is a bit subtle. If $v$ is a 256-bit field then storing all zeros in a place in the array will effectively give away that it has been set. If $v$ is a 1-bit field then storing any particular value in any particular place will not give anything away, but other patterns, like storing more zeros than ones might. Imagine that the attacker has a correct guess about what the set of set keys is. Depending on how the array was used, he might be able to become more certain of his guess by querying the array. I simply want a data structure that doesn't give up any more information than strictly necessary beyond the infeasible full array of length $2^{256}$ initially stocked with random data.

Obviously it will have to give up a little more information because it will get bigger depending on how much data it has in it (compare black holes), but ideally the approximate number of keys set is all the information it will leak. Actually, it will not have to progressively give up more information if the maximum amount of data the structure can contain is fixed at creation time. And the partial solutions to the problem that are given below by otus and myself require the size to be fixed in this way anyway.

Is there such a data structure? Also, what is a good reference for this kind of question?

The attacker only gets to see the final data structure. The concept of history-independent data structures looks really interesting, but I am skeptical that literature will help with my problem. Among other things, the couple of papers I looked at seemed to use a threat model where the attacker has infinite computational resources, but it is impossible to solve the problem under that threat model.

score 2 · Answer 1 · edited Apr 13 '17 at 12:48

This answer builds on the answer supplied by @otus. There is also an important worry about whether the main solution works at all as advertised. I am posting it now to make clearer to @otus a question I asked in a comment. I will eventually either delete it or substantially revise it.

First off, let me emphasize that if the table is going to be used to store arbitrary data, then an attacker who suspects k is a key may be able to test the suspicion by checking whether get(k) is data-like. So the natural guarantee in this case is just that an attacker cannot recover any information from the table without possessing the key to that data. If this guarantee is all we require, the problem can be completely solved using the @otus's idea of hashing the keys and using the hash as a symmetric encryption key. Let me explain this solution as a preliminary.

Assume, as @otus does, that an initial hash has been performed. (Questions: do standard cryptographic hash functions actually come with a guarantee that the hashes are indistinguishable from draws from the uniform distribution? Also, if we are willing to hash to a larger space than the domain, can we be guaranteed that the hash function is injective?)

A simple version of the data structure is as follows. We maintain a list of data items. Each item in the list is of the form $E_k(k,v)$ for a $k$ that is unique in the list. To retrieve the value associated with $k$, we decrypt the list members using $k$ and check for one that returns to $(k,v)$. To store $(k,v)$, we first check if $k$ already has a value associated with it and, if so, remove the corresponding data item from the list. Then we add $E_k(k,v)$ to the list.

The simple version has linear time overhead, but it is straightforward to eliminate that inefficiency by using hash tables with collision resolution.

Now back to the full version of the problem, which is mainly interesting in the case where the stored data will be indistinguishable from random bit strings.

@otus gives a solution to this problem, but it has three drawbacks: it is probabilistic, it requires super-quadratic space to have low error probability, and it the eventual size of the data structure must be known in advance. The first and third drawbacks seem to me inescapable, though I have no proof of this conjecture. I have an idea about how to eliminate the second drawback, but it depends on the existence of a certain kind of error-correcting code.

The data structure is an array of $N$ bits initialized to random values. To store $(k,v)$, we use $k$ to seed a pseudo-random number generator to generate a sequence of indices into the table. We blow up $E_k(v)$ to double its original size using an error-correcting code, and we store the successive bits of this blown up version of $E_k(V)$ at the successive pseudo-random indices into the table. This code can correct errors in up to half bits, so as long as the probability of a collision at any bit is significantly less than a half, we can be virtually certain of recovering the data. If the table is four times size of the data stored in it then the expected number of bits stored at each index will be 1/2. We can think of the number of collisions at a location as following a Poisson distribution with mean 1/2, so the probability of no collisions at a location will be about .61. Assuming the data is long, there will be an overwhelmingly high probability that we can recover it from the code. And the total space requirement is only four times the amount of space we need to hold the data.

The only problem is that with normal error-correcting codes, it is easy to tell whether an bit string is a code or not, and it is probably also easy to tell if a bit string is close to being a code in the Hamming metric. So an attacker in possession of $k$ would be able to confirm that $k$ was likely a key, precisely what we don't want. I am not sure if this problem dooms the whole approach I am contemplating here or whether there are special codes that do not have this problem. Now with any code there will be some bit strings that it is easy to see are not code words because they have short Hamming distance to bit strings that decode to something different. But maybe such bit strings could be rare, or even impossible to find, and bit strings that do not have that property would be indistinguishable from codes.

score 2 · Answer 2 · answered Sep 28 '14 at 22:42

I'm looking forward to reading other answers -- a structure with these properties would be very useful for plausibly deniable encryption.

Perhaps something like this might be adequate:

// untested pseudocode
initialize():
    global_current_index = 0
    global_array = []
    for each word in a dictionary of plausible words:
        set( word, get_random_value() ) // dummy data

set(k,v):
    // would the winner of the [password hashing competition][2] be overkill
    // for this one-way function?
    tag = one-way-function( k )
    ciphertext = encrypt( key=k, data=v )
    location = binary_search( global_array, key=tag )
    // forget and overwrite old value if it exists; otherwise append new value
    if not_found(location):
        location = global_current_index
        global_current_index = global_current_index + 1
    global_array[location] = (tag, ciphertext)
    dummy_k = get_random_value()
    tag = one-way function( dummy_k )
    ciphertext = encrypt( key=dummy_k, data=get_random_value() )
    global_array[global_current_index] = (tag, ciphertext) // dummy data
    global_current_index = global_current_index + 1
    sort( global_array )

get(k):
    tag = one-way-function( k )
    location = binary_search( global_array, key=tag )
    if not_found(location):
        log( "k:" + k + "not detected: unauthorized guessing ?" )
        return get_random_value()
    else:
        ciphertext = global_array[ location ]
        plaintext = decrypt( key=k, data=ciphertext)
        return plaintext

Space complexity: After the initial dummy dictionary, the size of the array increases by 2 entries each time set(k,v) is called with some new unique k value -- one real entry and one dummy entry. Each entry is relatively small -- the size of the encrypted v value (perhaps exactly the same length as the plaintext v value with classic deterministic encryption algorithms, or perhaps a little longer to handle an IV, padding, etc) plus the size of the tag entry (perhaps 256 bits if we use SHA256() as the one-way function).

Time complexity: get() uses a binary search, so it's O(n ln n) where n is the number of previous calls to set.

set also uses a binary search, so it has the same O(n ln n).

If an attacker get a single copy of this array of size N and knows you've been using this algorithm, he knows you've stored somewhere between 1 entry of actual data -- and overwritten it N times -- up to at most N/2 entries of actual data.

However, if you are using a non-broken one-way function, and a non-broken encryption routine, then it seems difficult to extract what data you've stored or what keys you stored them under without brute-force guessing.

This structure does leak a lot of data about what keys you've never used -- the attacker can make a guess, and if the one-way-function of that guess is not stored as a tag in the array, then the attacker knows you've never stored anything under that guess.

On the other hand, if you know ahead of time what guesses the attacker might make, and you've pre-loaded them in the "dictionary of plausible words", then the attacker might not be tell if you've stored something under that guess or not.

score 1 · Answer 3 · answered Sep 23 '14 at 06:02

You can create a lossy version of the data structure using a hash table. If the $k$ values are not uniformly randomly distributed, you would initially hash them $k' = H(k)$ and use that as $k$ below.

Initialize the table of size $N$ with random data. When inserting $(k,v)$, encrypt $v$ using $k$ as the key with some symmetric encryption algorithm $E_k(v)$. Then write that ciphertext into the table position $k \mod N$.

Now you can query the key $k$ using $D_k(x)$, where $x$ is the data in the position $k \mod N$. That always returns some data, so an attacker wouldn't know if it was previously inserted or not. (This has to be the case for the encryption algorithm, so many padding modes wouldn't work. Any unauthenticated stream cipher would.)

However, if two $k$ values collide, the second $(k',v')$ will overwrite the first value $v$ with garbage data. You can avoid this only by preallocating a large enough hash table that collisions are unlikely. I.e. if you need it to handle $2^n$ values without collision, you would need it to be larger than $2^{2n}$ items – how much larger depending on how unlikely collisions must be.

I don't see a way to extend this approach to a non-lossy/probabilistic data structure.

The choice of where to find the value in memory must only depend on $k$, since it has to return the same item even if every other item in the data structure has been changed. There can't be any way to verify the chosen item was the correct one either (like using MAC-then-encrypt) or else an attacker can test which $k$ have not been inserted. This means anything that needs comparisons, like trees or hash table collision resolution, won't work.

Cryptographic data structure: sparse array without membership test

3 Answers3

Linked