Secure content-defined chunking

Question

Intro

For deduplication purposes, I need to split a stream of plaintext bytes into variable-sized chunks. The way this is traditionally done is using a rolling hash function defined over some window $w$ (e.g. 48 bytes). This window "slides" along the byte stream and is evaluated at each byte position, and when the value of the function is let's say $f(w) \equiv 0\ (\textrm{mod}\ 1024)$ a cutpoint is made, marking the end of the current chunk and beginning of the next one. The hash function is extremely fast to evaluate when the following is known:

The value of the rolling hash function at some window position
The byte entering the window
The byte leaving the window

The issue

Now if I encrypt each chunk and upload it to the server, I already leak some information about the plaintext:

The adversary knows that the hash value (a simple CRC32-like checksum) of the last $w$ bytes of each plaintext chunk $\equiv 0\ (\textrm{mod}\ 1024)$
All the other windows are hashed to a value $\not\equiv 0\ (\textrm{mod}\ 1024)$.

Here is a similar question, but it seems people there did not realize that the hash value (output of the rolling hash function) is never published or known to the attacker. The only information disclosed is whether the hash value $\equiv 0\ (\textrm{mod}\ 1024)$ (true or false) for every window position in the input plaintext stream. So they recommend to use SipHash or $AES(f(w)) \equiv 0\ (\textrm{mod}\ 1024)$, but both solutions are just too slow compared to using the plain rolling hash.

There seem to be similarities with the hash-table flooding attacks, as the hash value is also never directly published. Besides the already-mentioned SipHash, Microsoft has proposed Marvin32 for this particular issue.

Questions

Besides the already mentioned solutions, what is the proper way to perform chunking without leaking information about the plaintext? Performance is very critical here though.
Would UMAC or Universal hashing have any application here?

Would this rolling hash function solve the information leakage on its own without encrypting the hash value with AES/SipHash? If not, what can be done to fix the information leakage? Pseudo-code for this hash function:

// My secret key, never shared or disclosed to anyone
var key = ....; 

// Generate substitution table
var substitution = new uint32[256];
var substitutionKey = HMAC(key, SUBSTITUTION_SALT);
for (uint32 i = 0; i <= 255; i++)
{
    var randomBytes = HMAC(substitutionKey, UInt32ToBytes(i));
    substitution[i] = BytesToUInt32(randomBytes);
}

// Generate random irreducible polynomial of degree 33 (GF 2) and store it
// in a 32-bit register (33 coefficients, but the 33rd is implied because
// it's always 1)
// The algorithm is not shown here due to its complexity, but let's assume
// the polynomial is derived from the key
var polynomialKey = HMAC(key, POLYNOMIAL_SALT);
uint32 polynomial = RandomIrreduciblePolynomial33(polynomialKey);

// Hash an input window
uint8[] window = new uint8[48];
LoadPlaintext(ref window);
uint32 hash = 0;
for (uint32 i = 0; i < 48; i++)
{
    // Galois multiplication by x and subsequent reduction
    hash = (hash << 1) ^ ((hash >> 31) * polynomial);
    // Add the substitution polynomial
    hash = hash ^ substitution[window[i]];
}

...

// Now if we want to move the window forward by a single byte, there
// exist a very efficient algorithm that does that, but it's not
// relevant to the issue at hand

score 5 · Answer 1 · answered Sep 14 '15 at 07:08

UMAC uses AES, so it will not lead to a faster algorithm than the rolling hash + AES solution suggested in the previous question. Universal hashing based MACs like UMAC and Poly1305 usually use a PRF (which you could just use directly) to hide the hash. There are universal hashing MACs that do not, but as far as I know they are even slower. (Further, I am not sure if a secure MAC would even be enough, or if you actually need a PRF.)

Without leaking any more information, there are not many options. Rolling hash + SipHash is probably faster anywhere that you lack AES hardware, but that may be the best you can do.

If you find the performance cost to be unacceptable, you might want to allow some information to leak. For example, if you define chunking using parity(w) = 0 and then PRF(k, f(w)) mod (n / 2) = 0, you halve the number of PRF calls while only leaking one extra bit of information on the window (and even less for those positions where you did not end a chunk).

You can of course reveal even more information to reduce the number of calls further, but in the end you back at something leaks as much as just using f(w) mod n = 0.

Secure content-defined chunking

Intro

The issue

Related

Questions

1 Answers1