2

I was reading up on chunking and found Rabin fingerprinting is also used to break files into chunks. After reading about this algorithm, I understood how rolling hash is computed using Rabin fingerprint. This is explained almost everywhere. What I don't get is how in Rabin fingerprinting, we know to end one chunk here and start another? And how does insertion doesn't affect original chunks? Some sources mention the following, which I didn't get. (Although it's given with reference to LBFS, I think for de-duplication also chunking will be done in similar way.)

When the low 13 bits of the fingerprint are zero LBFS calls those 48 bytes a breakpoint and ends the current block and begins a new one. Since the output of Rabin fingerprints are pseudo-random the probability of any given 48 bytes being a breakpoint is 2^{-13}. This has the effect of shift-resistant variable size blocks.

We are checking if low 13 bits are 0 or not, because fingerprint is basically a remainder, after dividing by an irreducible polynomial? But even then, how does inserting something in chunk-1, still makes sure chunking will be shift-resistant?

Thanks in advance for the help!

Edit : I am talking about application of Rabin fingerprinting in breaking files into chunks for de-duplication. (Not for string search, I understood that one)

Raphael
  • 73,212
  • 30
  • 182
  • 400
user270386
  • 121
  • 1
  • 3

3 Answers3

3

Note that any hash algorithm can be used to break files into chunks where chunk boundaries don't change much after inserts/deletes. Rabin fingerprinting (and RabinKarp rollsums) are just hash functions that are good for this purpose.

The trick is you calculate the hash of a small window of data bytes (say 64 bytes, but I've heard even as low as 16 works fine) at every single byte offset, and the chunk boundaries are whenever the last N bits of the hash are zero. This gives you an average block size of 2^N bytes. Typically you also have a min and max blocksize, which allows you to skip over the min-blocksize after block boundaries, and avoid having blocks that are too small or too big.

The main reason Rabin (or RabinKarp, which I believe is a more generalized form that highlights the rolling property) is useful for this, is it can efficiently be updated and "rolled" over the data with cheap/fast updates to rotate the first byte out and the next byte into the small "window". This makes it very cheap to calculate the hash at every byte offset.

The bit you may have missed in your LBFS example of this, is the rabin fingerprint is only of the last 48 byte "window" of each chunk. Inserting a byte anywhere before that window will not change the fingerprint of that window, and thus it will still be a chunk break point.

It is possible that a combination of min-block size and a coincidence of multiple break points in the data, that an insert/delete could find another different break point that affects the following break point (typically by masking it because it's within the min-block size), but in practice the breakpoints will quickly re-synchronize after that.

Donovan Baarda
  • 181
  • 1
  • 2
2

I know I'm very late to this question, but I want to try and provide an answer for anyone who stumbles across this as I have. I went and deciphered the original paper by Michael Rabin, and found that the Rabin-fingerprint was designed to NOT be shift-resistant. In his paper (http://www.xmailserver.org/rabin.pdf) he used the fingerprint to detect any data-modification in files as the fingerprint would be different if any bit was changed. So, I have no idea why the LBFS paper is claiming to use fingerprinting for dynamic chunk borders because that's exactly what it is not meant for. Look for a different rolling hash if you want shift-resistant chunk borders. That being said fingerprinting would be a great way to check whether two chunks are the same or not.

1

It depends on what you are using the Rabin rolling hash for. If you want to check whether a string P of length k is contained in a large string S, then the Rabin algorithm computes the Rabin hash of all substrings of S of length k. So, there are no decisions to make: every substring of length k gets hashed.

Other systems may use the Rabin hash for other purposes. There's more than one algorithm that uses this hash function. We can't tell you how "the algorithm" decides where to begin/end chunks without knowing what algorithm you are referring to.

D.W.
  • 167,959
  • 22
  • 232
  • 500