Rabin Karp rolling hash - dynamic sized chunks based on hashed content
The traditional computation of the hashes of interleaving windows generally works like this:
Input: [a,b,c,d,e]
Window size: 3
Hash#1: hash([a,b,c])
Hash#2: hash([b,c,d])
Hash#3: hash([c,d,e])
The Rabin-Karp implementation is very efficiently reuses the previous hash to calculate the new one:
Hash#1 = hash([a,b,c])
Hash#2 = Hash#1 – hash(a) + hash(d)
Hash#3 = Hash#2 – hash(b) + hash(e)
So, to calculate the next window’s hash all we need to do is to remove the window’s first element's ‘hash value’ from the current hash and add the new to it.
Even though a simple mathematical addition would do fine, it’s usually not preferred because it can cause high collision rate among the hash values:
a+b+c = a+c+b = b+a+c = b+c+a = c+a+b = c+b+a !
However, Rabin-Karp algorithm typically works with the powers of a prime number to do calculate the hash.
p^(n)*a[x] + p^(n-1)*a[x-1] + ... + a[0]
Lets say the prime is 7 (seven is not too good choice for hash in the MOD 2^32 number space):
49*a + 7*b + c <> 49*b + 7*a + c <> ….
[note: of course the inequality is not true with some values]
Using the previous example, if we know the hash of [a,b,c] the next hash window is straightforward:
Hash#2 = (Hash#1 – 49*a ) * 7 + d
= (49a + 7b + c - 49a) * 7 + d
= (7b + c) * 7 + d
= 49b + 7c + d
The nice thing about it is it requires constant number of operations and does not depend on the size of the window.
Pattern matching
One of the (not too exciting) usages of the Rabin-Karp algorithm is pattern matching: finding a short string sample in a large text corpus. The reason it’s not too exciting because there are many other specifically text searching algorithms whose running time is generally better than Rabin-Karp (the latter runs in O(len(text) + len(pattern)) on average but worst case is O(len(text) * len(pattern)) when hash equality is too frequent).
Variable file chunk size slicing
However, the content-based file chunking is a very interesting application of the Rabin-Karp algorithm. Let’s see the following problem:
One’s got a large file which hash been modified at some location(s). Modification can be add/remove/replace. Check which sections of the file have been modified.
Original This is text. This will be removed. This will be modified. The end. |
Modified This is text. This has been modified. The end. |
If we create a fixed size window (10 bytes) to check the file then apart from the first window ('This is te') every other have been modified, even though it’s clear that this is not what happened.
However, we can say we are aiming to get a window size about 16 bytes. This means, that if we used a 4 bit counter we would want to create a new window when the bits are all zeroes after incrementing it:
0000 – Window start
0001
…
1111
0000 – New window start
We can do exactly the same with Rabin-Karp rolling hash: start a new window when the bits of the hash are all zeroes!
If we are aiming for a ~8k window size, we’d need to use the lowest 13 bits of the hash to decide when to start a new window. This method gives very reasonable window sizes (tested on a ~300mb binary installer file):
Min window: 1
Max window: 81287
Average window: 8132
Median window: 5658
With this method, even if some parts of the file were modified, these modifications can only affect the current and potentially the next window, the rest would be identical slices.
So, we have a lot of reasonable sized chunks of a file. How to compare them to the original windows? The trivial answer would be to use byte-by-byte comparison but that is way too expensive. Simple hashing can be used to compare the content.
Comparison by hashing
Generally it’s very natural to say that just because of two hash values are the same it does not mean that the original values are the same too. Mathematically this is perfectly correct, although in practice most of the large hash outputs (128+ bits) are so unique that it is safe to assume that if the hash is the same, the original value was the same too.
Google published a memory hardware error study in 2009 and they found that memory error rate is very common - much more than I'd think. The uncorrectable error rate (when the content of the memory corrupts irreversibly) is quite frequent too:
"…while uncorrectable errors are less common than correctable errors, they do happen at a significant rate. Across the entire fleet, 1.3% of machines are affected by uncorrectable errors per year, with some platforms seeing as many as 2-4% affected…"
Adding up all kind of hardware errors it is safe to say that SHA-1 (160 bit) or SHA-2 (224+ bit) are magnitudes less likely to cause collision than a hardware error happing so it is safe and sufficient to compare the hash values of these chunks, no need to compare the contents byte-by-byte.
Sample implementation
Below is a very simple Java implementation of the Rabin-Karp hash based chunking, producing ~8k chunk sizes (the code is not optimized!). Look out for the 3 key points:
get the 13 bit mask: int mask = 1 << 13; mask--;
get the initial 1024 bytes hash: int hash = inithash(1024);
calculate the next hash from the input stream: hash = nexthash(hash);
check the lowest 13 bits are all zeroes: if ((hash & mask) == 0)
public static final int hconst = 69069; // good hash multiplier for MOD 2^32 public int mult = 1; // this will hold the p^n value int[] buffer; // circular buffer - reading from file stream int buffptr = 0; InputStream is; public void displayChunks() { String filelocation = "file.bin"; int mask = 1 << 13; mask--; // 13 bit of '1's File f = new File(filelocation); FileInputStream fs; try { fs = new FileInputStream(f); BufferedInputStream bis = new BufferedInputStream(fs); // BufferedInputStream is faster to read byte-by-byte from this.is = bis; long length = bis.available(); long curr = length; // get the initial 1k hash window // int hash = inithash(1024); //////////////////////////////////// curr -= bis.available(); while (curr < length) { if ((hash & mask) == 0) { // window found - hash it, // compare it to the other file chunk list } // next window's hash // hash = nexthash(hash); ///////////////////////// curr++; } fs.close(); } catch (Exception e) { } } public int nexthash(int prevhash) throws IOException { int c = is.read(); // next byte from stream prevhash -= mult * buffer[buffptr]; // remove the last value prevhash *= hconst; // multiply the whole chain with prime prevhash += c; // add the new value buffer[buffptr] = c; // circular buffer, 1st pos == lastpos buffptr++; buffptr = buffptr % buffer.length; return prevhash; } public int inithash(int length) throws IOException { return inithash(length - 1, 0); } public int inithash(int from, int to) throws IOException { buffer = new int[from - to + 1]; // create circular buffer int hash = 0; int cnt = to; while (cnt != 0) { // skip first characters if needed int c = is.read(); if (c == -1) throw new IOException(); cnt--; } // calculate the hash sum of p^n * a[x] for (int i = 0; i <= from - to; i++) { int c = is.read(); if (c == -1) // file is shorter than the required window size break; // store byte so we can remove it from the hash later buffer[buffptr] = c; buffptr++; buffptr = buffptr % buffer.length; hash *= hconst; // multiply the current hash with constant hash += c; // add byte to hash if(i>0) // calculate the large p^n value for later usage mult *= hconst; } return hash; } }
>> get the 12 bit mask: int mask = 1 << 13; mask--;
ReplyDeleteLooks like 13 bit mask
Quite true Alexey - was waiting for someone to point out :)
DeleteYour blog is awesome. There are lots of new things to learn and I'm commenting here just to let you know that I'll be reading the entire blog.
ReplyDeleteThank you. :)