Wednesday, October 24, 2012

Rabin Karp rolling hash - dynamic sized chunks based on hashed content

The Rabin-Karp rolling hash algorithm is excellent at finding a pattern in a large file or stream, but there is an even more interesting use case: creating content based chunks of a file to detect the changed blocks without doing a full byte-by-byte comparison.






The traditional computation of the hashes of interleaving windows generally works like this:

Input: [a,b,c,d,e]
Window size: 3
Hash#1: hash([a,b,c])
Hash#2: hash([b,c,d])
Hash#3: hash([c,d,e])

The Rabin-Karp implementation is very efficiently reuses the previous hash to calculate the new one:

Hash#1 = hash([a,b,c])
Hash#2 = Hash#1 – hash(a) + hash(d)
Hash#3 = Hash#2 – hash(b) + hash(e)

So, to calculate the next window’s hash all we need to do is to remove the window’s first element's ‘hash value’ from the current hash and add the new to it.
Even though a simple mathematical addition would do fine, it’s usually not preferred because it can cause high collision rate among the hash values:

a+b+c = a+c+b = b+a+c = b+c+a = c+a+b = c+b+a !

However, Rabin-Karp algorithm typically works with the powers of a prime number to do calculate the hash.

p^(n)*a[x] + p^(n-1)*a[x-1] + ... + a[0]

Lets say the prime is 7 (seven is not too good choice for hash in the MOD 2^32 number space):

49*a + 7*b + c <> 49*b + 7*a + c <> …. 
[note: of course the inequality is not true with some values]

Using the previous example, if we know the hash of [a,b,c] the next hash window is straightforward:

Hash#2 = (Hash#1 – 49*a ) * 7 + d
       = (49a + 7b + c - 49a) * 7 + d
       = (7b + c) * 7 + d
       = 49b + 7c + d

The nice thing about it is it requires constant number of operations and does not depend on the size of the window.

Pattern matching

One of the (not too exciting) usages of the Rabin-Karp algorithm is pattern matching: finding a short string sample in a large text corpus. The reason it’s not too exciting because there are many other specifically text searching algorithms whose running time is generally better than Rabin-Karp (the latter runs in O(len(text) + len(pattern)) on average but worst case is O(len(text) * len(pattern)) when hash equality is too frequent).

Variable file chunk size slicing

However, the content-based file chunking is a very interesting application of the Rabin-Karp algorithm. Let’s see the following problem:

One’s got a large file which hash been modified at some location(s). Modification can be add/remove/replace. Check which sections of the file have been modified.

Original

This is text.
This will be removed.
This will be modified.
The end.
Modified

This is text.
This has been modified.
The end.


If we create a fixed size window (10 bytes) to check the file then apart from the first window ('This is te') every other have been modified, even though it’s clear that this is not what happened.

However, we can say we are aiming to get a window size about 16 bytes. This means, that if we used a 4 bit counter we would want to create a new window when the bits are all zeroes after incrementing it:

0000 – Window start
0001

1111
0000 – New window start

We can do exactly the same with Rabin-Karp rolling hash: start a new window when the bits of the hash are all zeroes!
If we are aiming for a ~8k window size, we’d need to use the lowest 13 bits of the hash to decide when to start a new window. This method gives very reasonable window sizes (tested on a ~300mb binary installer file):

Min window:      1
Max window:      81287
Average window:  8132
Median window:   5658



With this method, even if some parts of the file were modified, these modifications can only affect the current and potentially the next window, the rest would be identical slices.

So, we have a lot of reasonable sized chunks of a file. How to compare them to the original windows? The trivial answer would be to use byte-by-byte comparison but that is way too expensive. Simple hashing can be used to compare the content.

Comparison by hashing

Generally it’s very natural to say that just because of two hash values are the same it does not mean that the original values are the same too. Mathematically this is perfectly correct, although in practice most of the large hash outputs (128+ bits) are so unique that it is safe to assume that if the hash is the same, the original value was the same too.

Google published a memory hardware error study in 2009 and they found that memory error rate is very common - much more than I'd think. The uncorrectable error rate (when the content of the memory corrupts irreversibly) is quite frequent too:

"…while uncorrectable errors are less common than correctable errors, they do happen at a significant rate. Across the entire fleet, 1.3% of machines are affected by uncorrectable errors per year, with some platforms seeing as many as 2-4% affected…"

Adding up all kind of hardware errors it is safe to say that SHA-1 (160 bit) or SHA-2 (224+ bit) are magnitudes less likely to cause collision than a hardware error happing so it is safe and sufficient to compare the hash values of these chunks, no need to compare the contents byte-by-byte.

Sample implementation

Below is a very simple Java implementation of the Rabin-Karp hash based chunking, producing ~8k chunk sizes (the code is not optimized!). Look out for the 3 key points:
get the 13 bit mask: int mask = 1 << 13; mask--;
get the initial 1024 bytes hash: int hash = inithash(1024); 
calculate the next hash from the input stream: hash = nexthash(hash);
check the lowest 13 bits are all zeroes: if ((hash & mask) == 0) 

public static final int hconst = 69069; // good hash multiplier for MOD 2^32
public int mult = 1; // this will hold the p^n value
int[] buffer; // circular buffer - reading from file stream
int buffptr = 0;
InputStream is;

public void displayChunks() {
 String filelocation = "file.bin";
 int mask = 1 << 13;
 mask--; // 13 bit of '1's

 File f = new File(filelocation);
 FileInputStream fs;
 try {
  fs = new FileInputStream(f);
  BufferedInputStream bis = new BufferedInputStream(fs); 
  // BufferedInputStream is faster to read byte-by-byte from
  this.is = bis;

  long length = bis.available();
  long curr = length;

  // get the initial 1k hash window //
  int hash = inithash(1024); 
  ////////////////////////////////////

  curr -= bis.available();

  while (curr < length) {
   if ((hash & mask) == 0) {
    // window found - hash it, 
    // compare it to the other file chunk list
   }
   // next window's hash  //
   hash = nexthash(hash);
   /////////////////////////
   curr++;
  }
  fs.close();
 } catch (Exception e) {
 }
}

public int nexthash(int prevhash) throws IOException {
 int c = is.read(); // next byte from stream

 prevhash -= mult * buffer[buffptr]; // remove the last value
 prevhash *= hconst; // multiply the whole chain with prime
 prevhash += c; // add the new value
 buffer[buffptr] = c; // circular buffer, 1st pos == lastpos
 buffptr++;
 buffptr = buffptr % buffer.length;

 return prevhash;
}

public int inithash(int length) throws IOException {
 return inithash(length - 1, 0);
}

public int inithash(int from, int to) throws IOException {
 buffer = new int[from - to + 1]; // create circular buffer

 int hash = 0;

 int cnt = to;
 while (cnt != 0) { // skip first characters if needed
  int c = is.read();
  if (c == -1)
   throw new IOException();
  cnt--;
 }

 // calculate the hash sum of p^n * a[x]
 for (int i = 0; i <= from - to; i++) { 
  int c = is.read();
  if (c == -1) // file is shorter than the required window size
   break;

  // store byte so we can remove it from the hash later
  buffer[buffptr] = c; 
  buffptr++;
  buffptr = buffptr % buffer.length;

  hash *= hconst; // multiply the current hash with constant

  hash += c; // add byte to hash
   
  if(i>0) // calculate the large p^n value for later usage
   mult *= hconst;
 }

 return hash;
}

} 


2 comments:

  1. >> get the 12 bit mask: int mask = 1 << 13; mask--;

    Looks like 13 bit mask

    ReplyDelete
    Replies
    1. Quite true Alexey - was waiting for someone to point out :)

      Delete