Rabin Karp rolling hash - dynamic sized chunks based on hashed content
The Rabin-Karp rolling hash algorithm is excellent at finding a pattern in a large file or stream, but there is an even more interesting use case: creating content based chunks of a file to detect the changed blocks without doing a full byte-by-byte comparison. The traditional computation of the hashes of interleaving windows generally works like this: Input: [a,b,c,d,e] Window size: 3 Hash#1: hash([a,b,c]) Hash#2: hash([b,c,d]) Hash#3: hash([c,d,e]) The Rabin-Karp implementation is very efficiently reuses the previous hash to calculate the new one: Hash#1 = hash([a,b,c]) Hash#2 = Hash#1 – hash(a) + hash(d) Hash#3 = Hash#2 – hash(b) + hash(e) So, to calculate the next window’s hash all we need to do is to remove the window’s first element's ‘hash value’ from the current hash and add the new to it. Even though a simple mathematical addition would do fine, it’s usually not preferred because it can cause high collision rate among the hash values: a+b+c = a+