August 9th, 2011, 02:22 PM
Calculating a hash code for a large file in parallel
I would like to improve the performance of hashing large files, say for example in the tens of gigabytes in size.
Normally, you sequentially hash the bytes of the files using a hash function (say, for example SHA-256, although I will most likely use Skein, so hashing will be slow compared to reading of the file from an SSD). Let's call this Method 1.
The idea is to hash 1 MB blocks of the file individually (in parallel on 8 CPUs) and then hash the concatenated hashes into a single final hash. Let's call this Method 2.
I would like to know if this idea is sound and how much "security" is lost (in terms of collisions being more probable) vs doing a single hash over the span of the entire file.
Let's use the SHA-256 variant of SHA-2 and the file size is 2^34=34,359,738,368 bytes.
So using a simple single pass (Method 1), I would get a 256-bit hash for the entire file.
Compare this with:
Using the parallel hashing (i.e., Method 2), I would break the file into 32,768 blocks of 1 MB, hash those blocks using SHA-256 into 32,768 hashes of 256 bits (32 bytes), concatenate the hashes and do a final hash of the resultant concatenated 1,048,576 byte data set to get my final 256-bit hash for the entire file.
Is Method 2 any less secure than Method 1, in terms of collisions being more possible and/or probable? Perhaps I should rephrase this question as: Does Method 2 make it easier for an attacker to create a file that hashes to the same hash value as the original file, except of course for the trivial fact that a brute force attack would be cheaper since the hash can be calculated in parallel on N cpus?
August 10th, 2011, 05:11 AM
It seems to be a difficult problem.
Maybe try "probabilistic" approach:
For given hash function one can estimate the probability of collision.
This should give some estimation of collisions in hash of hashes.
August 11th, 2011, 05:16 AM
Just a remark:
for simple hash function: h(T) = (T1*...*Tn) mod p
the hash of hashes is the same number.
Here T = T1...Tn is ascii (say) sequence of bytes.
August 26th, 2011, 07:11 PM
It probably weakens the hash.
But that begs the question, what are you worried about?
Without doing a lot of complex analysis, I would guess that its very important that you always hash the results of the mini-hash in the same order.
The improvement in performance may well be worth the small amount of weakening.