#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2011
    Posts
    1
    Rep Power
    0

    Calculating a hash code for a large file in parallel


    I would like to improve the performance of hashing large files, say for example in the tens of gigabytes in size.

    Normally, you sequentially hash the bytes of the files using a hash function (say, for example SHA-256, although I will most likely use Skein, so hashing will be slow compared to reading of the file from an SSD). Let's call this Method 1.

    The idea is to hash 1 MB blocks of the file individually (in parallel on 8 CPUs) and then hash the concatenated hashes into a single final hash. Let's call this Method 2.

    I would like to know if this idea is sound and how much "security" is lost (in terms of collisions being more probable) vs doing a single hash over the span of the entire file.

    For example:

    Let's use the SHA-256 variant of SHA-2 and the file size is 2^34=34,359,738,368 bytes.

    So using a simple single pass (Method 1), I would get a 256-bit hash for the entire file.

    Compare this with:

    Using the parallel hashing (i.e., Method 2), I would break the file into 32,768 blocks of 1 MB, hash those blocks using SHA-256 into 32,768 hashes of 256 bits (32 bytes), concatenate the hashes and do a final hash of the resultant concatenated 1,048,576 byte data set to get my final 256-bit hash for the entire file.

    Is Method 2 any less secure than Method 1, in terms of collisions being more possible and/or probable? Perhaps I should rephrase this question as: Does Method 2 make it easier for an attacker to create a file that hashes to the same hash value as the original file, except of course for the trivial fact that a brute force attack would be cheaper since the hash can be calculated in parallel on N cpus?

    Thanks
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2011
    Posts
    313
    Rep Power
    0
    It seems to be a difficult problem.

    Maybe try "probabilistic" approach:
    For given hash function one can estimate the probability of collision.
    This should give some estimation of collisions in hash of hashes.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2011
    Posts
    313
    Rep Power
    0
    Just a remark:

    for simple hash function: h(T) = (T1*...*Tn) mod p
    the hash of hashes is the same number.

    Here T = T1...Tn is ascii (say) sequence of bytes.
  6. #4
  7. Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Sep 2007
    Location
    outside Washington DC
    Posts
    2,642
    Rep Power
    3699
    It probably weakens the hash.

    But that begs the question, what are you worried about?

    Without doing a lot of complex analysis, I would guess that its very important that you always hash the results of the mini-hash in the same order.

    The improvement in performance may well be worth the small amount of weakening.

IMN logo majestic logo threadwatch logo seochat tools logo