Trust, but verify [Free tool (and source code) for computing commonly used hash codes!]

Tuesday, January 13, 2009

Internet downloads (particularly large ones) are often published with an associated checksum that can be used to verify that the file was successfully downloaded. While transmission errors are relatively rare, the popularity of file sharing and malware introduce the possibility that what you get isn't always exactly what you wanted. The posting of checksums by the original content publisher attempts to solve this problem by giving the end user an easy way to validate the downloaded file. Checksums are typically computed by applying a cryptographic hash function to the original file; the hash function computes a small (10-30 character) textual "snapshot" of the file that satisfies the following properties (quoting from Wikipedia):

It is easy to compute the hash for any given data

It is extremely difficult to construct a [file] that has a given hash

It is extremely difficult to modify a given [file] without changing its hash

It is extremely unlikely that two different [files] will have the same hash

Because of these four properties, a user can be fairly confident a file has not been tampered with or garbled as long as the checksum computed on their own machine matches what the publisher posted. I say fairly confident because there's always the possibility of a hash collision - two different files with the same checksum. No matter how good the hash function is, the pigeonhole principle guarantees there will ALWAYS be the possibility of collisions. The point of a good hash function is to make this possibility so unlikely as to be impossible for all intents and purposes.

Popular hash functions in use today are MD5 and SHA-1, with CRC-32 rapidly losing favor. Speaking in very broad terms, one might say that the quality of CRC-32 is "not good", MD5 is "good", and SHA-1 is "very good". For now, that is; research is always under way that could render any of these algorithms useless tomorrow... (For more information about the weaknesses of each algorithm, refer to the links above.)

In order for published checksums to be useful, the user needs an easy way to calculate them. I looked around a bit and didn't a lot of free tools for computing these popular hash functions that I was comfortable with, so I wrote my own using .NET. Here it is:

ComputeFileHashes
   Computes CRC32, MD5, and SHA1 hashes for the specified file(s)
   Version 2009-01-12
   http://blogs.msdn.com/Delay/

Syntax: ComputeFileHashes FileOrPattern [FileOrPattern [...]]
   Each FileOrPattern can specify a single file or a set of files

Examples:
   ComputeFileHashes Image.iso
   ComputeFileHashes *.iso
   ComputeFileHashes *.iso *.vhd

[Click here to download ComputeFileHashes along with its complete source code.]

Here's the output of running ComputeFileHashes on the recently released Windows 7 Beta ISO images. Anyone can verify the checksums below on the MSDN Subscriber Downloads site. (Note: You need to be a subscriber to download files from there; everyone else can download from the official Windows 7 link.)

C:\T>ComputeFileHashes.exe "M:\Windows 7\*"

M:\Windows 7\en_windows_7_beta_dvd_x64_x15-29074.iso
CRC32: 8E2FAD39
MD5: 773FC9CC60338C612AF716A2A14F177D
SHA1: E09FDBC1CB3A92CF6CC872040FDAF65553AB62A5

M:\Windows 7\en_windows_7_beta_dvd_x86_x15-29073.iso
CRC32: AABA5A48
MD5: F9DCE6EBD0A63930B44D8AE802B63825
SHA1: 6071184282B2156FF61CDC5260545C078CCA31EE

One of the things that was important to me when writing ComputeFileHashes was performance. Nobody likes to wait, and I'm probably even less patient than the average bear. One of the things I wanted my program to do was take advantage of multi-processing and the multi-core CPUs that are so prevalent these days. So ComputeFileHashes runs the three hash functions in parallel with each other and with loading the next bytes of the file. Theoretically, this can take advantage of four different cores - though in practice my limited testing suggests there's just not enough work to saturate them all. :)

While I paid attention to performance at a macro level (i.e., algorithm design), I didn't worry about it at a micro level (i.e., focused optimization). And I used .NET of all things. (Cue the doubters: "ZOMG, EPIC FAIL!!") If it uses .NET, it must be slow, right? Well, let's do the numbers:

I took some informal measurements of the time to compute hashes for the Windows 7 Beta x64 ISO image. ComputeFileHashes took just under 60 seconds to compute the CRC-32, MD5, and SHA-1 hashes of the 3.15 GB file on my home machine. Not instantaneous by any means, but also pretty reasonable for that amount of data.
Next up was my favorite program ever, Altap Salamander. I use Salamander for all my file management and it comes with a handy plugin for calculating/verifying checksums. Salamander is fully native code, so it should have a distinct performance advantage. Salamander took about 40 seconds to compute the CRC-32 and MD5 checksums in a single pass. This performance is noticeably better than ComputeFileHashes's, but it also doesn't include SHA-1 which is probably the most computationally intensive algorithm. I doubt adding SHA-1 would double the time, but it would almost certainly take longer. It's hard to balance missing features against better performance, so maybe this one is too close to call. :)
Last up was an internal tool I had called gethash. gethash is also native code and supports MD5, SHA-1, and some other hash types - but not CRC-32. However, gethash computes only one checksum a time. So I ran it once for SHA-1 in a time of just over 35 seconds and again for MD5 for another time of just over 35 seconds. The total time for gethash was therefore about 70 seconds and just like before we're missing one of the checksum types. In this case, one could argue that ComputeFileHashes wins the race because it's 10 seconds faster and delivers more value. Of course, if you only need one type of checksum, then you don't care that ComputeFileHashes generated others and gethash can do a single checksum in close to half the time it takes ComputeFileHashes. Again, no clear winner for me.

So depending on how you want to look at things, ComputeFileHashes is either the best or the worst of the lot. :) But recall that we're pitching an unoptimized .NET application against two solid native-code implementations. ComputeFileHashes pretty clearly held its ground and I'd say the doubters don't have much to complain about here.

Implementation notes:

.NET includes the HashAlgorithm abstract class and a variety of concrete subclasses for calculating cryptographic hashes. MD5 and SHA-1 are implemented by the Framework along with a handful of other useful hash functions. (In fact, it's trivial to add support for another HashAlgorithm to ComputeFileHashes simply by editing a table in the source code.) CRC-32 is not part of the Framework (due, I suspect, to its weaknesses), so I wrote my own CRC-32 HashAlgorithm subclass. I'll post more about this tomorrow.
The most efficient multi-hashing implementation would probably update each algorithm together as bytes are read from the file. However, the overhead of doing that with the .NET HashAlgorithm implementation (which sensibly expects blocks of data) seemed prohibitively expensive - and impractical to parallelize. An alternative would be to let different threads hash the complete file on their own. This parallelizes easily, but because the algorithms operate at different speeds, they'd soon be reading different parts of the file - and the disk thrashing would probably ruin performance. So I went with a compromise: ComputeFileHashes loads a 1 MB chunk of the file into memory, lets each of the algorithms process that block to completion on separate threads, then loads the next 1 MB of data and repeats. There's going to be some wasted time after the fastest algorithm completes and the slowest is still working, and there's probably some memory cache thrashing (for similar reasons as the disk) - but as the numbers above indicate, the performance of this approach is really quite reasonable. Of course, it's entirely possible that a different buffer size that would be even faster - so if you do some testing and find something better, please let me know! :)
ComputeFileHashes creates dedicated Threads for each of the hash functions it uses. Those threads stay around for the life of the application and are synchronized by a general-purpose helper class I wrote, WaitingRoom. I'll blog more about this in a couple of days.

ComputeFileHashes is a simple tool intended to make verifying checksums easy for anyone. I've wanted a good, fast, free, all-in-one solution I could trust for some time now, and the recent release of Windows 7 finally prompted me to write my own. I hope others to put ComputeFileHashes to use for themselves - perhaps even for the Windows 7 Beta! :)

Tags: Technical Utilities