Posts from January 2009

Thank goodness for reference implementations [Low-overhead .NET MD5 implementation (source code and tests) works great on Silverlight!]

Wednesday, January 28, 2009

In yesterday's post announcing ComputeFileHashes's new support for MD5 on Silverlight, I promised to share some details about my experience getting an MD5 HashAlgorithm implementation for Silverlight. Recall that an MD5 class is available in the desktop .NET Framework, but is not part of Silverlight 2's subset of the .NET Framework. (Probably in order to save space by excluding one of the less-secure cryptographic hash functions - a completely sensible tradeoff.) Because I didn't want to write my own code for MD5 (it's a non-trivial algorithm), the challenge was to find something freely available that I could just drop in and take advantage of. So I was very interested when I found out about Reid Borsuk's managed implementation of an MD5 HashAlgorithm for Silverlight because it sounded perfect for my needs.

The first step of incorporating something like this is to check the license: this code is under Ms-PL, so there were no problems there. The next step is to skim the code and get a general feel for how it works - and it was while doing this that I realized I wouldn't be able to use this code as-is...

To understand why, a little background is required:

The HashAlgorithm abstract class requires that derived classes implement the following methods: Initialize, HashCore, and HashFinal. Initialize gets called once at the start of hashing, then HashCore is called many times (being passed a different block of data each time), and then HashFinal is called once at the end of hashing to finalize any computations and return the computed hash value. It's a straightforward model and is flexible enough to accommodate a wide variety of checksum algorithms. Other than maintaining a few bytes of internal state across calls, there's no need for the hash algorithm to allocate anything: the data flows in from the user, gets processed, and is immediately forgotten about.

Or at least that's how it's supposed to work...

What I found when I looked at the aforementioned MD5 implementation was that it would allocate an internal buffer during Initialize, repeatedly re-allocate that buffer and append new data to the end of it during every call to HashCore, and then process the entire buffer all at once in HashFinal. While this approach works fine for fairly small inputs, it was completely impractical for ComputeFileHashes which is expected to process multi-gigabyte files as a matter of course. All those reallocations and the large internal buffer would quickly exhaust the physical memory of virtually any system in use today on something like the Windows 7 Beta ISO images I've been using for my examples. (In fact, it's a bit more dire than it initially seems: this technique requires twice the memory of the original data: that last call to HashCore needs to copy the nearly-full-sized buffer into a new full-sized buffer.)

Okay, so if I couldn't use the code as-is; the next step is to see what it would take to modify it so that it would work for my scenario. Well, this implementation uses a small HashAlgorithm wrapper class around a core MD5 implementation class, I wondered if it would be a simple matter of changing the way the wrapper called into the core. But looking at things a bit more closely, it seemed the core was not structured for that - and separating things like I wanted might not be trivial.

It's decision time: Do I start changing the structure of this code to work the way I need it to, or do I investigate other options? I considered the implications of both approaches, but it was something a coworker said that convinced me to spend a bit of time looking elsewhere. He asked, "If you don't like a fundamental part of the implementation and feel the need to fix it, why do you think you won't be compelled to make changes to the rest of it as well?" That question expressed my concerns pretty well, so I decided to look into other options for a bit. After all, I could always come back to this if nothing panned out.

The obvious place to start was the MD5 specification: RFC1321, The MD5 Message-Digest Algorithm. The body of this document describes the algorithm in great detail and would be a great place to start writing my own implementation if I was willing to spend a considerable amount of time developing and testing. But the real gem is in the appendix: a reference implementation of MD5 written in C! Fortunately, C's not so different from C# - and I've ported things before - so I had a decent idea what to expect. And it sure is hard to beat the reference implementation from the point of view of obtaining an accurate, (typically) bug-free, chunk of code. There is an accompanying license, but it's open (this is a public specification, after all) and primarily requires that derivative works identify themselves as being "derived from the RSA Data Security, Inc. MD5 Message-Digest Algorithm." (like I just did). So things seemed promising!

I decided to spend a bus ride porting the reference implementation and see how far I got. As it happens (and I'm sure this is no accident), the reference implementation uses exactly the Initialize/HashCore/HashFinal pattern that HashAlgorithm expects. Consequently, each of my own HashAlgorithm wrapper methods simply makes a single call into the ported reference implementation - and all of a sudden concerns about memory exhaustion are a thing of the past! By the end of the bus ride, I had successfully ported the reference implementation to C# and had it passing the seven test cases that are part of the specification.

My mind was pretty much made up at this point: I'd use my port of the MD5 reference implementation for the Silverlight version of ComputeFileHashes. This was code from a reliable source, code I had become familiar with, and code that I'd feel comfortable debugging or tuning if necessary. I beefed up the test cases a bit by exercising all of them for all the possible chunk sizes, addressed a couple of code analysis warnings, and had something ready in a jiffy. I added the MD5Managed class to the Silverlight build of ComputeFileHashes and - yep - it just worked. :)

So here's a(nother) completely managed MD5 implementation that anyone is free to use (subject to the reference implementation's license). I haven't spent time optimizing it - but that was kind of the point (see the class comments below for more). I'd started out trying to avoid writing my own MD5 implementation and I only partly succeeded - but I'm glad with how this worked out and maybe some of you can benefit from what I've done. Even if all you do is run the Silverlight version of ComputeFileHashes from time to time, I feel like my relatively minimal investment was worthwhile! :)

[Click here to download a Visual Studio solution containing the source code for a Silverlight-ready managed implementation of MD5 along with the simple test cases discussed above.]

using System;
using System.Diagnostics.CodeAnalysis;
using System.Security.Cryptography;

namespace Delay
{
    /// <summary>
    /// MD5Managed: A HashAlgorithm implementation that acts as a thin wrapper
    /// around a C# translation of the MD5 reference implementation. The C code
    /// has been translated as closely as possible so that most of the original
    /// structure remains and comparisons between the two are straightforward.
    /// </summary>
    /// <remarks>
    /// Derived from the RSA Data Security, Inc. MD5 Message-Digest Algorithm.
    /// 
    /// Specification:
    /// RFC1321 - The MD5 Message-Digest Algorithm
    /// http://www.faqs.org/rfcs/rfc1321.html
    /// 
    /// Original license:
    /// Copyright (C) 1991-2, RSA Data Security, Inc. Created 1991. All
    /// rights reserved.
    /// 
    /// License to copy and use this software is granted provided that it
    /// is identified as the "RSA Data Security, Inc. MD5 Message-Digest
    /// Algorithm" in all material mentioning or referencing this software
    /// or this function.
    /// 
    /// License is also granted to make and use derivative works provided
    /// that such works are identified as "derived from the RSA Data
    /// Security, Inc. MD5 Message-Digest Algorithm" in all material
    /// mentioning or referencing the derived work.
    /// 
    /// RSA Data Security, Inc. makes no representations concerning either
    /// the merchantability of this software or the suitability of this
    /// software for any particular purpose. It is provided "as is"
    /// without express or implied warranty of any kind.
    /// 
    /// These notices must be retained in any copies of any part of this
    /// documentation and/or software.
    /// </remarks>
    public class MD5Managed : HashAlgorithm
    {
        // Current context
        private readonly MD5_CTX _context = new MD5_CTX();
        // Last hash result
        private readonly byte[] _digest = new byte[16];
        // True if HashCore has been called
        private bool _hashCoreCalled;
        // True if HashFinal has been called
        private bool _hashFinalCalled;

        /// <summary>
        /// Initializes a new instance.
        /// </summary>
        public MD5Managed()
        {
            InitializeVariables();
        }

        /// <summary>
        /// Initializes internal state.
        /// </summary>
        public override void Initialize()
        {
            InitializeVariables();
        }

        /// <summary>
        /// Initializes variables.
        /// </summary>
        private void InitializeVariables()
        {
            MD5Init(_context);
            _hashCoreCalled = false;
            _hashFinalCalled = false;
        }

        /// <summary>
        /// Updates the hash code with the data provided.
        /// </summary>
        /// <param name="array">Data to hash.</param>
        /// <param name="ibStart">Start position.</param>
        /// <param name="cbSize">Number of bytes.</param>
        protected override void HashCore(byte[] array, int ibStart, int cbSize)
        {
            if (null == array)
            {
                throw new ArgumentNullException("array");
            }

            if (_hashFinalCalled)
            {
                throw new CryptographicException("Hash not valid for use in specified state.");
            }
            _hashCoreCalled = true;

            MD5Update(_context, array, (uint)ibStart, (uint)cbSize);
        }

        /// <summary>
        /// Finalizes the hash code and returns it.
        /// </summary>
        /// <returns></returns>
        protected override byte[] HashFinal()
        {
            _hashFinalCalled = true;
            MD5Final(_digest, _context);
            return Hash;
        }

        /// <summary>
        /// Returns the hash as an array of bytes.
        /// </summary>
        [SuppressMessage("Microsoft.Design", "CA1065:DoNotRaiseExceptionsInUnexpectedLocations", Justification = "Matching .NET behavior by throwing here.")]
        [SuppressMessage("Microsoft.Usage", "CA2201:DoNotRaiseReservedExceptionTypes", Justification = "Matching .NET behavior by throwing NullReferenceException.")]
        public override byte[] Hash
        {
            get
            {
                if (!_hashCoreCalled)
                {
                    throw new NullReferenceException();
                }
                if (!_hashFinalCalled)
                {
                    // Note: Not CryptographicUnexpectedOperationException because that can't be instantiated on Silverlight 4
                    throw new CryptographicException("Hash must be finalized before the hash value is retrieved.");
                }

                return _digest;
            }
        }

        // Return size of hash in bits.
        public override int HashSize
        {
            get
            {
                return _digest.Length * 8;
            }
        }

        ///////////////////////////////////////////////
        // MD5 reference implementation begins here. //
        ///////////////////////////////////////////////

        /* MD5 context. */
        private class MD5_CTX
        {
            public readonly uint[] state;   /* state (ABCD) */
            public readonly uint[] count;   /* number of bits, modulo 2^64 (lsb first) */
            public readonly byte[] buffer;  /* input buffer */

            public MD5_CTX()
            {
                state = new uint[4];
                count = new uint[2];
                buffer = new byte[64];
            }

            public void Clear()
            {
                Array.Clear(state, 0, state.Length);
                Array.Clear(count, 0, count.Length);
                Array.Clear(buffer, 0, buffer.Length);
            }
        }

        /* Constants for MD5Transform routine. */
        private const int S11 = 7;
        private const int S12 = 12;
        private const int S13 = 17;
        private const int S14 = 22;
        private const int S21 = 5;
        private const int S22 = 9;
        private const int S23 = 14;
        private const int S24 = 20;
        private const int S31 = 4;
        private const int S32 = 11;
        private const int S33 = 16;
        private const int S34 = 23;
        private const int S41 = 6;
        private const int S42 = 10;
        private const int S43 = 15;
        private const int S44 = 21;

        private static byte[] PADDING;

        [SuppressMessage("Microsoft.Performance", "CA1810:InitializeReferenceTypeStaticFieldsInline", Justification = "More compact this way")]
        static MD5Managed()
        {
            PADDING = new byte[64];
            PADDING[0] = 0x80;
        }

        /* F, G, H and I are basic MD5 functions. */
        private static uint F(uint x, uint y, uint z) { return (((x) & (y)) | ((~x) & (z))); }
        private static uint G(uint x, uint y, uint z) { return (((x) & (z)) | ((y) & (~z))); }
        private static uint H(uint x, uint y, uint z) { return ((x) ^ (y) ^ (z)); }
        private static uint I(uint x, uint y, uint z) { return ((y) ^ ((x) | (~z))); }

        /* ROTATE_LEFT rotates x left n bits. */
        private static uint ROTATE_LEFT(uint x, int n) { return (((x) << (n)) | ((x) >> (32 - (n)))); }

        /* FF, GG, HH, and II transformations for rounds 1, 2, 3, and 4.
           Rotation is separate from addition to prevent recomputation. */
        private static void FF(ref uint a, uint b, uint c, uint d, uint x, int s, uint ac)
        {
            (a) += F((b), (c), (d)) + (x) + (uint)(ac);
            (a) = ROTATE_LEFT((a), (s));
            (a) += (b);
        }
        private static void GG(ref uint a, uint b, uint c, uint d, uint x, int s, uint ac)
        {
            (a) += G((b), (c), (d)) + (x) + (uint)(ac);
            (a) = ROTATE_LEFT((a), (s));
            (a) += (b);
        }
        private static void HH(ref uint a, uint b, uint c, uint d, uint x, int s, uint ac)
        {
            (a) += H((b), (c), (d)) + (x) + (uint)(ac);
            (a) = ROTATE_LEFT((a), (s));
            (a) += (b);
        }
        private static void II(ref uint a, uint b, uint c, uint d, uint x, int s, uint ac)
        {
            (a) += I((b), (c), (d)) + (x) + (uint)(ac);
            (a) = ROTATE_LEFT((a), (s));
            (a) += (b);
        }

        /* MD5 initialization. Begins an MD5 operation, writing a new context. */
        private static void MD5Init(MD5_CTX context)  /* context */
        {
            context.count[0] = context.count[1] = 0;

            /* Load magic initialization constants. */
            context.state[0] = 0x67452301;
            context.state[1] = 0xefcdab89;
            context.state[2] = 0x98badcfe;
            context.state[3] = 0x10325476;
        }

        /* MD5 block update operation. Continues an MD5 message-digest
           operation, processing another message block, and updating the
           context. */
        private static void MD5Update(MD5_CTX context,  /* context */
                                      byte[] input,     /* input block */
                                      uint inputIndex,  // Starting index for input block
                                      uint inputLen)    /* length of input block */
        {
            /* Compute number of bytes mod 64 */
            uint index = (uint)((context.count[0] >> 3) & 0x3F);

            /* Update number of bits */
            if ((context.count[0] += ((uint)inputLen << 3)) < ((uint)inputLen << 3))
            {
                context.count[1]++;
            }
            context.count[1] += ((uint)inputLen >> 29);

            uint partLen = 64 - index;

            /* Transform as many times as possible. */
            uint i = 0;
            if (inputLen >= partLen)
            {
                Buffer.BlockCopy(input, (int)inputIndex, context.buffer, (int)index, (int)partLen);
                MD5Transform(context.state, context.buffer, 0);

                for (i = partLen; i + 63 < inputLen; i += 64)
                {
                    MD5Transform(context.state, input, inputIndex + i);
                }

                index = 0;
            }

            /* Buffer remaining input */
            Buffer.BlockCopy(input, (int)(inputIndex + i), context.buffer, (int)index, (int)(inputLen - i));
        }

        /* MD5 finalization. Ends an MD5 message-digest operation, writing the
           the message digest and zeroizing the context. */
        private static void MD5Final(byte[] digest,    /* message digest */
                                     MD5_CTX context)  /* context */
        {
            byte[] bits = new byte[8];

            /* Save number of bits */
            Encode(bits, context.count, 8);

            /* Pad out to 56 mod 64. */
            uint index = (uint)((context.count[0] >> 3) & 0x3f);
            uint padLen = (index < 56) ? (56 - index) : (120 - index);
            MD5Update(context, PADDING, 0, padLen);

            /* Append length (before padding) */
            MD5Update(context, bits, 0, 8);

            /* Store state in digest */
            Encode(digest, context.state, 16);

            /* Zeroize sensitive information. */
            context.Clear();
        }

        /* MD5 basic transformation. Transforms state based on block. */
        private static void MD5Transform(uint[] state,
                                         byte[] block,
                                         uint blockIndex)
        {
            uint a = state[0], b = state[1], c = state[2], d = state[3];
            uint[] x = new uint[16];

            Decode(x, block, blockIndex, 64);

            /* Round 1 */
            FF(ref a, b, c, d, x[0],  S11, 0xd76aa478); /* 1 */
            FF(ref d, a, b, c, x[1],  S12, 0xe8c7b756); /* 2 */
            FF(ref c, d, a, b, x[2],  S13, 0x242070db); /* 3 */
            FF(ref b, c, d, a, x[3],  S14, 0xc1bdceee); /* 4 */
            FF(ref a, b, c, d, x[4],  S11, 0xf57c0faf); /* 5 */
            FF(ref d, a, b, c, x[5],  S12, 0x4787c62a); /* 6 */
            FF(ref c, d, a, b, x[6],  S13, 0xa8304613); /* 7 */
            FF(ref b, c, d, a, x[7],  S14, 0xfd469501); /* 8 */
            FF(ref a, b, c, d, x[8],  S11, 0x698098d8); /* 9 */
            FF(ref d, a, b, c, x[9],  S12, 0x8b44f7af); /* 10 */
            FF(ref c, d, a, b, x[10], S13, 0xffff5bb1); /* 11 */
            FF(ref b, c, d, a, x[11], S14, 0x895cd7be); /* 12 */
            FF(ref a, b, c, d, x[12], S11, 0x6b901122); /* 13 */
            FF(ref d, a, b, c, x[13], S12, 0xfd987193); /* 14 */
            FF(ref c, d, a, b, x[14], S13, 0xa679438e); /* 15 */
            FF(ref b, c, d, a, x[15], S14, 0x49b40821); /* 16 */

            /* Round 2 */
            GG(ref a, b, c, d, x[1],  S21, 0xf61e2562); /* 17 */
            GG(ref d, a, b, c, x[6],  S22, 0xc040b340); /* 18 */
            GG(ref c, d, a, b, x[11], S23, 0x265e5a51); /* 19 */
            GG(ref b, c, d, a, x[0],  S24, 0xe9b6c7aa); /* 20 */
            GG(ref a, b, c, d, x[5],  S21, 0xd62f105d); /* 21 */
            GG(ref d, a, b, c, x[10], S22, 0x02441453); /* 22 */
            GG(ref c, d, a, b, x[15], S23, 0xd8a1e681); /* 23 */
            GG(ref b, c, d, a, x[4],  S24, 0xe7d3fbc8); /* 24 */
            GG(ref a, b, c, d, x[9],  S21, 0x21e1cde6); /* 25 */
            GG(ref d, a, b, c, x[14], S22, 0xc33707d6); /* 26 */
            GG(ref c, d, a, b, x[3],  S23, 0xf4d50d87); /* 27 */
            GG(ref b, c, d, a, x[8],  S24, 0x455a14ed); /* 28 */
            GG(ref a, b, c, d, x[13], S21, 0xa9e3e905); /* 29 */
            GG(ref d, a, b, c, x[2],  S22, 0xfcefa3f8); /* 30 */
            GG(ref c, d, a, b, x[7],  S23, 0x676f02d9); /* 31 */
            GG(ref b, c, d, a, x[12], S24, 0x8d2a4c8a); /* 32 */

            /* Round 3 */
            HH(ref a, b, c, d, x[5],  S31, 0xfffa3942); /* 33 */
            HH(ref d, a, b, c, x[8],  S32, 0x8771f681); /* 34 */
            HH(ref c, d, a, b, x[11], S33, 0x6d9d6122); /* 35 */
            HH(ref b, c, d, a, x[14], S34, 0xfde5380c); /* 36 */
            HH(ref a, b, c, d, x[1],  S31, 0xa4beea44); /* 37 */
            HH(ref d, a, b, c, x[4],  S32, 0x4bdecfa9); /* 38 */
            HH(ref c, d, a, b, x[7],  S33, 0xf6bb4b60); /* 39 */
            HH(ref b, c, d, a, x[10], S34, 0xbebfbc70); /* 40 */
            HH(ref a, b, c, d, x[13], S31, 0x289b7ec6); /* 41 */
            HH(ref d, a, b, c, x[0],  S32, 0xeaa127fa); /* 42 */
            HH(ref c, d, a, b, x[3],  S33, 0xd4ef3085); /* 43 */
            HH(ref b, c, d, a, x[6],  S34, 0x04881d05); /* 44 */
            HH(ref a, b, c, d, x[9],  S31, 0xd9d4d039); /* 45 */
            HH(ref d, a, b, c, x[12], S32, 0xe6db99e5); /* 46 */
            HH(ref c, d, a, b, x[15], S33, 0x1fa27cf8); /* 47 */
            HH(ref b, c, d, a, x[2],  S34, 0xc4ac5665); /* 48 */

            /* Round 4 */
            II(ref a, b, c, d, x[0],  S41, 0xf4292244); /* 49 */
            II(ref d, a, b, c, x[7],  S42, 0x432aff97); /* 50 */
            II(ref c, d, a, b, x[14], S43, 0xab9423a7); /* 51 */
            II(ref b, c, d, a, x[5],  S44, 0xfc93a039); /* 52 */
            II(ref a, b, c, d, x[12], S41, 0x655b59c3); /* 53 */
            II(ref d, a, b, c, x[3],  S42, 0x8f0ccc92); /* 54 */
            II(ref c, d, a, b, x[10], S43, 0xffeff47d); /* 55 */
            II(ref b, c, d, a, x[1],  S44, 0x85845dd1); /* 56 */
            II(ref a, b, c, d, x[8],  S41, 0x6fa87e4f); /* 57 */
            II(ref d, a, b, c, x[15], S42, 0xfe2ce6e0); /* 58 */
            II(ref c, d, a, b, x[6],  S43, 0xa3014314); /* 59 */
            II(ref b, c, d, a, x[13], S44, 0x4e0811a1); /* 60 */
            II(ref a, b, c, d, x[4],  S41, 0xf7537e82); /* 61 */
            II(ref d, a, b, c, x[11], S42, 0xbd3af235); /* 62 */
            II(ref c, d, a, b, x[2],  S43, 0x2ad7d2bb); /* 63 */
            II(ref b, c, d, a, x[9],  S44, 0xeb86d391); /* 64 */

            state[0] += a;
            state[1] += b;
            state[2] += c;
            state[3] += d;

            /* Zeroize sensitive information. */
            Array.Clear(x, 0, x.Length);
        }

        /* Encodes input (UINT4) into output (unsigned char). Assumes len is
           a multiple of 4. */
        private static void Encode(byte[] output,
                                   uint[] input,
                                   uint len)
        {
            for (uint i = 0, j = 0; j < len; i++, j += 4)
            {
                output[j] = (byte)(input[i] & 0xff);
                output[j + 1] = (byte)((input[i] >> 8) & 0xff);
                output[j + 2] = (byte)((input[i] >> 16) & 0xff);
                output[j + 3] = (byte)((input[i] >> 24) & 0xff);
            }
        }

        /* Decodes input (unsigned char) into output (UINT4). Assumes len is
           a multiple of 4. */
        private static void Decode(uint[] output,
                                   byte[] input,
                                   uint inputIndex,
                                   uint len)
        {
            for (uint i = 0, j = 0; j < len; i++, j += 4)
            {
                output[i] = ((uint)input[inputIndex + j]) |
                    (((uint)input[inputIndex + j + 1]) << 8) |
                    (((uint)input[inputIndex + j + 2]) << 16) |
                    (((uint)input[inputIndex + j + 3]) << 24);
            }
        }
    }
}

Updated 2009-02-16: Call MD5Init from the constructor for consistency with the Framework's HashAlgorithm classes where a call to Initialize is not necessary for a newly constructed instance.

Updated 2010-12-06: Added missing inputIndex offset to MD5Update method.

Tags: Technical Silverlight

Cross-platform feature parity: achieved [Silverlight version of ComputeFileHashes now includes MD5!]

Tuesday, January 27, 2009

I was very happy with last week's release of ComputeFileHashes supporting the command-line, WPF, Silverlight, *and* ClickOnce. Only one thing bothered me: the Silverlight version didn't do MD5 due to the lack of support for that type of checksum by Silverlight 2. Recall that I'd fairly happily implemented my own CRC-32 class because none of the platforms supported it. [Also, it was relatively simple and had a good reference implementation. :) ] But because MD5 is a more complex algorithm and was only missing on Silverlight, I was reluctant to do the same thing for MD5...

What I really wanted was a freely available, Silverlight compatible HashAlgorithm-based MD5 implementation that I could trivially drop into my code and use on Silverlight. So I was excited when kind reader (and teammate!) Jeff Wilcox left a comment pointing to something that sounded perfect for my needs. I told Jeff I'd add MD5 for Silverlight and mentally breathed a sigh of relief that all four of ComputeFileHashes's supported platforms would provide the same set of checksums.

As it turns out, after a bit of research I decided not to use that MD5 implementation. (I'll explain why in my next post.) However, now that I'd fully bought in to the idea of MD5 on Silverlight, I was reluctant to let it go... So I spent some time working on an alternate solution and developed something I'm quite happy with. So I'm able to release an update to ComputeFileHashes that offers MD5 support on Silverlight!

The latest version of ComputeFileHashes is now 2009-01-26. I've updated all the binaries in order to avoid version number confusion - but the only real change here is the addition of MD5 for Silverlight. (FYI, I only updated the screenshot of the Silverlight version below.)

If you're using Silverlight to run ComputeFileHashes, you'll automatically get the new version next time you run ComputeFileHashes.
If you're using ClickOnce to run ComputeFileHashes, the application will automatically update itself after you run it a couple of times.
If you're using the WPF or command-line versions, you'll need to download the new binaries and update manually.

Please refer to the original release announcement for more information about supported platforms, source code, implementation, etc..

Click here or on the image below to run the Silverlight version of ComputeFileHashes in your browser.

Click here or on the image below to download the command-line and WPF versions of ComputeFileHashes - along with the ClickOnce and Silverlight versions AND the complete source code for everything!

I've said that "ComputeFileHashes is a simple tool intended to make verifying checksums easy for anyone.". And in some ways, I think the Silverlight version is the easiest option of all because there's no need to install it on your machine and it runs everywhere Silverlight 2 does (PC, Mac, (Linux soon!), Internet Explorer, Firefox, Safari, ...). So I'm really glad to add MD5 support to ComputeFileHashes for Silverlight - I hope you enjoy the new functionality!

Tags: Technical Silverlight WPF Utilities

Math is hard, let's go shopping. [Minor bug fix for free CRC-32 HashAlgorithm implementation for .NET]

Friday, January 23, 2009

While working on code for an upcoming blog post, I found myself dealing with the HashAlgorithm.HashSize property again and realized I'd made a silly mistake a few days ago... :(

I'm pretty sure I remember consulting the documentation when implementing this method for my free CRC-32 HashAlgorithm implementation, and I obviously believed the correct behavior was to return the size in bytes because that's what my comment says and that's what my code does. However, the documentation seems pretty clear on the matter: Gets the size, in bits, of the computed hash code." (emphasis mine). So my initial implementation of this property was wrong. Fortunately, none of the four implementations of ComputeFileHashes (command-line, WPF, ClickOnce, Silverlight) make use of HashSize, so they're not affected by this bug. Unfortunately, anyone who decided to use my CRC-32 implementation and referenced this property would see the wrong value. For their sake, I've just made the trivial fix to the code (multiplying the byte count by 8 to get bit count), updated the comment for that property, added a note about the update to the bottom of the original post, and republished it.

I'm very sorry for the error and any trouble this may have caused.

PS - The version of CRC32.cs in the ComputeFileHashes source code download has not been updated - but will be in a few days as part of an upcoming post.

Tags: Technical

Gratuitous platform support [ComputeFileHashes works on the command-line, on WPF, on Silverlight, and via ClickOnce!]

Wednesday, January 21, 2009

Last week, I released the ComputeFileHashes tool for calculating file checksums. (To read more about what checksums are and why they're useful, please refer to that post.) ComputeFileHashes is a fairly simple .NET command-line application for calculating the MD5, SHA-1, and CRC-32 hashes of one or more files. It takes advantage of the multi-processing capabilities of today's hardware to complete that task quickly - roughly on par with native-code implementations. ComputeFileHashes works quite well and I happily used it to verify the recently released Windows 7 Beta ISO images I'd downloaded.

Because not everybody is a fan of command-line tools, I thought it would be nice to use WPF to create a more user-friendly version of ComputeFileHashes. Once I'd done that, I knew it would be a trivial matter to publish the WPF version via ClickOnce to enable an absurdly easy install scenario. From there, porting to Silverlight would be straightforward and would offer an install-free, completely web-based solution with cross-platform (ex: PC/Mac), cross-browser (ex: IE/Firefox/Safari) appeal. What's more, because all of these platforms are built on .NET, so I expected to be able to take significant advantage of code sharing!

Click here or on the image below to run the Silverlight version of ComputeFileHashes in your browser.

Implementation notes:

The command-line version of ComputeFileHashes is a standard .NET 2.0 application and should work pretty much everywhere. The Silverlight version requires Silverlight 2 which is tiny and can be completely installed in less than a minute start-to-finish. The WPF/ClickOnce versions are a little more advanced and require .NET 3.5 SP1 (conveniently pre-installed on all Windows 7 machines!). If you don't already have .NET 3.5 SP1 (and you may not because Windows Update still doesn't seem to offer it), you can get the .NET 3.5 SP1 installer from here. Unfortunately, the only indication of .NET 3.5 SP1 not being installed seems to be an application crash immediately after starting the stand-alone WPF version. :( Fortunately, the ClickOnce version knows about the .NET 3.5 SP1 prerequisite and should offer to install it automatically if it's not already present.
None of the computation or file processing is performed on the main user interface thread under WPF or Silverlight, so ComputeFileHashes remains responsive even when working on a large file. Additional files can be queued for processing or the application/browser can be closed without the user having to wait.
As I hoped, I was able to achieve a very high degree of code sharing. By refactoring the original ComputeFileHashes code slightly, I pulled the core implementation out into a common class/file that everything shares. Then I put nearly all of the user interface functionality into another class/file that the WPF and Silverlight implementations both share. The XAML for the WPF and Silverlight versions is separate, but very similar. (There are enough slight differences between the two versions that I deliberately did not attempt to share the same XAML file.)

The source code structure looks like this:

`ComputeFileHashesCore.cs`	Core implementation of the file hashing code shared by all implementations. Makes use of multiple threads to perform hash calculations in parallel.
`ComputeFileHashesUI.cs`	User interface code shared by the WPF and Silverlight implementations. Makes use of a worker thread to push all computation off of the user interface thread and keep the application responsive. Defers to ComputeFileHashesCore for hashing functionality.
`CRC32.cs WaitingRoom.cs`	Custom CRC-32 HashAlgorithm implementation and synchronization object shared by all implementations.
`HashFileInfo.cs BlockingQueue.cs`	Data object for tracking state and custom Queue subclass that are shared by the WPF and Silverlight implementations.
`ComputeFileHashesCL\` `ComputeFileHashesCL.cs`	Command-line interface for handling arguments and displaying progress. Defers to ComputeFileHashesCore for hashing functionality.
`ComputeFileHashesWPF\` `Window1.xaml` `Window1.xaml.cs`	WPF definition of the application window. Defers to ComputeFileHashesUI for nearly all functionality.
`ComputeFileHashesSL\` `Page.xaml` `Page.xaml.cs`	Silverlight definition of the application window. Defers to ComputeFileHashesUI for nearly all functionality.

A look at the screenshots above reveals a few differences between the WPF and Silverlight implementations:
- The DataGrids look different. On WPF, ComputeFileHashes makes use of the WPFToolkit's DataGrid; on Silverlight it uses the DataGrid in the SDK. The two are very similar to use from a developer perspective, but they draw themselves differently and have some slightly user-level functionality changes due to platform differences. This was actually my first experience with either DataGrid and I was happy to find that they both worked the same - and pretty much the way I expected them to!
- The Silverlight version does not calculate the MD5 hash. This is because Silverlight's .NET doesn't implement the MD5 HashAlgorithm subclass while the desktop's .NET does. MD5 is not trivial to write, so I wasn't too interested in developing my own implementation like I did for CRC-32 (which isn't supported on any .NET platform).
- The Silverlight version includes a "Details" column that's not present on the WPF version. On WPF, it's trivial to create a ToolTip for a DataGrid column, but on Silverlight the ToolTipService class must be used and my attempts to set its attached property with a Style were met with ... resistance. So if an exception is thrown when processing a file, the WPF version will show the exception text in a ToolTip while the Silverlight version shows it in the Details column.
- The Silverlight version does not support drag-and-drop from the Windows Explorer. Running within a browser imposes certain limitations on Silverlight; an inability to integrate quite as richly with the operating system is one of them.
- The hyperlinks aren't quite the same. Silverlight ships with HyperlinkButton which is exactly the right control for the job here. WPF doesn't have that control, so the similar Hyperlink control is made to behave as desired with a small bit of code.

In the original release announcement, I wrote that "ComputeFileHashes is a simple tool intended to make verifying checksums easy for anyone.". Well, that's still the case - and adding support for WPF, ClickOnce, and Silverlight should make it even easier for everyone to use. Just decide what kind of user interface you prefer, and start using that version of ComputeFileHashes for all your checksumming needs! :)

Tags: Technical Silverlight WPF Utilities

The doctor will see you now... [WaitingRoom is a reusable synchronization object for .NET]

Thursday, January 15, 2009

In my notes for the release of the ComputeFileHashes tool (and source code), I mentioned that I'd written a synchronization object for managing the parallel computation of checksums across multiple threads. I called this class WaitingRoom (after failing to come up with anything better) and thought I'd write about the pattern it implements so others might use it as well.

To understand the motivation behind the WaitingRoom class, it's helpful to understand a bit about how ComputeFileHashes works. For those who haven't read the implementation notes, the basic goal is to enable parallel computation of multiple checksum algorithms for sequential chunks of a file. So there's a primary thread which is responsible for opening the file and making consecutive blocks of it available for processing and there are one or more worker threads which are responsible for processing the data in each block that becomes available. For performance reasons, there are two buffers for these blocks and they're swapped repeatedly: one is the "current" block (which has valid data) and the other is the "next" block (which is being filled-in). It's important to ensure the worker threads only access a block when it's valid, so the primary thread needs a way to tell the worker threads when it's safe to start as well as a way to find out when they've all finished with a block.

The WaitingRoom class makes this synchronization process easy by exposing three methods: Wait, Release, and Arrive. Both Wait and Release are used by the primary thread: it calls Wait to block until all worker threads are "in the waiting room" (during which time they're all blocked) and then calls Release to "open the waiting room door" and unblock the worker threads. [Okay, the analogy breaks down a bit here because most doctors admit only *one* patient at a time. :) ] The worker threads call Arrive to signal they've "entered the waiting room" and automatically block until the primary thread releases them. What's great about using WaitingRoom is that the order of the threads doesn't matter: the primary thread can be ready first, or the worker threads can be ready first, or the primary thread can be ready after some - but not all - of the worker threads are ready. Whatever the order, WaitingRoom coordinates things smoothly!

As it happens, ComputeFileHashes uses two WaitingRoom instances. The first instance is used to wait for the worker threads to be ready to process a new file. When this is going on, the primary thread has nothing else to do, so it calls Wait and then immediately calls Release when the wait completes. The second instance is used to wait for the worker threads to finish processing a block of data. In this case, the primary thread has some work it can do in parallel with the workers: it reads the next block of data from disk and updates the status display. So it makes a call to Release (which gets the worker threads going), does its own work while they're busy, then calls Wait to wait for them to finish. After that, it's safe to swap the "current" and "next" buffers and repeat the same steps with the next block.

The complete implementation of WaitingRoom is below. It uses only the standard .NET Monitor class - along with the C# lock statement to simplify the syntax a bit. There are a few Debug.Asserts in there to try to keep callers honest, but it's up to the developer to ensure that Wait and Release are only called by the primary thread and that Arrive is only called by the worker threads.

Here's what it looks like:

using System.Diagnostics;
using System.Threading;

namespace ComputeFileHashes
{
    /// <summary>
    /// Implements a synchronization object that allows an owner
    /// thread to synchronize with its worker threads.
    /// </summary>
    class WaitingRoom
    {
        // Number of worker threads
        private readonly int _capacity;
        // Object on which to lock for entry
        private readonly object _entryLock = new object();
        // Object on which to lock for exit
        private readonly object _exitLock = new object();
        // Current count of worker threads
        private int _count;
        // "Sign" of owner/worker threads
        private bool _sign;

        /// <summary>
        /// Initializes a new instance.
        /// </summary>
        /// <param name="capacity">Number of worker threads.</param>
        public WaitingRoom(int capacity)
        {
            Debug.Assert(0 < capacity);
            _capacity = capacity;
        }

        /// <summary>
        /// Waits for all worker threads to call the Arrive method.
        /// </summary>
        public void Wait()
        {
            // Claim entry lock
            lock (_entryLock)
            {
                Debug.Assert((0 <= _count) && (_count <= _capacity));

                // Block if all worker threads have not arrived
                if (_count < _capacity)
                {
                    Monitor.Wait(_entryLock);
                }
            }
        }

        /// <summary>
        /// Signals the presence/availability of a worker thread.
        /// </summary>
        public void Arrive()
        {
            // Claim the entry lock
            bool sign;
            lock (_entryLock)
            {
                Debug.Assert((0 <= _count) && (_count < _capacity));

                // Capture sign
                sign = _sign;

                // Wake owner thread if all worker threads present
                _count++;
                if (_count == _capacity)
                {
                    Monitor.Pulse(_entryLock);
                }
            }

            // Claim the exit lock
            lock (_exitLock)
            {
                // Block if owner has not yet released the worker threads
                if (sign == _sign)
                {
                    Monitor.Wait(_exitLock);
                }
            }
        }

        public void Release()
        {
            // Claim the exit lock
            lock (_exitLock)
            {
                Debug.Assert(_count == _capacity);

                // Reset count and flip sign
                _count = 0;
                _sign = !_sign;

                // Wake worker threads
                Monitor.PulseAll(_exitLock);
            }
        }
    }
}

There's probably an official name for this kind of synchronization primitive, but I don't know what it is. :| I poked around a bit on Wikipedia's "Currency Control" page just now and found something called room synchronization that sounds close... I think WaitingRoom may be a type of room synchronization - with some specialized restrictions for the specifics of the ComputeFileHashes scenario. But whatever you call it, WaitingRoom is a pretty handy class to work with. Feel free to use WaitingRoom in your own code - maybe even drop me a note if you find other interesting uses. And if anyone reading knows the real name for what it's doing, please let me know! :)

Tags: Technical

Free hash [A reusable CRC-32 HashAlgorithm implementation for .NET]

Wednesday, January 14, 2009

In the notes for yesterday's release of the ComputeFileHashes tool (and source code), I mentioned that I'd written my own .NET HashAlgorithm class to compute CRC-32 hash values. The complete implementation can be found below and should behave just like every other HashAlgorithm subclass (ex: MD5 or SHA1). The code here is based on the CRC-32 reference implementation provided in Annex D of the PNG specification and pretty much "just worked". It implements the necessary Initialize, HashCore, and HashFinal methods as well as the technically optional (but practically necessary) Hash and HashSize properties. There's no test code to speak of, though it's worth pointing out that I've run tens of gigabytes of data through my ComputeFileHashes tool and have verified the correctness of the computed CRC-32 value for each test file. :)

Without further ado:

using System;
using System.Security.Cryptography;

namespace Delay
{
    /// <summary>
    /// HashAlgorithm implementation for CRC-32.
    /// </summary>
    [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Naming",
        "CA1709:IdentifiersShouldBeCasedCorrectly", MessageId = "CRC",
        Justification = "Matching algorithm acronym.")]
    public class CRC32 : HashAlgorithm
    {
        // Shared, pre-computed lookup table for efficiency
        private static readonly uint[] _crc32Table;

        /// <summary>
        /// Initializes the shared lookup table.
        /// </summary>
        [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Performance",
            "CA1810:InitializeReferenceTypeStaticFieldsInline", Justification =
            "Table values must be computed; not possible to remove the static constructor.")]
        static CRC32()
        {
            // Allocate table
            _crc32Table = new uint[256];

            // For each byte
            for (uint n = 0; n < 256; n++)
            {
                // For each bit
                uint c = n;
                for (int k = 0; k < 8; k++)
                {
                    // Compute value
                    if (0 != (c & 1))
                    {
                        c = 0xedb88320 ^ (c >> 1);
                    }
                    else
                    {
                        c = c >> 1;
                    }
                }

                // Store result in table
                _crc32Table[n] = c;
            }
        }

        // Current hash value
        private uint _crc32Value;

        // True if HashCore has been called
        private bool _hashCoreCalled;

        // True if HashFinal has been called
        private bool _hashFinalCalled;

        /// <summary>
        /// Initializes a new instance.
        /// </summary>
        public CRC32()
        {
            InitializeVariables();
        }

        /// <summary>
        /// Initializes internal state.
        /// </summary>
        public override void Initialize()
        {
            InitializeVariables();
        }

        /// <summary>
        /// Initializes variables.
        /// </summary>
        private void InitializeVariables()
        {
            _crc32Value = uint.MaxValue;
            _hashCoreCalled = false;
            _hashFinalCalled = false;
        }

        /// <summary>
        /// Updates the hash code for the provided data.
        /// </summary>
        /// <param name="array">Data.</param>
        /// <param name="ibStart">Start position.</param>
        /// <param name="cbSize">Number of bytes.</param>
        protected override void HashCore(byte[] array, int ibStart, int cbSize)
        {
            if (null == array)
            {
                throw new ArgumentNullException("array");
            }

            if (_hashFinalCalled)
            {
                throw new CryptographicException(
                    "Hash not valid for use in specified state.");
            }
            _hashCoreCalled = true;

            for (int i = ibStart; i < ibStart + cbSize; i++)
            {
                byte index = (byte)(_crc32Value ^ array[i]);
                _crc32Value = _crc32Table[index] ^ ((_crc32Value >> 8) & 0xffffff);
            }
        }

        /// <summary>
        /// Finalizes the hash code and returns it.
        /// </summary>
        /// <returns></returns>
        protected override byte[] HashFinal()
        {
            _hashFinalCalled = true;
            return Hash;
        }

        /// <summary>
        /// Returns the hash as an array of bytes.
        /// </summary>
        [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Design",
            "CA1065:DoNotRaiseExceptionsInUnexpectedLocations", Justification =
            "Matching .NET behavior by throwing here.")]
        [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Usage",
            "CA2201:DoNotRaiseReservedExceptionTypes", Justification =
            "Matching .NET behavior by throwing NullReferenceException.")]
        public override byte[] Hash
        {
            get
            {
                if (!_hashCoreCalled)
                {
                    throw new NullReferenceException();
                }
                if (!_hashFinalCalled)
                {
                    // Note: Not CryptographicUnexpectedOperationException because
                    // that can't be instantiated on Silverlight 4
                    throw new CryptographicException(
                        "Hash must be finalized before the hash value is retrieved.");
                }

                // Convert complement of hash code to byte array
                byte[] bytes = BitConverter.GetBytes(~_crc32Value);

                // Reverse for proper endianness, and return
                Array.Reverse(bytes);
                return bytes;
            }
        }

        // Return size of hash in bits.
        public override int HashSize
        {
            get
            {
                return 4 * 8;
            }
        }
    }
}

The CRC32 class presented here is nothing fancy, but it should be a pretty solid implementation of the once-popular CRC-32 algorithm that's ripe for reuse. Thanks to .NET's HashAlgorithm base class, it's easy to drop in CRC32 anywhere hashes are already being computed. I hope you find it useful!

Updated 2009-01-22: Corrected implementation of HashSize to return hash size in bits instead of bytes.

Updated 2009-02-16: Initialize the _crc32Value variable to its starting value for consistency with the Framework's HashAlgorithm classes where a call to Initialize is not necessary for a newly constructed instance.

Updated 2010-12-06: Added missing ibStart offset to HashCore loop.

Tags: Technical

Trust, but verify [Free tool (and source code) for computing commonly used hash codes!]

Tuesday, January 13, 2009

Internet downloads (particularly large ones) are often published with an associated checksum that can be used to verify that the file was successfully downloaded. While transmission errors are relatively rare, the popularity of file sharing and malware introduce the possibility that what you get isn't always exactly what you wanted. The posting of checksums by the original content publisher attempts to solve this problem by giving the end user an easy way to validate the downloaded file. Checksums are typically computed by applying a cryptographic hash function to the original file; the hash function computes a small (10-30 character) textual "snapshot" of the file that satisfies the following properties (quoting from Wikipedia):

It is easy to compute the hash for any given data

It is extremely difficult to construct a [file] that has a given hash

It is extremely difficult to modify a given [file] without changing its hash

It is extremely unlikely that two different [files] will have the same hash

Because of these four properties, a user can be fairly confident a file has not been tampered with or garbled as long as the checksum computed on their own machine matches what the publisher posted. I say fairly confident because there's always the possibility of a hash collision - two different files with the same checksum. No matter how good the hash function is, the pigeonhole principle guarantees there will ALWAYS be the possibility of collisions. The point of a good hash function is to make this possibility so unlikely as to be impossible for all intents and purposes.

Popular hash functions in use today are MD5 and SHA-1, with CRC-32 rapidly losing favor. Speaking in very broad terms, one might say that the quality of CRC-32 is "not good", MD5 is "good", and SHA-1 is "very good". For now, that is; research is always under way that could render any of these algorithms useless tomorrow... (For more information about the weaknesses of each algorithm, refer to the links above.)

In order for published checksums to be useful, the user needs an easy way to calculate them. I looked around a bit and didn't a lot of free tools for computing these popular hash functions that I was comfortable with, so I wrote my own using .NET. Here it is:

ComputeFileHashes
   Computes CRC32, MD5, and SHA1 hashes for the specified file(s)
   Version 2009-01-12
   http://blogs.msdn.com/Delay/

Syntax: ComputeFileHashes FileOrPattern [FileOrPattern [...]]
   Each FileOrPattern can specify a single file or a set of files

Examples:
   ComputeFileHashes Image.iso
   ComputeFileHashes *.iso
   ComputeFileHashes *.iso *.vhd

[Click here to download ComputeFileHashes along with its complete source code.]

Here's the output of running ComputeFileHashes on the recently released Windows 7 Beta ISO images. Anyone can verify the checksums below on the MSDN Subscriber Downloads site. (Note: You need to be a subscriber to download files from there; everyone else can download from the official Windows 7 link.)

C:\T>ComputeFileHashes.exe "M:\Windows 7\*"

M:\Windows 7\en_windows_7_beta_dvd_x64_x15-29074.iso
CRC32: 8E2FAD39
MD5: 773FC9CC60338C612AF716A2A14F177D
SHA1: E09FDBC1CB3A92CF6CC872040FDAF65553AB62A5

M:\Windows 7\en_windows_7_beta_dvd_x86_x15-29073.iso
CRC32: AABA5A48
MD5: F9DCE6EBD0A63930B44D8AE802B63825
SHA1: 6071184282B2156FF61CDC5260545C078CCA31EE

One of the things that was important to me when writing ComputeFileHashes was performance. Nobody likes to wait, and I'm probably even less patient than the average bear. One of the things I wanted my program to do was take advantage of multi-processing and the multi-core CPUs that are so prevalent these days. So ComputeFileHashes runs the three hash functions in parallel with each other and with loading the next bytes of the file. Theoretically, this can take advantage of four different cores - though in practice my limited testing suggests there's just not enough work to saturate them all. :)

While I paid attention to performance at a macro level (i.e., algorithm design), I didn't worry about it at a micro level (i.e., focused optimization). And I used .NET of all things. (Cue the doubters: "ZOMG, EPIC FAIL!!") If it uses .NET, it must be slow, right? Well, let's do the numbers:

I took some informal measurements of the time to compute hashes for the Windows 7 Beta x64 ISO image. ComputeFileHashes took just under 60 seconds to compute the CRC-32, MD5, and SHA-1 hashes of the 3.15 GB file on my home machine. Not instantaneous by any means, but also pretty reasonable for that amount of data.
Next up was my favorite program ever, Altap Salamander. I use Salamander for all my file management and it comes with a handy plugin for calculating/verifying checksums. Salamander is fully native code, so it should have a distinct performance advantage. Salamander took about 40 seconds to compute the CRC-32 and MD5 checksums in a single pass. This performance is noticeably better than ComputeFileHashes's, but it also doesn't include SHA-1 which is probably the most computationally intensive algorithm. I doubt adding SHA-1 would double the time, but it would almost certainly take longer. It's hard to balance missing features against better performance, so maybe this one is too close to call. :)
Last up was an internal tool I had called gethash. gethash is also native code and supports MD5, SHA-1, and some other hash types - but not CRC-32. However, gethash computes only one checksum a time. So I ran it once for SHA-1 in a time of just over 35 seconds and again for MD5 for another time of just over 35 seconds. The total time for gethash was therefore about 70 seconds and just like before we're missing one of the checksum types. In this case, one could argue that ComputeFileHashes wins the race because it's 10 seconds faster and delivers more value. Of course, if you only need one type of checksum, then you don't care that ComputeFileHashes generated others and gethash can do a single checksum in close to half the time it takes ComputeFileHashes. Again, no clear winner for me.

So depending on how you want to look at things, ComputeFileHashes is either the best or the worst of the lot. :) But recall that we're pitching an unoptimized .NET application against two solid native-code implementations. ComputeFileHashes pretty clearly held its ground and I'd say the doubters don't have much to complain about here.

Implementation notes:

.NET includes the HashAlgorithm abstract class and a variety of concrete subclasses for calculating cryptographic hashes. MD5 and SHA-1 are implemented by the Framework along with a handful of other useful hash functions. (In fact, it's trivial to add support for another HashAlgorithm to ComputeFileHashes simply by editing a table in the source code.) CRC-32 is not part of the Framework (due, I suspect, to its weaknesses), so I wrote my own CRC-32 HashAlgorithm subclass. I'll post more about this tomorrow.
The most efficient multi-hashing implementation would probably update each algorithm together as bytes are read from the file. However, the overhead of doing that with the .NET HashAlgorithm implementation (which sensibly expects blocks of data) seemed prohibitively expensive - and impractical to parallelize. An alternative would be to let different threads hash the complete file on their own. This parallelizes easily, but because the algorithms operate at different speeds, they'd soon be reading different parts of the file - and the disk thrashing would probably ruin performance. So I went with a compromise: ComputeFileHashes loads a 1 MB chunk of the file into memory, lets each of the algorithms process that block to completion on separate threads, then loads the next 1 MB of data and repeats. There's going to be some wasted time after the fastest algorithm completes and the slowest is still working, and there's probably some memory cache thrashing (for similar reasons as the disk) - but as the numbers above indicate, the performance of this approach is really quite reasonable. Of course, it's entirely possible that a different buffer size that would be even faster - so if you do some testing and find something better, please let me know! :)
ComputeFileHashes creates dedicated Threads for each of the hash functions it uses. Those threads stay around for the life of the application and are synchronized by a general-purpose helper class I wrote, WaitingRoom. I'll blog more about this in a couple of days.

ComputeFileHashes is a simple tool intended to make verifying checksums easy for anyone. I've wanted a good, fast, free, all-in-one solution I could trust for some time now, and the recent release of Windows 7 finally prompted me to write my own. I hope others to put ComputeFileHashes to use for themselves - perhaps even for the Windows 7 Beta! :)

Tags: Technical Utilities

The source code IS the executable [Releasing CSI, a C# interpreter (with source and tests) for .NET]

Wednesday, January 7, 2009

A few years ago I found myself spending a lot of time writing batch files to perform a variety of relatively simple tasks. For those who aren't familiar, you can do some surprisingly powerful things with batch files - but it usually involves a lot of trickery and arcane knowledge. I wanted a simpler way of doing things, but I didn't want to create a bunch of compiled executables and lose the elegant, easy-to-modify transparency that batch files offer. These days, I'd probably look to PowerShell for the solution, but back then there was no such thing...

What I did have was the power of .NET and some inspiration from a coworker's tool that was able to take .NET 1.1 source code and execute it "on the fly" without the need for compilation. What I did not have was something similar for .NET 2.0 or access to the source code for that tool (because the author had left the company). So I wrote my own:

============================================
==  CSI: C# Interpreter                   ==
==  Delay (http://blogs.msdn.com/Delay/)  ==
============================================


Summary
=======
CSI: C# Interpreter
     Version 2009-01-06 for .NET 3.5
     http://blogs.msdn.com/Delay/

Enables the use of C# as a scripting language by executing source code files
directly. The source code IS the executable, so it is easy to make changes and
there is no need to maintain a separate EXE file.

CSI (CodeFile)+ (-d DEFINE)* (-r Reference)* (-R)? (-q)? (-c)? (-a Arguments)?
   (CodeFile)+      One or more C# source code files to execute (*.cs)
   (-d DEFINE)*     Zero or more symbols to #define
   (-r Reference)*  Zero or more assembly files to reference (*.dll)
   (-R)?            Optional 'references' switch to include common references
   (-q)?            Optional 'quiet' switch to suppress unnecessary output
   (-c)?            Optional 'colorless' switch to suppress output coloring
   (-a Arguments)?  Zero or more optional arguments for the executing program

The list of common references included by the -R switch is:
   System.dll
   System.Data.dll
   System.Drawing.dll
   System.Windows.Forms.dll
   System.Xml.dll
   PresentationCore.dll
   PresentationFramework.dll
   WindowsBase.dll
   System.Core.dll
   System.Data.DataSetExtensions.dll
   System.Xml.Linq.dll

CSI's return code is 2147483647 if it failed to execute the program or 0 (or
whatever value the executed program returned) if it executed successfully.

Examples:
   CSI Example.cs
   CSI Example.cs -r System.Xml.dll -a ArgA ArgB -Switch
   CSI ExampleA.cs ExampleB.cs -d DEBUG -d TESTING -R


Notes
=====
CSI was inspired by net2bat, an internal .NET 1.1 tool whose author had left
Microsoft. CSI initially added support for .NET 2.0 and has now been extended
to support .NET 3.0 and .NET 3.5. Separate executables are provided to
accommodate environments where the latest version of .NET is not available.


Version History
===============

Version 2009-01-06
Initial public release

Version 2005-12-15
Initial internal release

[Click here to download CSI for .NET 3.5, 3.0, 2.0, and 1.1 - along with the complete source code and test suite.]

Using CSI is quite simple. Here's an example from the release package:

C:\T\CSI>TYPE Samples\HelloWorld.cs
using System;

public class HelloWorld
{
    public static void Main(string[] args)
    {
        Console.WriteLine("Hello world.");
    }
}

C:\T\CSI>CSI.exe Samples\HelloWorld.cs
Hello world.

And if you use the included batch files to register the .CSI file type (more on this in the notes below), it's even easier:

C:\T\CSI>TYPE Samples\Greetings.csi
using System;

public class Greetings
{
    public static void Main(string[] args)
    {
        Console.WriteLine("Hello {0}", string.Join(" ", args));
    }
}

C:\T\CSI>Samples\Greetings.csi out there world.
Hello out there world.

Notes:

Surprisingly, it's fairly straightforward to do what CSI does thanks to the CSharpCodeProvider class that's part of .NET. (In fact, the majority of the code for CSI is related to processing arguments, managing the display, and handling errors!) Behind the scenes, there's no real magic: CSharpCodeProvider is just a wrapper for the C# compiler which generates a temporary assembly that CSI calls into. Everything should work exactly the same under CSI as it would if compiled normally - the only difference is the convenience. :)
I hinted at the benefits of avoiding compilation above, but wanted to explain a bit more. In the usual way of doing things, there is typically (at least) a .CS file and a .EXE file for each tool used by, say a build process. The build process runs the .EXE file, and people refer to the .CS file. If anything ever needs to be changed, someone updates the .CS file, recompiles to generate a new .EXE, and checks both files in together. Usually... But sometimes they forget and only check in the .EXE - or there's some other .EXE-only modification that throws the synchronization of the two files out of whack and then it's just a mess to understand what's going on because the source code no longer matches the executable. CSI avoids that problem by allowing the build process to "run" the .CS file directly. Because there's only one file to manage, there's no chance of it getting out of sync!
Two simple scripts are included that can be used to RegisterCSI.cmd or UnregisterCSI.cmd. When run with administrative privileges, these scripts set up (or remove) a file association for .CSI files (which are just renamed .CS files) that automatically invokes CSI for the specified file. So, as shown in the earlier example, it's possible to run C# code files by name in a command window or by double-clicking them in the Windows Explorer!
There's a separate version of CSI compiled for each major .NET version that's in use today: 1.1, 2.0, 3.0, and 3.5. Aside from some minor functional limitations with the 1.1 version (due to non-existent features in that version of .NET), they're all the same CSI with the same options, etc.. The most obvious difference is the list of default assemblies included by the -R option: each version matches what Visual Studio 2008 provides as the default set of assemblies for a new project targeting the corresponding version of the Framework. The other big difference is that the .NET 3.5 version of CSI uses the C# 3.0 compiler (which can be a little tricky to do until someone shows you how), so it also supports new language features like var and all the LINQy syntax goodness that's available in .NET 3.5.
Editing standalone .CS files can be a bit annoying because the lack of an associated .CSPROJ file means that Visual Studio's IntelliSense isn't available. So I'll usually begin by creating a project in Visual Studio, taking advantage of its rich editing and debugging facilities to get everything working the way I want - then I pull out the resulting .CS file and run it with CSI. If there are any subsequent modifications to the code, they're usually trivial enough that the lack of IntelliSense isn't an issue. Alternatively, you could create a new project and add a link to the existing file in order to enable IntelliSense. OR you could use something like Snippet Compiler which offers a surprisingly rich Visual Studio-like editing and execution environment (with IntelliSense!) and works well with standalone .CS files.
Given that I've already mentioned PowerShell, what's the point of CSI? Well, maybe there isn't one... :) But I think there's still value here - even in a PowerShell world. (In fact, what prompted me to release CSI was an internal request to add support for .NET 3.5, so it seems at least one other person agrees with me.) Aside from knowing a bit about PowerShell and how it works, I haven't used it, so I won't try to do a feature comparison here. Some of the things PowerShell does are very cool and quite useful - and I think it's considerably better than .CMD files for automating typical tasks. Also, I love that it's tightly integrated with the .NET Framework. That said, it's another "language" to learn and not everybody has the time for that. Additionally, I don't think the current development and debugging options are quite as rich as Visual Studio already is. And PowerShell isn't available everywhere, whereas CSI offers a no-touch, install-free solution anywhere the .NET Framework can be found. Lastly, CSI is small (really small!) and very simple.
Though its name bears passing resemblance to a popular TV series, that's unintentional: once you know the C# compiler is named CSC.exe, it's obvious the C# interpreter should be named CSI.exe.

CSI was a fun project in its day and I've done quite a bit with it that would have been fairly tedious otherwise. Today's world offers plenty of alternatives - but if you're comfortable with C# and prefer to stick to one language for all your programming needs, CSI is pretty handy to have around. Because you never know when the urge to code will strike! :)

Tags: Technical Utilities