Safe X (ml parsing with XLINQ) [XLinqExtensions helps make XML parsing with .NET's XLINQ a bit safer and easier]

Thursday, May 26, 2011

XLINQ (aka LINQ-to-XML) is a set of classes that make it simple to work with XML by exposing the element tree in a way that's easy to manipulate using standard LINQ queries. So, for example, it's trivial to write code to select specific nodes for reading, create well-formed XML fragments, or transform an entire document. Because of its query-oriented nature, XLINQ makes it easy to ignore parts of a document that aren't relevant: if you don't query for them, they don't show up! Because it's so handy and powerful, I encourage folks who aren't already familiar to find out more.

Aside: As usual, flexibility comes with a cost and it is often more efficient to read and write XML with the underlying XmlReader and XmlWriter classes because they don't expose the same high-level abstractions. However, I'll suggest that the extra productivity of developing with XLINQ will often outweigh the minor computational cost it incurs.

When I wrote the world's simplest RSS reader as a sample for my post on WebBrowserExtensions, I needed some code to parse the RSS feed for my blog and dashed off the simplest thing possible using XLINQ. Here's a simplified version of that RSS feed for reference:

<rss version="2.0">
  <channel>
    <title>Delay's Blog</title>
    <item>
      <title>First Post</title>
      <pubDate>Sat, 21 May 2011 13:00:00 GMT</pubDate>
      <description>Post description.</description>
    </item>
    <item>
      <title>Another Post</title>
      <pubDate>Sun, 22 May 2011 14:00:00 GMT</pubDate>
      <description>Another post description.</description>
    </item>
  </channel>
</rss>

The code I wrote at the time looked a lot like the following:

private static void NoChecking(XElement feedRoot)
{
    var version = feedRoot.Attribute("version").Value;
    var title = feedRoot.Element("channel").Element("title").Value;
    ShowFeed(version, title);
    foreach (var item in feedRoot.Element("channel").Elements("item"))
    {
        title = item.Element("title").Value;
        var publishDate = DateTime.Parse(item.Element("pubDate").Value);
        var description = item.Element("description").Value;
        ShowItem(title, publishDate, description);
    }
}

Not surprisingly, running it on the XML above leads to the following output:

Delay's Blog (RSS 2.0)
  First Post
    Date: 5/21/2011
    Characters: 17
  Another Post
    Date: 5/22/2011
    Characters: 25

That code is simple, easy to read, and obvious in its intent. However (as is typical for sample code tangential to the topic of interest), there's no error checking or handling of malformed data. If anything within the feed changes, it's quite likely the code I show above will throw an exception (for example: because the result of the Element method is null when the named element can't be found). And although I don't expect changes to the format of this RSS feed, I'd be wary of shipping code like that because it's so fragile.

Aside: Safely parsing external data is a challenging task; many exploits take advantage of parsing errors to corrupt a process's state. In the discussion here, I'm focusing mainly on "safety" in the sense of "resiliency": the ability of code to continue to work (or at least not throw an exception) despite changes to the format of the data it's dealing with. Naturally, more resilient parsing code is likely to be less vulnerable to hacking, too - but I'm not specifically concerned with making code hack-proof here.

Adding the necessary error-checking to get the above snippet into shape for real-world use isn't particularly hard - but it does add a lot more code. Consequently, readability suffers; although the following method performs exactly the same task, its implementation is decidedly harder to follow than the original:

private static void Checking(XElement feedRoot)
{
    var version = "";
    var versionAttribute = feedRoot.Attribute("version");
    if (null != versionAttribute)
    {
        version = versionAttribute.Value;
    }
    var channelElement = feedRoot.Element("channel");
    if (null != channelElement)
    {
        var title = "";
        var titleElement = channelElement.Element("title");
        if (null != titleElement)
        {
            title = titleElement.Value;
        }
        ShowFeed(version, title);
        foreach (var item in channelElement.Elements("item"))
        {
            title = "";
            titleElement = item.Element("title");
            if (null != titleElement)
            {
                title = titleElement.Value;
            }
            var publishDate = DateTime.MinValue;
            var pubDateElement = item.Element("pubDate");
            if (null != pubDateElement)
            {
                if (!DateTime.TryParse(pubDateElement.Value, out publishDate))
                {
                    publishDate = DateTime.MinValue;
                }
            }
            var description = "";
            var descriptionElement = item.Element("description");
            if (null != descriptionElement)
            {
                description = descriptionElement.Value;
            }
            ShowItem(title, publishDate, description);
        }
    }
}

It would be nice if we could somehow combine the two approaches to arrive at something that reads easily while also handling malformed content gracefully... And that's what the XLinqExtensions extension methods are all about!

Using the naming convention SafeGet* where "*" can be Element, Attribute, StringValue, or DateTimeValue, these methods are simple wrappers that avoid problems by always returning a valid object - even if they have to create an empty one themselves. In this manner, calls that are expected to return an XElement always do; calls that are expected to return a DateTime always do (with a user-provided fallback value for scenarios where the underlying string doesn't parse successfully). To be clear, there's no magic here - all the code is very simple - but by pushing error handling into the accessor methods, the overall experience feels much nicer.

To see what I mean, here's what the same code looks like after it has been changed to use XLinqExtensions - note how similar it looks to the original implementation that used the simple "write it the obvious way" approach:

private static void Safe(XElement feedRoot)
{
    var version = feedRoot.SafeGetAttribute("version").SafeGetStringValue();
    var title = feedRoot.SafeGetElement("channel").SafeGetElement("title").SafeGetStringValue();
    ShowFeed(version, title);
    foreach (var item in feedRoot.SafeGetElement("channel").Elements("item"))
    {
        title = item.SafeGetElement("title").SafeGetStringValue();
        var publishDate = item.SafeGetElement("pubDate").SafeGetDateTimeValue(DateTime.MinValue);
        var description = item.SafeGetElement("description").SafeGetStringValue();
        ShowItem(title, publishDate, description);
    }
}

Not only is the XLinqExtensions version almost as easy to read as the simple approach, it has all the resiliancy benefits of the complex one! What's not to like?? :)

[Click here to download the XLinqExtensions sample application containing everything shown here.]

I've found the XLinqExtensions approach helpful in my own projects because it enables me to parse XML with ease and peace of mind. The example I've provided here only scratches the surface of what's possible (ex: SafeGetIntegerValue, SafeGetUriValue, etc.), and is intended to set the stage for others to adopt a more robust approach to XML parsing. So if you find yourself parsing XML, please consider something similar!

PS - The complete set of XLinqExtensions methods I use in the sample is provided below. Implementation of additional methods to suit custom scenarios is left as an exercise to the reader. :)

/// <summary>
/// Class that exposes a variety of extension methods to make parsing XML with XLINQ easier and safer.
/// </summary>
static class XLinqExtensions
{
    /// <summary>
    /// Gets the named XElement child of the specified XElement.
    /// </summary>
    /// <param name="element">Specified element.</param>
    /// <param name="name">Name of the child.</param>
    /// <returns>XElement instance.</returns>
    public static XElement SafeGetElement(this XElement element, XName name)
    {
        Debug.Assert(null != element);
        Debug.Assert(null != name);
        return element.Element(name) ?? new XElement(name, "");
    }

    /// <summary>
    /// Gets the named XAttribute of the specified XElement.
    /// </summary>
    /// <param name="element">Specified element.</param>
    /// <param name="name">Name of the attribute.</param>
    /// <returns>XAttribute instance.</returns>
    public static XAttribute SafeGetAttribute(this XElement element, XName name)
    {
        Debug.Assert(null != element);
        Debug.Assert(null != name);
        return element.Attribute(name) ?? new XAttribute(name, "");
    }

    /// <summary>
    /// Gets the string value of the specified XElement.
    /// </summary>
    /// <param name="element">Specified element.</param>
    /// <returns>String value.</returns>
    public static string SafeGetStringValue(this XElement element)
    {
        Debug.Assert(null != element);
        return element.Value;
    }

    /// <summary>
    /// Gets the string value of the specified XAttribute.
    /// </summary>
    /// <param name="attribute">Specified attribute.</param>
    /// <returns>String value.</returns>
    public static string SafeGetStringValue(this XAttribute attribute)
    {
        Debug.Assert(null != attribute);
        return attribute.Value;
    }

    /// <summary>
    /// Gets the DateTime value of the specified XElement, falling back to a provided value in case of failure.
    /// </summary>
    /// <param name="element">Specified element.</param>
    /// <param name="fallback">Fallback value.</param>
    /// <returns>DateTime value.</returns>
    public static DateTime SafeGetDateTimeValue(this XElement element, DateTime fallback)
    {
        Debug.Assert(null != element);
        DateTime value;
        if (!DateTime.TryParse(element.Value, out value))
        {
            value = fallback;
        }
        return value;
    }
}

Tags: Technical