The blog of dlaa.me

Looks the same - with half the overhead! [Update to free ConvertClipboardRtfToHtmlText tool and source code gives more compact output; Can you do better?]

I recently updated my ConvertClipboardRtfToHtmlText tool to work with Visual Studio 2010 (Beta 2). This utility takes the RTF clipboard format Visual Studio puts on the clipboard, converts it into HTML, and substitutes the converted text for pasting into web pages, blog posts, etc.. It works great and I use it all the time for my blog.

In the comments to that post, kind reader Sameer pointed out that the converted HTML was more verbose than it needed to be - and I quickly replied that it wasn't my fault. :) Here's the example Sameer gave (which is particularly inefficient):

public partial class

And here's the corresponding HTML (on multiple lines because it's so long):

<pre>
<span style='color:#000000'></span>
<span style='color:#0000ff'>public</span>
<span style='color:#000000'> </span>
<span style='color:#0000ff'>partial</span>
<span style='color:#000000'> </span>
<span style='color:#0000ff'>class</span>
</pre>

Yup, that's almost obnoxiously inefficient: there's a useless black span at the beginning and a bunch of pointless color swapping for both of the space characters. Something more along the lines of the following would be much better:

<pre style='color:#0000ff'>public partial class</pre>

The HTML for both examples ends up looking exactly the same in a web browser, so wouldn't it be nice if the tool produced the second, more compact form?

I thought so, too!

 

[Click here to download the ConvertClipboardRtfToHtmlText tool along with its complete source code.]

 

I had a bit of spare time the other night and decided to make a quick attempt at optimizing the output of ConvertClipboardRtfToHtmlText according to some ideas I'd been playing around with. Specifically, instead of outputting the converted text as it gets parsed, the new code builds an in-memory representation of the entire clipboard contents and associated color changes. After everything has been loaded, it performs some basic optimization steps to remove unnecessary color changes by ignoring whitespace and collapsing text runs. Once that's been done, the optimized HTML is placed on the clipboard just like before.

Here's what the relevant code looks like (recall that this tool compiles for .NET 2.0, so it's can't use Linq):

int j = runs.Count - 1;
while (0 <= j)
{
    Run run = runs[j];

    // Remove color changes for whitespace runs
    if (0 == run.Text.Trim().Length)
    {
        runs.RemoveAt(j);
        if (j < runs.Count)
        {
            runs[j].Text = run.Text + runs[j].Text;
        }
        else
        {
            j--;
        }
        continue;
    }

    // Remove redundant color changes
    if ((j + 1 < runs.Count) && (run.Color == runs[j + 1].Color))
    {
        runs.RemoveAt(j);
        runs[j].Text = run.Text + runs[j].Text;
    }

    j--;
}

// Find most common color
Dictionary<Color, int> colorCounts = new Dictionary<Color, int>();
foreach (Run run in runs)
{
    if (!colorCounts.ContainsKey(run.Color))
    {
        colorCounts[run.Color] = 0;
    }
    colorCounts[run.Color]++;
}
Color mostCommonColor = Color.Empty;
int mostCommonColorCount = 0;
foreach (Color color in colorCounts.Keys)
{
    if (mostCommonColorCount < colorCounts[color])
    {
        mostCommonColor = color;
        mostCommonColorCount = colorCounts[color];
    }
}

...

// Build HTML for run stream
sb.Length = 0;
sb.AppendFormat("<pre style='color:#{0:x2}{1:x2}{2:x2}'>", mostCommonColor.R, mostCommonColor.G, mostCommonColor.B);
foreach (Run run in runs)
{
    if (run.Color != mostCommonColor)
    {
        sb.AppendFormat("<span style='color:#{0:x2}{1:x2}{2:x2}'>", run.Color.R, run.Color.G, run.Color.B);
    }
    sb.Append(run.Text);
    if (run.Color != mostCommonColor)
    {
        sb.Append("</span>");
    }
}
sb.Append("</pre>");

 

The code comments explain what's going on and it's all pretty straightforward. The one sneaky thing is the part that finds the most commonly used color and makes that the default color of the entire block. By doing so, the number of span elements can be reduced significantly: switching to that common color becomes as simple as exiting the current span (which needed to happen anyway).

So was this coding exercise worth the effort? Is the resulting HTML noticeably smaller, or was this all just superficial messing around? To answer that, let's look at some statistics for converting the entire ConvertClipboardRtfToHtmlText.cs file:

Normal Optimized Change
Character count of .CS file 11,996 11,996 N/A
Character count converted HTML 32,091 21,158 -34%
Extra characters for HTML representation 20,095 9,162 -54%

Hey, those are pretty good results for just an hour's effort! And not only is the new representation significantly smaller, it's also less cluttered and easier to read - so it's easier to deal with, too. I'm happy with the improvement and switched to the new version of ConvertClipboardRtfToHtmlText a couple of posts ago. So if you notice my blog posts loading slightly faster than before, this could be why... :)

 

A challenge just for fun: I haven't thought about it too much (which could be my downfall), but I'll suggest that the output of the new approach is just about optimal for what it's doing. Every color change is now necessary, and they're about as terse as they can be. Unless I decide to throw away some information (ex: by using the 3-character HTML color syntax) or change the design (ex: by creating a bunch of 1-character CSS classes), I don't think things can get much better than this and still accurately reproduce the appearance of the original content in Visual Studio. Therefore, if you can reduce the overhead for this version of ConvertClipboardRtfToHtmlText.cs by an additional 5% (without resorting to invalid HTML), I will credit you and your technique in a future blog post! :)