Another tool in the fight against spelling errors. [Added HtmlStringExtractor for pulling strings, attributes, and comments from HTML/XML files and URLs]
When I blogged simple string extraction utilities for C# and JavaScript code, I thought I'd covered the programming languages that matter most to me. Then I realized my focus was too narrow; I spend a lot of time working with markup languages and want content there to be correctly spelled as well.
So I dashed off another string extractor based on the implementation of JavaScriptStringExtractor
:
HtmlStringExtractor Runs on Node.js. Dependencies via npm. htmlparser2 for parsing, glob for globbing, and request for web access. Wildcard matching includes subdirectories when
**
is used. Pass a URL to extract strings from the web.
HtmlStringExtractor
works just like its predecessors, outputting all strings/attributes/comments to the console for redirection or processing.
But it has an additional power: the ability to access content directly from the internet via URL.
Because so much of the web is HTML, it seemed natural to support live-extraction - which in turn makes it easier to spell-check a web site and be sure you're including "hidden" text (like title
and alt
attributes) that copy+paste don't cover.
Its HTML parser is very forgiving, so HtmlStringExtractor
will happily work with HTML-like languages such as XML
, ASPX
, and PHP
.
Of course, the utility of doing so decreases as the language gets further removed from HTML, but for many scenarios the results are quite acceptable.
In the specific case of XML
, output should be entirely meaningful, filtering out all element and attribute metadata and leaving just the "real" data for review.
In keeping with the theme of "small and simple", I didn't add an option to exclude attributes by name - but you can imagine that filtering out things like id
, src
, and href
would do a lot to reduce noise.
Who knows, maybe I'll support that in a future update. :)
For now, things are simple. The StringExtractors GitHub repository has the complete code for all three extractors.
Enjoy!
Aside: As I wrote this post, I realized there's another "language" I use regularly: JSON. Because of its simple structure, I don't think there's a need for JsonStringExtractor - but if you feel otherwise, please let me know! (It'd be easy to create.)