So I dashed off another string extractor based on the implementation of
HtmlStringExtractor Runs on Node.js. Dependencies via npm. htmlparser2 for parsing, glob for globbing, and request for web access. Wildcard matching includes subdirectories when
**is used. Pass a URL to extract strings from the web.
HtmlStringExtractor works just like its predecessors, outputting all strings/attributes/comments to the console for redirection or processing.
But it has an additional power: the ability to access content directly from the internet via URL.
Because so much of the web is HTML, it seemed natural to support live-extraction - which in turn makes it easier to spell-check a web site and be sure you're including "hidden" text (like
alt attributes) that copy+paste don't cover.
Its HTML parser is very forgiving, so
HtmlStringExtractor will happily work with HTML-like languages such as
Of course, the utility of doing so decreases as the language gets further removed from HTML, but for many scenarios the results are quite acceptable.
In the specific case of
XML, output should be entirely meaningful, filtering out all element and attribute metadata and leaving just the "real" data for review.
In keeping with the theme of "small and simple", I didn't add an option to exclude attributes by name - but you can imagine that filtering out things like
href would do a lot to reduce noise.
Who knows, maybe I'll support that in a future update. :)
For now, things are simple. The StringExtractors GitHub repository has the complete code for all three extractors.
Aside: As I wrote this post, I realized there's another "language" I use regularly: JSON. Because of its simple structure, I don't think there's a need for JsonStringExtractor - but if you feel otherwise, please let me know! (It'd be easy to create.)