The blog of dlaa.me

Posts from October 2020

"If you can't measure it, you can't manage it." [A brief analysis of markdownlint rule popularity]

From time to time, discussions of a markdownlint rule come up where the popularity of one of the rules is questioned. There are about 45 rules right now, so there's a lot of room for debate about whether a particular one goes too far or isn't generally applicable. By convention, all rules are enabled for linting by default, though it is easy to disable any rules that you disagree with or that don't fit with a project's approach. But until recently, I had no good way of knowing what the popularity of these rules was in practice.

If only there were an easy way to collect the configuration files for the most popular repositories in GitHub, I could do some basic analysis to get an idea what rules were used or ignored in practice. Well, that's where Google's BigQuery comes in - specifically its database of public GitHub repositories that is available for anyone to query. I developed a basic understanding of the database and came up with the following query to list the most popular repositories with a markdownlint configuration file:

SELECT files.repo_name
FROM `bigquery-public-data.github_repos.files` as files
INNER JOIN `bigquery-public-data.github_repos.sample_repos` as repos
  ON files.repo_name = repos.repo_name
WHERE files.path = ".markdownlint.json" OR files.path = ".markdownlint.yaml"
ORDER BY repos.watch_count DESC
LIMIT 100

Aside: While this resource was almost exactly what I needed, it turns out the data that's available is off by an order of magnitude, so this analysis is not perfect. However, it's an approximation anyway, so this should not be a problem. (For context, follow this Twitter thread with @JustinBeckwith.)

The query above returns about 60 repository names. The next step was to download and process the relevant configuration files and output a simple CSV file recording which repositories used which rules (based on configuration defaults, customizations, and deprecations). This was fairly easily accomplished with a bit of code I've published in the markdownlint-analyze-config repository. You can run it if you'd like, but I captured the output as of early October, 2020 in the file analyze-config.csv for convenience.

Importing that data into the Numbers app and doing some simple aggregation produced the following representation of how common each rule is across the data set:

Bar chart showing how common each  rule is

Some observations:

  • There are 8 rules that are used by every project whose data is represented here. These should be uncontroversial. Good job, rules!
  • The two rules at the bottom with less than 5% use are both deprecated because they have been replaced by more capable rules. It's interesting some projects have explicitly enabled them, but they can be safely ignored.
  • More than half of the rules are used in at least 95% of the scenarios. These seem pretty solid as well, and are probably not going to see much protest.
  • All but 4 (of the non-deprecated) rules are used in at least 80% of the scenarios. Again, pretty strong, though there is some room for discussion in the lower ranges of this category.
  • Of those 4 least-popular rules that are active, 3 are used between 70% and 80% of the time. That's not shabby, but it's clear these rules are checking for things that are less universally applicable and/or somewhat controversial.
  • The least popular (non-deprecated) rule is MD013/line-length at about 45% popularity. This is not surprising, as there are definitely good arguments for and against manually wrapping lines at an arbitrary column. This rule is already disabled by default for the VS Code markdownlint extension because it is noisy in projects that use long lines (where nearly every line could trigger a violation).

Overall, this was a very informative exercise. The data source isn't perfect, but it's a good start and I can always rerun the numbers if I get a better list of repositories. Rules seem to be disabled less often in practice than I would have guessed. This is nice to see - and a good reminder to be careful about introducing controversial rules that many people end up wanting to turn off. The next time a discussion about rule popularity comes up, I'll be sure to reference this post!

If one is good, two must be better [markdownlint-cli2 is a new kind of command-line interface for markdownlint]

About 5 years ago, Igor Shubovych and I discussed the idea of writing a CLI for markdownlint. I wasn't ready at the time, so Igor created markdownlint-cli and it has been a tremendous help for the popularity of the library. I didn't do much with it at first, but for the past 3+ years I have been the primary contributor to the project. This CLI is the primary way that many users interact with the markdownlint library, so I think it is important to maintain it.

However, I've always felt a little bit like a guest in someone else's home and while I have added new features, there were always some things I wasn't comfortable changing. A few months ago, I decided to address this by creating my own CLI - and approaching the problem from a slightly different/unusual perspective so as not to duplicate the fine work that had already been done. My implementation is named markdownlint-cli2 and you can find it here:

markdownlint-cli2 on GitHub
markdownlint-cli2 on npm

markdownlint-cli2 has a few principles that motivate its interface and behavior:

  • Faster is better. There are three phases of execution: globbing/configuration parsing, linting of each configuration set, and summarizing results. Each of these phases takes full advantage of asynchronous function calls to execute operations concurrently and make the best use of Node.js's single-threaded architecture. Because it's inefficient to enumerate files and directories that end up being ignored by a filter, all glob patterns for input (inclusive and exclusive) are expected to be passed on the command-line so they can be used by the glob library to optimize file system access.

    How much faster does it run? Well, it depends. :) In many cases, probably only a little bit faster - all the same Markdown files need to be processed by the same library code. That said, linting is done concurrently, so slow disk scenarios offer one opportunity for speed-ups. In testing, an artificial 5 millisecond delay for every file access was completely overcome by this concurrency. In situations that play to the strengths of the new implementation - such as with many ignored files and few Markdown files (common for Node.js packages with deep node_modules) - the difference can be significant. One early user reported times exceeding 100 seconds dropped to less than 1 second.

  • Configuration should be flexible. Command line arguments are never as expressive as data or code, so all configuration for markdownlint-cli2 is specified via appropriately-named JSON, YAML, or JavaScript files. These options files can live anywhere and automatically apply to their part of the directory tree. Settings cascade and inherit, so it's easy to customize a particular scenario without repeating yourself. Other than two necessary exceptions, all options (including custom rules and parser plugins) can be set or changed in any directory being linted.

    It's unconventional for a command-line tool not to allow configuration via command-line arguments, but this model keeps the input (glob patterns) separate from the configuration (files) and allows easier sharing of settings across tools (like the markdownlint extension for VS Code). It's also good for scenarios where the user may not have the ability to alter the command line (such as GitHub's Super-Linter action).

    In addition to support for custom rules, it's possible to provide custom markdown-it plugins - for each directory if desired. This can be necessary for scenarios that involve custom rendering and make use of non-standard CommonMark syntax. By using an appropriate plugin, the custom syntax gets parsed correctly and the linting rules can work with the intended structure of the document. A common scenario is when embedding TeX math equations with the $ math $ or $$ math $$ syntax and a plugin such as markdown-it-texmath.

    Although the default output format to stderr is identical to that of markdownlint-cli (making it easy to switch between CLI's), there are lots of ways to display results and so any number of output formatters can be configured to run after linting. I've provided stock implementations for default, JSON, JUnit, and summarized results, but anyone can provide their own formatter if they want something else.

  • Dependencies should be few. As with the markdownlint library itself, package dependencies are kept to a minimum. Fewer dependencies mean less code to install, parse, audit, and maintain - which makes everything easier.

So, which CLI should you use? Well, whichever you want! If you're happy with markdownlint-cli, there's no need to change. If you're looking for a bit more flexibility or want to see if markdownlint-cli2 is faster in your scenario, give it a try. At this point, markdownlint-cli2 supports pretty much everything markdownlint-cli does, so you're free to experiment and shouldn't need to give up any features if you switch.

What does this mean for the future of the original markdownlint-cli? Nothing! It's a great tool and it's used by many projects. I will continue to update it as I release new versions of the markdownlint library. However, I expect that my own time working on new features will be focused on markdownlint-cli2 for now.

Whether you use markdownlint-cli2 or markdownlint-cli, I hope you find it useful!