Posts tagged "Node.js"

En Provence [Some thoughts about npm package provenance - and why I have not enabled it]

Thursday, July 25, 2024

Last year, the GitHub blog outlined efforts to secure the Node.js package repository by Introducing npm package provenance. They write:

In order to increase the level of trust you have in the npm packages you download from the registry you must have visibility into the process by which the source was translated into the published artifact.

This requirement is addressed by npm package provenance:

What we need is a way to draw a direct line from the npm package back to the exact source code commit from which it was derived.

The npm documentation on Generating provenance statements explains the implications:

When a package in the npm registry has established provenance, it does not guarantee the package has no malicious code. Instead, npm provenance provides a verifiable link to the package's source code and build instructions, which developers can then audit and determine whether to trust it or not.

It is important to call out that provenance does NOT:

Establish trustworthiness: A package published with provenance could contain code to delete personal files when installed or imported. And if a package did so, it would be inappropriate to remove provenance as that is not a statement about trust.
Enable reproducible builds: The process used to produce a package can (and frequently does) reference artifacts outside its own repository when creating the package. A common scenario is installing other npm packages without a lock file. Even using a lock file offers no guarantee because dependent packages could reference ephemeral content like a URL, local state, randomness, etc..
Avoid the need to audit package contents: Knowing which repository commit was used to generate a package offers no guarantee about what is actually in the package. So it is necessary to manually audit every file in the package if security is important and trustworthiness needs to be established.
Avoid the need to audit package contents for every new version: Similarly, just because one version of a package was found to be trustworthy, there is no guarantee the next version will be. Therefore, it is necessary to perform a complete audit for every update.

It's also notable that provenance DOES:

Require giving an npm publish token to GitHub: As outlined in the documentation on Publishing packages to the npm registry, provenance requires granting GitHub permission to publish packages on your behalf. This creates a new opportunity for integrity to be compromised (and makes GitHub an attractive target for attackers).
Require bypassing two-factor authentication (2FA) for npm package publish: Multi-factor authentication is widely considered a baseline security practice. As outlined in the documentation on Requiring 2FA for package publishing and settings modification, this security measure must be disabled to give GitHub permission to publish on your behalf (because GitHub does not have access to the second factor).
Require defining and maintaining a GitHub Actions workflow for package publish: This additional effort should not be overwhelming for a package maintainer, but it represents ongoing time and attention that does not add (direct) value for a package's users. Furthermore, this workflow must be replicated across each package.

Considering the advantages and disadvantages, it does not seem to me that introducing npm provenance offers compelling enough benefits (for package consumers or for producers) to offset the cost for a maintainer. (Especially for a maintainer like myself with multiple packages and limited free time.) For now, my intent is to continue using git tags to identify which repository commit is associated with published package versions (e.g., markdownlint tags).

Tags: Node.js Technical

"Hang loose" is for surfers, not developers [Why I pin dependency versions in Node.js packages]

Sunday, March 17, 2024

A few days ago, I posted a response to a question I get asked about open-source project management. Here we go again - this time the topic is dependency versioning.

What is a package dependency?

In the Node.js ecosystem, packages (a.k.a. projects) can make use of other packages by declaring them as a dependency in package.json and specifying the range of supported versions. When a package gets installed, the package manager (typically npm) makes sure appropriate versions of all dependencies are included.

How are dependency versions specified?

The Node community uses semantic versioning of the form major.minor.patch. There are many ways to specify a version range, but the most common is to identify a specific version (typically the most recent) and prefix it with a tilde or caret to signify that later versions which differ only by patch or minor.patch are also acceptable. For example: ~1.2.3 and ^1.2.3. This is what is meant by "loose" versioning.

Why does the community use "loose" versioning?

The intent of loose versioning is to automatically benefit from bug fixes and non-breaking changes to dependent packages. Any time an install is run or an update is performed, the latest (allowable) version of such dependencies will be included and the user will seamlessly benefit from any bug fixes that were made since the named version was published.

What is a "pinned" dependency version?

A pinned dependency version specifies a particular major.minor.patch version and does not include any modifiers. The only version that satisfies this range is the exact version listed. For example: 1.2.3. Bug fixes to such package dependencies will not be used until a new version of the package that references them is published (with updated references).

Why is pinning a better versioning strategy?

Pinning ensures that users only run a package with the set of dependencies it has been tested with. While this doesn't rule out the possibility of bugs, it's far safer and more predictable than loose versioning, which allows users to run with an unpredictable set of dependencies. In the loose versioning worst case, every install of a package could have a different set of dependencies. This is a nightmare for quality and reliability. With pinning, behavior changes only show up when the user decides to update versions. If anything breaks, the upgrade can be skipped while the issue is investigated. Loose versioning doesn't allow "undo"; when something breaks, you're stuck until a fix gets published.

What's so bad about running untested configurations?

As much as developers may try to ensure consistent behavior across minor- and patch-level version updates, any change - no matter how small - has the possibility of altering behavior and causing failures. Worse, such behavior changes show up unexpectedly and unpredictably and can be difficult to track down, especially for users who may not even realize the broken package was being used. I've had to investigate such issues on multiple occasions and think it is a waste of time for users and package maintainers alike.

Are popular projects safer to version loosely?

Well-run projects with thorough testing are probably less likely to cause problems then single-person hobby projects. But the underlying issue is the same: any change to dependency code can change runtime behavior and cause problems.

What about missing out on security bug fixes due to pinning?

While the urgency to include a security bug fix may be higher than a normal bug fix, the same challenges apply. There's no general-purpose way to identify a security fix from a normal fix from a breaking change.

Could pinning lead to larger install sizes?

Yes, because the package manager doesn't have as much freedom to choose among package versions that are shared by multiple dependencies. However, this is a speculative optimization with limited benefit in practice as disk space is comparatively inexpensive. Correctness and predictability are far more important.

Isn't pinning pointless if dependent packages version loosely?

No, though it's less effective because those transitive dependencies can change/break at any time. My opinion is that every package should use pinning, but I can only enforce that policy for my own packages. (But maybe by setting a good example, I can be the change I want to see in the world...)

Is there a way to force a dependency update for a pinned package?

Yes, by updating a project's package.json to use overrides (npm) or resolutions (yarn). This means users who are worried about a specific dependency version can make sure that version is used in their scenario - and any resulting problems are their responsibility to deal with.

Does pinning versions create more work for a maintainer?

No, maintainers should already be updating package dependencies as part of each release. This can be done manually or automatically through the use of a tool like Dependabot.

"DRINK ME" [Why I do not include npm-shrinkwrap.json in Node.js tool packages]

Thursday, March 7, 2024

I maintain a few open source projects and get asked some of the same questions from time to time. I wrote the explanation below in August of 2023 and posted it as a GitHub Gist; I am capturing it here for easier reference.

Background:

For historical purposes and possible future reference, here are my notes on why I backed out a change to use npm-shrinkwrap.json in markdownlint-cli2.

The basic problem is that npm will include platform-specific packages in npm-shrinkwrap.json. Specifically, if one generates npm-shrinkwrap.json on Mac, it may include components (like fsevents) that are only supported on Mac. Attempts to use a published package with such a npm-shrinkwrap.json on a different platform like Linux or Windows fails with EBADPLATFORM. This seems (to me, currently) like a fundamental and fatal flaw with the way npm implements npm-shrinkwrap.json. And while there are ways npm might address this problem, the current state of things seems unusably broken.

To make this concrete, the result of running rm npm-shrinkwrap.json && npm install && npm shrinkwrap for this project on macOS can be found here: https://github.com/DavidAnson/markdownlint-cli2/blob/v0.9.0/npm-shrinkwrap.json. Note that fsevents is an optional Mac-only dependency: https://github.com/DavidAnson/markdownlint-cli2/blob/66b36d1681566451da8d56dcef4bb7a193cdf302/npm-shrinkwrap.json#L1955-L1958. Including it is not wrong per se, but sets the stage for failure as reproduced via GitHub Codespaces:

@DavidAnson > /workspaces/temp (main) $ ls
@DavidAnson > /workspaces/temp (main) $ node --version
v20.5.1
@DavidAnson > /workspaces/temp (main) $ npm --version
9.8.0
@DavidAnson > /workspaces/temp (main) $ npm install markdownlint-cli2@v0.9.0
npm WARN deprecated date-format@0.0.2: 0.x is no longer supported. Please upgrade to 4.x or higher.

added 442 packages in 4s

9 packages are looking for funding
  run `npm fund` for details
@DavidAnson > /workspaces/temp (main) $ npm clean-install
npm ERR! code EBADPLATFORM
npm ERR! notsup Unsupported platform for fsevents@2.3.3: wanted {"os":"darwin"} (current: {"os":"linux"})
npm ERR! notsup Valid os:  darwin
npm ERR! notsup Actual os: linux

npm ERR! A complete log of this run can be found in: /home/codespace/.npm/_logs/2023-08-27T18_24_58_585Z-debug-0.log
@DavidAnson > /workspaces/temp (main) $

Note that the initial package install succeeded, but the subsequent attempt to use clean-install failed due to the platform mismatch. This is a basic scenario and the user is completely blocked at this point.

Because this is a second-level failure, it is not caught by most reasonable continuous integration configurations which work from the current project directory instead of installing and testing via the packed .tgz file. However, attempts to reproduce this failure in CI via .tgz were unsuccessful: https://github.com/DavidAnson/markdownlint-cli2/commit/f9bcd599b3e6dbc8d2ebc631b13e922c5d0df8c0. From what I can tell, npm install of a local .tgz file is handled differently than when that same (identical) file is installed via the package repository.

While there are some efforts to test the .tgz scenario better (for example: https://github.com/boneskull/midnight-smoker), better testing does not solve the fundamental problem that npm-shrinkwrap.json is a platform-specific file that gets used by npm in a cross-platform manner.

Unrelated, but notable: npm installs ALL package dependencies when npm-shrinkwrap.json is present - even in a context where it would normally NOT install devDependencies. Contrast the 442 packages installed above vs. the 40 when --omit=dev is used explicitly:

@DavidAnson > /workspaces/temp (main) $ npm install markdownlint-cli2@v0.9.0 --omit=dev

added 40 packages in 1s

9 packages are looking for funding
  run `npm fund` for details
@DavidAnson > /workspaces/temp (main) $

But the default behavior of a dependency install in this manner is not to include devDependencies as seen when installing a version of this package without npm-shrinkwrap.json:

@DavidAnson > /workspaces/temp (main) $ npm install markdownlint-cli2@v0.9.2

added 35 packages in 2s

7 packages are looking for funding
  run `npm fund` for details
@DavidAnson > /workspaces/temp (main) $

References:

Request and discussion: Commit package-lock.json #186
Problem and investigation: v0.9.0: shinkwrap causes all dev dependencies to be installed #198

Tags: Node.js Technical

"If you can't measure it, you can't manage it." [A brief analysis of markdownlint rule popularity]

Monday, October 26, 2020

From time to time, discussions of a markdownlint rule come up where the popularity of one of the rules is questioned. There are about 45 rules right now, so there's a lot of room for debate about whether a particular one goes too far or isn't generally applicable. By convention, all rules are enabled for linting by default, though it is easy to disable any rules that you disagree with or that don't fit with a project's approach. But until recently, I had no good way of knowing what the popularity of these rules was in practice.

If only there were an easy way to collect the configuration files for the most popular repositories in GitHub, I could do some basic analysis to get an idea what rules were used or ignored in practice. Well, that's where Google's BigQuery comes in - specifically its database of public GitHub repositories that is available for anyone to query. I developed a basic understanding of the database and came up with the following query to list the most popular repositories with a markdownlint configuration file:

SELECT files.repo_name
FROM `bigquery-public-data.github_repos.files` as files
INNER JOIN `bigquery-public-data.github_repos.sample_repos` as repos
  ON files.repo_name = repos.repo_name
WHERE files.path = ".markdownlint.json" OR files.path = ".markdownlint.yaml"
ORDER BY repos.watch_count DESC
LIMIT 100

Aside: While this resource was almost exactly what I needed, it turns out the data that's available is off by an order of magnitude, so this analysis is not perfect. However, it's an approximation anyway, so this should not be a problem. (For context, follow this Twitter thread with @JustinBeckwith.)

The query above returns about 60 repository names. The next step was to download and process the relevant configuration files and output a simple CSV file recording which repositories used which rules (based on configuration defaults, customizations, and deprecations). This was fairly easily accomplished with a bit of code I've published in the markdownlint-analyze-config repository. You can run it if you'd like, but I captured the output as of early October, 2020 in the file analyze-config.csv for convenience.

Importing that data into the Numbers app and doing some simple aggregation produced the following representation of how common each rule is across the data set:

Bar chart showing how common each rule is

Some observations:

There are 8 rules that are used by every project whose data is represented here. These should be uncontroversial. Good job, rules!
The two rules at the bottom with less than 5% use are both deprecated because they have been replaced by more capable rules. It's interesting some projects have explicitly enabled them, but they can be safely ignored.
More than half of the rules are used in at least 95% of the scenarios. These seem pretty solid as well, and are probably not going to see much protest.
All but 4 (of the non-deprecated) rules are used in at least 80% of the scenarios. Again, pretty strong, though there is some room for discussion in the lower ranges of this category.
Of those 4 least-popular rules that are active, 3 are used between 70% and 80% of the time. That's not shabby, but it's clear these rules are checking for things that are less universally applicable and/or somewhat controversial.
The least popular (non-deprecated) rule is MD013/line-length at about 45% popularity. This is not surprising, as there are definitely good arguments for and against manually wrapping lines at an arbitrary column. This rule is already disabled by default for the VS Code markdownlint extension because it is noisy in projects that use long lines (where nearly every line could trigger a violation).

Overall, this was a very informative exercise. The data source isn't perfect, but it's a good start and I can always rerun the numbers if I get a better list of repositories. Rules seem to be disabled less often in practice than I would have guessed. This is nice to see - and a good reminder to be careful about introducing controversial rules that many people end up wanting to turn off. The next time a discussion about rule popularity comes up, I'll be sure to reference this post!

Tags: Node.js Technical

If one is good, two must be better [markdownlint-cli2 is a new kind of command-line interface for markdownlint]

Wednesday, October 14, 2020

About 5 years ago, Igor Shubovych and I discussed the idea of writing a CLI for markdownlint. I wasn't ready at the time, so Igor created markdownlint-cli and it has been a tremendous help for the popularity of the library. I didn't do much with it at first, but for the past 3+ years I have been the primary contributor to the project. This CLI is the primary way that many users interact with the markdownlint library, so I think it is important to maintain it.

However, I've always felt a little bit like a guest in someone else's home and while I have added new features, there were always some things I wasn't comfortable changing. A few months ago, I decided to address this by creating my own CLI - and approaching the problem from a slightly different/unusual perspective so as not to duplicate the fine work that had already been done. My implementation is named markdownlint-cli2 and you can find it here:

markdownlint-cli2 on GitHub
markdownlint-cli2 on npm

markdownlint-cli2 has a few principles that motivate its interface and behavior:

Faster is better. There are three phases of execution: globbing/configuration parsing, linting of each configuration set, and summarizing results. Each of these phases takes full advantage of asynchronous function calls to execute operations concurrently and make the best use of Node.js's single-threaded architecture. Because it's inefficient to enumerate files and directories that end up being ignored by a filter, all glob patterns for input (inclusive and exclusive) are expected to be passed on the command-line so they can be used by the glob library to optimize file system access.

How much faster does it run? Well, it depends. :) In many cases, probably only a little bit faster - all the same Markdown files need to be processed by the same library code. That said, linting is done concurrently, so slow disk scenarios offer one opportunity for speed-ups. In testing, an artificial 5 millisecond delay for every file access was completely overcome by this concurrency. In situations that play to the strengths of the new implementation - such as with many ignored files and few Markdown files (common for Node.js packages with deep node_modules) - the difference can be significant. One early user reported times exceeding 100 seconds dropped to less than 1 second.
Configuration should be flexible. Command line arguments are never as expressive as data or code, so all configuration for markdownlint-cli2 is specified via appropriately-named JSON, YAML, or JavaScript files. These options files can live anywhere and automatically apply to their part of the directory tree. Settings cascade and inherit, so it's easy to customize a particular scenario without repeating yourself. Other than two necessary exceptions, all options (including custom rules and parser plugins) can be set or changed in any directory being linted.

It's unconventional for a command-line tool not to allow configuration via command-line arguments, but this model keeps the input (glob patterns) separate from the configuration (files) and allows easier sharing of settings across tools (like the markdownlint extension for VS Code). It's also good for scenarios where the user may not have the ability to alter the command line (such as GitHub's Super-Linter action).

In addition to support for custom rules, it's possible to provide custom markdown-it plugins - for each directory if desired. This can be necessary for scenarios that involve custom rendering and make use of non-standard CommonMark syntax. By using an appropriate plugin, the custom syntax gets parsed correctly and the linting rules can work with the intended structure of the document. A common scenario is when embedding TeX math equations with the $ math $ or $$ math $$ syntax and a plugin such as markdown-it-texmath.

Although the default output format to stderr is identical to that of markdownlint-cli (making it easy to switch between CLI's), there are lots of ways to display results and so any number of output formatters can be configured to run after linting. I've provided stock implementations for default, JSON, JUnit, and summarized results, but anyone can provide their own formatter if they want something else.
Dependencies should be few. As with the markdownlint library itself, package dependencies are kept to a minimum. Fewer dependencies mean less code to install, parse, audit, and maintain - which makes everything easier.

So, which CLI should you use? Well, whichever you want! If you're happy with markdownlint-cli, there's no need to change. If you're looking for a bit more flexibility or want to see if markdownlint-cli2 is faster in your scenario, give it a try. At this point, markdownlint-cli2 supports pretty much everything markdownlint-cli does, so you're free to experiment and shouldn't need to give up any features if you switch.

What does this mean for the future of the original markdownlint-cli? Nothing! It's a great tool and it's used by many projects. I will continue to update it as I release new versions of the markdownlint library. However, I expect that my own time working on new features will be focused on markdownlint-cli2 for now.

Whether you use markdownlint-cli2 or markdownlint-cli, I hope you find it useful!

Tags: Node.js Technical

Don't just complain - offer solutions! [Enabling markdownlint rules to fix the violations they report]

Thursday, August 27, 2020

In October of 2017, an issue was opened in the markdownlint repository on GitHub asking for the ability to automatically fix rule violations. (Background: markdownlint is a Node.js style checker and lint tool for Markdown/CommonMark files.) I liked the idea, but had some concerns about how to implement it effectively. I had recently added the ability to fix simple violations to the vscode-markdownlint extension for VS Code based entirely on regular expressions and it was primitive, but mostly sufficient.

Such was the state of things for about two years, with 15 of the 44 linting rules having regular expression-based fixes in VS Code that usually worked. Then, in August of 2019, I overcame my reservations about the feature and added fix information as one of the things a rule can report with a linting violation. In doing so, the road was paved for an additional 9 rules to become auto-fixable. What's more, it became possible for custom rules written by others to offer fixes as well.

Implementation notes

The way a rule reports fix information for a violation is via an object that looks like this in TypeScript:

/**
 * Fix information for RuleOnErrorInfo.
 */
type RuleOnErrorFixInfo = {
    /**
     * Line number (1-based).
     */
    lineNumber?: number;
    /**
     * Column of the fix (1-based).
     */
    editColumn?: number;
    /**
     * Count of characters to delete.
     */
    deleteCount?: number;
    /**
     * Text to insert (after deleting).
     */
    insertText?: string;
};

Aside: markdownlint now includes a TypeScript declaration file for all public APIs and objects!

The "fix information" object identifies a single edit that fixes the corresponding violation. All the properties shown above are optional, but in practice there will always be 2 or 3. lineNumber defaults to the line of the corresponding violation and almost never needs to be set. editColumn points to the location in the line to edit. deleteCount says how many characters to delete (the value -1 means to delete the entire line); insertText provides the characters to add. If delete and insert are both specified, the delete is applied before the insert. This simple format is easy for callers of the markdownlint API to apply, so the structure is proxied to them pretty much as-is when returning violations.

Aside: With the current design, a violation can only include a single fixInfo object. This could be limiting, but has proven adequate for all scenarios so far.

Practical matters

Considered in isolation, a single fix is easy to reason about and apply. However, when dealing with an entire document, there can be multiple violations for a line and therefore multiple fixes with potential to overlap and conflict. The first strategy to deal with this is to make fixes simple; the change represented by a fix should alter as little as possible. The second strategy is to apply fixes in the order least likely to create conflicts - that's right-to-left on a line with detection of overlaps that may cause the application of the second fix to be skipped. Finally, overlapping edits of different kinds that don't conflict are merged into one. This process isn't especially tricky, but there are some subtleties and so there are helper methods in the markdownlint-rule-helpers package for applying a single fix (applyFix) or multiple fixes (applyFixes).

Aside: markdownlint-rule-helpers is an undocumented, unsupported collection of functions and variables that helps author rules and utilities for markdownlint. The API for this package is ad-hoc, but everything in it is used by the core library and part of the 100% test coverage that project has.

Availability

Automatic fix behavior is available in markdownlint-cli, markdownlint-cli2, and the vscode-markdownlint extension for VS Code. Both CLIs can fix multiple files at once; the VS Code extension limits fixes to the current file (and includes undo support). Fixability is also available to consumers of the library via the markdownlint-rule-helpers package mentioned earlier. Not all rules are automatically fixable - in some cases the resolution is ambiguous and needs human intervention. However, rules that offer fixes can dramatically improve the quality of a document without the user having to do any work!

Let's go to the video tape! [tape-player is a simple, terse, in-process reporter for the tape test harness for Node.js]

Monday, February 24, 2020

I've been a happy user of the nodeunit test harness for a long time, but it was deprecated a few years ago. Recently, I went looking for a similar Node.js test harness to replace it. I prefer small, simple packages and settled on the tape test harness. I enjoy nearly everything about it, but didn't like having to pipe output to a formatter (more on this below). So I wrote a quick bit of code to create an in-process reporter. Then I realized what I'd done could have broader applicability (in my own projects, if nowhere else!) and published a reusable package after adding scenario tests to ensure formatted output for all of the the tape primitives is reasonable.

If this seems interesting, the README goes into more detail:

The Test Anything Protocol (TAP) used by many test harnesses is versatile, but it's not much to look at - or rather, it's too much to look at. There are many custom formatters that work with the tape test harness, but most work by piping process output. This is a useful technique, but interferes with the exit status of the test harness which is a problem in scripts that are meant to fail when tests fail (like npm test). (Though there are workarounds for this, they are shell- and platform-specific.)

Fortunately, tape offers an alternative logging mechanism via its createStream API. This technique is easy to use and runs in-process so it doesn't interfere with the exit status of the test harness. tape-player takes advantage of this to produce a concise test log that's easy to enable.

You can find directions to install and enable tape-player on the GitHub project for tape-player.

Tags: Miscellaneous Node.js

Oops, I did it again [A rewrite of my website/blog platform - now public, open-source, and on GitHub as simple-website-with-blog]

Monday, December 3, 2018

Almost 5 years ago, I moved my blog (and website) from a hosted environment to a Node.js implementation of my own creation. At the time, I was new to Node and this was a great way to learn. That code has served me well, but over time I've found a few things I wanted to change - in part due to the rapid growth of the Node platform. Some of the changes were foundational, so I chose to do a rewrite instead of multiple revisions.

The rewrite ended up taking about twice as long as I anticipated, but I'm glad I did it. The new architecture is similar to what it was before - and the user interface almost identical - but there are some nice improvements and modernizations under the hood. Of note, the rendering was separated so it's easy to customize (there are samples for a text blog and a photo blog; the unit tests are implemented as a blog, too), search is much more powerful (with inclusion, exclusion, partial matches, etc.), and the code is simplified by the removal of some undue complexity (mostly features I thought were neat but that added little). Finally, because the new project was written to be reused for other purposes and by other people, I'm able to share it.

For context, the project goals are:

An easy way to create a simple, secure website with a blog
Support for text-based and photo-based blog formats
Easy authoring in HTML, Markdown (with code formatting), or JSON
Ordering of posts by publish date or content date
Easy customization of site layout and formatting
High resolution (2x) support for photo blog images
Support for Windows and Linux hosting with Node.js
Simple post format that separates content and metadata
Ability to author hidden posts and schedule a publish date
Ability to create posts that never show up in the timeline
Support for archive links and tagging of posts by category
Quick search of post content, including simple search queries
Automatic Twitter and Open Graph metadata for social media
Automatic cross-linking of related posts
No JavaScript requirement for client browsers

To learn more about the code, its dependencies, and how to use it, please visit: simple-website-with-blog on GitHub

To see it in action, just browse this blog and site. :)

Tags: Miscellaneous Node.js

It depends... [A look at the footprint of popular Node.js command-line parsing packages]

Tuesday, November 28, 2017

In a recent discussion of the Node.js ecosystem, I opined that packages with a large number of dependencies contribute to excessive disk space use by apps that reference them.

But I didn't have data to back that claim up, so I made some measurements to find out. Command-line argument parsing is a common need and there are a variety of packages to make it easier. I found nine of the most popular and installed each into a new, blank project as a standard dependency item in package.json via npm install. Then I counted the number of direct dependencies for that package, the total (transitive) number of packages that end up being installed, and the size (in bytes) of disk space consumed (on Windows). I tabulated the results below and follow with a few observations.

Important: I made no attempt to assess the quality or usefulness of these packages. They are all popular and each offers a different approach to the problem. Some are feature-rich, while others offer a simple API. I am not promoting or critiquing any of them; rather, I am using the aggregate as a source of data.

Package	Popularity	Direct Dependencies	Transitive Dependencies	Size on Disk
argparse	494	1	2	152,661
commander	18865	0	1	48,328
command-line-args	677	3	5	237,789
dashdash	156	1	2	94,377
meow	2344	10	43	455,525
minimatch	2335	1	4	57,803
minimist	8490	0	1	31,151
nomnom	549	2	6	119,237
yargs	7516	12	44	576,724

These metrics were captured on 2017-11-25 and may have changed by the time you read this.

Notes:

The two most popular packages are the smallest on disk and have no dependencies; the third and fourth most popular are the biggest and have the most dependencies.
Packages with fewer dependencies tend to have the smallest size; those with the most dependencies have the largest.
The difference between the extremes of direct dependency count is about 10x.
The difference between extremes for transitive dependency count is about 40x.
The difference between disk space extremes is about 20x.

While this was a simple experiment that doesn't represent the whole Node ecosystem, it seems reasonable to conclude that:

Similar packages can exhibit differences of an order of magnitude (or more) in dependency count and size. If that matters for your scenario, measure before you choose!

For my part, I tend to resist taking on additional dependencies when possible and prefer using dependencies that adhere to the same principle. Reinventing the wheel is wasteful, of course - but sometimes less is more and it's good to keep complexity to a minimum.

Tags: Node.js Technical

Binary Log OBjects, gotta download 'em all! [A simple tool to download blobs from an Azure container]

Wednesday, August 10, 2016

The latest in a series of "I didn't want to write a thing, but couldn't find another thing that already did exactly what I wanted, which is probably because I'm too picky, but whatever" projects, azure-blob-container-download (a.k.a. abcd) is a simple, command-line tool to download all the blobs in an Azure storage container. Here's how it's described in the README:

A simple, cross-platform tool to bulk-download blobs from an Azure storage container.

Though limited in scope, it does a specific set of things vs. the official tools:

AzCopy is not cross-platform

Azure CLI does not bulk-download

Azure PowerShell is not cross-platform

Azure Portal does not bulk-download

The motivation for this project was the same as with my previous post about getting an HTTPS certificate: I've migrated my website from a virtual machine to an Azure Web App. And while it's easy to enable logging for a Web App and get hourly log files in the W3C Extended Log File Format, it wasn't obvious to me how to parse those logs offline to measure traffic, referrers, etc.. (Although that's not something I've bothered with up to now, it's an ability I'd like to have.) What I wanted was a trustworthy, cross-platform tool to download all those log files to a local machine - but the options I investigated each seemed to be missing something.

So I wrote a simple Node.JS CLI and gave it a few extra features to make my life easier. The code is fairly compact and straightforward (and the dependencies minimal), so it's easy to audit. The complete options for downloading and filtering are:

Usage: abcd [options]

Options:
  --account           Storage account (or set AZURE_STORAGE_ACCOUNT)  [string]
  --key               Storage access key (or set AZURE_STORAGE_ACCESS_KEY)  [string]
  --containerPattern  Regular expression filter for container names  [string]
  --blobPattern       Regular expression filter for blob names  [string]
  --startDate         Starting date for blobs  [string]
  --endDate           Ending date for blobs  [string]
  --snapshots         True to include blob snapshots  [boolean]
  --version           Show version number  [boolean]
  --help              Show help  [boolean]

Download blobs from an Azure container.
https://github.com/DavidAnson/azure-blob-container-download

Azure Web Apps create a new log file every hour, so they add up quickly; abcd's date filtering options make it easy to perform incremental downloads. The default directory structure (based on / separators) is collapsed during download, so all files end up in the same directory (named by container) and ordered by date. The tool limits itself to one download at a time, so things proceed at a steady, moderate pace. Once blobs have finished downloading, you're free to do with them as you please. :)

Find out more on the GitHub project page for azure-blob-container-download.

Tags: Node.js Technical Utilities

The blog of dlaa.me