Monday, February 29, 2016

Managing NPM flat module installations in a Nix build environment

Some time ago, I have reengineered npm2nix and described some of its underlying concepts in a blog post. In the reengineered version, I have ported the implementation from CoffeeScript to JavaScript, refactored/modularized the code, and I have been improving the implementation to more accurately simulate NPM's dependency organization, including many of its odd traits.

I have observed that in the latest Node.js (the 5.x series) NPM's behaviour has changed significantly. To cope with this, I did yet another major reengineering effort. In this blog post, I will describe the path that has lead to the latest implementation.

The first attempt


Getting a few commonly used NPM packages deployed with Nix is not particularly challenging, but to make it work completely right turns out to be quite difficult -- the early npm2nix implementations generated Nix expressions that build every package and all of its dependencies in separate derivations (in other words: each package and dependency translates to a separate Nix store path). To allow a package to find its dependencies, the build script creates a node_modules/ sub folder containing symlinks that refer to the Nix store paths of the packages that it requires.

NPM packages have loose dependency specifiers, e.g. wildcards and version ranges, whereas Nix package dependencies are exact, i.e. they bind to packages that are identified by unique hash codes derived from all build time dependencies. npm2nix makes this translation by "snapshotting" the latest conforming version and turning that into into a Nix package.

For example, one my own software projects (NiJS) has the following package configuration file:

{
  "name" : "nijs",
  "version" : "0.0.23",
  "dependencies" : {
    "optparse" : ">= 1.0.3",
    "slasp": "0.0.4"
  }
  ...
}

The package configuration states that it requires optparse version 1.0.3 or higher, and slasp version 0.0.4. Running npm install results in the following directory structure of dependencies:

nijs/
  ...
  package.json
  node_modules/
    optparse/
      package.json
      ...
    slasp/
      package.json
      ...

A node_modules/ folder gets created in which each sub directory represents an NPM package that is a dependency of NiJS. In the older versions of npm2nix, it gets translated as follows:

/nix/store/ab12pq...-nijs-0.0.24/
  ...
  package.json
  node_modules/
    optparse -> /nix/store/4pq1db...-optparse-1.0.5
    slasp -> /nix/store/8j12qp...-slasp-0.0.4
/nix/store/4pq1db...-optparse-1.0.5/
  ...
  package.json
/nix/store/8j12qp...-slasp-0.0.4/
  ...
  package.json

Each involved package is stored in its own private folder in the Nix store. The NiJS package has a node_modules/ folder containing symlinks to its dependencies. For many packages, this approach works well enough, as it at least provides a conforming version for each dependency that it requires.

Unfortunately, it is possible to run into oddities as well. For example, a package that does not work properly in such a model is ironhorse.

For example, we could declare mongoose and ironhorse dependencies of a project:

{
  "name": "myproject",
  "version": "0.0.1",
  "dependencies": {
    "mongoose": "3.8.5",
    "ironhorse": "0.0.11"
  }
}

Ironhorse has an overlapping dependency with the project's dependencies -- it also depends on mongoose, as shown in the following package configuration:

{
  "name": "ironhorse",
  "version": "0.0.11",
  "license" : "MIT",
  "dependencies" : {
    "underscore": "~1.5.2",
    "mongoose": "*",
    "temp": "*",
    ...
  },
  ...
}

Running 'npm install' on project level yields the following directory structure:

myproject/
  ...
  package.json
  node_modules/
    mongoose/
      ...
    ironhorse/
      ...
      package.json
      node_modules/
        underscore/
        temp/

Note that the mongoose only appears one time in the hierarchy of node_modules/ folders despite that it has been declared as a dependency twice.

In contrast, when using an older version of npm2nix, the following directory structure gets generated:

/nix/store/67ab07...-myproject-0.0.1
  ...
  package.json
  node_modules/
    mongoose -> /nix/store/ec704c...-mongoose-3.8.5
    ironhorse -> /nix/store/3ee85e...-ironhorse-0.0.11
/nix/store/3ee85e...-ironhorse-0.0.11
  ...
  package.json
  node_modules/
    underscore -> /nix/store/10af96...-underscore-1.5.2
    mongoose -> /nix/store/a37f75...-mongoose-4.4.5
    temp -> /nix/store/fae379...-temp-0.8.3
/nix/store/ec704c...-mongoose-3.8.5
  package.json
  ...
/nix/store/a37f75...-mongoose-4.4.5
  package.json
  ...
/nix/store/10af96...-underscore-1.5.2
/nix/store/fae379...-temp-0.8.3

In the above directory structure, we can observe that two different versions of mongoose have been deployed -- version 3.8.5 (as a dependency for the project) and version 4.4.5 (as a dependency for ironhorse). Having two different versions of mongoose deployed typically leads to problems.

The reason why npm2nix produces a different result is because whenever NPM encounters a dependency specification, it recursively searches the parent directories to find a conforming version. If a conforming version has been found that fits within the version range of a package dependency, it will not be included again. This is also the reason why NPM can "handle" cyclic dependencies (despite the fact that they are a bad practice) -- when a dependency has been encountered a second time, it will not be deployed again causing NPM to break the cycle.

npm2nix did not implement this kind behaviour -- it always binds a dependency to the latest conforming version, but as can be observed in the last example, this is not what NPM always does -- it could also bind to a shared dependency that may be older than the latest version in the NPM registry (As a sidenote: I wonder how many NPM users actually know about this detail!).

Second attempt: simulating shared dependency behaviour


One of the main objectives in the reengineered version (as described in my previous blog post), is to more accurately mimic NPM's shared dependency behaviour, as the old behaviour was particularly problematic for packages having cyclic dependencies -- Nix does not allow them and causes the evaluation of the entire Nixpkgs set on the Hydra build server to fail as a result.

The reengineered version worked, but the solution was quite expensive and controversial -- I compose Nix expressions of all packages involved, in which each dependency resolves to the latest corresponding version.

Each time a package includes a dependency, I propagate an attribute set to its build function telling it which dependencies have already been resolved by any of the parents. Resolved dependencies get excluded as a dependency.

To check whether a resolved dependency fits within a version range specifier, I have to consult semver. Because semver is unsupported in the Nix expression language, I use a trick in which I import Nix expressions generated by a build script (that invokes the semver command-line utility) to figure out which dependencies have been resolved already.

Besides consulting semver, I used another hack -- packages that have been resolved by any of the parents must be excluded as a dependency. However, NPM packages in Nix are deployed independently from each other in separate build functions and will fail because NPM expects them to present. To solve this problem, I create shims for the excluded packages, by substituting them by empty packages with the same name and version, and removing them after the package has been built.

Symlinking the dependencies also no longer worked reliably -- the CommonJS module system dereferences the location of the includer first and looks in the parent directories for shared dependencies relative from there. This means in case of a symlink, it incorrectly resolves to a Nix store path that has no meaningful parent directories. The only solution I could think of is copying dependencies instead of symlinking them.

To summarize: the new solution worked more accurately than the original version (and can cope with cyclic dependencies) but it is quite inefficient as well -- making copies of dependencies causes a lot of duplication (that would be a waste of disk space) and building Nix expressions in the instantiation phase makes the process quite slow.

Third attempt: computing the dependency graph ahead of time


Apart from the earlier described inefficiencies, the main reason that I had to do yet another major revision is that Node.js 5.x (that includes npm 3.x) executes so-called "flat-module installations. The idea is that when a package includes a dependency, it will be stored in a node_modules/ folder as high in the directory structure as possible without breaking any dependencies.

This new approach has a number of implications. For example, deploying the Disnix virtual hosts test web application with the old npm 2.x used to yield the following directory structure:

webapp/
  ...
  package.json
  node_modules/
    express/
      ...
      package.json
      node_modules/
        accepts/
        array-flatten/
        content-disposition/
        ...
    ejs/
      ...
      package.json

As can be observed in the structure above, the test webapp depends on two packages: express and ejs. Express has dependencies of its own, such as accepts, array-flatten, content-disposition. Because no parent node_modules/ folder provides them, they are included privately for the express package.

Running 'npm install' with the new npm 3.x yields the following directory structure:

webapp/
  ...
  package.json
  node_modules/
    accepts/
    array-flatten/
    content-disposition/
    express/
      ...
      package.json
    ejs/
      ...
      package.json

Since the libraries that express requires do not conflict with the includer's dependencies, they have been moved one level up to the parent package's node_modules/ folder.

Flattening the directory structure makes deploying a NPM project even more imperative -- previously, the dependencies that were included with a package depend on the state of the includer. Now we must also modify the entire directory hierarchy of dependencies by moving packages up in the directory structure. It also makes the resulting dependency graph less predictable. For example, the order in which dependencies are installed matters -- unless all dependencies are discarded and reinstalled from scratch, it may result in different kinds of dependency graphs.

If this flat module approach has all kinds of oddities, why would NPM uses such an approach, you may wonder? It turns out that the only reason is: better Windows support. On Windows, there is a limit on the length on paths and flattening the directory structure helps to prevent hitting it. Unfortunately, it comes at the price of making deployments more imperative and less predictable.

To simulate this flattening strategy, I had to revise npm2nix again. Because of its previous drawbacks and the fact that we have to perform even more imperative operations, I have decided to implement a new strategy in which I compute the entire dependency graph ahead of time by the generator, instead of hacking it into the evaluation phase of the Nix expressions.

Supporting private and shared dependencies works exactly the same as in the old implementation, but is now performed ahead of time. Additionally, I simulate the flat dependency structure as follows:

  • When a package requires a dependency: I check whether the parent directory has a conflicting dependency. This means: it either has a dependency bundled with the same name and a different version or indirectly binds to another parent that provides a conflicting version.
  • If the dependency conflicts: bundle the dependency in the current package.
  • If the dependency does not conflict: bind the package to the dependency (but do not include it) and consult the parent package one level higher.

Besides computing the dependency graph ahead of time, I also deploy the entire dependency graph in one Nix build function -- because including dependencies is stateful, it no longer makes sense to build them as individual Nix packages, that are supposed to be pure.

I have made the flattening algorithm optional. By default, the new npm2nix generates Nix expressions for Node.js 4.x (using the old npm 2.x) release:

$ npm2nix

By appending the -5 parameter, it generates Nix expressions for usage with Node.js 5.x (using the new npm 3.x with flat module installations):

$ npm2nix -5

I have tested the new approach on many packages including my public projects. The good news is: they all seem to work!

Unfortunately, despite the fact that I could get many packages working, the approach is not perfect and hard to get 100% right. For example, in a private project I have encountered bundled dependencies (dependencies that are statically included with a package). NPM also moves them up, while npm2nix merely generates an expression composing the dependency graph (that reflects flat module installations as much as possible). To fix this issue, we must also run a post processing step that moves dependencies up that are in the wrong places. Currently, this step has not been implemented yet in npm2nix.

Another issue is that we want Nix to obtain all dependencies instead of NPM. To prevent NPM from consulting external resources, we substitute some version specifiers (such as Git repositories) by a wildcard: *. These version specifiers sometimes confuse NPM, despite the fact that the directory structure matches NPM's dependency structure.

To cope with these imperfections, I have also added an option to npm2nix to refrain it from running npm install -- in many cases, packages still work fine despite NPM being confused. Moreover, the npm install step in the Nix builder environment merely serves as a validation step -- the Nix builder script is responsible for actually providing the dependencies.

Discussion


In this blog post, I have described the path that has lead to a second reengineered version of npm2nix. The new version computes dependency graphs ahead of time and can mostly handle npm 3.x's flat module installations. Moreover, compared to the previous version, it does no longer rely on very expensive and ugly hacks.

Despite the fact that I can now more or less handle flat installations, I am still not quite happy. Some things that bug me are:

  • The habit of "reusing" modules that have been bundled with any of the includers, makes it IMO difficult and counter-intuitive to predict which version will actually be used in a certain context. In some cases, packagers might expect that the latest version of a version range will be used, but this is not guaranteed to be the case. This could, for example, reintroduce security and stability issues without end users noticing (or expecting) it.
  • Flat module installations are less deterministic and make it really difficult to predict what a dependency graph looks like -- the dependencies that appear at a certain level in the directory structure depend on the order in which dependencies are installed. Therefore, I do not consider this an improvement over npm 2.x.

Because of these drawbacks, I expect that NPM will reconsider some of its concepts again in the future causing npm2nix to break again.

I would recommend the NPM developers to use the following approach:

  • All involved packages should be stored in a single node_modules/ folder instead of multiple nested hierarchies of node_modules/ folders.
  • When a module requests another module, the module loader should consult the package.json configuration file of the package where the includer module belongs to. It should take the latest conforming version in the central node_modules/ folder. I consider taking the last version of a version range to be less counter-intuitive than taking any conforming version.
  • To be able to store multiple versions of packages in a single node_modules/ folder, a better directory naming convention should be adopted. Currently, NPM only identifies modules by name in a node_modules/ folder, making it impossible to store two versions next to each other in one directory.

    If they would, for example, use both the name and version numbers in the directory names, more things are possible. Adding more properties in the path names makes sharing even better -- for example, a package with a name and version number could originate from various sources, e.g. the NPM registry or a Git repository -- reflecting this in the path makes it possible to store more variants next to each other in a reliable way.

    Naming things to improve shareability is not really rocket science -- Nix uses hash codes (that are derived from all build-time dependencies) to uniquely identify packages and the .NET Global Assembly Cache uses so-called strong names that include various naming attributes, such as cryptographic keys to ensure that no library conflicts. I am convinced that adopting a better naming convention for storing NPM packages would be quite beneficial as well.
  • To cope with cyclic dependencies: I would simply say that it suffices to disallow them. Packages are supposed to be units of reuse, and if two packages mutually depend on each other, then they should be combined into one package.

Availability


The second reengineered npm2nix version can be obtained from my GitHub page. The code resides in the reengineering2 branch.