Sander van der Burg's blog: A test framework for the Nix process management framework

As already explained in many previous blog posts, the Nix process management framework adds new ideas to earlier service management concepts explored in Nixpkgs and NixOS:

It makes it possible to deploy services on any operating system that can work with the Nix package manager, including conventional Linux distributions, macOS and FreeBSD. It also works on NixOS, but NixOS is not a requirement.
It allows you to construct multiple instances of the same service, by using constructor functions that identify conflicting configuration parameters. These constructor functions can be invoked in such a way that these configuration properties no longer conflict.
We can target multiple process managers from the same high-level deployment specifications. These high-level specifications are automatically translated to parameters for a target-specific configuration function for a specific process manager.

It is also possible to override or augment the generated parameters, to work with configuration properties that are not universally supported.
There is a configuration option that conveniently allows you to disable user changes making it possible to deploy services as an unprivileged user.

Although the above features are interesting, one particular challenge is that the framework cannot guarantee that all possible variations will work after writing a high-level process configuration. The framework facilitates code reuse, but it is not a write once, run anywhere approach.

To make it possible to validate multiple service variants, I have developed a test framework that is built on top of the NixOS test driver that makes it possible to deploy and test a network of NixOS QEMU virtual machines with very minimal storage and RAM overhead.

In this blog post, I will describe how the test framework can be used.

Automating tests

Before developing the test framework, I was mostly testing all my packaged services manually. Because a manual test process is tedious and time consuming, I did not have any test coverage for anything but the most trivial example services. As a result, I frequently ran into many configuration breakages.

Typically, when I want to test a process instance, or a system that is composed of multiple collaborative processes, I perform the following steps:

First, I need to deploy the system for a specific process manager and configuration profile, e.g. for a privileged or unprivileged user, in an isolated environment, such as a virtual machine or container.
Then I need to wait for all process instances to become available. Readiness checks are critical and typically more complicated than expected -- for most services, there is a time window between a successful invocation of a process and its availability to carry out its primary task, such as accepting network connections. Executing tests before a service is ready, typically results in errors.

Although there are process managers that can generally deal with this problem (e.g. systemd has the sd_notify protocol and s6 its own protocol and a sd_notify wrapper), the lack of a standardized protocol and its adoption still requires me to manually implement readiness checks.

(As a sidenote: the only readiness check protocol that is standardized is for traditional System V services that daemonize on their own. The calling parent process should almost terminate immediately, but still wait until the spawned daemon child process notifies it to be ready.

As described in an earlier blog post, this notification aspect is more complicated to implement than I thought. Moreover, not all traditional System V daemons follow this protocol.)
When all process instances are ready, I can check whether they properly carry out their tasks, and whether the integration of these processes work as expected.

An example

I have developed a Nix function: testService that automates the above process using the NixOS test driver -- I can use this function to create a test suite for systems that are made out of running processes, such as the webapps example described in my previous blog posts about the Nix process management framework.

The example system consists of a number of webapp processes with an embedded HTTP server returning HTML pages displaying their identities. Nginx reverse proxies forward incoming connections to the appropriate webapp processes by using their corresponding virtual host header values:

{ pkgs ? import <nixpkgs> { inherit system; }
, system ? builtins.currentSystem
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, libDir ? "${stateDir}/lib"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager
}:

let
  sharedConstructors = import ../../../examples/services-agnostic/constructors/constructors.nix {
    inherit pkgs stateDir runtimeDir logDir cacheDir libDir tmpDir forceDisableUserChange processManager;
  };

  constructors = import ../../../examples/webapps-agnostic/constructors/constructors.nix {
    inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager;
    webappMode = null;
  };
in
rec {
  webapp1 = rec {
    port = 5000;
    dnsName = "webapp1.local";

    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "1";
    };
  };

  webapp2 = rec {
    port = 5001;
    dnsName = "webapp2.local";

    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "2";
    };
  };

  webapp3 = rec {
    port = 5002;
    dnsName = "webapp3.local";

    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "3";
    };
  };

  webapp4 = rec {
    port = 5003;
    dnsName = "webapp4.local";

    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "4";
    };
  };

  nginx = rec {
    port = if forceDisableUserChange then 8080 else 80;
    webapps = [ webapp1 webapp2 webapp3 webapp4 ];

    pkg = sharedConstructors.nginxReverseProxyHostBased {
      inherit port webapps;
    } {};
  };

  webapp5 = rec {
    port = 5004;
    dnsName = "webapp5.local";

    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "5";
    };
  };

  webapp6 = rec {
    port = 5005;
    dnsName = "webapp6.local";

    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "6";
    };
  };

  nginx2 = rec {
    port = if forceDisableUserChange then 8081 else 81;
    webapps = [ webapp5 webapp6 ];

    pkg = sharedConstructors.nginxReverseProxyHostBased {
      inherit port webapps;
      instanceSuffix = "2";
    } {};
  };
}

The processes model shown above (processes-advanced.nix) defines the following process instances:

There are six webapp process instances, each running an embedded HTTP service, returning HTML pages with their identities. The dnsName property specifies the DNS domain name value that should be used as a virtual host header to make the forwarding from the reverse proxies work.
There are two nginx reverse proxy instances. The former: nginx forwards incoming connections to the first four webapp instances. The latter: nginx2 forwards incoming connections to webapp5 and webapp6.

With the following command, I can connect to webapp2 through the first nginx reverse proxy:

$ curl -H 'Host: webapp2.local' http://localhost:8080
<!DOCTYPE html>
<html>
  <head>
    <title>Simple test webapp</title>
  </head>
  <body>
    Simple test webapp listening on port: 5001
  </body>
</html>

Creating a test suite

I can create a test suite for the web application system as follows:

{ pkgs, testService, processManagers, profiles }:

testService {
  exprFile = ./processes.nix;

  readiness = {instanceName, instance, ...}:
    ''
      machine.wait_for_open_port(${toString instance.port})
    '';

  tests = {instanceName, instance, ...}:
    pkgs.lib.optionalString (instanceName == "nginx" || instanceName == "nginx2")
      (pkgs.lib.concatMapStrings (webapp: ''
        machine.succeed(
            "curl --fail -H 'Host: ${webapp.dnsName}' http://localhost:${toString instance.port} | grep ': ${toString webapp.port}'"
        )
      '') instance.webapps);

  inherit processManagers profiles;
}

The Nix expression above invokes testService with the following parameters:

processManagers refers to a list of names of all the process managers that should be tested.
profiles refers to a list of configuration profiles that should be tested. Currently, it supports privileged for privileged deployments, and unprivileged for unprivileged deployments in an unprivileged user's home directory, without changing user permissions.
The exprFile parameter refers to the processes model of the system: processes-advanced.nix shown earlier.
The readiness parameter refers to a function that does a readiness check for each process instance. In the above example, it checks whether each service is actually listening on the required TCP port.
The tests parameter refers to a function that executes tests for each process instance. In the above example, it ignores all but the nginx instances, because explicitly testing a webapp instance is a redundant operation.

For each nginx instance, it checks whether all webapp instances can be reached from it, by running the curl command.

The readiness and tests functions take the following parameters: instanceName identifies the process instance in the processes model, and instance refers to the attribute set containing its configuration.

Furthermore, they can refer to global process model configuration parameters:

stateDir: The directory in which state files are stored (typically /var for privileged deployments)
runtimeDir: The directory in which runtime files are stored (typically /var/run for privileged deployments).
forceDisableUserChange: Indicates whether to disable user changes (for unprivileged deployments) or not.

In addition to writing tests that work on instance level, it is also possible to write tests on system level, with the following parameters (not shown in the example):

initialTests: instructions that run right after deploying the system, but before the readiness checks, and instance-level tests.
postTests: instructions that run after the instance-level tests.

The above functions also accept the same global configuration parameters, and processes that refers to the entire processes model.

We can also configure other properties useful for testing:

systemPackages: installs additional packages into the system profile of the test virtual machine.
nixosConfig defines a NixOS module with configuration properties that will be added to the NixOS configuration of the test machine.
extraParams propagates additional parameters to the processes model.

Composing test functions

The Nix expression above is not self-contained. It is a function definition that needs to be invoked with all required parameters including all the process managers and profiles that we want to test for.

We can compose tests in the following Nix expression:

{ pkgs ? import <nixpkgs> { inherit system; }
, system ? builtins.currentSystem
, processManagers ? [ "supervisord" "sysvinit" "systemd" "disnix" "s6-rc" ]
, profiles ? [ "privileged" "unprivileged" ]
}:

let
  testService = import ../../nixproc/test-driver/universal.nix {
    inherit system;
  };
in
{

  nginx-reverse-proxy-hostbased = import ./nginx-reverse-proxy-hostbased {
    inherit pkgs processManagers profiles testService;
  };

  docker = import ./docker {
    inherit pkgs processManagers profiles testService;
  };

  ...
}

The above partial Nix expression (default.nix) invokes the function defined in the previous Nix expression that resides in the nginx-reverse-proxy-hostbased directory and propagates all required parameters. It also composes other test cases, such as docker.

The parameters of the composition expression allow you to globally configure all the desired service variants:

processManagers allows you to select the process managers you want to test for.
profiles allows you to select the configuration profiles.

With the following command, we can test our system as a privileged user, using systemd as a process manager:

$ nix-build -A nginx-reverse-proxy-hostbased.privileged.systemd

we can also run the same test, but then as an unprivileged user:

$ nix-build -A nginx-reverse-proxy-hostbased.unprivileged.systemd

In addition to systemd, any configured process manager can be used that works in NixOS. The following command runs a privileged test of the same service for sysvinit:

$ nix-build -A nginx-reverse-proxy-hostbased.privileged.sysvinit

Results

With the test driver in place, I have managed to expand my repository of example services, provided test coverage for them and fixed quite a few bugs in the framework caused by regressions.

Below is a screenshot of Hydra: the Nix-based continuous integration service showing an overview of test results for all kinds of variants of a service:

So far, the following services work multi-instance, with multiple process managers, and (optionally) as an unprivileged user:

Apache HTTP server. In the services repository, there are multiple constructors for deploying an Apache HTTP server: to deploy static web applications or dynamic web applications with PHP, and to use it as a reverse proxy (via HTTP and AJP) with HTTP basic authentication optionally enabled.
Apache Tomcat.
Nginx. For Nginx we also have multiple constructors. One to deploy a configuration for serving static web apps, and two for setting up reverse proxies using paths or virtual hosts to forward incoming requests to the appropriate services.

The reverse proxy constructors can also generate configurations that will cache the responses of incoming requests.
MySQL/MariaDB.
PostgreSQL.
InfluxDB.
MongoDB.
OpenSSH.
svnserve.
xinetd.
fcron. By default, the fcron user and group are hardwired into the executable. To facilitate unprivileged user deployments, we automatically create a package build override to propagate the --with-run-non-privileged configuration flag so that it can run as unprivileged user. Similarly, for multiple instances we create an override to use a different user and group that does not conflict with the primary instance.
supervisord
s6-svscan

The following service also works with multiple instances and multiple process managers, but not as an unprivileged user:

Docker. In theory, Docker supports rootless deployments, but it is still very highly experimental and I find it very cumbersome to set up.

The following services work with multiple process managers, but not multi-instance or as an unprivileged user:

D-Bus
Disnix
nix-daemon
Hydra

In theory, the above services could be adjusted to work as an unprivileged user, but doing so is not very useful -- for example, the nix-daemon's purpose is to facilitate multi-user package deployments. As an unprivileged user, you only want to facilitate package deployments for yourself.

Moreover, the multi-instance aspect is IMO also not very useful to explore for these services. For example, I can not think of a useful scenario to have two Hydra instances running next to each other.

Discussion

The test framework described in this blog post is an important feature addition to the Nix process management framework -- it allowed me to package more services and fix quite a few bugs caused by regressions.

I can now finally show that it is doable to package services and make them work under nearly all possible conditions that the framework supports (e.g. multiple instances, multiple process managers, and unprivileged user installations).

The only limitation of the test framework is that it is not operating system agnostic -- the NixOS test driver (that serves as its foundation), only works (as its name implies) with NixOS, which itself is a Linux distribution. As a result, we can not automatically test bsdrc scripts, launchd daemons, and cygrunsrv services.

In theory, it is also possible to make a more generalized test driver that works with multiple operating systems. The NixOS test driver is a combination of ideas (e.g. a shared Nix store between the host and guest system, an API to control QEMU, and an API to manage services). We could also dissect these ideas and run them on conventional QEMU VMs running different operating systems (with the Nix package manager).

Although making a more generalized test driver is interesting, it is beyond the scope of the Nix process management framework (which is about managing process instances, not entire systems).

Another drawback is that while it is possible to test all possible service variants on Linux, it may be very expensive to do so.

However, full process manager coverage is often not required to get a reasonable level of confidence. For many services, it typically suffices to implement the following strategy:

Pick two process managers: one that prefers foreground processes (e.g. supervisord) and one that prefers daemons (e.g. sysvinit). This is the most significant difference (from a configuration perspective) between all these different process managers.
If a service supports multiple configuration variants, and multiple instances, then create a processes model that concurrently deploys all these variants.

Implementing the above strategy only requires you to test four variants, providing a high degree of certainty that it will work with all other process managers as well.

Future work

Most of the interesting functionality required to work with the Nix process management framework is now implemented. I still need to implement more changes to make it more robust and "dog food" more of my own problems as much as possible.

Moreover, the docker backend still requires a bit more work to make it more usable.

Eventually, I will be thinking of an RFC that will upstream the interesting bits of the framework into Nixpkgs.

Availability

The Nix process management framework repository as well as the example services repository can be obtained from my GitHub page.

Sander van der Burg's blog

Monday, April 26, 2021

A test framework for the Nix process management framework