Sunday, November 29, 2020

Constructing a simple alerting system with well-known open source projects


Some time ago, I have been experimenting with all kinds of monitoring and alerting technologies. For example, with the following technologies, I can develop a simple alerting system with relative ease:

  • Telegraf is an agent that can be used to gather measurements and transfer the corresponding data to all kinds of storage solutions.
  • InfluxDB is a time series database platform that can store, manage and analyze timestamped data.
  • Kapacitor is a real-time streaming data process engine, than be used for a variety of purposes. I can use Kapacitor to analyze measurements and see if a threshold has been exceeded so that an alert can be triggered.
  • Alerta is a monitoring system that can store, de-duplicate alerts, and arrange black outs.
  • Grafana is a multi-platform open source analytics and interactive visualization web application.

These technologies appear to be quite straight forward to use. However, as I was learning more about them, I discovered a number of oddities, that may have big implications.

Furthermore, testing and making incremental changes also turns out to be much more challenging than expected, making it very hard to diagnose and fix problems.

In this blog post, I will describe how I built a simple monitoring and alerting system, and elaborate about my learning experiences.

Building the alerting system


As described in the introduction, I can combine several technologies to create an alerting system. I will explain them more in detail in the upcoming sections.

Telegraf


Telegraf is a plugable agent that gathers measurements from a variety of inputs (such as system metrics, platform metrics, database metrics etc.) and sends them to a variety of outputs, typically storage solutions (database management systems such as InfluxDB, PostgreSQL or MongoDB). Telegraf has a large plugin eco-system that provides all kinds integrations.

In this blog post, I will use InfluxDB as an output storage backend. For the inputs, I will restrict myself to capturing a sub set of system metrics only.

With the following telegraf.conf configuration file, I can capture a variety of system metrics every 10 seconds:

[agent]
  interval = "10s"

[[outputs.influxdb]]
  urls = [ "http://test1:8086" ]
  database = "sysmetricsdb"
  username = "sysmetricsdb"
  password = "sysmetricsdb"

[[inputs.system]]
  # no configuration

[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
  report_active = true

[[inputs.mem]]
  # no configuration

With the above configuration file, I can collect the following metrics:
  • System metrics, such as the hostname and system load.
  • CPU metrics, such as how much the CPU cores on a machine are utilized, including the total CPU activity.
  • Memory (RAM) metrics.

The data will be stored in an InfluxDB database name: sysmetricsdb hosted on a remote machine with host name: test1.

InfluxDB


As explained earlier, InfluxDB is a timeseries platform that can store, manage and analyze timestamped data. In many ways, InfluxDB resembles relational databases, but there are also some notable differences.

The query language that InfluxDB uses is called InfluxQL (that shares many similarities with SQL).

For example, with the following query I can retrieve the first three data points from the cpu measurement, that contains the CPU-related measurements collected by Telegraf:

> precision rfc3339
> select * from "cpu" limit 3

providing me the following result set:

name: cpu
time                 cpu       host  usage_active       usage_guest usage_guest_nice usage_idle        usage_iowait        usage_irq usage_nice usage_softirq       usage_steal usage_system      usage_user
----                 ---       ----  ------------       ----------- ---------------- ----------        ------------        --------- ---------- -------------       ----------- ------------      ----------
2020-11-16T15:36:00Z cpu-total test2 10.665258711721098 0           0                89.3347412882789  0.10559662090813073 0         0          0.10559662090813073 0           8.658922914466714 1.79514255543822
2020-11-16T15:36:00Z cpu0      test2 10.665258711721098 0           0                89.3347412882789  0.10559662090813073 0         0          0.10559662090813073 0           8.658922914466714 1.79514255543822
2020-11-16T15:36:10Z cpu-total test2 0.1055966209080346 0           0                99.89440337909197 0                   0         0          0.10559662090813073 0           0                 0

As you may probably notice by looking at the output above, every data point has a timestamp and a number of fields capturing CPU metrics:

  • cpu identifies the CPU core.
  • host contains the host name of the machine.
  • The remainder of the fields contain all kinds of CPU metrics, e.g. how much CPU time is consumed by the system (usage_system), the user (usage_user), by waiting for IO (usage_iowait) etc.
  • The usage_active field contains the total CPU activity percentage, which is going to be useful to develop an alert that will warn us if there is too much CPU activity for a long period of time.

Aside from the fact that all data is timestamp based, data in InfluxDB has another notable difference compared to relational databases: an InfluxDB database is schemaless. You can add an arbitrary number of fields and tags to a data point without having to adjust the database structure (and migrating existing data to the new database structure).

Fields and tags can contain arbitrary data, such as numeric values or strings. Tags are also indexed so that you can search for these values more efficiently. Furthermore, tags can be used to group data.

For example, the cpu measurement collection has the following tags:

> SHOW TAG KEYS ON "sysmetricsdb" FROM "cpu";
name: cpu
tagKey
------
cpu
host

As shown in the above output, the cpu and host fields are tags in the cpu measurement.

We can use these tags to search for all data points related to a CPU core and/or host machine. Moreover, we can use these tags for grouping allowing us to compute aggregate values, sch as the mean value per CPU core and host.

Beyond storing and retrieving data, InfluxDB has many useful additional features:

  • You can also automatically sample data and run continuous queries that generate and store sampled data in the background.
  • Configure retention policies so that data is no longer stored for an indefinite amount of time. For example, you can configure a retention policy to drop raw data after a certain amount of time, but retain the corresponding sampled data.

InfluxDB has a "open core" development model. The free and open source edition (FOSS) of InfluxDB server (that is MIT licensed) allows you to host multiple databases on a multiple servers.

However, if you also want horizontal scalability and/or high assurance, then you need to switch to the hosted InfluxDB versions -- data in InfluxDB is partitioned into so-called shards of a fixed size (the default shard size is 168 hours).

These shards can be distributed over multiple InfluxDB servers. It is also possible to deploy multiple read replicas of the same shard to multiple InfluxDB servers improving read speed.

Kapacitor


Kapacitor is a real-time streaming data process engine developed by InfluxData -- the same company that also develops InfluxDB and Telegraf.

It can be used for all kinds of purposes. In my example cases, I will only use it to determine whether some threshold has been exceeded and an alert needs to be triggered.

Kapacitor works with customly implemented tasks that are written in a domain-specific language called the TICK script language. There are two kinds of tasks: stream and batch tasks. Both task types have advantages and disadvantages.

We can easily develop an alert that gets triggered if the CPU activity level is high for a relatively long period of time (more than 75% on average over 1 minute).

To implement this alert as a stream job, we can write the following TICK script:

dbrp "sysmetricsdb"."autogen"

stream
    |from()
        .measurement('cpu')
        .groupBy('host', 'cpu')
        .where(lambda: "cpu" != 'cpu-total')
    |window()
        .period(1m)
        .every(1m)
    |mean('usage_active')
    |alert()
        .message('Host: {{ index .Tags "host" }} has high cpu usage: {{ index .Fields "mean" }}')
        .warn(lambda: "mean" > 75.0)
        .crit(lambda: "mean" > 85.0)
        .alerta()
            .resource('{{ index .Tags "host" }}/{{ index .Tags "cpu" }}')
            .event('cpu overload')
            .value('{{ index .Fields "mean" }}')

A stream job is built around the following principles:

  • A stream task does not execute queries on an InfluxDB server. Instead, it creates a subscription to InfluxDB -- whenever a data point gets inserted into InfluxDB, the data points gets forwarded to Kapacitor as well.

    To make subscriptions work, both InfluxDB and Kapacitor need to be able to connect to each other with a public IP address.
  • A stream task defines a pipeline consisting of a number of nodes (connected with the | operator). Each node can consume data points, filter, transform, aggregate, or execute arbitrary operations (such as calling an external service), and produce new data points that can be propagated to the next node in the pipeline.
  • Every node also has property methods (such as .measurement('cpu')) making it possible to configure parameters.

The TICK script example shown above does the following:

  • The from node consumes cpu data points from the InfluxDB subscription, groups them by host and cpu and filters out data points with the the cpu-total label, because we are only interested in the CPU consumption per core, not the total amount.
  • The window node states that we should aggregate data points over the last 1 minute and pass the resulting (aggregated) data points to the next node after one minute in time has elapsed. To aggregate data, Kapacitor will buffer data points in memory.
  • The mean node computes the mean value for usage_active for the aggregated data points.
  • The alert node is used to trigger an alert of a specific severity level (WARNING if the mean activity percentage is bigger than 75%) and (CRITICAL if the mean activity percentage is bigger than 85%). In the remainder of the case, the status is considered OK. The alert is sent to Alerta.

It is also possible to write a similar kind of alerting script as a batch task:

dbrp "sysmetricsdb"."autogen"

batch
    |query('''
        SELECT mean("usage_active")
        FROM "sysmetricsdb"."autogen"."cpu"
        WHERE "cpu" != 'cpu-total'
    ''')
        .period(1m)
        .every(1m)
        .groupBy('host', 'cpu')
    |alert()
        .message('Host: {{ index .Tags "host" }} has high cpu usage: {{ index .Fields "mean" }}')
        .warn(lambda: "mean" > 75.0)
        .crit(lambda: "mean" > 85.0)
        .alerta()
            .resource('{{ index .Tags "host" }}/{{ index .Tags "cpu" }}')
            .event('cpu overload')
            .value('{{ index .Fields "mean" }}')

The above TICK script looks similar to the stream task shown earlier, but instead of using a subscription, the script queries the InfluxDB database (with an InfluxQL query) for data points over the last minute with a query node.

Which approach for writing a CPU alert is best, you may wonder? Each of these two approaches have their pros and cons:

  • Stream tasks offer low latency responses -- when a data point appears, a stream task can immediately respond, whereas a batch task needs to query every minute all the data points to compute the mean percentage over the last minute.
  • Stream tasks maintain a buffer for aggregating the data points making it possible to only send incremental updates to Alerta. Batch tasks are stateless. As a result, they need to update the status of all hosts and CPUs every minute.
  • Processing data points is done synchronously and in sequential order -- if an update round to Alerta takes too long (which is more likely to happen with a batch task), then the next processing run may overlap with the previous, causing all kinds of unpredictable results.

    It may also cause Kapacitor to eventually crash due to growing resource consumption.
  • Batch tasks may also miss data points -- while querying data over a certain time window, it may happen that a new data point gets inserted in that time window (that is being queried). This new data point will not be picked up by Kapacitor.

    A subscription made by a stream task, however, will never miss any data points.
  • Stream tasks can only work with data points that appear from the moment Kapacitor is started -- it cannot work with data points in the past.

    For example, if Kapacitor is restarted and some important event is triggered in the restart time window, Kapacitor will not notice that event, causing the alert to remain in its previous state.

    To work effectively with stream tasks, a continuous data stream is required that frequently reports on the status of a resource. Batch tasks, on the other hand, can work with historical data.
  • The fact that nodes maintain a buffer may also cause the RAM consumption of Kapacitor to grow considerably, if the data volumes are big.

    A batch task on the other hand, does not buffer any data and is more memory efficient.

    Another compelling advantage of batch tasks over stream tasks is that InfluxDB does all the work. The hosted version of InfluxDB can also horizontally scale.
  • Batch tasks can also aggregate data more efficiently (e.g. computing the mean value or sum of values over a certain time period).

I consider neither of these script types the optimal solution. However, for implementing the alerts I tend to have a slight preference for stream jobs, because of its low latency, and incremental update properties.

Alerta


As explained in the introduction, Alerta is a monitoring system that can store and de-duplicate alerts, and arrange black outs.

The Alerta server provides a REST API that can be used to query and modify alerting data and uses MongoDB or PostgreSQL as a storage database.

There are also a variety of Alerta clients: there is the alerta-cli allows you to control the service from the command-line. There is also a web user interface that I will show later in this blog post.

Running experiments


With all the components described above in place, we can start running experiments to see if the CPU alert will work as expected. To gain better insights in the process, I can install Grafana that allows me to visualize the measurements that are stored in InfluxDB.

Configuring a dashboard and panel for visualizing the CPU activity rate was straight forward. I configured a new dashboard, with the following variables:


The above variables allow me to select for each machine in the network, which CPU core's activity percentage I want to visualize.

I have configured the CPU panel as follows:


In the above configuration, I query the usage_activity from the cpu measurement collection, using the dashboard variables: cpu and host to filter for the right target machine and CPU core.

I have also configured the field unit to be a percentage value (between 0 and 100).

When running the following command-line instruction on a test machine that runs Telegraf (test2), I can deliberately hog the CPU:

$ dd if=/dev/zero of=/dev/null

The above command reads zero bytes (one-by-one) and discards them by sending them to /dev/null, causing the CPU to remain utilized at a high level:


In the graph shown above, it is clearly visible that CPU core 0 on the test2 machine remains utilized at 100% for several minutes.

(As a sidenote, we can also hog both the CPU and consume RAM at the same time with a simple command line instruction).

If we keep hogging the CPU and wait for at least a minute, the Alerta web interface dashboard will show a CRITICAL alert:


If we stop the dd command, then the TICK script should eventually notice that the mean percentage drops below the WARNING threshold causing the alert to go back into the OK state and disappearing from the Alerta dashboard.

Developing test cases


Being able to trigger an alert with a simple command-line instruction is useful, but not always convenient or effective -- one of the inconveniences is that we always have to wait at least one minute to get feedback.

Moreover, when an alert does not work, it is not always easy to find the root cause. I have encountered the following problems that contribute to a failing alert:

  • Telegraf may not be running and, as a result, not capturing the data points that need to be analyzed by the TICK script.
  • A subscription cannot be established between InfluxDB and Kapacitor. This may happen when Kapacitor cannot be reached through a public IP address.
  • There are data points collected, but only the wrong kinds of measurements.
  • The TICK script is functionally incorrect.

Fortunately, for stream tasks it is relatively easy to quickly find out whether an alert is functionally correct or not -- we can generate test cases that almost instantly trigger each possible outcome with a minimal amount of data points.

An interesting property of stream tasks is that they have no notion of time -- the .window(1m) property may suggest that Kapacitor computes the mean value of the data points every minute, but that is not what it actually does. Instead, Kapacitor only looks at the timestamps of the data points that it receives.

When Kapacitor sees that the timestamps of the data points fit in the 1 minute time window, then it keeps buffering. As soon as a data point appears that is outside this time window, the window node relays an aggregated data point to the next node (that computes the mean value, than in turn is consumed by the alert node deciding whether an alert needs to be raised or not).

We can exploit that knowledge, to create a very minimal bash test script that triggers every possible outcome: OK, WARNING and CRITICAL:

influxCmd="influx -database sysmetricsdb -host test1"

export ALERTA_ENDPOINT="http://test1"

### Trigger CRITICAL alert

# Force the average CPU consumption to be 100%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100   0000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100  60000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100 120000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "critical" ]
then
     echo "Expected severity: critical, but we got: $actualSeverity" >&2
     false
fi
      
### Trigger WARNING alert

# Force the average CPU consumption to be 80%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 180000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 240000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 300000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "warning" ]
then
     echo "Expected severity: warning, but we got: $actualSeverity" >&2
     false
fi

### Trigger OK alert

# Force the average CPU consumption to be 0%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 300000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 360000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 420000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "ok" ]
then
     echo "Expected severity: ok, but we got: $actualSeverity" >&2
     false
fi

The shell script shown above automatically triggers all three possible outcomes of the CPU alert:

  • CRITICAL is triggered by generating data points that force a mean activity percentage of 100%.
  • WARNING is triggered by a mean activity percentage of 80%.
  • OK is triggered by a mean activity percentage of 0%.

It uses the Alerta CLI to connect to the Alerta server to check whether the alert's severity level has the expected value.

We need three data points to trigger each alert type -- the first two data points are on the boundaries of the 1 minute window (0 seconds and 60 seconds), forcing the mean value to become the specified CPU activity percentage.

The third data point is deliberately outside the time window (of 1 minute), forcing the alert node to be triggered with a mean value over the previous two data points.

Although the above test strategy works to quickly validate all possible outcomes, one impractical aspect is that the timestamps in the above example start with 0 (meaning 0 seconds after the epoch: January 1st 1970 00:00 UTC).

If we also want to observe the data points generated by the above script in Grafana, we need to configure the panel to go back in time 50 years.

Fortunately, I can also easily adjust the script to start with a base timestamp, that is 1 hour in the past:

offset="$(($(date +%s) - 3600))"

With this tiny adjustment, we should see the following CPU graph (displaying data points from the last hour) after running the test script:


As you may notice, we can see that the CPU activity level quickly goes from 100%, to 80%, to 0%, using only 9 data points.

Although testing stream tasks (from a functional perspective) is quick and convenient, testing batch tasks in a similar way is difficult. Contrary to the stream task implementation, the query node in the batch task does have a notion of time (because of the WHERE clause that includes the now() expression).

Moreover, the embedded InfluxQL query evaluates the mean values every minute, but the test script does not exactly know when this event triggers.

The only way I could think of to (somewhat reliably) validate the outcomes is by creating a test script that continuously inserts data points for at least double the time window size (2 minutes) until Alerta reports the right alert status (if it does not after a while, I can conclude that the alert is incorrectly implemented).

Automating the deployment


As you may probably have already guessed, to be able to conveniently experiment with all these services, and to reliably run tests in isolation, some form of deployment automation is an absolute must-have.

Most people who do not know anything about my deployment technology preferences, will probably go for Docker or docker-compose, but I have decided to use a variety of solutions from the Nix project.

NixOps is used to automatically deploy a network of NixOS machines -- I have created a logical and physical NixOps configuration that deploys two VirtualBox virtual machines.

With the following command I can create and deploy the virtual machines:

$ nixops create network.nix network-virtualbox.nix -d test
$ nixops deploy -d test

The first machine: test1 is responsible for hosting the entire monitoring infrastructure (InfluxDB, Kapacitor, Alerta, Grafana), and the second machine (test2) runs Telegraf and the load tests.

Disnix (my own deployment tool) is responsible for deploying all services, such as InfluxDB, Kapacitor, Alarta, and the database storage backends. Contrary to docker-compose, Disnix does not work with containers (or other Docker objects, such as networks or volumes), but with arbitrary deployment units that are managed with a plugin system called Dysnomia.

Moreover, Disnix can also be used for distributed deployment in a network of machines.

I have packaged all the services and captured them in a Disnix services model that specifies all deployable services, their types, and their inter-dependencies.

If I combine the services model with the NixOps network models, and a distribution model (that maps Telegraf and the test scripts to the test2 machine and the remainder of the services to the first: test1), I can deploy the entire system:

$ export NIXOPS_DEPLOYMENT=test
$ export NIXOPS_USE_NIXOPS=1

$ disnixos-env -s services.nix \
  -n network.nix \
  -n network-virtualbox.nix \
  -d distribution.nix

The following diagram shows a possible deployment scenario of the system:


The above diagram describes the following properties:

  • The light-grey colored boxes denote machines. In the above diagram, we have two of them: test1 and test2 that correspond to the VirtualBox machines deployed by NixOps.
  • The dark-grey colored boxes denote containers in a Disnix-context (not to be confused with Linux or Docker containers). These are environments that manage other services.

    For example, a container service could be the PostgreSQL DBMS managing a number of PostgreSQL databases or the Apache HTTP server managing web applications.
  • The ovals denote services that could be any kind of deployment unit. In the above example, we have services that are running processes (managed by systemd), databases and web applications.
  • The arrows denote inter-dependencies between services. When a service has an inter-dependency on another service (i.e. the arrow points from the former to the latter), then the latter service needs to be activated first. Moreover, the former service also needs to know how the latter can be reached.
  • Services can also be container providers (as denoted by the arrows in the labels), stating that other services can be embedded inside this service.

    As already explained, the PostgreSQL DBMS is an example of such a service, because it can host multiple PostgreSQL databases.

Although the process components in the diagram above can also be conveniently deployed with Docker-based solutions (i.e. as I have explained in an earlier blog post, containers are somewhat confined and restricted processes), the non-process integrations need to be managed by other means, such as writing extra shell instructions in Dockerfiles.

In addition to deploying the system to machines managed by NixOps, it is also possible to use the NixOS test driver -- the NixOS test driver automatically generates QEMU virtual machines with a shared Nix store, so that no disk images need to be created, making it possible to quickly spawn networks of virtual machines, with very small storage footprints.

I can also create a minimal distribution model that only deploys the services required to run the test scripts -- Telegraf, Grafana and the front-end applications are not required, resulting in a much smaller deployment:


As can be seen in the above diagram, there are far fewer components required.

In this virtual network that runs a minimal system, we can run automated tests for rapid feedback. For example, the following test driver script (implemented in Python) will run my test shell script shown earlier:

test2.succeed("test-cpu-alerts")

With the following command I can automatically run the tests on the terminal:

$ nix-build release.nix -A tests

Availability


The deployment recipes, test scripts and documentation describing the configuration steps are stored in the monitoring playground repository that can be obtained from my GitHub page.

Besides the CPU activity alert described in this blog post, I have also developed a memory alert that triggers if too much RAM is consumed for a longer period of time.

In addition to virtual machines and services, there is also deployment automation in place allowing you also easily deploy Kapacitor TICK scripts and Grafana dashboards.

To deploy the system, you need to use the very latest version of Disnix (version 0.10) that was released very recently.

Acknowledgements


I would like to thank my employer: Mendix for writing this blog post. Mendix allows developers to work two days per month on research projects, making projects like these possible.

Saturday, October 31, 2020

Building multi-process Docker images with the Nix process management framework

Some time ago, I have described my experimental Nix-based process management framework that makes it possible to automatically deploy running processes (sometimes also ambiguously called services) from declarative specifications written in the Nix expression language.

The framework is built around two concepts. As its name implies, the Nix package manager is used to deploy all required packages and static artifacts, and a process manager of choice (e.g. sysvinit, systemd, supervisord and others) is used to manage the life-cycles of the processes.

Moreover, it is built around flexible concepts allowing integration with solutions that are not qualified as process managers (but can still be used as such), such as Docker -- each process instance can be deployed as a Docker container with a shared Nix store using the host system's network.

As explained in an earlier blog post, Docker has become such a popular solution that it has become a standard for deploying (micro)services (often as a utility in the Kubernetes solution stack).

When deploying a system that consists of multiple services with Docker, a typical strategy (and recommended practice) is to use multiple containers that have only one root application process. Advantages of this approach is that Docker can control the life-cycles of the applications, and that each process is (somewhat) isolated/protected from other processes and the host system.

By default, containers are isolated, but if they need to interact with other processes, then they can use all kinds of integration facilities -- for example, they can share namespaces, or use shared volumes.

In some situations, it may also be desirable to deviate from the one root process per container practice -- for some systems, processes may need to interact quite intensively (e.g. with IPC mechanisms, shared files or shared memory, or a combination these) in which the container boundaries introduce more inconveniences than benefits.

Moreover, when running multiple processes in a single container, common dependencies can also typically be more efficiently shared leading to lower disk and RAM consumption.

As explained in my previous blog post (that explores various Docker concepts), sharing dependencies between containers only works if containers are constructed from images that share the same layers with the same shared libraries. In practice, this form of sharing is not always as efficient as we want it to be.

Configuring a Docker image to run multiple application processes is somewhat cumbersome -- the official Docker documentation describes two solutions: one that relies on a wrapper script that starts multiple processes in the background and a loop that waits for the "main process" to terminate, and the other is to use a process manager, such as supervisord.

I realised that I could solve this problem much more conveniently by combining the dockerTools.buildImage {} function in Nixpkgs (that builds Docker images with the Nix package manager) with the Nix process management abstractions.

I have created my own abstraction function: createMultiProcessImage that builds multi-process Docker images, managed by any supported process manager that works in a Docker container.

In this blog post, I will describe how this function is implemented and how it can be used.

Creating images for single root process containers


As shown in earlier blog posts, creating a Docker image with Nix for a single root application process is very straight forward.

For example, we can build an image that launches a trivial web application service with an embedded HTTP server (as shown in many of my previous blog posts), as follows:

{dockerTools, webapp}:  

dockerTools.buildImage {
  name = "webapp";
  tag = "test";

  runAsRoot = ''
    ${dockerTools.shadowSetup}
    groupadd webapp
    useradd webapp -g webapp -d /dev/null
  '';

  config = {
    Env = [ "PORT=5000" ];
    Cmd = [ "${webapp}/bin/webapp" ];
    Expose = {
      "5000/tcp" = {};
    };
  };
}

The above Nix expression (default.nix) invokes the dockerTools.buildImage function to automatically construct an image with the following properties:

  • The image has the following name: webapp and the following version tag: test.
  • The web application service requires some state to be initialized before it can be used. To configure state, we can run instructions in a QEMU virual machine with root privileges (runAsRoot).

    In the above deployment Nix expression, we create an unprivileged user and group named: webapp. For production deployments, it is typically recommended to drop root privileges, for security reasons.
  • The Env directive is used to configure environment variables. The PORT environment variable is used to configure the TCP port where the service should bind to.
  • The Cmd directive starts the webapp process in foreground mode. The life-cycle of the container is bound to this application process.
  • Expose exposes TCP port 5000 to the public so that the service can respond to requests made by clients.

We can build the Docker image as follows:

$ nix-build

load it into Docker with the following command:

$ docker load -i result

and launch a container instance using the image as a template:

$ docker run -it -p 5000:5000 webapp:test

If the deployment of the container succeeded, we should get a response from the webapp process, by running:

$ curl http://localhost:5000
<!DOCTYPE html>
<html>
  <head>
    <title>Simple test webapp</title>
  </head>
  <body>
    Simple test webapp listening on port: 5000
  </body>
</html>

Creating multi-process images


As shown in previous blog posts, the webapp process is part of a bigger system, namely: a web application system with an Nginx reverse proxy forwarding requests to multiple webapp instances:

{ pkgs ? import <nixpkgs> { inherit system; }
, system ? builtins.currentSystem
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager
}:

let
  sharedConstructors = import ../services-agnostic/constructors.nix {
    inherit pkgs stateDir runtimeDir logDir cacheDir tmpDir forceDisableUserChange processManager;
  };

  constructors = import ./constructors.nix {
    inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager;
  };
in
rec {
  webapp = rec {
    port = 5000;
    dnsName = "webapp.local";

    pkg = constructors.webapp {
      inherit port;
    };
  };

  nginx = rec {
    port = 8080;

    pkg = sharedConstructors.nginxReverseProxyHostBased {
      webapps = [ webapp ];
      inherit port;
    } {};
  };
}

The Nix expression above shows a simple processes model variant of that system, that consists of only two process instances:

  • The webapp process is (as shown earlier) an application that returns a static HTML page.
  • nginx is configured as a reverse proxy to forward incoming connections to multiple webapp instances using the virtual host header property (dnsName).

    If somebody connects to the nginx server with the following host name: webapp.local then the request is forwarded to the webapp service.

Configuration steps


To allow all processes in the process model shown to be deployed to a single container, we need to execute the following steps in the construction of an image:

  • Instead of deploying a single package, such as webapp, we need to refer to a collection of packages and/or configuration files that can be managed with a process manager, such as sysvinit, systemd or supervisord.

    The Nix process management framework provides all kinds of Nix function abstractions to accomplish this.

    For example, the following function invocation builds a configuration profile for the sysvinit process manager, containing a collection of sysvinit scripts (also known as LSB Init compliant scripts):

    profile = import ../create-managed-process/sysvinit/build-sysvinit-env.nix {
      exprFile = ./processes.nix;
      stateDir = "/var";
    };
        

  • Similar to single root process containers, we may also need to initialize state. For example, we need to create common FHS state directories (e.g. /tmp, /var etc.) in which services can store their relevant state files (e.g. log files, temp files).

    This can be done by running the following command:

    nixproc-init-state --state-dir /var
        
  • Another property that multiple process containers have in common is that they may also require the presence of unprivileged users and groups, for security reasons.

    With the following commands, we can automatically generate all required users and groups specified in a deployment profile:

    ${dysnomia}/bin/dysnomia-addgroups ${profile}
    ${dysnomia}/bin/dysnomia-addusers ${profile}
        
  • Instead of starting a (single root) application process, we need to start a process manager that manages the processes that we want to deploy. As already explained, the framework allows you to pick multiple options.

Starting a process manager as a root process


From all process managers that the framework currently supports, the most straight forward option to use in a Docker container is: supervisord.

To use it, we can create a symlink to the supervisord configuration in the deployment profile:

ln -s ${profile} /etc/supervisor

and then start supervisord as a root process with the following command directive:

Cmd = [
  "${pkgs.pythonPackages.supervisor}/bin/supervisord"
  "--nodaemon"
  "--configuration" "/etc/supervisor/supervisord.conf"
  "--logfile" "/var/log/supervisord.log"
  "--pidfile" "/var/run/supervisord.pid"
];

(As a sidenote: creating a symlink is not strictly required, but makes it possible to control running services with the supervisorctl command-line tool).

Supervisord is not the only option. We can also use sysvinit scripts, but doing so is a bit tricky. As explained earlier, the life-cycle of container is bound to a running root process (in foreground mode).

sysvinit scripts do not run in the foreground, but start processes that daemonize and terminate immediately, leaving daemon processes behind that remain running in the background.

As described in an earlier blog post about translating high-level process management concepts, it is also possible to run "daemons in the foreground" by creating a proxy script. We can also make a similar foreground proxy for a collection of daemons:

#!/bin/bash -e

_term()
{
    nixproc-sysvinit-runactivity -r stop ${profile}
    kill "$pid"
    exit 0
}

nixproc-sysvinit-runactivity start ${profile}

# Keep process running, but allow it to respond to the TERM and INT
# signals so that all scripts are stopped properly

trap _term TERM
trap _term INT

tail -f /dev/null & pid=$!
wait "$pid"

The above proxy script does the following:

  • It first starts all sysvinit scripts by invoking the nixproc-sysvinit-runactivity start command.
  • Then it registers a signal handler for the TERM and INT signals. The corresponding callback triggers a shutdown procedure.
  • We invoke a dummy command that keeps running in the foreground without consuming too many system resources (tail -f /dev/null) and we wait for it to terminate.
  • The signal handler properly deactivates all processes in reverse order (with the nixproc-sysvinit-runactivity -r stop command), and finally terminates the dummy command causing the script (and the container) to stop.

In addition supervisord and sysvinit, we can also use Disnix as a process manager by using a similar strategy with a foreground proxy.

Other configuration properties


The above configuration properties suffice to get a multi-process container running. However, to make working with such containers more practical from a user perspective, we may also want to:

  • Add basic shell utilities to the image, so that you can control the processes, investigate log files (in case of errors), and do other maintenance tasks.
  • Add a .bashrc configuration file to make file coloring working for the ls command, and to provide a decent prompt in a shell session.

Usage


The configuration steps described in the previous section are wrapped into a function named: createMultiProcessImage, which itself is a thin wrapper around the dockerTools.buildImage function in Nixpkgs -- it accepts the same parameters with a number of additional parameters that are specific to multi-process configurations.

The following function invocation builds a multi-process container deploying our example system, using supervisord as a process manager:

let
  pkgs = import <nixpkgs> {};

  createMultiProcessImage = import ../../nixproc/create-multi-process-image/create-multi-process-image.nix {
    inherit pkgs system;
    inherit (pkgs) dockerTools stdenv;
  };
in
createMultiProcessImage {
  name = "multiprocess";
  tag = "test";
  exprFile = ./processes.nix;
  stateDir = "/var";
  processManager = "supervisord";
}

After building the image, and deploying a container, with the following commands:

$ nix-build
$ docker load -i result
$ docker run -it --network host multiprocessimage:test

we should be able to connect to the webapp instance via the nginx reverse proxy:

$ curl -H 'Host: webapp.local' http://localhost:8080
<!DOCTYPE html>
<html>
  <head>
    <title>Simple test webapp</title>
  </head>
  <body>
    Simple test webapp listening on port: 5000
  </body>
</html>

As explained earlier, the constructed image also provides extra command-line utilities to do maintenance tasks, and control the life-cycle of the individual processes.

For example, we can "connect" to the running container, and check which processes are running:

$ docker exec -it mycontainer /bin/bash
# supervisorctl
nginx                            RUNNING   pid 11, uptime 0:00:38
webapp                           RUNNING   pid 10, uptime 0:00:38
supervisor>

If we change the processManager parameter to sysvinit, we can deploy a multi-process image in which the foreground proxy script is used as a root process (that starts and stops sysvinit scripts).

We can control the life-cycle of each individual process by directly invoking the sysvinit scripts in the container:

$ docker exec -it mycontainer /bin/bash
$ /etc/rc.d/init.d/webapp status
webapp is running with Process ID(s) 33.

$ /etc/rc.d/init.d/nginx status
nginx is running with Process ID(s) 51.

Although having extra command-line utilities to do administration tasks is useful, a disadvantage is that they considerably increase the size of the image.

To save storage costs, it is also possible to disable interactive mode to exclude these packages:

let
  pkgs = import <nixpkgs> {};

  createMultiProcessImage = import ../../nixproc/create-multi-process-image/create-multi-process-image.nix {
    inherit pkgs system;
    inherit (pkgs) dockerTools stdenv;
  };
in
createMultiProcessImage {
  name = "multiprocess";
  tag = "test";
  exprFile = ./processes.nix;
  stateDir = "/var";
  processManager = "supervisord";
  interactive = false; # Do not install any additional shell utilities
}

Discussion


In this blog post, I have described a new utility function in the Nix process management framework: createMultiProcessImage -- a thin wrapper around the dockerTools.buildImage function that can be used to convienently build multi-process Docker images, using any Docker-capable process manager that the Nix process management framework supports.

Besides the fact that we can convienently construct multi-process images, this function also has the advantage (similar to the dockerTools.buildImage function) that Nix is only required for the construction of the image. To deploy containers from a multi-process image, Nix is not a requirement.

There is also a drawback: similar to "ordinary" multi-process container deployments, when it is desired to upgrade a process, the entire container needs to be redeployed, also requiring a user to terminate all other running processes.

Availability


The createMultiProcessImage function is part of the current development version of the Nix process management framework that can be obtained from my GitHub page.

Thursday, October 8, 2020

Transforming Disnix models to graphs and visualizing them

In my previous blog post, I have described a new tool in the Dynamic Disnix toolset that can be used to automatically assign unique numeric IDs to services in a Disnix service model. Unique numeric IDs can represent all kinds of useful resources, such as TCP/UDP port numbers, user IDs (UIDs), and group IDs (GIDs).

Although I am quite happy to have this tool at my disposal, implementing it was much more difficult and time consuming than I expected. Aside from the fact that the problem is not as obvious as it may sound, the main reason is that the Dynamic Disnix toolset was originally developed as a proof of concept implementation for a research paper under very high time pressure. As a result, it has accumulated quite a bit of technical debt, that as of today, is still at a fairly high level (but much better than it was when I completed the PoC).

For the ID assigner tool, I needed to make changes to the foundations of the tools, such as the model parsing libraries. As a consequence, all kinds of related aspects in the toolset started to break, such as the deployment planning algorithm implementations.

Fixing some of these algorithm implementations was much more difficult than I expected -- they were not properly documented, not decomposed into functions, had little to no reuse of common concepts and as a result, were difficult to understand and change. I was forced to re-read the papers that I used as a basis for these algorithms.

To prevent myself from having to go through such a painful process again, I have decided to revise them in such a way that they are better understandable and maintainable.

Dynamically distributing services


The deployment models in the core Disnix toolset are static. For example, the distribution of services to machines in the network is done in a distribution model in which the user has to manually map services in the services model to target machines in the infrastructure model (and optionally to container services hosted on the target machines).

Each time a condition changes, e.g. the system needs to scale up or a machine crashes and the system needs to recover, a new distribution model must be configured and the system must be redeployed. For big complex systems that need to be reconfigured frequently, manually specifying new distribution models becomes very impractical.

As I have already explained in older blog posts, to cope with the limitations of static deployment models (and other static configuration aspects), I have developed Dynamic Disnix, in which various configuration aspects can be automated, including the distribution of services to machines.

A strategy for dynamically distributing services to machines can be specified in a QoS model, that typically consists of two phases:

  • First, a candidate target selection must be made, in which for each service the appropriate candidate target machines are selected.

    Not all machines are capable of hosting a certain service for functional and non-functional reasons -- for example, a i686-linux machine is not capable of running a binary compiled for a x86_64-linux machine.

    A machine can also be exposed to the public internet, and as a result, may not be suitable to host a service that exposes privacy-sensitive information.
  • After the suitable candidate target machines are known for each service, we must decide to which candidate machine each service gets distributed.

    This can be done in many ways. The strategy that we want to use is typically based on all kinds of non-functional requirements.

    For example, we can optimize a system's reliability by minimizing the amount of network links between services, requiring a strategy in which services that depend on each other are mapped to the same machine, as much as possible.

Graph-based optimization problems


In the Dynamic Disnix toolset, I have implemented various kinds of distribution algorithms/strategies for all kinds of purposes.

I did not "invent" most of them -- for some, I got inspiration from papers in the academic literature.

Two of the more advanced deployment planning algorithms are graph-based, to accomplish the following goals:

  • Reliable deployment. Network links are a potential source making a distributed system unreliable -- connections may fail, become slow, or could be interrupted frequently. By minimizing the amount of network links between services (by co-locating them on the same machine), their impact can be reduced. To not make deployments not too expensive, it should be done with a minimal amount of machines.

    As described in the paper: "Reliable Deployment of Component-based Applications into Distributed Environments" by A. Heydarnoori and F. Mavaddat, this problem can be transformed into a graph problem: the multiway cut problem (which is NP-hard).

    It can only be solved in polynomial time with an approximation algorithm that comes close to the optimal solution, unless a proof that P = NP exists.
  • Fragile deployment. Inspired by the above deployment problem, I also came up with the opposite problem (as my own "invention") -- how can we make any connection between a service a true network link (not local), so that we can test a system for robustness, using a minimal amount of machines?

    This problem can be modeled as a graph coloring problem (that is a NP-hard problem as well). I used one of the approximation algorithms described in the paper: "New Methods to Color the Vertices of a Graph" by D. Brélaz to implement a solution.

To work with these graph-based algorithms, I originally did not apply any transformations -- because of time pressure, I directly worked with objects from the Disnix models (e.g. services, target machines) and somewhat "glued" these together with generic data structures, such as lists and hash tables.

As a result, when looking at the implementation, it is very hard to get an understanding of the process and how an implementation aspect relates to a concept described in the papers shown above.

In my revised version, I have implemented a general purpose graph library that can be used to solve all kinds of general graph related problems.

Aside from using a general graph library, I have also separated the graph-based generation processes into the following steps:

  • After opening the Disnix input models (such as the services, infrastructure, and distribution models) I transform the models to a graph representing an instance of the problem domain.
  • After the graph has been generated, I apply the approximation algorithm to the graph data structure.
  • Finally, I transform the resolved graph back to a distribution model that should provide our desired distribution outcome.

This new organization provides better separation of concerns, common concepts can be reused (such as graph operations), and as a result, the implementations are much closer to the approximation algorithms described in the papers.

Visualizing the generation process


Another advantage of having a reusable graph implementation is that we can easily extend it to visualize the problem graphs.

When I combine these features together with my earlier work that visualizes services models, and a new tool that visualizes infrastructure models, I can make the entire generation process transparent.

For example, the following services model:

{system, pkgs, distribution, invDistribution}:

let
  customPkgs = import ./pkgs { inherit pkgs system; };
in
rec {
  testService1 = {
    name = "testService1";
    pkg = customPkgs.testService1;
    type = "echo";
  };

  testService2 = {
    name = "testService2";
    pkg = customPkgs.testService2;
    dependsOn = {
      inherit testService1;
    };
    type = "echo";
  };

  testService3 = {
    name = "testService3";
    pkg = customPkgs.testService3;
    dependsOn = {
      inherit testService1 testService2;
    };
    type = "echo";
  };
}

can be visualized as follows:

$ dydisnix-visualize-services -s services.nix


The above services model and corresponding visualization capture the following properties:

  • They describe three services (as denoted by ovals).
  • The arrows denote inter-dependency relationships (the dependsOn attribute in the services model).

    When a service has an inter-dependency on another service means that the latter service has to be activated first, and that the dependent service needs to know how to reach the former.

    testService2 depends on testService1 and testService3 depends on both the other two services.

We can also visualize the following infrastructure model:

{
  testtarget1 = {
    properties = {
      hostname = "testtarget1";
    };
    containers = {
      mysql-database = {
        mysqlPort = 3306;
      };
      echo = {};
    };
  };

  testtarget2 = {
    properties = {
      hostname = "testtarget2";
    };
    containers = {
      mysql-database = {
        mysqlPort = 3306;
      };
    };
  };

  testtarget3 = {
    properties = {
      hostname = "testtarget3";
    };
  };
}

with the following command:

$ dydisnix-visualize-infra -i infrastructure.nix

resulting in the following visualization:


The above infrastructure model declares three machines. Each target machine provides a number of container services (such as a MySQL database server, and echo that acts as a testing container).

With the following command, we can generate a problem instance for the graph coloring problem using the above services and infrastructure models as inputs:

$ dydisnix-graphcol -s services.nix -i infrastructure.nix \
  --output-graph

resulting in the following graph:


The graph shown above captures the following properties:

  • Each service translates to a node
  • When an inter-dependency relationship exists between services, it gets translated to a (bi-directional) link representing a network connection (the rationale is that a service that has an inter-dependency on another service, interact with each other by using a network connection).

Each target machine translates to a color, that we can represent with a numeric index -- 0 is testtarget1, 1 is testtarget2 and so on.

The following command generates the resolved problem instance graph in which each vertex has a color assigned:

$ dydisnix-graphcol -s services.nix -i infrastructure.nix \
  --output-resolved-graph

resulting in the following visualization:


(As a sidenote: in the above graph, colors are represented by numbers. In theory, I could also use real colors, but if I also want that the graph to remain visually appealing, I need to solve a color picking problem, which is beyond the scope of my refactoring objective).

The resolved graph can be translated back into the following distribution model:

$ dydisnix-graphcol -s services.nix -i infrastructure.nix
{
  "testService2" = [
    "testtarget2"
  ];
  "testService1" = [
    "testtarget1"
  ];
  "testService3" = [
    "testtarget3"
  ];
}

As you may notice, every service is distributed to a separate machine, so that every network link between a service is a real network connection between machines.

We can also visualize the problem instance of the multiway cut problem. For this, we also need a distribution model that, declares for each service, which target machine is a candidate.

The following distribution model makes all three target machines in the infrastructure model a candidate for every service:

{infrastructure}:

{
  testService1 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
  testService2 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
  testService3 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
}

With the following command we can generate a problem instance representing a host-application graph:

$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
  -d distribution.nix --output-graph

providing me the following output:


The above problem graph has the following properties:

  • Each service translates to an app node (prefixed with app:) and each candidate target machine to a host node (prefixed with host:).
  • When a network connection between two services exists (implicitly derived from having an inter-dependency relationship), an edge is generated with a weight of 1.
  • When a target machine is a candidate target for a service, then an edge is generated with a weight of n2 representing a very large number.

The objective of solving the multiway cut problem is to cut edges in the graph in such a way that each terminal (host node) is disconnected from the other terminals (host nodes), in which the total weight of the cuts is minimized.

When applying the approximation algorithm in the paper to the above graph:

$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
  -d distribution.nix --output-resolved-graph

we get the following resolved graph:


that can be transformed back into the following distribution model:

$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
  -d distribution.nix
{
  "testService2" = [
    "testtarget1"
  ];
  "testService1" = [
    "testtarget1"
  ];
  "testService3" = [
    "testtarget1"
  ];
}

As you may notice by looking at the resolved graph (in which the terminals: testtarget2 and testtarget3 are disconnected) and the distribution model output, all services are distributed to the same machine: testtarget1 making all connections between the services local connections.

In this particular case, the solution is not only close to the optimal solution, but it is the optimal solution.

Conclusion


In this blog post, I have described how I have revised the deployment planning algorithm implementations in the Dynamic Disnix toolset. Their concerns are now much better separated, and the graph-based algorithms now use a general purpose graph library, that can also be used for generating visualizations of the intermediate steps in the generation process.

This revision was not on my short-term planned features list, but I am happy that I did the work. Retrospectively, I regret that I never took the time to finish things up properly after the submission of the paper. Although Dynamic Disnix's quality is still not where I want it to be, it is quite a step forward in making the toolset more usable.

Sadly, it is almost 10 years ago that I started Dynamic Disnix and still there is no offical release yet and the technical debt in Dynamic Disnix is one of the important reasons that I never did an official release. Hopefully, with this step I can do it some day. :-)

The good news is that I made the paper submission deadline and that the paper got accepted for presentation. It brought me to the SEAMS 2011 conference (co-located with ICSE 2011) in Honolulu, Hawaii, allowing me to take pictures such as this one:


Availability


The graph library and new implementations of the deployment planning algorithms described in this blog post are part of the current development version of Dynamic Disnix.

The paper: "A Self-Adaptive Deployment Framework for Service-Oriented Systems" describes the Dynamic Disnix framework (developed 9 years ago) and can be obtained from my publications page.

Acknowledgements


To generate the visualizations I used the Graphviz toolset.

Thursday, September 24, 2020

Assigning unique IDs to services in Disnix deployment models

As described in some of my recent blog posts, one of the more advanced features of Disnix as well as the experimental Nix process management framework is to deploy multiple instances of the same service to the same machine.

To make running multiple service instances on the same machine possible, these tools rely on conflict avoidance rather than isolation (typically used for containers). To allow multiple services instances to co-exist on the same machine, they need to be configured in such a way that they do not allocate any conflicting resources.

Although for small systems it is doable to configure multiple instances by hand, this process gets tedious and time consuming for larger and more technologically diverse systems.

One particular kind of conflicting resource that could be configured automatically are numeric IDs, such as TCP/UDP port numbers, user IDs (UIDs), and group IDs (GIDs).

In this blog post, I will describe how multiple service instances are configured (in Disnix and the process management framework) and how we can automatically assign unique numeric IDs to them.

Configuring multiple service instances


To facilitate conflict avoidance in Disnix and the Nix process management framework, services are configured as follows:

{createManagedProcess, tmpDir}:
{port, instanceSuffix ? "", instanceName ? "webapp${instanceSuffix}"}:

let
  webapp = import ../../webapp;
in
createManagedProcess {
  name = instanceName;
  description = "Simple web application";
  inherit instanceName;

  # This expression can both run in foreground or daemon mode.
  # The process manager can pick which mode it prefers.
  process = "${webapp}/bin/webapp";
  daemonArgs = [ "-D" ];

  environment = {
    PORT = port;
    PID_FILE = "${tmpDir}/${instanceName}.pid";
  };
  user = instanceName;
  credentials = {
    groups = {
      "${instanceName}" = {};
    };
    users = {
      "${instanceName}" = {
        group = instanceName;
        description = "Webapp";
      };
    };
  };

  overrides = {
    sysvinit = {
      runlevels = [ 3 4 5 ];
    };
  };
}

The Nix expression shown above is a nested function that describes how to deploy a simple self-contained REST web application with an embedded HTTP server:

  • The outer function header (first line) specifies all common build-time dependencies and configuration properties that the service needs:

    • createManagedProcess is a function that can be used to define process manager agnostic configurations that can be translated to configuration files for a variety of process managers (e.g. systemd, launchd, supervisord etc.).
    • tmpDir refers to the temp directory in which temp files are stored.
  • The inner function header (second line) specifies all instance parameters -- these are the parameters that must be configured in such a way that conflicts with other process instances are avoided:

    • The instanceName parameter (that can be derived from the instanceSuffix) is a value used by some of the process management backends (e.g. the ones that invoke the daemon command) to derive a unique PID file for the process. When running multiple instances of the same process, each of them requires a unique PID file name.
    • The port parameter specifies to which TCP port the service binds to. Binding the service to a port that is already taken by another service, causes the deployment of this service to fail.
  • In the function body, we invoke the createManagedProcess function to construct configuration files for all supported process manager backends to run the webapp process:

    • As explained earlier, the instanceName is used to configure the daemon executable in such a way that it allocates a unique PID file.
    • The process parameter specifies which executable we need to run, both as a foreground process or daemon.
    • The daemonArgs parameter specifies which command-line parameters need to be propagated to the executable when the process should daemonize on its own.
    • The environment parameter specifies all environment variables. The webapp service uses these variables for runtime property configuration.
    • The user parameter is used to specify that the process should run as an unprivileged user. The credentials parameter is used to configure the creation of the user account and corresponding user group.
    • The overrides parameter is used to override the process manager-agnostic parameters with process manager-specific parameters. For the sysvinit backend, we configure the runlevels in which the service should run.

Although the convention shown above makes it possible to avoid conflicts (assuming that all potential conflicts have been identified and exposed as function parameters), these parameters are typically configured manually:

{ pkgs, system
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager ? "sysvinit"
, ...
}:

let
  constructors = import ./constructors.nix {
    inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager;
  };

  processType = import ../../nixproc/derive-dysnomia-process-type.nix {
    inherit processManager;
  };
in
rec {
  webapp1 = rec {
    name = "webapp1";
    port = 5000;
    dnsName = "webapp.local";
    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "1";
    };
    type = processType;
  };

  webapp2 = rec {
    name = "webapp2";
    port = 5001;
    dnsName = "webapp.local";
    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "2";
    };
    type = processType;
  };
}

The above Nix expression shows both a valid Disnix services as well as a valid processes model that composes two web application process instances that can run concurrently on the same machine by invoking the nested constructor function shown in the previous example:

  • Each webapp instance has its own unique instance name, by specifying a unique numeric instanceSuffix that gets appended to the service name.
  • Every webapp instance binds to a unique TCP port (5000 and 5001) that should not conflict with system services or other process instances.

Previous work: assigning port numbers


Although configuring two process instances is still manageable, the configuration process becomes more tedious and time consuming when the amount and the kind of processes (each having their own potential conflicts) grow.

Five years ago, I already identified a resource that could be automatically assigned to services: port numbers.

I have created a very simple port assigner tool that allows you to specify a global ports pool and a target-specific pool pool. The former is used to assign globally unique port numbers to all services in the network, whereas the latter assigns port numbers that are unique to the target machine where the service is deployed to (this is to cope with the scarcity of port numbers).

Although the tool is quite useful for systems that do not consist of too many different kinds of components, I ran into a number limitations when I want to manage a more diverse set of services:

  • Port numbers are not the only numeric IDs that services may require. When deploying systems that consist of self-contained executables, you typically want to run them as unprivileged users for security reasons. User accounts on most UNIX-like systems require unique user IDs, and the corresponding users' groups require unique group IDs.
  • We typically want to manage multiple resource pools, for a variety of reasons. For example, when we have a number of HTTP server instances and a number of database instances, then we may want to pick port numbers in the 8000-9000 range for the HTTP servers, whereas for the database servers we want to use a different pool, such as 5000-6000.

Assigning unique numeric IDs


To address these shortcomings, I have developed a replacement tool that acts as a generic numeric ID assigner.

This new ID assigner tool works with ID resource configuration files, such as:

rec {
  ports = {
    min = 5000;
    max = 6000;
    scope = "global";
  };

  uids = {
    min = 2000;
    max = 3000;
    scope = "global";
  };

  gids = uids;
}

The above ID resource configuration file (idresources.nix) defines three resource pools: ports is a resource that represents port numbers to be assigned to the webapp processes, uids refers to user IDs and gids to group IDs. The group IDs' resource configuration is identical to the users' IDs configuration.

Each resource attribute refers the following configuration properties:

  • The min value specifies the minimum ID to hand out, max the maximum ID.
  • The scope value specifies the scope of the resource pool. global (which is the default option) means that the IDs assigned from this resource pool to services are globally unique for the entire system.

    The machine scope can be used to assign IDs that are unique for the machine where a service is distributed to. When the latter option is used, services that are distributed two separate machines may have the same ID.

We can adjust the services/processes model in such a way that every service will use dynamically assigned IDs and that each service specifies for which resources it requires a unique ID:

{ pkgs, system
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager ? "sysvinit"
, ...
}:

let
  ids = if builtins.pathExists ./ids.nix then (import ./ids.nix).ids else {};

  constructors = import ./constructors.nix {
    inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager ids;
  };

  processType = import ../../nixproc/derive-dysnomia-process-type.nix {
    inherit processManager;
  };
in
rec {
  webapp1 = rec {
    name = "webapp1";
    port = ids.ports.webapp1 or 0;
    dnsName = "webapp.local";
    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "1";
    };
    type = processType;
    requiresUniqueIdsFor = [ "ports" "uids" "gids" ];
  };

  webapp2 = rec {
    name = "webapp2";
    port = ids.ports.webapp2 or 0;
    dnsName = "webapp.local";
    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "2";
    };
    type = processType;
    requiresUniqueIdsFor = [ "ports" "uids" "gids" ];
  };
}

In the above services/processes model, we have made the following changes:

  • In the beginning of the expression, we import the dynamically generated ids.nix expression that provides ID assignments for each resource. If the ids.nix file does not exists, we generate an empty attribute set. We implement this construction (in which the absence of ids.nix can be tolerated) to allow the ID assigner to bootstrap the ID assignment process.
  • Every hardcoded port attribute of every service is replaced by a reference to the ids attribute set that is dynamically generated by the ID assigner tool. To allow the ID assigner to open the services model in the first run, we provide a fallback port value of 0.
  • Every service specifies for which resources it requires a unique ID through the requiresUniqueIdsFor attribute. In the above example, both service instances require unique IDs to assign a port number, user ID to the user and group ID to the group.

The port assignments are propagated as function parameters to the constructor functions that configure the services (as shown earlier in this blog post).

We could also implement a similar strategy with the UIDs and GIDs, but a more convenient mechanism is to compose the function that creates the credentials, so that it transparently uses our uids and gids assignments.

As shown in the expression above, the ids attribute set is also propagated to the constructors expression. The constructors expression indirectly composes the createCredentials function as follows:

{pkgs, ids ? {}, ...}:

{
  createCredentials = import ../../create-credentials {
    inherit (pkgs) stdenv;
    inherit ids;
  };

  ...
}

The ids attribute set is propagated to the function that composes the createCredentials function. As a result, it will automatically assign the UIDs and GIDs in the ids.nix expression when the user configures a user or group with a name that exists in the uids and gids resource pools.

To make these UIDs and GIDs assignments go smoothly, it is recommended to give a process instance the same process name, instance name, user and group names.

Using the ID assigner tool


By combining the ID resources specification with the three Disnix models: a services model (that defines all distributable services, shown above), an infrastructure model (that captures all available target machines) and their properties and a distribution model (that maps services to target machines in the network), we can automatically generate an ids configuration that contains all ID assignments:

$ dydisnix -s services.nix -i infrastructure.nix -d distribution.nix \
  --id-resources idresources.nix --output-file ids.nix

The above command will generate an ids configuration file (ids.nix) that provides, for each resource in the ID resources model, a unique assignment to services that are distributed to a target machine in the network. (Services that are not distributed to any machine in the distribution model will be skipped, to not waste too many resources).

The output file (ids.nix) has the following structure:

{
  "ids" = {
    "gids" = {
      "webapp1" = 2000;
      "webapp2" = 2001;
    };
    "uids" = {
      "webapp1" = 2000;
      "webapp2" = 2001;
    };
    "ports" = {
      "webapp1" = 5000;
      "webapp2" = 5001;
    };
  };
  "lastAssignments" = {
    "gids" = 2001;
    "uids" = 2001;
    "ports" = 5001;
  };
}

  • The ids attribute contains for each resource (defined in the ID resources model) the unique ID assignments per service. As shown earlier, both service instances require unique IDs for ports, uids and gids. The above attribute set stores the corresponding ID assignments.
  • The lastAssignments attribute memorizes the last ID assignment per resource. Once an ID is assigned, it will not be immediately reused. This is to allow roll backs and to prevent data to incorrectly get owned by the wrong user accounts. Once the maximum ID limit is reached, the ID assigner will start searching for a free assignment from the beginning of the resource pool.

In addition to assigning IDs to services that are distributed to machines in the network, it is also possible to assign IDs to all services (regardless whether they have been deployed or not):

$ dydisnix -s services.nix \
  --id-resources idresources.nix --output-file ids.nix

Since the above command does not know anything about the target machines, it only works with an ID resources configuration that defines global scope resources.

When you intend to upgrade an existing deployment, you typically want to retain already assigned IDs, while obsolete ID assignment should be removed, and new IDs should be assigned to services that have none yet. This is to prevent unnecessary redeployments.

When removing the first webapp service and adding a third instance:

{ pkgs, system
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager ? "sysvinit"
, ...
}:

let
  ids = if builtins.pathExists ./ids.nix then (import ./ids.nix).ids else {};

  constructors = import ./constructors.nix {
    inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager ids;
  };

  processType = import ../../nixproc/derive-dysnomia-process-type.nix {
    inherit processManager;
  };
in
rec {
  webapp2 = rec {
    name = "webapp2";
    port = ids.ports.webapp2 or 0;
    dnsName = "webapp.local";
    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "2";
    };
    type = processType;
    requiresUniqueIdsFor = [ "ports" "uids" "gids" ];
  };
  
  webapp3 = rec {
    name = "webapp3";
    port = ids.ports.webapp3 or 0;
    dnsName = "webapp.local";
    pkg = constructors.webapp {
      inherit port;
      instanceSuffix = "3";
    };
    type = processType;
    requiresUniqueIdsFor = [ "ports" "uids" "gids" ];
  };
}

And running the following command (that provides the current ids.nix as a parameter):

$ dydisnix -s services.nix -i infrastructure.nix -d distribution.nix \
  --id-resources idresources.nix --ids ids.nix --output-file ids.nix

we will get the following ID assignment configuration:

{
  "ids" = {
    "gids" = {
      "webapp2" = 2001;
      "webapp3" = 2002;
    };
    "uids" = {
      "webapp2" = 2001;
      "webapp3" = 2002;
    };
    "ports" = {
      "webapp2" = 5001;
      "webapp3" = 5002;
    };
  };
  "lastAssignments" = {
    "gids" = 2002;
    "uids" = 2002;
    "ports" = 5002;
  };
}

As may be observed, since the webapp2 process is in both the current and the previous configuration, its ID assignments will be retained. webapp1 gets removed because it is no longer in the services model. webapp3 gets the next numeric IDs from the resources pools.

Because the configuration of webapp2 stays the same, it does not need to be redeployed.

The models shown earlier are valid Disnix services models. As a consequence, they can be used with Dynamic Disnix's ID assigner tool: dydisnix-id-assign.

Although these Disnix services models are also valid processes models (used by the Nix process management framework) not every processes model is guaranteed to be compatible with a Disnix service model.

For process models that are not compatible, it is possible to use the nixproc-id-assign tool that acts as a wrapper around dydisnix-id-assign tool:

$ nixproc-id-assign --id-resources idresources.nix processes.nix

Internally, the nixproc-id-assign tool converts a processes model to a Disnix service model (augmenting the process instance objects with missing properties) and propagates it to the dydisnix-id-assign tool.

A more advanced example


The webapp processes example is fairly trivial and only needs unique IDs for three kinds of resources: port numbers, UIDs, and GIDs.

I have also developed a more complex example for the Nix process management framework that exposes several commonly used system services on Linux systems, such as:

{ pkgs ? import <nixpkgs> { inherit system; }
, system ? builtins.currentSystem
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager
}:

let
  ids = if builtins.pathExists ./ids.nix then (import ./ids.nix).ids else {};

  constructors = import ./constructors.nix {
    inherit pkgs stateDir runtimeDir logDir tmpDir cacheDir forceDisableUserChange processManager ids;
  };
in
rec {
  apache = rec {
    port = ids.httpPorts.apache or 0;

    pkg = constructors.simpleWebappApache {
      inherit port;
      serverAdmin = "root@localhost";
    };

    requiresUniqueIdsFor = [ "httpPorts" "uids" "gids" ];
  };

  postgresql = rec {
    port = ids.postgresqlPorts.postgresql or 0;

    pkg = constructors.postgresql {
      inherit port;
    };

    requiresUniqueIdsFor = [ "postgresqlPorts" "uids" "gids" ];
  };

  influxdb = rec {
    httpPort = ids.influxdbPorts.influxdb or 0;
    rpcPort = httpPort + 2;

    pkg = constructors.simpleInfluxdb {
      inherit httpPort rpcPort;
    };

    requiresUniqueIdsFor = [ "influxdbPorts" "uids" "gids" ];
  };
}

The above processes model exposes three service instances: an Apache HTTP server (that works with a simple configuration that serves web applications from a single virtual host), PostgreSQL and InfluxDB. Each service requires a unique user ID and group ID so that their privileges are separated.

To make these services more accessible/usable, we do not use a shared ports resource pool. Instead, each service type consumes port numbers from their own resource pools.

The following ID resources configuration can be used to provision the unique IDs to the services above:

rec {
  uids = {
    min = 2000;
    max = 3000;
  };

  gids = uids;

  httpPorts = {
    min = 8080;
    max = 8085;
  };

  postgresqlPorts = {
    min = 5432;
    max = 5532;
  };

  influxdbPorts = {
    min = 8086;
    max = 8096;
    step = 3;
  };
}


The above ID resources configuration defines a shared UIDs and GIDs resource pool, but separate ports resource pools for each service type. This has the following implications if we deploy multiple instances of each service type:

  • All Apache HTTP server instances get a TCP port assignment between 8080-8085.
  • All PostgreSQL server instances get a TCP port assignment between 5432-5532.
  • All InfluxDB server instances get a TCP port assignment between 8086-8096. Since an InfluxDB allocates two port numbers: one for the HTTP server and one for the RPC service (the latter's port number is the base port number + 2). We use a step count of 3 so that we can retain this convention for each InfluxDB instance.

Conclusion


In this blog post, I have described a new tool: dydisnix-id-assign that can be used to automatically assign unique numeric IDs to services in Disnix service models.

Moreover, I have described: nixproc-id-assign that acts a thin wrapper around this tool to automatically assign numeric IDs to services in the Nix process management framework's processes model.

This tool replaces the old dydisnix-port-assign tool in the Dynamic Disnix toolset (described in the blog post written five years ago) that is much more limited in its capabilities.

Availability


The dydisnix-id-assign tool is available in the current development version of Dynamic Disnix. The nixproc-id-assign is part of the current implementation of the Nix process management framework prototype.