Saturday, October 31, 2020

Building multi-process Docker images with the Nix process management framework

Some time ago, I have described my experimental Nix-based process management framework that makes it possible to automatically deploy running processes (sometimes also ambiguously called services) from declarative specifications written in the Nix expression language.

The framework is built around two concepts. As its name implies, the Nix package manager is used to deploy all required packages and static artifacts, and a process manager of choice (e.g. sysvinit, systemd, supervisord and others) is used to manage the life-cycles of the processes.

Moreover, it is built around flexible concepts allowing integration with solutions that are not qualified as process managers (but can still be used as such), such as Docker -- each process instance can be deployed as a Docker container with a shared Nix store using the host system's network.

As explained in an earlier blog post, Docker has become such a popular solution that it has become a standard for deploying (micro)services (often as a utility in the Kubernetes solution stack).

When deploying a system that consists of multiple services with Docker, a typical strategy (and recommended practice) is to use multiple containers that have only one root application process. Advantages of this approach is that Docker can control the life-cycles of the applications, and that each process is (somewhat) isolated/protected from other processes and the host system.

By default, containers are isolated, but if they need to interact with other processes, then they can use all kinds of integration facilities -- for example, they can share namespaces, or use shared volumes.

In some situations, it may also be desirable to deviate from the one root process per container practice -- for some systems, processes may need to interact quite intensively (e.g. with IPC mechanisms, shared files or shared memory, or a combination these) in which the container boundaries introduce more inconveniences than benefits.

Moreover, when running multiple processes in a single container, common dependencies can also typically be more efficiently shared leading to lower disk and RAM consumption.

As explained in my previous blog post (that explores various Docker concepts), sharing dependencies between containers only works if containers are constructed from images that share the same layers with the same shared libraries. In practice, this form of sharing is not always as efficient as we want it to be.

Configuring a Docker image to run multiple application processes is somewhat cumbersome -- the official Docker documentation describes two solutions: one that relies on a wrapper script that starts multiple processes in the background and a loop that waits for the "main process" to terminate, and the other is to use a process manager, such as supervisord.

I realised that I could solve this problem much more conveniently by combining the dockerTools.buildImage {} function in Nixpkgs (that builds Docker images with the Nix package manager) with the Nix process management abstractions.

I have created my own abstraction function: createMultiProcessImage that builds multi-process Docker images, managed by any supported process manager that works in a Docker container.

In this blog post, I will describe how this function is implemented and how it can be used.

Creating images for single root process containers


As shown in earlier blog posts, creating a Docker image with Nix for a single root application process is very straight forward.

For example, we can build an image that launches a trivial web application service with an embedded HTTP server (as shown in many of my previous blog posts), as follows:

{dockerTools, webapp}:  

dockerTools.buildImage {
  name = "webapp";
  tag = "test";

  runAsRoot = ''
    ${dockerTools.shadowSetup}
    groupadd webapp
    useradd webapp -g webapp -d /dev/null
  '';

  config = {
    Env = [ "PORT=5000" ];
    Cmd = [ "${webapp}/bin/webapp" ];
    Expose = {
      "5000/tcp" = {};
    };
  };
}

The above Nix expression (default.nix) invokes the dockerTools.buildImage function to automatically construct an image with the following properties:

  • The image has the following name: webapp and the following version tag: test.
  • The web application service requires some state to be initialized before it can be used. To configure state, we can run instructions in a QEMU virual machine with root privileges (runAsRoot).

    In the above deployment Nix expression, we create an unprivileged user and group named: webapp. For production deployments, it is typically recommended to drop root privileges, for security reasons.
  • The Env directive is used to configure environment variables. The PORT environment variable is used to configure the TCP port where the service should bind to.
  • The Cmd directive starts the webapp process in foreground mode. The life-cycle of the container is bound to this application process.
  • Expose exposes TCP port 5000 to the public so that the service can respond to requests made by clients.

We can build the Docker image as follows:

$ nix-build

load it into Docker with the following command:

$ docker load -i result

and launch a container instance using the image as a template:

$ docker run -it -p 5000:5000 webapp:test

If the deployment of the container succeeded, we should get a response from the webapp process, by running:

$ curl http://localhost:5000
<!DOCTYPE html>
<html>
  <head>
    <title>Simple test webapp</title>
  </head>
  <body>
    Simple test webapp listening on port: 5000
  </body>
</html>

Creating multi-process images


As shown in previous blog posts, the webapp process is part of a bigger system, namely: a web application system with an Nginx reverse proxy forwarding requests to multiple webapp instances:

{ pkgs ? import <nixpkgs> { inherit system; }
, system ? builtins.currentSystem
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager
}:

let
  sharedConstructors = import ../services-agnostic/constructors.nix {
    inherit pkgs stateDir runtimeDir logDir cacheDir tmpDir forceDisableUserChange processManager;
  };

  constructors = import ./constructors.nix {
    inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager;
  };
in
rec {
  webapp = rec {
    port = 5000;
    dnsName = "webapp.local";

    pkg = constructors.webapp {
      inherit port;
    };
  };

  nginx = rec {
    port = 8080;

    pkg = sharedConstructors.nginxReverseProxyHostBased {
      webapps = [ webapp ];
      inherit port;
    } {};
  };
}

The Nix expression above shows a simple processes model variant of that system, that consists of only two process instances:

  • The webapp process is (as shown earlier) an application that returns a static HTML page.
  • nginx is configured as a reverse proxy to forward incoming connections to multiple webapp instances using the virtual host header property (dnsName).

    If somebody connects to the nginx server with the following host name: webapp.local then the request is forwarded to the webapp service.

Configuration steps


To allow all processes in the process model shown to be deployed to a single container, we need to execute the following steps in the construction of an image:

  • Instead of deploying a single package, such as webapp, we need to refer to a collection of packages and/or configuration files that can be managed with a process manager, such as sysvinit, systemd or supervisord.

    The Nix process management framework provides all kinds of Nix function abstractions to accomplish this.

    For example, the following function invocation builds a configuration profile for the sysvinit process manager, containing a collection of sysvinit scripts (also known as LSB Init compliant scripts):

    profile = import ../create-managed-process/sysvinit/build-sysvinit-env.nix {
      exprFile = ./processes.nix;
      stateDir = "/var";
    };
        

  • Similar to single root process containers, we may also need to initialize state. For example, we need to create common FHS state directories (e.g. /tmp, /var etc.) in which services can store their relevant state files (e.g. log files, temp files).

    This can be done by running the following command:

    nixproc-init-state --state-dir /var
        
  • Another property that multiple process containers have in common is that they may also require the presence of unprivileged users and groups, for security reasons.

    With the following commands, we can automatically generate all required users and groups specified in a deployment profile:

    ${dysnomia}/bin/dysnomia-addgroups ${profile}
    ${dysnomia}/bin/dysnomia-addusers ${profile}
        
  • Instead of starting a (single root) application process, we need to start a process manager that manages the processes that we want to deploy. As already explained, the framework allows you to pick multiple options.

Starting a process manager as a root process


From all process managers that the framework currently supports, the most straight forward option to use in a Docker container is: supervisord.

To use it, we can create a symlink to the supervisord configuration in the deployment profile:

ln -s ${profile} /etc/supervisor

and then start supervisord as a root process with the following command directive:

Cmd = [
  "${pkgs.pythonPackages.supervisor}/bin/supervisord"
  "--nodaemon"
  "--configuration" "/etc/supervisor/supervisord.conf"
  "--logfile" "/var/log/supervisord.log"
  "--pidfile" "/var/run/supervisord.pid"
];

(As a sidenote: creating a symlink is not strictly required, but makes it possible to control running services with the supervisorctl command-line tool).

Supervisord is not the only option. We can also use sysvinit scripts, but doing so is a bit tricky. As explained earlier, the life-cycle of container is bound to a running root process (in foreground mode).

sysvinit scripts do not run in the foreground, but start processes that daemonize and terminate immediately, leaving daemon processes behind that remain running in the background.

As described in an earlier blog post about translating high-level process management concepts, it is also possible to run "daemons in the foreground" by creating a proxy script. We can also make a similar foreground proxy for a collection of daemons:

#!/bin/bash -e

_term()
{
    nixproc-sysvinit-runactivity -r stop ${profile}
    kill "$pid"
    exit 0
}

nixproc-sysvinit-runactivity start ${profile}

# Keep process running, but allow it to respond to the TERM and INT
# signals so that all scripts are stopped properly

trap _term TERM
trap _term INT

tail -f /dev/null & pid=$!
wait "$pid"

The above proxy script does the following:

  • It first starts all sysvinit scripts by invoking the nixproc-sysvinit-runactivity start command.
  • Then it registers a signal handler for the TERM and INT signals. The corresponding callback triggers a shutdown procedure.
  • We invoke a dummy command that keeps running in the foreground without consuming too many system resources (tail -f /dev/null) and we wait for it to terminate.
  • The signal handler properly deactivates all processes in reverse order (with the nixproc-sysvinit-runactivity -r stop command), and finally terminates the dummy command causing the script (and the container) to stop.

In addition supervisord and sysvinit, we can also use Disnix as a process manager by using a similar strategy with a foreground proxy.

Other configuration properties


The above configuration properties suffice to get a multi-process container running. However, to make working with such containers more practical from a user perspective, we may also want to:

  • Add basic shell utilities to the image, so that you can control the processes, investigate log files (in case of errors), and do other maintenance tasks.
  • Add a .bashrc configuration file to make file coloring working for the ls command, and to provide a decent prompt in a shell session.

Usage


The configuration steps described in the previous section are wrapped into a function named: createMultiProcessImage, which itself is a thin wrapper around the dockerTools.buildImage function in Nixpkgs -- it accepts the same parameters with a number of additional parameters that are specific to multi-process configurations.

The following function invocation builds a multi-process container deploying our example system, using supervisord as a process manager:

let
  pkgs = import <nixpkgs> {};

  createMultiProcessImage = import ../../nixproc/create-multi-process-image/create-multi-process-image.nix {
    inherit pkgs system;
    inherit (pkgs) dockerTools stdenv;
  };
in
createMultiProcessImage {
  name = "multiprocess";
  tag = "test";
  exprFile = ./processes.nix;
  stateDir = "/var";
  processManager = "supervisord";
}

After building the image, and deploying a container, with the following commands:

$ nix-build
$ docker load -i result
$ docker run -it --network host multiprocessimage:test

we should be able to connect to the webapp instance via the nginx reverse proxy:

$ curl -H 'Host: webapp.local' http://localhost:8080
<!DOCTYPE html>
<html>
  <head>
    <title>Simple test webapp</title>
  </head>
  <body>
    Simple test webapp listening on port: 5000
  </body>
</html>

As explained earlier, the constructed image also provides extra command-line utilities to do maintenance tasks, and control the life-cycle of the individual processes.

For example, we can "connect" to the running container, and check which processes are running:

$ docker exec -it mycontainer /bin/bash
# supervisorctl
nginx                            RUNNING   pid 11, uptime 0:00:38
webapp                           RUNNING   pid 10, uptime 0:00:38
supervisor>

If we change the processManager parameter to sysvinit, we can deploy a multi-process image in which the foreground proxy script is used as a root process (that starts and stops sysvinit scripts).

We can control the life-cycle of each individual process by directly invoking the sysvinit scripts in the container:

$ docker exec -it mycontainer /bin/bash
$ /etc/rc.d/init.d/webapp status
webapp is running with Process ID(s) 33.

$ /etc/rc.d/init.d/nginx status
nginx is running with Process ID(s) 51.

Although having extra command-line utilities to do administration tasks is useful, a disadvantage is that they considerably increase the size of the image.

To save storage costs, it is also possible to disable interactive mode to exclude these packages:

let
  pkgs = import <nixpkgs> {};

  createMultiProcessImage = import ../../nixproc/create-multi-process-image/create-multi-process-image.nix {
    inherit pkgs system;
    inherit (pkgs) dockerTools stdenv;
  };
in
createMultiProcessImage {
  name = "multiprocess";
  tag = "test";
  exprFile = ./processes.nix;
  stateDir = "/var";
  processManager = "supervisord";
  interactive = false; # Do not install any additional shell utilities
}

Discussion


In this blog post, I have described a new utility function in the Nix process management framework: createMultiProcessImage -- a thin wrapper around the dockerTools.buildImage function that can be used to convienently build multi-process Docker images, using any Docker-capable process manager that the Nix process management framework supports.

Besides the fact that we can convienently construct multi-process images, this function also has the advantage (similar to the dockerTools.buildImage function) that Nix is only required for the construction of the image. To deploy containers from a multi-process image, Nix is not a requirement.

There is also a drawback: similar to "ordinary" multi-process container deployments, when it is desired to upgrade a process, the entire container needs to be redeployed, also requiring a user to terminate all other running processes.

Availability


The createMultiProcessImage function is part of the current development version of the Nix process management framework that can be obtained from my GitHub page.

Thursday, October 8, 2020

Transforming Disnix models to graphs and visualizing them

In my previous blog post, I have described a new tool in the Dynamic Disnix toolset that can be used to automatically assign unique numeric IDs to services in a Disnix service model. Unique numeric IDs can represent all kinds of useful resources, such as TCP/UDP port numbers, user IDs (UIDs), and group IDs (GIDs).

Although I am quite happy to have this tool at my disposal, implementing it was much more difficult and time consuming than I expected. Aside from the fact that the problem is not as obvious as it may sound, the main reason is that the Dynamic Disnix toolset was originally developed as a proof of concept implementation for a research paper under very high time pressure. As a result, it has accumulated quite a bit of technical debt, that as of today, is still at a fairly high level (but much better than it was when I completed the PoC).

For the ID assigner tool, I needed to make changes to the foundations of the tools, such as the model parsing libraries. As a consequence, all kinds of related aspects in the toolset started to break, such as the deployment planning algorithm implementations.

Fixing some of these algorithm implementations was much more difficult than I expected -- they were not properly documented, not decomposed into functions, had little to no reuse of common concepts and as a result, were difficult to understand and change. I was forced to re-read the papers that I used as a basis for these algorithms.

To prevent myself from having to go through such a painful process again, I have decided to revise them in such a way that they are better understandable and maintainable.

Dynamically distributing services


The deployment models in the core Disnix toolset are static. For example, the distribution of services to machines in the network is done in a distribution model in which the user has to manually map services in the services model to target machines in the infrastructure model (and optionally to container services hosted on the target machines).

Each time a condition changes, e.g. the system needs to scale up or a machine crashes and the system needs to recover, a new distribution model must be configured and the system must be redeployed. For big complex systems that need to be reconfigured frequently, manually specifying new distribution models becomes very impractical.

As I have already explained in older blog posts, to cope with the limitations of static deployment models (and other static configuration aspects), I have developed Dynamic Disnix, in which various configuration aspects can be automated, including the distribution of services to machines.

A strategy for dynamically distributing services to machines can be specified in a QoS model, that typically consists of two phases:

  • First, a candidate target selection must be made, in which for each service the appropriate candidate target machines are selected.

    Not all machines are capable of hosting a certain service for functional and non-functional reasons -- for example, a i686-linux machine is not capable of running a binary compiled for a x86_64-linux machine.

    A machine can also be exposed to the public internet, and as a result, may not be suitable to host a service that exposes privacy-sensitive information.
  • After the suitable candidate target machines are known for each service, we must decide to which candidate machine each service gets distributed.

    This can be done in many ways. The strategy that we want to use is typically based on all kinds of non-functional requirements.

    For example, we can optimize a system's reliability by minimizing the amount of network links between services, requiring a strategy in which services that depend on each other are mapped to the same machine, as much as possible.

Graph-based optimization problems


In the Dynamic Disnix toolset, I have implemented various kinds of distribution algorithms/strategies for all kinds of purposes.

I did not "invent" most of them -- for some, I got inspiration from papers in the academic literature.

Two of the more advanced deployment planning algorithms are graph-based, to accomplish the following goals:

  • Reliable deployment. Network links are a potential source making a distributed system unreliable -- connections may fail, become slow, or could be interrupted frequently. By minimizing the amount of network links between services (by co-locating them on the same machine), their impact can be reduced. To not make deployments not too expensive, it should be done with a minimal amount of machines.

    As described in the paper: "Reliable Deployment of Component-based Applications into Distributed Environments" by A. Heydarnoori and F. Mavaddat, this problem can be transformed into a graph problem: the multiway cut problem (which is NP-hard).

    It can only be solved in polynomial time with an approximation algorithm that comes close to the optimal solution, unless a proof that P = NP exists.
  • Fragile deployment. Inspired by the above deployment problem, I also came up with the opposite problem (as my own "invention") -- how can we make any connection between a service a true network link (not local), so that we can test a system for robustness, using a minimal amount of machines?

    This problem can be modeled as a graph coloring problem (that is a NP-hard problem as well). I used one of the approximation algorithms described in the paper: "New Methods to Color the Vertices of a Graph" by D. Brélaz to implement a solution.

To work with these graph-based algorithms, I originally did not apply any transformations -- because of time pressure, I directly worked with objects from the Disnix models (e.g. services, target machines) and somewhat "glued" these together with generic data structures, such as lists and hash tables.

As a result, when looking at the implementation, it is very hard to get an understanding of the process and how an implementation aspect relates to a concept described in the papers shown above.

In my revised version, I have implemented a general purpose graph library that can be used to solve all kinds of general graph related problems.

Aside from using a general graph library, I have also separated the graph-based generation processes into the following steps:

  • After opening the Disnix input models (such as the services, infrastructure, and distribution models) I transform the models to a graph representing an instance of the problem domain.
  • After the graph has been generated, I apply the approximation algorithm to the graph data structure.
  • Finally, I transform the resolved graph back to a distribution model that should provide our desired distribution outcome.

This new organization provides better separation of concerns, common concepts can be reused (such as graph operations), and as a result, the implementations are much closer to the approximation algorithms described in the papers.

Visualizing the generation process


Another advantage of having a reusable graph implementation is that we can easily extend it to visualize the problem graphs.

When I combine these features together with my earlier work that visualizes services models, and a new tool that visualizes infrastructure models, I can make the entire generation process transparent.

For example, the following services model:

{system, pkgs, distribution, invDistribution}:

let
  customPkgs = import ./pkgs { inherit pkgs system; };
in
rec {
  testService1 = {
    name = "testService1";
    pkg = customPkgs.testService1;
    type = "echo";
  };

  testService2 = {
    name = "testService2";
    pkg = customPkgs.testService2;
    dependsOn = {
      inherit testService1;
    };
    type = "echo";
  };

  testService3 = {
    name = "testService3";
    pkg = customPkgs.testService3;
    dependsOn = {
      inherit testService1 testService2;
    };
    type = "echo";
  };
}

can be visualized as follows:

$ dydisnix-visualize-services -s services.nix


The above services model and corresponding visualization capture the following properties:

  • They describe three services (as denoted by ovals).
  • The arrows denote inter-dependency relationships (the dependsOn attribute in the services model).

    When a service has an inter-dependency on another service means that the latter service has to be activated first, and that the dependent service needs to know how to reach the former.

    testService2 depends on testService1 and testService3 depends on both the other two services.

We can also visualize the following infrastructure model:

{
  testtarget1 = {
    properties = {
      hostname = "testtarget1";
    };
    containers = {
      mysql-database = {
        mysqlPort = 3306;
      };
      echo = {};
    };
  };

  testtarget2 = {
    properties = {
      hostname = "testtarget2";
    };
    containers = {
      mysql-database = {
        mysqlPort = 3306;
      };
    };
  };

  testtarget3 = {
    properties = {
      hostname = "testtarget3";
    };
  };
}

with the following command:

$ dydisnix-visualize-infra -i infrastructure.nix

resulting in the following visualization:


The above infrastructure model declares three machines. Each target machine provides a number of container services (such as a MySQL database server, and echo that acts as a testing container).

With the following command, we can generate a problem instance for the graph coloring problem using the above services and infrastructure models as inputs:

$ dydisnix-graphcol -s services.nix -i infrastructure.nix \
  --output-graph

resulting in the following graph:


The graph shown above captures the following properties:

  • Each service translates to a node
  • When an inter-dependency relationship exists between services, it gets translated to a (bi-directional) link representing a network connection (the rationale is that a service that has an inter-dependency on another service, interact with each other by using a network connection).

Each target machine translates to a color, that we can represent with a numeric index -- 0 is testtarget1, 1 is testtarget2 and so on.

The following command generates the resolved problem instance graph in which each vertex has a color assigned:

$ dydisnix-graphcol -s services.nix -i infrastructure.nix \
  --output-resolved-graph

resulting in the following visualization:


(As a sidenote: in the above graph, colors are represented by numbers. In theory, I could also use real colors, but if I also want that the graph to remain visually appealing, I need to solve a color picking problem, which is beyond the scope of my refactoring objective).

The resolved graph can be translated back into the following distribution model:

$ dydisnix-graphcol -s services.nix -i infrastructure.nix
{
  "testService2" = [
    "testtarget2"
  ];
  "testService1" = [
    "testtarget1"
  ];
  "testService3" = [
    "testtarget3"
  ];
}

As you may notice, every service is distributed to a separate machine, so that every network link between a service is a real network connection between machines.

We can also visualize the problem instance of the multiway cut problem. For this, we also need a distribution model that, declares for each service, which target machine is a candidate.

The following distribution model makes all three target machines in the infrastructure model a candidate for every service:

{infrastructure}:

{
  testService1 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
  testService2 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
  testService3 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
}

With the following command we can generate a problem instance representing a host-application graph:

$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
  -d distribution.nix --output-graph

providing me the following output:


The above problem graph has the following properties:

  • Each service translates to an app node (prefixed with app:) and each candidate target machine to a host node (prefixed with host:).
  • When a network connection between two services exists (implicitly derived from having an inter-dependency relationship), an edge is generated with a weight of 1.
  • When a target machine is a candidate target for a service, then an edge is generated with a weight of n2 representing a very large number.

The objective of solving the multiway cut problem is to cut edges in the graph in such a way that each terminal (host node) is disconnected from the other terminals (host nodes), in which the total weight of the cuts is minimized.

When applying the approximation algorithm in the paper to the above graph:

$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
  -d distribution.nix --output-resolved-graph

we get the following resolved graph:


that can be transformed back into the following distribution model:

$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
  -d distribution.nix
{
  "testService2" = [
    "testtarget1"
  ];
  "testService1" = [
    "testtarget1"
  ];
  "testService3" = [
    "testtarget1"
  ];
}

As you may notice by looking at the resolved graph (in which the terminals: testtarget2 and testtarget3 are disconnected) and the distribution model output, all services are distributed to the same machine: testtarget1 making all connections between the services local connections.

In this particular case, the solution is not only close to the optimal solution, but it is the optimal solution.

Conclusion


In this blog post, I have described how I have revised the deployment planning algorithm implementations in the Dynamic Disnix toolset. Their concerns are now much better separated, and the graph-based algorithms now use a general purpose graph library, that can also be used for generating visualizations of the intermediate steps in the generation process.

This revision was not on my short-term planned features list, but I am happy that I did the work. Retrospectively, I regret that I never took the time to finish things up properly after the submission of the paper. Although Dynamic Disnix's quality is still not where I want it to be, it is quite a step forward in making the toolset more usable.

Sadly, it is almost 10 years ago that I started Dynamic Disnix and still there is no offical release yet and the technical debt in Dynamic Disnix is one of the important reasons that I never did an official release. Hopefully, with this step I can do it some day. :-)

The good news is that I made the paper submission deadline and that the paper got accepted for presentation. It brought me to the SEAMS 2011 conference (co-located with ICSE 2011) in Honolulu, Hawaii, allowing me to take pictures such as this one:


Availability


The graph library and new implementations of the deployment planning algorithms described in this blog post are part of the current development version of Dynamic Disnix.

The paper: "A Self-Adaptive Deployment Framework for Service-Oriented Systems" describes the Dynamic Disnix framework (developed 9 years ago) and can be obtained from my publications page.

Acknowledgements


To generate the visualizations I used the Graphviz toolset.