Sander van der Burg's blog: On the improvement of software deployment processes and some definitions

Some time ago, I wrote a blog post about techniques and lessons to improve software deployment processes. The take-home message of this blog post was that in order to improve deployment processes, you must automate everything from the very beginning in a software development process and properly decompose the process into sub units to make the process more manageable and efficient.

In this blog post, I'd like to dive a bit deeper into the latter aspect by exploring some definitions of "decomposition units" in the literature and by deriving mappings between them.

Software projects

The first "definition" that I want to mention is the software project, for which I (interestingly enough) could not find anything in the literature. The reason why I start with this term is that software deployment issues often already appear in the early stages of a software development process.

The term "software project" is something which is hard to define formally IMHO. To me they typically manifest themselves as directories of files that I can divide into the following categories:

Executable code. Files typically containing code implementing a program that performs computation and manipulates data.
Resources/data. Files not implementing anything that is executed, which are used or referenced by the program, such as images, configuration files, video, audio, HTML pages, etc.
Build configuration files. Configuration files used by a build system that transform or change the files belonging to the earlier two categories.

For example, executable code is often implemented in higher level programming languages and must be compiled to object code so that the program can be executed. Also many kinds of other processing steps can be executed, such as scaling images to lower resolutions, obfuscating/minifying code, running a style checker, bundling object code and resources etc.

Sometimes it is hard to draw a hard line between executable code and data files. For example, it may be possible that a data artifact (e.g. an HTML page) includes executable code (e.g. embedded JavaScript), and the other way around, such as assembly code containing strings in their code sections for efficiency.

Software projects can often be conveniently created by an Integrated Development Environment (IDE) that typically provides useful templates and automatically fills in many boilerplate settings. However, for small projects, people frequently create software projects manually, for example, by manually creating a directory of source files with a Makefile.

It is probably obvious to notice that dealing with software deployment complexity requires automation and files belonging to the third category (build configuration files) must be provided. Yet, I have seen quite a few projects in the past in which nothing is automated and people still rely on manually executing executing build tasks in an IDE, which is often tedious, time consuming and error prone.

Software modules

An automated build process of a software project provides a basic and typically faster means of (re)producing releases of a software product and is often less error prone than a manual build process.

However, besides build process automation there could still be many other issues. For example, if a software project has a monolithic build structure in which nothing can be built separately, deployment times become unnecessarily long and their configurations often have a huge maintenance complexity. Also, upgrading an existing deployment is typically difficult, expensive and unreliable.

To improve the efficiency of build processes, we need to decompose them into units that can be built separately. An import prerequisite to accomplish build decomposition is functional separation of important aspects of a software project.

A relatively simple concept supporting functional separation is the software module. According to Clemens Szyperski's "Component Software" book, a software module is a unit that has the following characteristics:

A module implements an ADT (Abstract Data Type).
Encapsulates multiple entities, often classes, but sometimes other kinds of entities, such as functions.
Have no concept of instantiation, in other words: there is one and only one instance of a module.

Several programming languages have a notion of modules, such as Module-2, Ada, C# and Java (since version 9). Sometimes the module concept is named differently in these languages. For example, in Ada modules are called packages and in C# they are called assemblies.

Not all programming languages support modular programming. Sometimes external facilities must be used, such as CommonJS in JavaScript. Moreover, modules can also be "simulated" in various ways, such as with static classes or singleton objects.

Encapsulating functionality into modules also typically imposes a certain filesystem structure for organizing the source code files. In some contexts, a module must correspond to a single file (e.g. in CommonJS) and in others to directories of files following a certain convention (e.g. in Java the names of directories should correspond to the package names, and the names of regular files to the name of the enclosing type in the code). Sometimes files belonging to a module can also be bundled into a single archive, such as a Zip container (e.g. a JAR file) or library file (e.g. *.dll or *.so files).

Refactoring a monolithic codebase into modules in a meaningful way is all but trivial. According to the paper "On the criteria to be used in decomposing systems into modules" written by David Parnas, it is a good practice to minimize coupling between modules (i.e. the dependencies between modules should be minimized) and maximize cohesion within modules (i.e. strongly related things should belong to the same module).

Software components

The biggest benefit of modularization is that parts of the code can be effectively reused. Reuse of software assets can be improved even further by turning modules (that typically work on code level) into software components that work on system level. Clemens Szyperski's "Component Software" book says the following about them:

The characteristic properties of a component are that it:

is a unit of independent deployment

is a unit of third-party composition

has no (externally) observable state

The above characteristics have several implications:

Independent deployment means that a component is well separated from the environment and other components, never deployed partially and third parties should not require access to its construction details.
To allow third-party composition a component must be sufficiently self contained and have clear specifications of what it provides and what it requires. In other words, they interact with the environment with well defined interfaces.
No externally observable state means that no distinction can be made between multiple copies of components.

So in what way are components different than modules? From my point of view, modularization is a prerequisite for componentization and some modules may already qualify themselves as minimal components.

However, some notable differences between modules and components is that the former are allowed to have observable state (e.g. having global variables that are imperatively modified) and dependencies on implementations rather than interfaces.

Furthermore, to implement software components standardized component models are frequently used, such as CORBA, COM, EJB, or web services (e.g. SOAP, WSDL, UDDI) that provide various kinds of facilities, such as (some sort of) a platform independent interface, lookup and discovery. Modules typically use the interface facilities provided by a programming language.

Build-Level Components

Does functional separation of a monolithic codebase into modules and/or components also improve deployment? According to Merijn de Jonge's IEEE TSE paper titled: "Build-Level components" this is not necessarily true.

For example, it may still be possible that source code files implementing modules or components on a functional level, are scattered across directories of source code files. For example, between the directories in a codebase, many references may exist (strong coupling) and directories often contain too many files (weak cohesion).

According to the paper, strong coupling and weak cohesion on the build level have the following disadvantages:

potentially reusable code, contained in some of the entangled modules, cannot easily be made available for reuse;

the fixed nature of directory hierarchies makes it hard to add or to remove functionality;

the build system will easily break when the directory structure changes, or when files are removed or renamed.

In the paper, the author shows that Component-Based Software Engineering (CBSE) principles can be applied to the build level as well. Build-Level components can be formed by directories of source files and serve as a unit of composition. Access occurs via build, configuration, and requires interfaces:

The build interface defines which build operations to execute. In a GNU Autotools project following the GNU Coding Standards (used in the paper), these operations correspond to a number standardized make targets, e.g. make all, make install, make dist.
The configuration interface defines which variability points and parameters can be enabled or disabled. In a GNU Autotools project, this interface correspond to the --enable-foo and --disable-foo parameters passed to the configure script -- each enable or disable parameter defines a certain feature that can be enabled or disabled.
The requires interface can be used to bind dependencies to components. In a GNU Autotools project, this interface correspond to the --with-foo and --without-foo parameters passed to the configure script that take the paths to the corresponding dependencies as parameters allowing the configuration script to find it.

Although the paper only uses GNU Autotools-based for implementation purposes, build-level components are not restricted to any build technology -- the only thing that matters is that the operations for these three interfaces are standardized so that any component can be configured, composed, and built uniformly.

The paper describes a collection of smells and some refactor patterns that need to be applied to turn directories of source files into build level components. The rules mentioned in the paper are the following:

Components with directory granularity

Circular dependencies should be prevented

Software building via standardized build interface

Compile-time variability binding via standardized configuration interface

Late binding of dependencies via require interface

Build process definition per component

Configuration process definition per component

Component deployment with build level-packages

Automated component composition

Software packages

As described in the previous sections, functional separation is a prerequisite to compose build level components. One important aspect of build-level components is that build processes of modules and components are separated. But how does build separation affect the overall deployment process (to which the build phase also belongs)?

Many deployment processes are typically carried out by tools called package managers. Package managers install units that are called software packages. According to the paper: "Package Upgrades in FOSS Distributions: Details and Challenges" written by Di Cosmo et al (HotSWUp 2008), a software package can be defined as follows:

Packages are abstractions defining the granularity at which users can act (add, remove, upgrade, etc.) on available software.

According to the paper a package is typically a bundle of 3 parts:

Set of files. Contains all kinds of files that must be copied somewhere to the host system to make the software work, such as scripts, binaries, resources etc.
Set of valued meta-information. Contains various kinds of meta attributes, such as the name of the package, the version, a description and its license. Most importantly, it contains information about the inter-package relationships which includes a set of dependencies on other packages and a set of conflicts with other packages. Package managers typically install its required dependencies automatically and refuses to install if a conflict has been encountered.
Executable configuration scripts (also known as maintainer scripts). These are basically scripts that imperatively "glue" files from the package to files already residing on the system. For example, after a certain package has been installed, some configuration files of the host system are automatically adapted so that it can be used properly.

Getting a software project packaged typically involves defining the meta data (including the dependencies/conflicts on external packages), bundling the build process (for source package managers) or the resulting build artifacts (for binary package managers), and composing maintainer scripts taking care of the remaining bits to make the package work (although I would personally not recommend using these kinds of scripts).

This process already works for big monolithic software projects. However, it has several drawbacks for these kinds of projects. Since it needs to deploy a big project as a whole, deployment is typically an expensive process. Not only a fresh installation of a package takes time, but also upgrading, since it has to replace an existing installation as a whole instead of the affected areas only.

Moreover, upgrading is also quite dangerous. Many package managers typically replace and remove files belonging to a package that reside in global locations on the filesystem, such as /usr/bin, /usr/lib (on Linux) or C:\WINDOWS\SYSTEM32 (on Windows). If an upgrade process gets interrupted, the system might reach an inconsistent state for which it might be difficult (or impossible) to do a rollback. The bigger a project is the more severe the potential damage becomes.

Packaging smaller units of a software project (e.g. a build-level component) is typically more work, but also has great benefits. It allows certain, smaller pieces of a software projects to be replaced separately, significantly increasing the efficiency and reliability of the upgrades. Moreover, the dependencies of software components and build-level components have already been identified and only need to be translated to the corresponding packages that provide them.

Nix packages

I typically use the Nix package manager (and related tools) for deployment activities. It borrows concepts from purely functional programming languages to make deployment reliable, reproducible and efficient.

In what way do packages deployed by Nix conform to the definition of software package shown earlier?

Deployment in Nix is driven by build recipes (called Nix expressions) that build packages including all its dependencies from source. Every package build (indirectly) invokes the derivation {} function that composes an isolated environment in which builds are executed in such a way that only the declared dependencies can be found and anything else cannot influence the build. The function arguments include package metadata, such as a description, license, maintainer etc. and the package dependencies.

References to dependencies in Nix are exact meaning that they bind to specific builds of other Nix packages. Conventional package managers, software components and build-level components typically use nominal version specifications consisting of the names and version numbers of the packages, which are less strict. Mapping nominal dependencies to exact dependencies is not always trivial. For example, nominal version ranges are unsupported in Nix and must be snapshotted. In an earlier blog post that describes how to deploy NPM packages with Nix has more details about this.

Another notable trait of Nix is that is has no notion of conflicts. In Nix, any package can coexist with another because they are all stored in isolated directories. However, conflicts may also indicate runtime conflicts between two packages. These kinds of issues need to be solved by other means.

Finally, Nix packages have no configuration (or maintainer) scripts, because they imperatively modify the system's state which conflicts with its underlying purely functional deployment model. Many things that configuration scripts typically do are accomplished in a different way if Nix is used for deployment. For example, configuration files are not adapted, but generated in a Nix expression and deployed as a Nix package. Service activation is typically done by generating a job description file (e.g. init script or systemd job) that starts and stops it.

NixOS configurations, Disnix configurations, Hydra jobsets

If something is packaged in a Nix expression you could easily broaden the application area of deployment:

With a few small modifications (mainly encapsulating several packages into a jobset), a Nix package can be turned into a Hydra jobset, so that a project can be integrated and tested continuously.
A package can be referenced from a NixOS module that, for example, automatically starts and stops a package on startup and shutdown. NixOS can be used to deploy entire system configurations from a single declarative specification in which the module is enabled.
A collection of NixOS configurations can also be deployed in a network of physical or virtual machines through NixOps.
A package can be turned into service by adding a specification of inter-dependencies (services that may reside on other machines in a network). These services can be used to compose a Disnix configuration that deploys services to machines in a network.

Summary

I can summarize all the terms described in this blog post and the activities that need to be performed to implement them in the following chart:

Concluding remarks

In this blog post, I have described some terminology and potential mappings between them with the purpose of defining a reengineering process that makes deployment processes more manageable and efficient.

The terms and mappings used in this blog post are quite abstract. However, if we make a number of concrete technology choices, e.g. a programming language (Java), component technology (web services), package manager (Nix), we can define a more concrete process allowing someone to make considerable improvements.

Moreover, the terms described in this blog post are idealistic. In practice, most units that are called modules or components do not fully qualify themselves as such, while it is still possible to package and deploy them individually. Perhaps, it would also be useful to make "weaker" definitions of some of the terms described in this blog post and to look for their corresponding minimum requirements.

Finally, we can also look into more refactor/reengineering patterns for the other terms and possible automation of them.

Sander van der Burg's blog

Tuesday, December 30, 2014

On the improvement of software deployment processes and some definitions