I have observed that in organizations like these, configuration management (CM) is typically nobody's (full) responsibility. Preferably, people want to stick themselves to their primary responsibilities and typically carry out change activities in an ad-hoc and unstructured way.
Not properly implementing changes have a number of serious implications. For example, some problems I have encountered are:
- Delays. There are many factors that will unnecessarily increase the time it will take to implement a change. Many of my big delays were caused by the fact that I always have to search for all the relevant installation artifacts, such as documentation, installation discs, and so on. I have also encountered many times that artifacts were missing requiring me to obtain copies elsewhere.
- Errors. Any change could potentially lead to errors for many kinds of reasons. For example, implementing a set of changes in the wrong order could break a system. Also, components of which a system consist may have complex dependencies on other components that have to be met. Quite often, it is not fully clear what the dependencies of a component or system are, especially when documentation is incomplete or lacking.
Moreover, after having solved an error, you need to remember many arbitrary things, such as workarounds, that tend to become forgotten knowledge over time.
- Disruptions. When implementing changes, a system may be partially or fully unavailable until all changes have been implemented. Preferably this time window should be as short as possible. Unfortunately, the inconsistency time window most likely becomes quite big when the configuration management process is not optimal or subject to errors.
It does not matter if an organization is small or big, but these problems cost valuable time and money. To alleviate these problems, it is IMO unavoidable to have a structured way of carrying out changes so that a system maintains its integrity.
Big organizations typically have information systems, people and management procedures to support structured configuration management, because failures are typically too costly for them. There are also standards available (such as the IEEE 828-2012 Standard for Configuration Management in Systems and Software Engineering) that they may use as a reference for implementing a configuration management process.
However, in small organizations people typically refrain from thinking about a process at all while they keep suffering from the consequences, because they think they are too small for it. Furthermore, they find it too costly to invest in people or an information system supporting configuration management procedures. Consulting a standard is something that is generally considered a leap too far.
In this blog post, I will describe a very basic software configuration management process I have implemented at my current employer.
The IEEE Standard for Configuration Management
As crazy as this may sound, I have used the IEEE 828-2012 standard as a reference for my implementation. The reason why I consulted this standard besides the fact that using an existing and reasonably well-established reference is good, is that I was already familiar with it, because of my previous background as a PhD researcher in software deployment.
The IEEE standard defines a framework of so-called "lower-level processes" from which a configuration management process can be derived. The most important lower-level processes are:
- CM identification, which concerns identifying, naming, describing, versioning, storing and retrieving configuration items (CIs) and their baselines.
- CM change control is about planning, requesting, approval and verification of change activities.
- CM status accounting is about identifying the status of CIs and change requests.
- CM auditing concerns identifying, tracing and reporting discrepancies with regards to the CIs and their baselines.
- CM release management is about planning, defining a format for distribution, delivery, and archival of configuration items.
All the other lower-level processes have references to the lower-level processes listed above. CM planning is basically about defining a plan how to carry out the above activities. CM management is about actually installing the tools involved, executing the process, monitoring its progress, and status and revising/optimizing the plan if any problem occurs.
The remaining two lower-level processes concern outside involvement -- Supplier configuration item control concerns CIs that are provided by external parties. Interface control concerns the implications of configuration changes that concern external parties. I did not take the traits of these lower-level processes into account in the implementation.
Implementing a configuration management process
Implementing a configuration management process (according to the IEEE standard) starts by developing configuration management plan. The standard mentions many planning aspects, such as identifying the information needs, reporting needs, the reporting frequency, and the information needed to manage CM activities. However, from my perspective, in a small organization many of these aspects are difficult to answer in advance, in particular the information needs.
As a rule of thumb, I think that when you do not exactly know what is needed, consider that the worst thing could happen -- the office burns down and everything gets lost. What does it take to reproduce the entire configuration from scratch?
This is how I have implemented the main lower-level processes:
CM identification
A configuration item is any structural unit that is distinguishable and configurable. In our situation, the most important kind of configuration item is a machine configuration (e.g. a physical machine or a virtual machine hosted in an IaaS environment, such as Amazon EC2), or a container configuration (such as an Apache Tomcat container or PaaS service, such as Elastic Beanstalk).
Machines/containers belong to an environment. Some examples of environments that we currently maintain are: the production environment containing the configurations of the production machines of our service, test contains the configurations of the test environment, and internal contains the configurations of our internal IT infrastructure, such as our internal Hydra build cluster, and other peripherals, such as routers and printers.
Machines run a number of applications that may have complex installation procedures. Moreover, identical/similar application configurations may have to be deployed to multiple machines.
For storing the configurations of the CIs, I have set up a Git repository that follows a specific directory structure of three levels:
<environment>/<machine | container>/<application>
Each directory contains all artifacts (e.g. keys, data files, configuration files, scripts, documents etc.) required to reproduce a CI's configuration from scratch. Moreover, each directory has a README.md markdown file:
- The top-level README.md describes which environments are available and what their purposes are.
- The environment-level README.md describes which machines/containers are part of it, a brief description of their purpose, and a picture showing how they are connected. I typically use Dia to draw them, because the tool is simple and free.
- The machine-level README.md describes the purpose of the machine and the activities that must be carried out to reproduce its configuration from scratch.
- The application-level README.md captures the steps that must be executed to reproduce an application's configuration.
When storing artifacts and writing README.md files, I try to avoid duplication as much as possible, because it makes it harder to keep the repository consistent and maintainable:
- README.md files are not supposed to be tool manuals. I mention the steps that must be executed and what their purposes are, but I avoid explaining how a tool works. That is the purpose of the tool's manual.
- When there are common files used among machines, applications or environments, I do not duplicate them. Instead, I use a _common/ folder that I put one directory level higher. For example, the _common/ folder in the internal/ directory contains shared artifacts that are supposed to reused among all machines belonging to our internal IT infrastructure.
- I also capture common configuration steps in a separate README.md and refer to it, instead of duplicating the same steps in multiple README.md files.
Because I use Git, versioning, storage and retrieval of configurations is implicit. I do not have to invent something myself or think too much about it. For example, I do not have to manually assign version numbers to CIs, because Git already computes them for each change. Moreover, because I use textual representations of most of the artifacts I can also easily compare versions of the configurations.
Furthermore, besides capturing and storing all the prerequisites to reproduce a configuration, I also try to automate this process as much as possible. For most of the automation aspects, I use tools from the Nix project, such as the Nix package manager for individual packages, NixOS for system configurations, Disnix for distributed services, and NixOps for networks of machines.
Tools in the Nix project are driven by declarative specifications -- a specification captures the structure of a system, e.g. their components and their dependencies. From this specification the entire deployment process will be derived, such as building the components from source code, distributing them to the right machines in the network, and activating them in the right order.
Using a declarative deployment approach prevents me writing down the activities to carry out, because they are implicit. Also, there is no need describing the structure of the system because it is already captured in the deployment specification.
Unfortunately, not all machine's deployment processes can be fully automated with Nix deployment tools (e.g. non-Linux machines and special purpose peripherals, such as routers) still requiring me to carry out some configuration activities manually.
CM change control
Implementing changes may cause disruptions costing time and money. That is why the right people must be informed and approval is needed. Big organizations typically have sophisticated management procedures including request and approval forms, but in a small organization it typically suffices to notify people informally before implementing a change.
Besides notifying people, I also take the following things into account while implementing changes:
- Some configuration changes require validation including review and testing before they can be actually implemented in production. I typically keep the master Git branch in a state releasable state, meaning that it is ready to be deployed into production. Any changes that require explicit validation go into a different branch first.
Moreover, when using tools from the Nix project it is relatively easy to reliably test changes first by deploying a system into a test environment, or by spawning virtual NixOS machines in which integration tests can be executed. - Sometimes you need to make an analysis of the impact and costs that a change would bring. Up-to-date and consistent documentation of the CIs including their dependencies makes this process more reliable.
Furthermore, with tools from the Nix project you can also make better estimations by executing a dry-run deployment process -- the dry run shows what activities will be carried out without actually executing them or bringing the system down. - After a change has been deployed, we also need to validate whether the new configuration is correct. Typically, this requires testing.
Tools from the Nix project support congruent deployment, meaning that if the deployment has succeeded, the actual configuration is guaranteed to match the deployment specification for the static parts of a system, giving better guarantees about its validity.
- Also you have to pick the right moment to implement potentially disrupting changes. For example, it is typically a bad idea to do this while your service is under peak load.
CM status accounting
It is also highly desirable to know what the status of the CIs and the change activities are. The IEEE standard goes quite far in this. For example, the overall state of the system may converge into a certain direction (e.g. in terms of features complete, error ratios etc.), which you continuously want to measure and report about. I think that in a small organization these kinds of status indicators are typically too complex to define and measure, in particular in the early stages of a development process.
However, I think the most important status indicator that you never want to lose track of is the following: does everything (still) work?
There are two facilities that help me out a lot in keeping a system in working state:
- Automating deployment with tools from the Nix project ensure that the static parts of a deployed system are congruent with the deployment configuration specification and atomic -- either a deployment is in the old configuration or the new configuration but never in an inconsistent mix of the two. As a result, we have fewer broken configurations as a result of (re)deploying a system.
- We must also observe a system's runtime behavior and take action if things will grow out of hand. For example, when a machine runs out of system resources.
Using a monitoring service, such as Zabbix or Datadog, helps me a lot in accomplishing this. They can also be used to configure alarms that warn you when things become critical.
CM auditing
Another important aspect is the integrity of the configurations repository. How can we be sure that what is stored inside the repository matches the actual configurations and that the configuration procedures still work?
Fortunately, because we use tools from the Nix project, there is relatively little audit work we need to do. With Nix-related tools the deployment process is fully automated. As a consequence, we need to adapt the deployment specification when we intend to make changes. Moreover, since the deployment specifications of Nix-related tools are congruent, we know that the static parts of a system are guaranteed to match the actual configurations if the (re)deployment process succeeded.
However, for non-NixOS machines and other peripherals, we must still manually check once in a while whether the indented configuration matches. I made it a habit to go through them once a month and to adjust the documentation if any discrepancies were found.
CM release management
When updating a configuration file, we must also release the new corresponding configuration items. The IEEE standard describes many concerns, such as approval procedures, requirements on the distribution mechanism and so on.
For me, most of these concerns are unimportant, especially in a small organization. The only thing that matters to me is that a release process is fully automated, reliable, reproducible. Fortunately, the deployment tools from the Nix project support these properties quite well.
Discussion
In this blog post, I have described a basic configuration management process that I have implemented in a small organization.
Some people will probably argue that defining a CM process in a small organization looks crazy. Some people think they do not need a process and that it is too much of an investment. Following an IEEE standard is generally considered a leap too far.
In my opinion, however, the barrier of implementing a CM process is not actually not that high. From my experience, the biggest investment is setting up a configuration management repository. Although big organizations typically have sophisticated information systems, I have also shown that using a simple filesystem layout and collection of free and open source tools (e.g. Git, Dia, Nix) a simple variant of such a repository can be set up with relatively little effort.
I also observed that automating CM tasks helps a lot, in particular using a declarative and congruent deployment approach, such as Nix. With a declarative approach, configuration activities are implicit (they are a consequence of applying a change in the deployment specification) and do not have to be documented. Furthermore, because Nix's deployment models are congruent, the static aspects of a configuration are guaranteed to match the deployment specifications. Moreover, the deployment model serves as the documentation, because it captures the structure of a system.
So how beneficial is setting up a CM process in a small organization? I observed many benefits. For example, a major benefit is that I can carry out many CM tasks much faster. I no longer have to waste much of my time looking for configuration artifacts and documentation. Also, because the steps to carry out are documented or automated, there are fewer things I need to (re)discover or solve while implementing a change.
Another benefit is that I can more reliably estimate the impact of implementing changes, because the CIs and their relationships are known. More knowledge, also causes fewer errors.
Although a simple CM approach provides benefits and many aspects can be automated, it always requires discipline from all people involved. For example, when errors are discovered and configurations must be modified in a stressful situation, it is very tempting to bypass updating the documentation.
Moreover, communication is also an important aspect. For example, when notifying people of a potentially disrupting change, clear communication is required. Typically, also non-technical stakeholders must be informed. Eventually, you have to start developing formalized procedures to properly handle decision processes.
Finally, the CM approach described in this blog post is obviously too limited if a company grows. If an organization gets bigger, a more sophisticated and more formalized CM approach will be required.
No comments:
Post a Comment