Sander van der Burg's blog: 2020

Thursday, December 31, 2020

Annual blog reflection over 2020

In my previous blog post that I wrote yesterday, I celebrated my blog's 10th anniversary and did a reflection over the last decade. However, I did not elaborate much about 2020.

Because 2020 is a year for the history books, I have decided to also do an annual reflection over the last year (similar to my previous annual blog reflections).

A summary of blog posts written in 2020

Nearly all of the blog posts that I have written this year were in service of only two major goals: developing the Nix process management framework and implementing service container support in Disnix.

Both of them took a substantial amount of development effort. Much more than I initially anticipated.

Investigating process management

I already started working on this topic last year. In November 2019, I wrote a blog post about packaging sysvinit scripts and a Nix-based functional organization for configuring process instances, that could potentially also be applied to other process management solutions, such as systemd and supervisord.

After building my first version of a framework (which was already a substantial leap in reaching my full objective), I thought it would not take me that much time to get all details that I originally planned finished. It turns out that I heavily underestimated the complexity.

To test my framework, I needed a simple test program that could daemonize on its own, which was (and still is) a common practice for running services on Linux and many other UNIX-like operating systems.

I thought writing such a tool that daemonizes would be easy, but after some research, I discovered that it is actually quite complicated to do it properly. I wrote a blog post about my findings.

It took me roughly three months to finish the first implementation of the process manager-agnostic abstraction layer that makes it possible to write a high-level specification of a running process, that could universally target all kinds of process managers, such as sysvinit, systemd, supervisord and launchd.

After completing the abstraction layer, I also discovered that a sufficiently high-level deployment specification of running processes could also target other kinds deployment solutions.

I have developed two additional backends for the Nix process management framework: one that uses Docker and another using Disnix. Both solutions are technically not qualified as process managers, but they can still be used as such by only using a limited set of features of these tools.

To be able to develop the Docker backend, I needed to dive deep into the underlying concepts of Docker. I wrote a blog post about the relevant Docker deployment concepts, and also gave a presentation about it at Mendix.

While implementing more examples, I also realized that to more securely run long-running services, they typically need to run as unprivileged users. To get predictable results, these unprivileged users require stable user IDs and group IDs.

Several years ago, I have already worked on a port assigner tool that could already assign unique TCP/UDP port numbers to services, so that multiple instances can co-exist.

I have extended the port assigner tool to assign arbitrary numeric IDs to generically solve this problem. In turns out that implementing this tool was much more difficult than expected -- the Dynamic Disnix toolset was originally developed under very high time pressure and had a substantial amount of technical debt.

In order to implement the numeric ID assigner tool, I needed to revise the model parsing libraries, that broke the implementations of some of the deployment planner algorithms.

To fix them, I was forced to study how these algorithms worked again. I wrote a blog post about the graph-based deployment planning algorithms and a new implementation that should be better maintainable. Retrospectively, I wish I did my homework better at the time when I wrote the original implementation.

In September, I gave a talk about the Nix process management framework at NixCon 2020, that was held online.

I pretty much reached all my objectives that I initially set for the Nix process management framework, but there is still some leftover work to bring it at an acceptable usability level -- to be able to more easily add new backends (somebody gave me s6 as an option) the transformation layer needs to be standardized.

Moreover, I still need to develop a test strategy for services so that you can be (reasonably) sure that they work with a variety of process managers and under a variety of conditions (e.g. unprivileged user deployments).

Exposing services as containers in Disnix

Disnix is a Nix-based distributed service deployment tool. Services can basically be any kind of deployment unit whose life-cycle can be managed by a companion tool called Dysnomia.

There is one practical problem though, in order to deploy a service-oriented system with Disnix, it typically requires the presence of already deployed containers (not be confused with Linux containers), that are environments in which services are managed by another service.

Some examples of container providers and corresponding services are:

The MySQL DBMS (as a container) and multiple hosted MySQL databases (as services)
Apache Tomcat (as a container) and multiple hosted Java web applications (as services)
systemd (as a container) and multiple hosted systemd unit configuration files (as services)

Disnix deploys the services (as described above), but not the containers. These need to be deployed by other means first.

In the past, I have been working on solutions that manage the underlying infrastructure of services as well (I typically used to call this problem domain: infrastructure deployment). For example, NixOps can deploy a network of NixOS machines that also expose container services that can be used by Disnix. It is also possible to deploy the containers as services, in a separate deployment layer managed by Disnix.

When the Nix process management framework became more usable, I wanted to make the deployment of container providers also a more accessible feature. I heavily revised Disnix with a new feature that makes it possible to expose services as container providers, making it possible to deploy both the container services and application services from a single deployment model.

To make this feature work reliably, I was again forced to revise the model transformation pipeline. This time I concluded that the lack of references in the Nix expression language was an impediment.

Another nice feature by combining the Nix process management framework and Disnix is that you can more easily deploy a heterogeneous system locally.

I have released a new version of Disnix: version 0.10, that provides all these new features.

The Monitoring playground

Besides working on the two major topics shown above, the only other thing I did was a Mendix crafting project in which I developed a monitoring playground, allowing me to locally experiment with alerting scripts, including the visualization and testing.

Some thoughts

From a blogging perspective, I am happy what I have accomplished this year -- not only have I managed to reach my usual level of productivity again (last year was somewhat disappointing), I also managed to both develop a working Nix process management framework (making it possible to use all kinds of process managers), and use Disnix to deploy both container and application services. Both of these features are on my wish list for many years.

In the Nix community, having the ability to also use other process managers than systemd, is something we have been discussing already since late 2014.

However, there are also two major things that kept me mentally occupied in the last year.

Open source work

Many blog posts are about the open source work I do. Some of my open source work is done as part of my day time job as a software engineer -- sometimes we can write a new useful feature, make an extension to an existing project that may come in handy, or do something as part of an investigation/learning project.

However, the majority of my open source work is done in my spare time -- in many cases, my motivation is not as altruistic as people may think: typically I need something to solve my own problems or there is some technical concept that I would like to explore. However, I still do a substantial amount of work to help other people or for the "greater good".

Open source projects are typically quite satisfying the work on, but they also have negative aspects (typically the negative aspects are negligible in the early stages of a project). Sadly as projects get more popular and gain more exposure, the negativity attached to them also grows.

For example, although I got quite a few positive reactions on my work on the Nix process management framework (especially at NixCon 2020), I know that not everybody is or will be happy about it.

I have worked with people in the past, who consider this kind of work a complete waste of time -- in their opinion, we already have Kubernetes that has already solved all relevant service management problems (some people even think it is a solution to all problems).

I have to admit that, while Kubernetes can be used to solve similar kind of problems (and what is not supported as a first-class feature, can still be scripted in ad-hoc ways), there is still much to think about:

As explained in my blog post about Docker concepts, the Nix store typically supports much more efficient sharing of common dependencies between packages than layered Docker images, resulting in much lower disk space consumption and RAM consumption of processes that have common dependencies.
Docker containers support the deployment of so-called microservices because of a common denominator: processes. Almost all modern operating systems and programming-languages have a notion of processes.

As a consequence, lots of systems nowadays typically get constructed in such a way that they can be easily decomposed into processes (translating to container instances), imposing substantial overhead on each process instance (because these containers typically need to embed common sub services).

Services can also be more efficiently deployed (in terms of storage and RAM usage) as units by managed by a common runtime (e.g. multiple Java web applications managed by Apache Tomcat or multple PHP applications managed by the Apache HTTP server).

The latter form of reuse is now slowly disappearing, because it does not fit nicely in a container model. In Disnix, this form of reuse is a first-class concept.
Microservices managed by Docker (somewhat) support technology diversity, because of the fact that all major programming languages support the concept of processes.

However, one particular kind of technology that you cannot choose is the the operating system -- Docker/Kubernetes relies on non-standardized Linux-only concepts.

I have also been thinking about the option to pick your operating system as well: you need security, then pick: OpenBSD, you want performance, then pick: Linux etc. The Nix process management framework allows you to also target process managers on different operating systems than Linux, such as BSD rc scripts and Apple's launchd.

I personally believe that these goals are still important, and that keeps me motivated to work on it.

Furthermore, I also believe that it is important to have multiple implementations of tools that solve the same or similar kind of problems -- in the open source world, there are lot of "battles" between communities about which technology should be the standard for a certain problem.

My favourite example of such a battle is the system's process manager -- many Linux distributions nowadays have adopted systemd, but this is not without any controversy, such as in the Debian project.

It took them many years to come to the decision to adopt it, and still there are people who want to discuss "init system diversity". Likewise, there are people who find the systemd-adoption decision unacceptable, and have forked Debian into Devuan, providing a Debian-based distribution without systemd.

With the Nix process management framework the fact that systemd exists (and may not be everybody's first choice) is not a big deal -- you can actually switch to other solutions, if desired. A battle between service managers is not required. A sufficiently high-level specification of a well understood problem allows you to target multiple solutions.

Another problem I face is that these two projects are not the only projects I have been working on or maintain. There are many other projects I have been working on the past.

Sadly, I am also a very bad multitasker. If there are problems reported with my other projects, and the fix is straight forward, or there is a straight forward pull request, then it is typically no big deal to respond.

However, I also learned that some for some of the problems other people face, there is no quick fix. Sometimes I get pull requests that partially solves a problem, or in other cases: fix a specific problem, but breaks others features. These pull requests cannot always be directly accepted and also need a substantial amount of my time for reviewing.

For certain kinds of reported problems, I need to work on a fundamental revision that requires a substantial amount of development effort -- however, it is impossible to pick up such a task while working on another "major project".

Alternatively, I need to make the decision to abandon what I am currently working on and make the switch. However, this option also does not have my preference because I know it will significantly delay my original goal.

I have noticed that lots of people get dissatisfied and frustrated, including myself. Moreover, I also consider it a bad thing to feel pressure on the things I am working on in my spare time.

So what to do about it? Maybe I can write a separate blog post on this subject.

Anyway, I was not planning to abandon or stop anything. Eventually, I will pick up these other problems as well -- my strategy for now, is to do it when I am ready. People simply have to wait (so if you are reading this and waiting for something: yes, I will pick it up eventually, just be patient).

The COVID-19 crisis

The other subject that kept me (and pretty much everybody in the world) busy is the COVID-19 crisis.

I still remember the beginning of 2020 -- for me personally, it started out very well. I visited some friends that I have not seen in a long time, and then FOSDEM came, the Free and Open Source Developers European Meeting.

Already in January, I heard about this new virus that was rapidly spreading in the Wuhan region on the news. At that time, nobody in the Netherlands (or in Europe) was really worried yet. Even to questions, such as: "what will happen when it reaches Europe?", people typically responded with: "ah yes, well influenza has a impact on people too, it will not be worse!".

A few weeks later, it started to spread to countries close to Europe. The first problematic country I heard about was Iran, and a couple of weeks later it reached Italy. In Italy, it spread so rapidly that within only a few weeks, the intensive care capacity was completely drained, forcing medical personnel to make choices who could be helped and who could not.

By then, it sounded pretty serious to me. Furthermore, I was already quite sure that it was only a matter of time before it would reach the Netherlands. And indeed, at the end of February, the first COVID-19 case was reported. Apparently this person contracted the virus in Italy.

Then the spreading went quickly -- every day, more and more COVID-19 cases were reported and this amount grew exponentially. Similar to other countries, we also slowly ran into capacity problems in hospitals (materials, equipment, personnel, intensive care capacity etc.). In particular, the intensive care capacity reached at a very critical level. Fortunately, there were hospitals in Germany willing to help us out.

In March, a country-wide lockdown was announced -- basically all group activities were forbidden, schools and non-essential shops were closed, and everybody who is capable of working from should work from home. As a consequence, since March, I have been permanently working from home.

As with pretty much everybody in the world, COVID-19 has negative consequences for me as well. Fortunately, I have not much to complain about -- I did not get unemployed, I did not get sick, and also nobody in my direct neighbourhood ran into any serious problems.

The biggest implication of the COVID-19 pandemic for me is social contacts -- despite the lockdown I still regularly meet up with family and frequent acquaintances, but I have barely met any new people. For example, at Mendix, I typically came in contact with all kinds of people in the company, especially those that do not work in the same team.

Moreover, I also learned that quite a few of my contacts got isolated because of all group activities that were canceled -- for example I did not have any music rehearsals in a while, causing me not to see or speak to any of my friends there.

Same thing with conferences and meet ups -- because most of them were canceled or turned into online events, it is very difficult to have good interactions with new people.

I also did not do any traveling -- my summer holiday was basically a staycation. Fortunately, in the summer, we have managed to minimize the amount of infections, making it possible to open up public places. I visited some touristic places in the Netherlands, that are normally crowded by people from abroad. That by itself was quite interesting -- I normally tend to neglect national touristic sites.

Although the COVID-19 pandemic brought all kinds of negative things, there were also a couple of things that I consider a good thing:

At Mendix, we have an open office space that typically tends to be very crowded and noisy. It is not that I cannot work in such an environment, but I also realize that I do appreciate silence, especially for programming tasks that require concentration. At home, it is quiet, I have much fewer distractions and I also typically feel much less tired after a busy work day.
I also typically used to neglect home improvements a lot. The COVID-19 crisis helped me to finally prioritize some non-urgent home improvements tasks -- for example, on the attic, where my musical instruments are stored, I finally took the time to organize everything in such a way that I can rehearse conveniently.
Besides the fact that rehearsals and concerts were cancelled, I actually practiced a lot -- I even studied many advanced solo pieces that I have not looked at in years. Playing music became a standard activity between my programming tasks, to clear my mind. Normally, I would use this time to talk to people at the coffee machine in the office.
During busy times I also used to tend to neglect house keeping tasks a lot. I still remember (many years ago) when I just moved into my first house, doing the dishes was already a problem (I had no dish washer at that time). When working from home, it is not a problem to keep everything tidy.
It is also much easier to maintain healthy daily habits. In the first lockdown (that was in spring), cycling/walking/running was a daily routine that I could maintain with ease.

In the Netherlands, we have managed to overcome the first lockdown in just a matter of weeks by social distancing. Sadly, after the restrictions were relaxed we got sloppy and at the end of the summer the infection rate started to grow. We also ran into all kinds of problems to mitigate the infections -- limited test capacity, people who got tired of all the countermeasures not following the rules, illegal parties etc.

Since a couple of weeks we are in our second lockdown with a comparable level of strictness -- again, the schools and non-essentials shops are closed etc. The second lockdown feels a lot worse than the first -- now it is in the winter, people are no longer motivated (the amount of people that revolt in the Netherlands have grown substantially, including people spreading news that everything is a Hoax and/or a one big scheme organized by left-wing politicians) and it is already taking much longer than the first.

Fortunately, there is a tiny light at the end of the tunnel. In Europe, one vaccine (the Pfizer vaccine) has been approved and more submissions are pending (with good results). By Monday, the authorities will start to vaccinate people in the Netherlands.

If we can keep the infection rates and the mutations under control (such as the mutation that appeared in England) then we will eventually build up the required group immunity to finally get the situation under control (this probably is going to take many more months, but at least it is a start).

Conclusion

This elaborate reflection blog post (that is considerably longer than all my previous yearly reflections combined) reflects over 2020 that is probably a year that will no go unnoticed in the history books.

I hope everybody remains in good health and stays motivated to do what it is needed to get the virus under control.

Moreover, when the crisis is over, I also hope we can retain the positive things learned in this crisis, such as making it more a habit to allow people to work (at least partially) from home. The open-source complaint in this blog post is just a minor inconvenience compared to the COVID-19 crisis and the impact that it has on many people in the world.

The final thing I would like to say is:

HAPPY NEW YEAR!!!!!!!!!!!!

Wednesday, December 30, 2020

Blog reflection over the last decade

Today it is exactly ten years ago that I started this blog. As with previous years, I will do a reflection, but this time it will be over the last decade.

What was holding me back

The idea to have my own blog was already there for a long time. I always thought it was an interesting medium. For example, I considered it a good instrument to express my thoughts on the technical work I do, and in particular, I liked having the ability to get feedback.

The main reason why it still took me so long to start was because I never considered it "the right time". For example, I was already close to starting a blog 15 years ago (while I was still early in my studies) when web development was still one of my main technical interests, but still refrained from doing so.

At that time I did some interesting "discoveries" and I had some random ideas I could elaborate about, but these ideas never materialized enough so that I could write a story about it.

Moreover, I also did not feel comfortable enough yet to express myself, because I did not have much writing experience in English. Retrospectively, I learned that there is a never a right time for having a blog, I should just start.

Entering the research domain

A couple of years later, while I was working on my master's thesis, I made the decision to go for a PhD degree, because I was genuinely interested in the research domain of my master's thesis: software deployment, mostly because of my prior experience in industry and building Linux distributions from scratch.

Even before starting my PhD, I already knew that writing is an important component in research -- as a researcher, you have to regularly report about your work by means of research papers that typically need to be anonymously peer reviewed.

In most scientific disciplines, academic papers are published in journals. In the computer science domain, it is more common to publish papers in conference proceedings.

Only a certain percentage of paper submissions that are considered good quality (as judged by the peer reviews) are accepted for publication. Rejection of a paper typically requires you make revisions and submitting that paper to a different conference.

For top general conferences in the software engineering domain the acceptance rate is typically lower than 20% (this ratio used be even lower, close to 15%).

In my PhD, I had a very quick publication start -- in the first month, a paper about atomic upgrades for distributed systems was accepted that covered an important topic of my master's thesis.

Roughly half a year later, me and my co-workers published a research paper about the objectives of the Pull Deployment of Services (PDS) research project (in which my research was of the sub topics) funded by NWO/Jacquard.

Although I had a very good start, I slowly started to learn (the hard way) that you cannot simply publish research papers about all the work you do -- as a matter of fact, it only represents a modest sub set of your daily work.

To write a good research paper, it takes quite a bit of time and effort to decide about the topic (including the paper's title) and to get all the details right. I had all kinds of interesting ideas but many of these ideas were not considered novel -- they were interesting engineering efforts but they did not add interesting new (significant) scientific knowledge.

Moreover, in a research paper, you also need to put your contribution in context (e.g. explain/show how it compares to similar work and how it expands existing knowledge), and provide validation (this can be a proof, but in most cases you evaluate in what degree your contribution meets its claims, for example, by providing empirical data).

After instant acceptance of the first two papers, things did not work out that smoothly anymore. I had several paper rejections in a row -- one paper was badly rejected because I did not put it into the right context (for example, I ignored some important related work) and I did not make my contribution very clear (I basically left it open to the interpretation of the reader, which is a bad thing).

Fortunately, I learned a lot from this rejection. The reviewers even suggested me an alternative conference where I could submit my revised paper to. After addressing the reviewers' criticisms, the paper got accepted.

Another paper was rejected twice in a row for IMO very weak reasons. Most notably, it turned out that many reviewers believed that the subject was not really software engineering related (which is strange, because software deployment is explicitly listed as one of the subjects in the conference's call for papers).

When I explained this peculiarity to Eelco Visser (one of my supervisors and co-promotor), he suggested that I should have more frequent interaction with the scientific community and write about the subject on my blog. Software deployment is generally a neglected subject in the software engineering research community.

Eventually, we have managed to publish the problematic papers (one is about Disnix, the tool implementation of my research) and the other about the testing aspect of the previously rejected paper.

After that problematic period, I have managed to publish two more papers that got instantly accepted bringing me to all kinds of interesting conferences.

The decision to start my blog

Although having a 3 paper acceptance streak and traveling to the conferences to present them felt nice for a while, I still was not too happy.

In late 2010, one day before new years eve (I typically reflect over things in the past year at new year's eve) I realized that research papers alone is just a very narrow representation of the work that I do as a researcher (although the amount of papers and their impact are typically used as the only metric to judge the performance of a researcher).

In addition to getting research papers accepted and doing the required writing, there is much more that the work of an academic researcher (and in particular the software engineering domain) is about:

Research in software engineering is about constructing tools. For example, the paper: Research Paradigms in Computer Science' by Peter Wegner from Brown University says:

Research in engineering is directed towards the efficient accomplishment of specific tasks and towards the development of tools that will enable classes of tasks to be accomplished more efficiently.

In addition to the problems that tools try solve or optimize, the construction of these tools is typically also very challenging, similar to conventional software development projects.

Although the construction aspects of tools may not always be novel and too detailed for a research paper (that typically has a page limit), it is definitely useful to work towards a good and stable design and implementation. Writing about these aspects can be very useful for yourself, your colleagues and peers in the field.

Moreover, having a tool that is usable and works also mattered to me and to the people in my research group. For example, my deployment research was built on top of the Nix package manager, that in addition to research, was also used to solve our internal deployment problems.
I did not completely start all the development work of my tooling from scratch -- I was building my deployment tooling on top of the Nix package manager that was both a research project, and an open source project (more accurately called a community project) with a small group of external contributors.

(As a sidenote: the Nix package manager was started by Eelco Dolstra who was a Postdoc in the same research project and one my university supervisors).

I considered my blog a good instrument to communicate with the Nix community about ideas and implementation aspects.
Research is also about having frequent interaction with your peers that work for different universities, companies and/or research institutes.

A research paper is useful to get feedback, but at the same time, it is also quite an inaccessible medium -- people can obtain a copy from publishers (typically behind a paywall) or from your personal homepage and communicate by e-mail, but the barrier is typically high.
I was also frequently in touch with software engineering practitioners, such as former study friends, open source communities and people from our research project's industry partner: Philips Healthcare.

I regularly received all kinds of interesting questions related to the practical aspects of my work. For example, how to apply our research tools to industry problems or how our research tools compare to conventional tools.

Not all of these questions can be transformed into research papers, but were definitely useful to investigate and write about.
Being in academia is more than just working on publications. You also travel to conferences, get involved in all kinds of different (and sometimes related) research subjects of your colleagues and peers and you may also help in teaching. These subjects are also worth writing about.

Because of the above reasons, I was finally convinced that the time was right to start my blog.

The beginning: catching up with my research papers

Since I was already working on my PhD research for more than 2 years, there was still a lot of catching up I had to do. It did not make sense to just randomly start writing about something technical or research related. Basically, I wanted all information on my blog "to fit together".

For the first half year, my blog was basically about writing things down I had already done and published about.

After my blog announcement, I started explaining what the Pull Deployment of Services research project is about, then explaining the Nix package manager that serves as the fundamental basis of all the deployment tooling that I was developing, followed by NixOS, a Linux distribution that is entirely managed by the Nix package manager that can be deployed from a single declarative specification.

The next blog post was about declarative deployment and testing with NixOS. It was used as an ingredient for a research paper that already got published, and a talk with the same title for FOSDEM: the free and open source's European meeting in Brussels. Writing about the subject on my blog was a useful preparation session for my talk.

After giving my talk at FOSDEM, there was more catching up work to do. After explaining the basic Nix concepts, I could finally elaborate about Disnix, the tool I have been developing as part of my research that uses Nix to extend deployment to the domain of service-oriented systems.

After writing about the self-adaptive deployment framework built on top of Disnix (I have submitted my paper at the beginning that year, and it got accepted shortly before writing the corresponding blog post), I was basically up-to-date with all research aspects.

Using my blog for research

After my catch up phase was completed, I could finally start writing about things that were not directly related to any research papers already written in the past.

One of the things I have been struggling with for a while was making our tools work with .NET technology. The Nix package manager (and sister projects, such as Disnix) were primarily developed for UNIX-like operating systems (most notably Linux) and technologies that run on these operating systems.

Our industry partner: Philips Healthcare, mostly uses Microsoft technologies in their development stack ranging from .NET as a runtime, C# for coding, SQL server for storage, and IIS as web server.

At that time, .NET was heavily tied to the Windows eco-system (Mono already existed that provided a somewhat compatible runtime for other operating systems than Windows, but it did not provide compatible implementations of all libraries to work with the Philips platform).

With some small modifications, I could use Nix on Cygwin to build .NET projects. However, running .NET applications that rely on shared libraries (called library assemblies in .NET terminology) was still a challenge. I could only provide a number of very sub optimal solutions, of which none was ideal.

I wrote about it on my blog, and during my trip to ICSE 2011 in Hawaii I learned from a discussion with a co-attendee that you could also use an event listener that triggers when a library assembly is missing. The reflection API can be used in this event handler to load these missing assemblies, making it possible to efficiently solve my dependency problem making it possible to use both Nix and Disnix to deploy .NET services on Windows without any serious obstacles.

I have also managed to discuss one of my biggest frustrations in the research community: the fact that software deployment is a neglected subject. Thanks to spreading the blog post on Twitter (that in turn got retweeted by all kinds of people in the research community) it attracted quite a few visitors and a large number helpful comments. I even got in touch with a company that develops a software deployment automation solution as their main product.

Another investigation that I did as part of my blog (without publishing in mind) was addressing a common criticism from various communities, such as the Debian community, that Nix would not qualify itself as a viable package management solution because it does not comply to the Filesystem Hierarchy Standard (FHS).

I also did a comparison with the deployment properties of GoboLinux, another Linux distribution that deliberately deviates from the FHS to show that a different filesystem organisation has clear benefits for making deployments more reliable and reproducible. The GoboLinux blog post appeared on Reddit (both the NixOS and Linux channels) and attracted quite a few visitors.

From these practical investigations I wrote a blog post that draws some general conclusions.

Reaching the end of my PhD research

After an interesting year, both from a research and blogging perspective, I was reaching the final year of my PhD research (in the Netherlands, a contract of a PhD student is typically only valid for 4 years).

I had already slowly started with writing my PhD thesis, but there was still some unfinished business. There were four (!!!) more research ideas that I wanted to publish about (which was retrospectively looking, a very overambitious goal).

One of these papers was a collaboration project in which we combined our knowledge about software deployment and construction with license compliance engineering to determine which source files are actually used in a binary so that we could detect whether it meets the terms and conditions of free and open-source licenses.

Although our contribution looked great and we were able to detect a compliance issue in FFmpeg, a widely used open source project, the paper was rejected twice in a row. The second time the reviews were really vague and not helpful at all. One of my co-authors called the reviewers extremely incompetent.

After the second rejection, I was (sort of) done with it and extremely disappointed. I did not even want to revise it and submit it anywhere else. Nonetheless, I have published the paper as a technical report, reported about it on my blog, and added it as a chapter to my PhD thesis.

(As a sidenote: more than 2 years later, we did another attempt to resurrect the paper. The revisions were quite a bit of work, but the third version finally got accepted at ASE 2014: one of the top general conferences in the software engineering domain.

This was a happy moment for me -- I was so disappointed about the process, and I was happy to see that there were people who could motivate and convince me that we should not give up).

Another research idea was formalizing infrastructure deployment. Sadly, the idea was not really considered novel -- it was mostly just an incremental improvement over our earlier work. As a result, I got two paper rejections in a row. After the second rejection, I have abolished the idea to publish about it, but I still wrote a chapter about it in my PhD thesis.

All the above rejections (and the corresponding reviews) really started to negatively impact my motivation. I wrote two blog posts about my observations: one blog post was about a common reason for rejecting a paper: the complaint that a contribution is engineering, but not science (which is quite weird for research in software engineering). Another blog post was about the difficulties in connecting academic research with software engineering practice. From my experiences thus far, I concluded that there is a huge gap between the two.

Fortunately, I still managed to gather enough energy to finish my third idea. I already had a proof-of-concept implementation for managing state of services deployed by Disnix for a while. By pulling out a few all nighters, I managed to write a research paper (all by myself) and submitted it to HotSWUp 2012. That paper got instantly accepted, which was a good boost for my motivation.

In the last few months, the only thing I could basically do is finishing up my PhD thesis. To still keep my blog somewhat active, I have written a number of posts about my conference experiences.

Although I already had a very early proof-of-concept implementation, I never managed to finish my fourth research paper idea. This was not a problem for finishing my PhD thesis as I already had enough material to complete it, but still I consider it one the more interesting research ideas that I never got to finish. As of today, I still have not finished or published about it (neither on my blog or in a research paper).

Leaving academia, working for industry

A couple of weeks before my contract with the university was about to expire, I finished the first draft of my PhD thesis and submitted it to the reading committee for review.

Although the idea of having an academic research career crossed my mind several times, I ultimately decided that this was not something I wanted to pursue, for a variety of reasons. Most notably, the discrepancy between topics suitable for publishing and things that could be applied in practice was one of the major reasons.

All that was left was looking for a new job. After an interesting job searching month I joined Conference Compass, a startup company that consisted of fewer than 10 people when I joined.

One of the interesting technical challenges they were facing was setting up a product-line for their mobile conference apps. My past experience with deployment technologies turned out to come in quite handy.

The Nix project did not disappear after all involved people in the PDS project left the university (besides me, Eelco Dolstra (the author of the Nix package manager) and Rob Vermaas also joined an industrial company) -- the project moved to GitHub, increasing its popularity and the number of contributors.

The fact that the Nix project continued and that blogging had so many advantages for me personally, I decided to resume my blog. The only thing that changed is that my blog was no longer in service of a research project, but just a personal means to dive into technical subjects.

Reintroducing Nix to different audiences

Almost at the same time that the Nix project moved to GitHub, the GNU Guix project was announced: GNU Guix is a package manager with similar objectives to the Nix package manager, but with some notable differences too: instead of the Nix expression language, it uses Scheme as a configuration language.

Moreover, the corresponding software packages distribution: GuixSD, exclusively provides free software.

GNU Guix reuses the Nix daemon, and related components such as the Nix store from the Nix package manager to organize and isolate software packages.

I wrote a comparison blog post, that was posted on Reddit and Hackernews attracting a huge number of visitors. The amount of visitors was several orders of magnitude higher than all the blog posts I have written before that. As of today, this blog post is still in my overall top 10.

One of the things I did in the first month at Conference Compass is explaining the Nix package manager to my colleagues who did not have much system administration experience or knowledge about package managers.

I have decided to use a programming language-centered Nix explanation recipe, as opposed to a system administration-centered explanation. In many ways, I consider this explanation recipe the better of the three that I wrote.

This blog post also got posted on Reddit and Hackernews attracting a huge number of visitors. In only one month, with two blog posts, I attracted more visitors to my blog than all my previous blog posts combined.

Developing an app building infrastructure

As explained earlier, Conference Compass was looking into developing a product-line for mobile conference apps.

I did some of the work in the open, by using a variety of tools from the Nix project and making contributions to the Nix project.

I have packaged many components of the Android SDK and developed a function abstraction that automatically builds Android APKs. Similarly, I also built a function for iOS apps (that works both with the simulator and real devices), and for Appcelerator Titanium: a JavaScript-based cross platform framework allowing you target a variety of mobile platforms including Android and iOS.

In addition to the Nix-based app building infrastructure, I have also described how you can set up Hydra: a Nix-based continuous integration service to automatically build mobile apps and other software projects.

It turns out that in addition to ordinary software projects, Hydra also works well for distributing bleeding edge builds of mobile apps -- for example, you can use your phone or tablet's web browser to automatically download and install any bleeding edge build that you want.

The only thing that was a bit of a challenge was distributing apps to iOS devices with Hydra, but with some workarounds that was also possible.

I have also developed a Node.js package to conveniently integrate custom application with Hydra.

Finishing up my PhD and defending my thesis

Although I left academia, the transition to industry was actually very gradual -- as explained earlier, while being employed at Conference Compass, I still had to finish and defend my PhD thesis.

Several weeks before my planned defence date, I received feedback from my reading committee about my draft that I finished in my last month at the university. This was a very stressful period -- in addition to making revisions to my PhD thesis, I also had to arrange the printing and the logistics of the ceremony.

I also wrote three more blog posts about my thesis and the defence process: I provided a summary of my PhD thesis as a blog post, I wrote about the defence ceremony, and about my PhD thesis propositions.

Writing thesis propositions is also a tradition in the Netherlands. Earlier that year, my former colleague Felienne Hermans decided to blog and tweet about her PhD thesis propositions, and I did the same thing.

PhD thesis propositions are typically not supposed to have a direct relationship to your PhD thesis, but they should be defendable. In addition to your thesis, the committee members are also allowed to ask you questions about your propositions.

The blog post about my PhD thesis propositions (as of today) still regularly attracts visitors. The amount of visitors of this blog post heavily outnumbers the summary blog post about my PhD thesis.

In addition to my PhD thesis, there were more interesting post-academia research events: a journal paper submission finally got officially published (4 years after submitting the first draft!) and we have managed to get our paper about discovering license compliance inconsistencies accepted at ASE 2014, that was previously rejected twice.

Learning Node.js and more about JavaScript

In addition to the app building infrastructure at Conference Compass, I have also spend considerable amounts of time learning things about Node.js and its underlying concepts: the asynchronous event loop. Although I already had some JavaScript programming experience, all my knowledge thus far was limited to the web browser.

I learned about all kinds of new concepts, such as callbacks (and function-level scoping), promises, asynchronous programming (in general) and mixing callbacks with promises. Moreover, I also learned that (despite my earlier experiences in the concepts of programming languages course) working with prototypes in JavaScript was more difficult than expected. I have decided to address my earlier shortcomings in my teachings with a blog post.

With Titanium (the cross-platform mobile app development framework that uses JavaScript as an implementation language), beyond regular development work, I investigated how we can port a Node.js-based XMPP library to Titanium and how we can separate concerns well enough to make a simple, reliable chat application.

Building a service management platform and implementing major Disnix improvements

At Conference Compass, somewhere in the middle of 2013, we decided to shift away from a single monolithic backend application for all our apps, to a more modular approach in which each app has their own backend and their own storage.

After a couple of brief experiments with Heroku, we shifted to a Nix-based approach in mid 2014. NixOps was used to automatically deploy virtual machines in the cloud (using Amazon's EC2 service), and Disnix became responsible for deploying all services to these virtual machines.

In the Nix community, there was quite a bit of confusion about these two tools, because both use the Nix package manager and are designed for distributed deployment. I wrote a blog post to explain in what ways they are similar and different.

Over the course of 2015, most of my company work was concentrated on the service management platform. In addition to automating the deployment of all machines and services, I also implemented the following functionality:

Backup support (using the experimental state management facilities of Dysnomia)
Monitoring support with Datadog
A general configuration management framework to organize and document all relevant configuration items
Various optimizations: target-specific services that do not require unnecessary reconfigurations and on-demand activation and self-termination of services, to save RAM.

In late 2015, the first NixCon conference was organized, in which I gave a presentation about Disnix and explained how it can be used for the deployment of microservices. I received all kinds of useful feedback that I implemented in the first half of 2016:

Most notably, I changed the internal model of Disnix to also work with the notion of containers (environments that manage services), a feature that Dysnomia already supported, but could not be directly controlled in Disnix.
You can also manage multiple instances of container services on a single machine.

Over time, I did many more interesting Disnix developments:

I made modifications to use it as a remote package deployer
I worked on an abstraction layer to more easily deal with concurrency
I built a tool that can reconstruct the Disnix deployment models from a network of already deployed services.
I built a tool that helps diagnosing problems
I created more public examples (based on Chord and my own web framework).

Furthermore, the Dynamic Disnix framework (an extension toolset that I developed for a research paper many years ago), also got all kinds of updates. For example, it was extended to automatically assign TCP/UDP port numbers and to work with state migrations.

While working on the service management platform, five new Disnix versions were released (the first was 0.3, the last 0.8). I wrote a blog post for the 0.5 release that explains all previously released versions, including the first two prototype iterations.

Brief return to web technology

As explained in the introduction, I already had the idea to start my blog while I was still actively doing web development.

At some point I needed to make some updates to web applications that I had developed for my voluntary work that still use pieces of my old custom web framework.

I already release some pieces (most notably the layout manager) of it on my GitHub page as a side project, but at some point I have also decided to release the remainder of the components.

I also wrote a blog post about my struggles composing a decent layout and some pointers on "rational" layout decisions.

Working on Nix generators

In addition to JavaScript development at Conference Compass, I was also using Nix-related tools for automating deployments of Node.js projects.

Eventually, I created node2nix to make deployments with the NPM package manager possible in Nix expressions (at the time this was already possible with npm2nix, but node2nix was developed to address important shortcomings of npm2nix, such as circular dependencies).

Over time, I faced many more challenges that were node2nix/NPM related:

To make NPM more compatible on Windows, the NPM authors introduced a "flattening strategy" that required a substantial rewrite of node2nix.
Simulating global NPM package installations in Nix expressions.
Substituting impure version specifiers that may trigger accidental remote network requests.
Bypassing NPM's content-addressable cache for local installations (that is fundamentally incompatible with how NPM installations are done in Nix expressions).

When I released my custom web framework I also did the same for PHP. I have created composer2nix to allow PHP composer projects to be deployed with the Nix package manager.

In addition to building these generators, I also heavily invested in working towards identifying common concepts and implementation decisions for both node2nix and composer2nix.

Both tools use an internal DSL to generate Nix expressions (NiJS for JavaScript, and PNDP for PHP) as opposed to using strings.

Both tools implement a domain model (that is close to NPM and composer concepts) that get translated to an object structure in the Nix expression language with a generic translation strategy.

Joining Mendix, working on Nix concepts and improvements

Slightly over 2 years ago I joined Mendix, a company that develops a low-code application development platform and related services.

While I was learning about Mendix, I wrote a blog post that explains its basic concepts.

In addition, as a crafting project, I also automated the deployment of Mendix applications with Nix technologies (and even wrote about it on the public Mendix blog).

While learning about the Mendix cloud deployment platform, I also got heavily involved in documenting its architecture. I wrote a blog post about my practices (the notation that I used was inspired by the diagrams that I generate with Disnix). I even implemented some of these ideas in the Dynamic Disnix toolset.

When I just joined Mendix, I was mostly learning about the company and their development stack. In my spare time, I made quite a few random Nix contributions:

As a personal learning exercise and attempt to make the stdenv.mkDerivation function abstraction in Nix more understandable, I wrote layered build function abstractions.
I also extended the lessons for building these abstractions to automate the deployment of SDKs with Nix, most notably the Android SDK.
I also worked on automating the process of patching prebuilt ELF binaries so that these programs can be conveniently deployed by Nix. Most notably, it came in handy for the Android SDK.

Furthermore, I made some structural Disnix improvements as well:

I wrote a data exchange library to more reliably consume Disnix deployment models (that are generated by Nix expressions), that also provides much better error reporting.
I revised the Disnix model transformation pipeline to improve maintainability. In addition, it also provides more configuration properties and a new model: the packages model.

Side projects

In addition to all the major themes above, there are also many in between projects and blog posts about all kinds of random subjects.

For example, one of my long-running side projects is the IFF file format experiments project (a container file format commonly used on the Commodore Amiga) that I already started in the middle of my PhD.

In addition to the viewer, I also developed a hacky Nix function to build software projects on AmigaOS, explained how to emulate the Amiga graphics modes, ported the project to SDL 2.0, and to Visual Studio so that it could run on Windows.

I also wrote many general Nix-related blog posts between major projects, such as:

And covered developer's cultural aspects, such as my experiences with Agile software development and Scrum, and developer motivation.

Some thoughts

In this blog post, I have explained my motivation for starting my blog 10 years ago, and covered all the major projects I have been working including most of the blog posts that I have written.

If you are a PhD student or a more seasoned researcher, then I would definitely encourage you to start a blog -- it gave me the following benefits:

It makes your job much more interesting. All aspects of your research and teaching get attention, not just the papers, that typically only reflect over a modest sub set of your work.
It is a good and more accessible means to get frequent interaction with peers, practitioners, and outsiders who might have an interest in your work.
It improves your writing skills, which is also useful for writing papers.
It helps me to structure my work, by working on focused goals one at the time. You can use some of these pieces as ingredient for a research paper and/or your PhD thesis.
It may attract more visitors than research papers.

About the last benefit: in academia, there all kinds of metrics to measure the impact of a researcher, such as the G-index, and H-index. These metrics are sometimes taken very seriously, for example, by organizations that decide whether you can get a research grant or not.

To give you a comparison: my most "popular" research paper titled: "Software deployment in a dynamic cloud: From device to service orientation in a hospital environment" was only downloaded (at the time of writing this blog post) 625 times from the ACM digital library and 240 times from IEEE Xplore. According to Google Scholar, it got 28 citations.

My most popular blog post (that I wrote as an ingredient for my PhD research) is: On Nix, NixOS and the Filesystem Hierarchy Standard (FHS) that attracted 5654 views, which is several orders of magnitude higher than my most popular research paper. In addition, I wrote several more research-related blog posts that got a comparable number of views, such as the blog post about my PhD thesis propositions.

After completing my PhD research, I wrote blog posts that attracted even several orders of magnitude more visitors than the two blog posts mentioned above.

(As a sidenote: I personally am not a big believer in the relevance of these numbers. What matters to me is the quality of my work, not quantity).

Regularly writing for yourself as part of your job is not an observation that is unique to me. For example, the famous computer scientist Edsger Dijkstra, wrote more than 1300 manuscripts (called EWDs) about topics that he considered important, without publishing in mind.

In EWD 1000, he says:

If there is one "scientific" discovery I am proud of, it is the discovery of the habit of writing without publication in mind. I experience it as a liberating habit: without it, doing the work becomes one thing and writing it down becomes another one, which is often viewed as an unpleasant burden. When working and writing have merged, that burden has been taken away.

If you feel hesitant to start your blog, he says the following about a writer's block:

I only freed myself from that writer's block by the conscious decision not to write for a target audience but to write primarily to please myself.

For software engineering practitioners (which I effectively became after leaving academia) a blog has benefits too:

I consider good writing skills important for practitioners as well, for example to write specifications, API documentation, other technical documentation and end-user documentation. A blog helps you developing them.
Structuring your thoughts and work is also useful for software development projects, in particular free and open source projects.
It is also a good instrument to get in touch with development and open source communities. In addition to the Nix community, I also got a bit of attention in the Titanium community (with my XMPP library porting project), the JavaScript community (for example, with my blog post about prototypes) and more recently: the InfluxData community (with my monitoring playground project).

Concluding remarks

In this blog post, I covered most of my blog post written in the last decade, but I did not elaborate much about 2020. Since 2020 is a year that will definitely not go unnoticed in the history books, I will write (as an exception) an annual reflection over 2020 tomorrow.

Moreover, after browsing over all my blog posts since the beginning of my blog, I also realized that it is a bit hard to find relevant old information.

To alleviate that problem, I have reorganized/standardized all my labels so that you can more easily search on subjects. On my homepage, I have added an overview of all labels that I am currently using.

Sunday, November 29, 2020

Constructing a simple alerting system with well-known open source projects

Some time ago, I have been experimenting with all kinds of monitoring and alerting technologies. For example, with the following technologies, I can develop a simple alerting system with relative ease:

Telegraf is an agent that can be used to gather measurements and transfer the corresponding data to all kinds of storage solutions.
InfluxDB is a time series database platform that can store, manage and analyze timestamped data.
Kapacitor is a real-time streaming data process engine, that can be used for a variety of purposes. I can use Kapacitor to analyze measurements and see if a threshold has been exceeded so that an alert can be triggered.
Alerta is a monitoring system that can store, de-duplicate alerts, and arrange black outs.
Grafana is a multi-platform open source analytics and interactive visualization web application.

These technologies appear to be quite straight forward to use. However, as I was learning more about them, I discovered a number of oddities, that may have big implications.

Furthermore, testing and making incremental changes also turns out to be much more challenging than expected, making it very hard to diagnose and fix problems.

In this blog post, I will describe how I built a simple monitoring and alerting system, and elaborate about my learning experiences.

Building the alerting system

As described in the introduction, I can combine several technologies to create an alerting system. I will explain them more in detail in the upcoming sections.

Telegraf

Telegraf is a pluggable agent that gathers measurements from a variety of inputs (such as system metrics, platform metrics, database metrics etc.) and sends them to a variety of outputs, typically storage solutions (database management systems such as InfluxDB, PostgreSQL or MongoDB). Telegraf has a large plugin eco-system that provides all kinds integrations.

In this blog post, I will use InfluxDB as an output storage backend. For the inputs, I will restrict myself to capturing a sub set of system metrics only.

With the following telegraf.conf configuration file, I can capture a variety of system metrics every 10 seconds:

[agent]
  interval = "10s"

[[outputs.influxdb]]
  urls = [ "http://test1:8086" ]
  database = "sysmetricsdb"
  username = "sysmetricsdb"
  password = "sysmetricsdb"

[[inputs.system]]
  # no configuration

[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
  report_active = true

[[inputs.mem]]
  # no configuration

With the above configuration file, I can collect the following metrics:

System metrics, such as the hostname and system load.
CPU metrics, such as how much the CPU cores on a machine are utilized, including the total CPU activity.
Memory (RAM) metrics.

The data will be stored in an InfluxDB database name: sysmetricsdb hosted on a remote machine with host name: test1.

InfluxDB

As explained earlier, InfluxDB is a timeseries platform that can store, manage and analyze timestamped data. In many ways, InfluxDB resembles relational databases, but there are also some notable differences.

The query language that InfluxDB uses is called InfluxQL (that shares many similarities with SQL).

For example, with the following query I can retrieve the first three data points from the cpu measurement, that contains the CPU-related measurements collected by Telegraf:

> precision rfc3339
> select * from "cpu" limit 3

providing me the following result set:

name: cpu
time                 cpu       host  usage_active       usage_guest usage_guest_nice usage_idle        usage_iowait        usage_irq usage_nice usage_softirq       usage_steal usage_system      usage_user
----                 ---       ----  ------------       ----------- ---------------- ----------        ------------        --------- ---------- -------------       ----------- ------------      ----------
2020-11-16T15:36:00Z cpu-total test2 10.665258711721098 0           0                89.3347412882789  0.10559662090813073 0         0          0.10559662090813073 0           8.658922914466714 1.79514255543822
2020-11-16T15:36:00Z cpu0      test2 10.665258711721098 0           0                89.3347412882789  0.10559662090813073 0         0          0.10559662090813073 0           8.658922914466714 1.79514255543822
2020-11-16T15:36:10Z cpu-total test2 0.1055966209080346 0           0                99.89440337909197 0                   0         0          0.10559662090813073 0           0                 0

As you may probably notice by looking at the output above, every data point has a timestamp and a number of fields capturing CPU metrics:

cpu identifies the CPU core.
host contains the host name of the machine.
The remainder of the fields contain all kinds of CPU metrics, e.g. how much CPU time is consumed by the system (usage_system), the user (usage_user), by waiting for IO (usage_iowait) etc.
The usage_active field contains the total CPU activity percentage, which is going to be useful to develop an alert that will warn us if there is too much CPU activity for a long period of time.

Aside from the fact that all data is timestamp based, data in InfluxDB has another notable difference compared to relational databases: an InfluxDB database is schemaless. You can add an arbitrary number of fields and tags to a data point without having to adjust the database structure (and migrating existing data to the new database structure).

Fields and tags can contain arbitrary data, such as numeric values or strings. Tags are also indexed so that you can search for these values more efficiently. Furthermore, tags can be used to group data.

For example, the cpu measurement collection has the following tags:

> SHOW TAG KEYS ON "sysmetricsdb" FROM "cpu";
name: cpu
tagKey
------
cpu
host

As shown in the above output, the cpu and host fields are tags in the cpu measurement.

We can use these tags to search for all data points related to a CPU core and/or host machine. Moreover, we can use these tags for grouping allowing us to compute aggregate values, sch as the mean value per CPU core and host.

Beyond storing and retrieving data, InfluxDB has many useful additional features:

You can also automatically sample data and run continuous queries that generate and store sampled data in the background.
Configure retention policies so that data is no longer stored for an indefinite amount of time. For example, you can configure a retention policy to drop raw data after a certain amount of time, but retain the corresponding sampled data.

InfluxDB has a "open core" development model. The free and open source edition (FOSS) of InfluxDB server (that is MIT licensed) allows you to host multiple databases on a multiple servers.

However, if you also want horizontal scalability and/or high assurance, then you need to switch to the hosted InfluxDB versions -- data in InfluxDB is partitioned into so-called shards of a fixed size (the default shard size is 168 hours).

These shards can be distributed over multiple InfluxDB servers. It is also possible to deploy multiple read replicas of the same shard to multiple InfluxDB servers improving read speed.

Kapacitor

Kapacitor is a real-time streaming data process engine developed by InfluxData -- the same company that also develops InfluxDB and Telegraf.

It can be used for all kinds of purposes. In my example cases, I will only use it to determine whether some threshold has been exceeded and an alert needs to be triggered.

Kapacitor works with customly implemented tasks that are written in a domain-specific language called the TICK script language. There are two kinds of tasks: stream and batch tasks. Both task types have advantages and disadvantages.

We can easily develop an alert that gets triggered if the CPU activity level is high for a relatively long period of time (more than 75% on average over 1 minute).

To implement this alert as a stream job, we can write the following TICK script:

dbrp "sysmetricsdb"."autogen"

stream
    |from()
        .measurement('cpu')
        .groupBy('host', 'cpu')
        .where(lambda: "cpu" != 'cpu-total')
    |window()
        .period(1m)
        .every(1m)
    |mean('usage_active')
    |alert()
        .message('Host: {{ index .Tags "host" }} has high cpu usage: {{ index .Fields "mean" }}')
        .warn(lambda: "mean" > 75.0)
        .crit(lambda: "mean" > 85.0)
        .alerta()
            .resource('{{ index .Tags "host" }}/{{ index .Tags "cpu" }}')
            .event('cpu overload')
            .value('{{ index .Fields "mean" }}')

A stream job is built around the following principles:

A stream task does not execute queries on an InfluxDB server. Instead, it creates a subscription to InfluxDB -- whenever a data point gets inserted into InfluxDB, the data points gets forwarded to Kapacitor as well.

To make subscriptions work, both InfluxDB and Kapacitor need to be able to connect to each other with a public IP address.
A stream task defines a pipeline consisting of a number of nodes (connected with the | operator). Each node can consume data points, filter, transform, aggregate, or execute arbitrary operations (such as calling an external service), and produce new data points that can be propagated to the next node in the pipeline.
Every node also has property methods (such as .measurement('cpu')) making it possible to configure parameters.

The TICK script example shown above does the following:

The from node consumes cpu data points from the InfluxDB subscription, groups them by host and cpu and filters out data points with the the cpu-total label, because we are only interested in the CPU consumption per core, not the total amount.
The window node states that we should aggregate data points over the last 1 minute and pass the resulting (aggregated) data points to the next node after one minute in time has elapsed. To aggregate data, Kapacitor will buffer data points in memory.
The mean node computes the mean value for usage_active for the aggregated data points.
The alert node is used to trigger an alert of a specific severity level (WARNING if the mean activity percentage is bigger than 75%) and (CRITICAL if the mean activity percentage is bigger than 85%). In the remainder of the case, the status is considered OK. The alert is sent to Alerta.

It is also possible to write a similar kind of alerting script as a batch task:

dbrp "sysmetricsdb"."autogen"

batch
    |query('''
        SELECT mean("usage_active")
        FROM "sysmetricsdb"."autogen"."cpu"
        WHERE "cpu" != 'cpu-total'
    ''')
        .period(1m)
        .every(1m)
        .groupBy('host', 'cpu')
    |alert()
        .message('Host: {{ index .Tags "host" }} has high cpu usage: {{ index .Fields "mean" }}')
        .warn(lambda: "mean" > 75.0)
        .crit(lambda: "mean" > 85.0)
        .alerta()
            .resource('{{ index .Tags "host" }}/{{ index .Tags "cpu" }}')
            .event('cpu overload')
            .value('{{ index .Fields "mean" }}')

The above TICK script looks similar to the stream task shown earlier, but instead of using a subscription, the script queries the InfluxDB database (with an InfluxQL query) for data points over the last minute with a query node.

Which approach for writing a CPU alert is best, you may wonder? Each of these two approaches have their pros and cons:

Stream tasks offer low latency responses -- when a data point appears, a stream task can immediately respond, whereas a batch task needs to query every minute all the data points to compute the mean percentage over the last minute.
Stream tasks maintain a buffer for aggregating the data points making it possible to only send incremental updates to Alerta. Batch tasks are stateless. As a result, they need to update the status of all hosts and CPUs every minute.
Processing data points is done synchronously and in sequential order -- if an update round to Alerta takes too long (which is more likely to happen with a batch task), then the next processing run may overlap with the previous, causing all kinds of unpredictable results.

It may also cause Kapacitor to eventually crash due to growing resource consumption.
Batch tasks may also miss data points -- while querying data over a certain time window, it may happen that a new data point gets inserted in that time window (that is being queried). This new data point will not be picked up by Kapacitor.

A subscription made by a stream task, however, will never miss any data points.
Stream tasks can only work with data points that appear from the moment Kapacitor is started -- it cannot work with data points in the past.

For example, if Kapacitor is restarted and some important event is triggered in the restart time window, Kapacitor will not notice that event, causing the alert to remain in its previous state.

To work effectively with stream tasks, a continuous data stream is required that frequently reports on the status of a resource. Batch tasks, on the other hand, can work with historical data.
The fact that nodes maintain a buffer may also cause the RAM consumption of Kapacitor to grow considerably, if the data volumes are big.

A batch task on the other hand, does not buffer any data and is more memory efficient.

Another compelling advantage of batch tasks over stream tasks is that InfluxDB does all the work. The hosted version of InfluxDB can also horizontally scale.
Batch tasks can also aggregate data more efficiently (e.g. computing the mean value or sum of values over a certain time period).

I consider neither of these script types the optimal solution. However, for implementing the alerts I tend to have a slight preference for stream jobs, because of its low latency, and incremental update properties.

Alerta

As explained in the introduction, Alerta is a monitoring system that can store and de-duplicate alerts, and arrange black outs.

The Alerta server provides a REST API that can be used to query and modify alerting data and uses MongoDB or PostgreSQL as a storage database.

There are also a variety of Alerta clients: there is the alerta-cli allows you to control the service from the command-line. There is also a web user interface that I will show later in this blog post.

Running experiments

With all the components described above in place, we can start running experiments to see if the CPU alert will work as expected. To gain better insights in the process, I can install Grafana that allows me to visualize the measurements that are stored in InfluxDB.

Configuring a dashboard and panel for visualizing the CPU activity rate was straight forward. I configured a new dashboard, with the following variables:

The above variables allow me to select for each machine in the network, which CPU core's activity percentage I want to visualize.

I have configured the CPU panel as follows:

In the above configuration, I query the usage_activity from the cpu measurement collection, using the dashboard variables: cpu and host to filter for the right target machine and CPU core.

I have also configured the field unit to be a percentage value (between 0 and 100).

When running the following command-line instruction on a test machine that runs Telegraf (test2), I can deliberately hog the CPU:

$ dd if=/dev/zero of=/dev/null

The above command reads zero bytes (one-by-one) and discards them by sending them to /dev/null, causing the CPU to remain utilized at a high level:

In the graph shown above, it is clearly visible that CPU core 0 on the test2 machine remains utilized at 100% for several minutes.

(As a sidenote, we can also hog both the CPU and consume RAM at the same time with a simple command line instruction).

If we keep hogging the CPU and wait for at least a minute, the Alerta web interface dashboard will show a CRITICAL alert:

If we stop the dd command, then the TICK script should eventually notice that the mean percentage drops below the WARNING threshold causing the alert to go back into the OK state and disappearing from the Alerta dashboard.

Developing test cases

Being able to trigger an alert with a simple command-line instruction is useful, but not always convenient or effective -- one of the inconveniences is that we always have to wait at least one minute to get feedback.

Moreover, when an alert does not work, it is not always easy to find the root cause. I have encountered the following problems that contribute to a failing alert:

Telegraf may not be running and, as a result, not capturing the data points that need to be analyzed by the TICK script.
A subscription cannot be established between InfluxDB and Kapacitor. This may happen when Kapacitor cannot be reached through a public IP address.
There are data points collected, but only the wrong kinds of measurements.
The TICK script is functionally incorrect.

Fortunately, for stream tasks it is relatively easy to quickly find out whether an alert is functionally correct or not -- we can generate test cases that almost instantly trigger each possible outcome with a minimal amount of data points.

An interesting property of stream tasks is that they have no notion of time -- the .window(1m) property may suggest that Kapacitor computes the mean value of the data points every minute, but that is not what it actually does. Instead, Kapacitor only looks at the timestamps of the data points that it receives.

When Kapacitor sees that the timestamps of the data points fit in the 1 minute time window, then it keeps buffering. As soon as a data point appears that is outside this time window, the window node relays an aggregated data point to the next node (that computes the mean value, than in turn is consumed by the alert node deciding whether an alert needs to be raised or not).

We can exploit that knowledge, to create a very minimal bash test script that triggers every possible outcome: OK, WARNING and CRITICAL:

influxCmd="influx -database sysmetricsdb -host test1"

export ALERTA_ENDPOINT="http://test1"

### Trigger CRITICAL alert

# Force the average CPU consumption to be 100%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100   0000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100  60000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100 120000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "critical" ]
then
     echo "Expected severity: critical, but we got: $actualSeverity" >&2
     false
fi
      
### Trigger WARNING alert

# Force the average CPU consumption to be 80%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 180000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 240000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 300000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "warning" ]
then
     echo "Expected severity: warning, but we got: $actualSeverity" >&2
     false
fi

### Trigger OK alert

# Force the average CPU consumption to be 0%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 300000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 360000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 420000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "ok" ]
then
     echo "Expected severity: ok, but we got: $actualSeverity" >&2
     false
fi

The shell script shown above automatically triggers all three possible outcomes of the CPU alert:

CRITICAL is triggered by generating data points that force a mean activity percentage of 100%.
WARNING is triggered by a mean activity percentage of 80%.
OK is triggered by a mean activity percentage of 0%.

It uses the Alerta CLI to connect to the Alerta server to check whether the alert's severity level has the expected value.

We need three data points to trigger each alert type -- the first two data points are on the boundaries of the 1 minute window (0 seconds and 60 seconds), forcing the mean value to become the specified CPU activity percentage.

The third data point is deliberately outside the time window (of 1 minute), forcing the alert node to be triggered with a mean value over the previous two data points.

Although the above test strategy works to quickly validate all possible outcomes, one impractical aspect is that the timestamps in the above example start with 0 (meaning 0 seconds after the epoch: January 1st 1970 00:00 UTC).

If we also want to observe the data points generated by the above script in Grafana, we need to configure the panel to go back in time 50 years.

Fortunately, I can also easily adjust the script to start with a base timestamp, that is 1 hour in the past:

offset="$(($(date +%s) - 3600))"

With this tiny adjustment, we should see the following CPU graph (displaying data points from the last hour) after running the test script:

As you may notice, we can see that the CPU activity level quickly goes from 100%, to 80%, to 0%, using only 9 data points.

Although testing stream tasks (from a functional perspective) is quick and convenient, testing batch tasks in a similar way is difficult. Contrary to the stream task implementation, the query node in the batch task does have a notion of time (because of the WHERE clause that includes the now() expression).

Moreover, the embedded InfluxQL query evaluates the mean values every minute, but the test script does not exactly know when this event triggers.

The only way I could think of to (somewhat reliably) validate the outcomes is by creating a test script that continuously inserts data points for at least double the time window size (2 minutes) until Alerta reports the right alert status (if it does not after a while, I can conclude that the alert is incorrectly implemented).

Automating the deployment

As you may probably have already guessed, to be able to conveniently experiment with all these services, and to reliably run tests in isolation, some form of deployment automation is an absolute must-have.

Most people who do not know anything about my deployment technology preferences, will probably go for Docker or docker-compose, but I have decided to use a variety of solutions from the Nix project.

NixOps is used to automatically deploy a network of NixOS machines -- I have created a logical and physical NixOps configuration that deploys two VirtualBox virtual machines.

With the following command I can create and deploy the virtual machines:

$ nixops create network.nix network-virtualbox.nix -d test
$ nixops deploy -d test

The first machine: test1 is responsible for hosting the entire monitoring infrastructure (InfluxDB, Kapacitor, Alerta, Grafana), and the second machine (test2) runs Telegraf and the load tests.

Disnix (my own deployment tool) is responsible for deploying all services, such as InfluxDB, Kapacitor, Alarta, and the database storage backends. Contrary to docker-compose, Disnix does not work with containers (or other Docker objects, such as networks or volumes), but with arbitrary deployment units that are managed with a plugin system called Dysnomia.

Moreover, Disnix can also be used for distributed deployment in a network of machines.

I have packaged all the services and captured them in a Disnix services model that specifies all deployable services, their types, and their inter-dependencies.

If I combine the services model with the NixOps network models, and a distribution model (that maps Telegraf and the test scripts to the test2 machine and the remainder of the services to the first: test1), I can deploy the entire system:

$ export NIXOPS_DEPLOYMENT=test
$ export NIXOPS_USE_NIXOPS=1

$ disnixos-env -s services.nix \
  -n network.nix \
  -n network-virtualbox.nix \
  -d distribution.nix

The following diagram shows a possible deployment scenario of the system:

The above diagram describes the following properties:

The light-grey colored boxes denote machines. In the above diagram, we have two of them: test1 and test2 that correspond to the VirtualBox machines deployed by NixOps.
The dark-grey colored boxes denote containers in a Disnix-context (not to be confused with Linux or Docker containers). These are environments that manage other services.

For example, a container service could be the PostgreSQL DBMS managing a number of PostgreSQL databases or the Apache HTTP server managing web applications.
The ovals denote services that could be any kind of deployment unit. In the above example, we have services that are running processes (managed by systemd), databases and web applications.
The arrows denote inter-dependencies between services. When a service has an inter-dependency on another service (i.e. the arrow points from the former to the latter), then the latter service needs to be activated first. Moreover, the former service also needs to know how the latter can be reached.
Services can also be container providers (as denoted by the arrows in the labels), stating that other services can be embedded inside this service.

As already explained, the PostgreSQL DBMS is an example of such a service, because it can host multiple PostgreSQL databases.

Although the process components in the diagram above can also be conveniently deployed with Docker-based solutions (i.e. as I have explained in an earlier blog post, containers are somewhat confined and restricted processes), the non-process integrations need to be managed by other means, such as writing extra shell instructions in Dockerfiles.

In addition to deploying the system to machines managed by NixOps, it is also possible to use the NixOS test driver -- the NixOS test driver automatically generates QEMU virtual machines with a shared Nix store, so that no disk images need to be created, making it possible to quickly spawn networks of virtual machines, with very small storage footprints.

I can also create a minimal distribution model that only deploys the services required to run the test scripts -- Telegraf, Grafana and the front-end applications are not required, resulting in a much smaller deployment:

As can be seen in the above diagram, there are far fewer components required.

In this virtual network that runs a minimal system, we can run automated tests for rapid feedback. For example, the following test driver script (implemented in Python) will run my test shell script shown earlier:

test2.succeed("test-cpu-alerts")

With the following command I can automatically run the tests on the terminal:

$ nix-build release.nix -A tests

Availability

The deployment recipes, test scripts and documentation describing the configuration steps are stored in the monitoring playground repository that can be obtained from my GitHub page.

Besides the CPU activity alert described in this blog post, I have also developed a memory alert that triggers if too much RAM is consumed for a longer period of time.

In addition to virtual machines and services, there is also deployment automation in place allowing you also easily deploy Kapacitor TICK scripts and Grafana dashboards.

To deploy the system, you need to use the very latest version of Disnix (version 0.10) that was released very recently.

Acknowledgements

I would like to thank my employer: Mendix for writing this blog post. Mendix allows developers to work two days per month on research projects, making projects like these possible.

Presentation

I have given a presentation about this subject at Mendix. For convienence, I have embedded the slides:

The Monitoring Playground from Sander van der Burg