Sander van der Burg's blog: Dynamic analysis of build processes to discover license constraints

As I have explained earlier in the blog post about software deployment complexity, software is rarely self-contained nowadays, but typically use many off the shelf components. Reuse has advantages, such as the fact that productivity increases and products can be finished more quickly. One of the disadvantages is an increasingly more complicated deployment process.

Apart from productivity and deployment aspects, the usage of components under Free and Open Source licenses is very popular. This is probably due to the fact that the source code is available, can be adapted and most software packages are available for free through the internet (free in price, a.k.a. gratis).

What a lot of vendors don't realize, is that most Free and Open-Source components are not in the public domain. They are in fact copyrighted and distributed under licenses ranging from simple permissive ones (allowing you to do almost everything you want including keeping modifications secret) to more complicated copyleft licenses imposing requirements on derived works. The GNU General Public License (GPL) is the most famous copyleft license around.

Because of the obligations that these licenses impose on users, licenses have become a very important non-functional requirement of software systems. Not obeying these licenses could result in costly lawsuits by copyright holders. Busybox is a well-know example of a software package which has been defended successfully in court several times.

Some clarification

Before you read on, I first want to give some clarification to readers unfamiliar with Free and Open Source software. I have written an earlier blog post about Free and Open-Source software explaining what it is and what it is not. In this blog post I also try to clarify a number of common misconceptions.

Most outsiders think that these lawsuits are about money, which is not true. These lawsuits are not held because people include FOSS components in their commercial products and ask money for these products. As I have explained earlier, selling free and open-source software is fine.

These lawsuits are held because some copyleft licenses require that the source code of the derived products or parts thereof are available under the conditions of the same license, which includes access to the source code. Typically, many vendors refrain from publishing source code and do not obey the obligations that these licenses specify. In many cases, vendors are unaware of this.

Background

The research I have done about this subject has a bit of history, which I'd like to explain here :-)

A couple of years ago, when I was in my first year as a PhD student, I've attended ICSE 2009 and there was one talk that I found very inspiring and gave me a lot of ideas. The paper was titled: 'License Integration Patterns: Addressing License Mismatches in Component-Based Development' and presented by Ahmed E. Hassan, which basically covered a large number of FOSS licenses and described patterns how to combine components governed under various licenses in a proper way.

Although their paper covers a great amount of license issues, they were still looking into automation of their patterns, for example to automatically verify derived works. This process turns out te be quite challenging because automating such processes require you to have powerful deployment tools and a complete notion of all dependencies involved in producing an artifact, such as a binary. Fortunately, deployment research is our expertise and our tools are designed for such purposes. For a while, I had several ideas about a possible solution in mind, but I never implemented anything.

Some time later at ASE 2010, which I also attended, there was another talk related to this subject titled: 'A sentence-matching method for automatic license identification of source code files' and given by Yuki Manabe. In this paper a tool was developed, called Ninka, which can be used to analyse sentences in comments of the source code, to determine under which license a source file is governed. I asked Yuki whether the tool was available somewhere, but unfortunately the idea quickly appeared on the bottom of my todo list and I forgot about this (which is a shame).

A while later Daniel German, who is involved in all the publications I have mentioned, was invited by Eelco Dolstra to visit our group in Delft. That visit resulted in an eventual collaboration between me, Eelco Dolstra (from our group), Daniel German, Julius Davies (from University of Victoria) and Armijn Hemel (who is from the gpl-violations.org project as well as owner of Tjaldur, a company specialised in software governance and license compliance engineering).

Motivation

In order to say something about the rights and obligations of software systems, you must know the following things:

How are source ﬁles combined into a final executable (e.g. static linking, dynamic linking)?
What licenses govern the (re)use of source ﬁles that were used to build the binary?
How can we derive the license of the resulting binary from the build graph containing the source ﬁles?

We together wrote a paper to provide an answer for the first question.

Approach

To provide a good answer for that question, we have crafted a method which traces system calls of build processes (essentially the involved processes and what files go in and out) and we produce build graphs out of these traces. Furthermore, the traces of each package are stored in a central database, so that inter package-dependencies can be determined.

We have used the Nix package manager to manage all the build processes. Nix is a very convenient instrument, as it has a number of good features, such as the fact that builds are pure (so no undeclared dependencies can affect the reliability of our results), that it guarantees dependency completeness (so that we are certain that no crucial dependencies affecting the license of the result binary are missed) and because Nix stores all packages in isolation in separate directories, we can easily identify inter-package relationships by looking at absolute file names. Furthermore, the Nix expression language allows us to modify the standard builder environment, without changing any package build specifications.

Tracing system calls

We trace the following system calls:

File related system calls, e.g. open(), execve()
Process related system calls, e.g. fork(), clone(), vfork()

Apart from capturing traces, there were a number issues we had to deal with:

We have to translate all relative paths to absolute paths
In Linux, pids wrap around if they exceed 32767, so we have to use a different attribute to distinguish between processes
Cycles appear in the graph, if files are read/written multiple times, which we have to remove
The are coarse grained processes, such as the install process, which install multiple files in one run. In the resulting graph it looks like the resulting artifact is dependent on all other artifiacts installed by the same process, which is not true. We have to identify these processes and rewrite them.
We don't want to know anything about the dependencies of the build tools themselves, because these are not considered derived work.

I have kept the mechanics intentionally brief here, because I don't want to explain them again here. The exact details can be found in the paper.

Build trace graphs

Below I have included a graph of cupsd, an executable belonging to the CUPS package, which we have derived with our tool:

The SVG pictures of this graph as well as several other graphs, can be found here.

By using a graph such as the one of cupsd and by using Ninka to analyse the source files in this graph, one can say something about the license under which the resulting binary is covered. In the paper, we have found an interesting problem with a well known free/open-source package, which I'm not going to reveal in this blog post :-)

Discussion

A reader with Nix experience may probably wonder why we have implemented an additional tracing approach, next to the Nix package manager. The answer is that Nix works on package level, but licenses do not always cover complete packages. There are packages in which individual files are covered under several licenses. Therefore, a more fine-grained tracing approach is required.

Unfortunately, the paper was rejected from ICSE 2012, which I was a bit disappointed about a while ago (although I'm still there anyway because I have to present another paper at HotSWUp). The fact that a paper is not "good enough" is not really what bothers me, but what bothers me is that it is a bit unclear whether this contribution is useful or useless and the fact that the solution is seen as 'too simple' (which is NOT a bad thing IMHO).

Perhaps it may indeed be too simple for a top general conference, but I also have no idea to what other type of conference or journal I could send this. And if this solution is too "practical", would it then perhaps be useful for a 'Software Engineering in Practice Track / Experience Track' at some conference? Although I have heard somebody talking about "engineering perspective", I haven't heard any reviewers suggesting about submitting to another track type.

The only thing that becomes clear to me from the reviews I have received, is that they are not really critical about the contents (although certain details can be strengthened of course), but rather about the significance of the contribution.

I have also noticed that the goal of this paper is generally misunderstood. People think that we are actually solving the complete licensing problem, but instead we provide an important ingredient, which is not there yet. Realising these build graphs, which cover complete build processes are already complicated enough, although the idea of using system call tracing for various purposes is not new. Nobody, however, has used system call tracing for this purpose yet (and therefore had to solve several problems as well). And furthermore, because we're using Nix, the process of experimenting with builds, suddenly becomes much simpler, which with conventional solutions will take significantly more effort.

If you look to the three questions I have given earlier, the paper is about the first question. The ASE 2010 paper provides a solution for the second. The third question is still future work, for an eventual license calculus. But in order to develop such a license calculus the ingredient of complete build trace graphs is required. I'm pretty sure that if I would talk to software deployment people about this, that this story will be appreciated. Unfortunately, as I have explained before, software deployment is a very cold research subject, without a real community.

I'm still thinking what to do with this paper, but I have no idea yet. Furthermore, the amount of time that I have left, is pretty limited. I have decided to put it online and announce it through this blog anyway. Normally, I always report about papers after they have been accepted, but not everything in research can be a 'success story'. Of course, I'm always open for all suggestions.

References

The paper is titled: 'Discovering Software License Constraints: Identifying a Binary's Sources by Tracing Build Processes'. As always, papers can be obtained from my publications page.

The techniques described in this blog post are becoming part of the service portfolio of Tjaldur, a company specialised in software governance. Furthermore, one day I expect that this tool is also going to be integrated in the Nix project.

UPDATE: Never give up! In the meantime, an updated version of this paper titled: "Tracing software build processes to uncover license compliance inconsistencies" has been accepted for ASE 2014! I owe a big thanks to Shane McIntosh, who did some major efforts in improving the paper, and he showed me that there are always new possibilities. Sometimes, it's good to be wrong about something! :-)

Sander van der Burg's blog

Thursday, April 26, 2012

Dynamic analysis of build processes to discover license constraints