Sander van der Burg's blog: April 2012

Thursday, April 26, 2012

Dynamic analysis of build processes to discover license constraints

As I have explained earlier in the blog post about software deployment complexity, software is rarely self-contained nowadays, but typically use many off the shelf components. Reuse has advantages, such as the fact that productivity increases and products can be finished more quickly. One of the disadvantages is an increasingly more complicated deployment process.

Apart from productivity and deployment aspects, the usage of components under Free and Open Source licenses is very popular. This is probably due to the fact that the source code is available, can be adapted and most software packages are available for free through the internet (free in price, a.k.a. gratis).

What a lot of vendors don't realize, is that most Free and Open-Source components are not in the public domain. They are in fact copyrighted and distributed under licenses ranging from simple permissive ones (allowing you to do almost everything you want including keeping modifications secret) to more complicated copyleft licenses imposing requirements on derived works. The GNU General Public License (GPL) is the most famous copyleft license around.

Because of the obligations that these licenses impose on users, licenses have become a very important non-functional requirement of software systems. Not obeying these licenses could result in costly lawsuits by copyright holders. Busybox is a well-know example of a software package which has been defended successfully in court several times.

Some clarification

Before you read on, I first want to give some clarification to readers unfamiliar with Free and Open Source software. I have written an earlier blog post about Free and Open-Source software explaining what it is and what it is not. In this blog post I also try to clarify a number of common misconceptions.

Most outsiders think that these lawsuits are about money, which is not true. These lawsuits are not held because people include FOSS components in their commercial products and ask money for these products. As I have explained earlier, selling free and open-source software is fine.

These lawsuits are held because some copyleft licenses require that the source code of the derived products or parts thereof are available under the conditions of the same license, which includes access to the source code. Typically, many vendors refrain from publishing source code and do not obey the obligations that these licenses specify. In many cases, vendors are unaware of this.

Background

The research I have done about this subject has a bit of history, which I'd like to explain here :-)

A couple of years ago, when I was in my first year as a PhD student, I've attended ICSE 2009 and there was one talk that I found very inspiring and gave me a lot of ideas. The paper was titled: 'License Integration Patterns: Addressing License Mismatches in Component-Based Development' and presented by Ahmed E. Hassan, which basically covered a large number of FOSS licenses and described patterns how to combine components governed under various licenses in a proper way.

Although their paper covers a great amount of license issues, they were still looking into automation of their patterns, for example to automatically verify derived works. This process turns out te be quite challenging because automating such processes require you to have powerful deployment tools and a complete notion of all dependencies involved in producing an artifact, such as a binary. Fortunately, deployment research is our expertise and our tools are designed for such purposes. For a while, I had several ideas about a possible solution in mind, but I never implemented anything.

Some time later at ASE 2010, which I also attended, there was another talk related to this subject titled: 'A sentence-matching method for automatic license identification of source code files' and given by Yuki Manabe. In this paper a tool was developed, called Ninka, which can be used to analyse sentences in comments of the source code, to determine under which license a source file is governed. I asked Yuki whether the tool was available somewhere, but unfortunately the idea quickly appeared on the bottom of my todo list and I forgot about this (which is a shame).

A while later Daniel German, who is involved in all the publications I have mentioned, was invited by Eelco Dolstra to visit our group in Delft. That visit resulted in an eventual collaboration between me, Eelco Dolstra (from our group), Daniel German, Julius Davies (from University of Victoria) and Armijn Hemel (who is from the gpl-violations.org project as well as owner of Tjaldur, a company specialised in software governance and license compliance engineering).

Motivation

In order to say something about the rights and obligations of software systems, you must know the following things:

How are source ﬁles combined into a final executable (e.g. static linking, dynamic linking)?
What licenses govern the (re)use of source ﬁles that were used to build the binary?
How can we derive the license of the resulting binary from the build graph containing the source ﬁles?

We together wrote a paper to provide an answer for the first question.

Approach

To provide a good answer for that question, we have crafted a method which traces system calls of build processes (essentially the involved processes and what files go in and out) and we produce build graphs out of these traces. Furthermore, the traces of each package are stored in a central database, so that inter package-dependencies can be determined.

We have used the Nix package manager to manage all the build processes. Nix is a very convenient instrument, as it has a number of good features, such as the fact that builds are pure (so no undeclared dependencies can affect the reliability of our results), that it guarantees dependency completeness (so that we are certain that no crucial dependencies affecting the license of the result binary are missed) and because Nix stores all packages in isolation in separate directories, we can easily identify inter-package relationships by looking at absolute file names. Furthermore, the Nix expression language allows us to modify the standard builder environment, without changing any package build specifications.

Tracing system calls

We trace the following system calls:

File related system calls, e.g. open(), execve()
Process related system calls, e.g. fork(), clone(), vfork()

Apart from capturing traces, there were a number issues we had to deal with:

We have to translate all relative paths to absolute paths
In Linux, pids wrap around if they exceed 32767, so we have to use a different attribute to distinguish between processes
Cycles appear in the graph, if files are read/written multiple times, which we have to remove
The are coarse grained processes, such as the install process, which install multiple files in one run. In the resulting graph it looks like the resulting artifact is dependent on all other artifiacts installed by the same process, which is not true. We have to identify these processes and rewrite them.
We don't want to know anything about the dependencies of the build tools themselves, because these are not considered derived work.

I have kept the mechanics intentionally brief here, because I don't want to explain them again here. The exact details can be found in the paper.

Build trace graphs

Below I have included a graph of cupsd, an executable belonging to the CUPS package, which we have derived with our tool:

The SVG pictures of this graph as well as several other graphs, can be found here.

By using a graph such as the one of cupsd and by using Ninka to analyse the source files in this graph, one can say something about the license under which the resulting binary is covered. In the paper, we have found an interesting problem with a well known free/open-source package, which I'm not going to reveal in this blog post :-)

Discussion

A reader with Nix experience may probably wonder why we have implemented an additional tracing approach, next to the Nix package manager. The answer is that Nix works on package level, but licenses do not always cover complete packages. There are packages in which individual files are covered under several licenses. Therefore, a more fine-grained tracing approach is required.

Unfortunately, the paper was rejected from ICSE 2012, which I was a bit disappointed about a while ago (although I'm still there anyway because I have to present another paper at HotSWUp). The fact that a paper is not "good enough" is not really what bothers me, but what bothers me is that it is a bit unclear whether this contribution is useful or useless and the fact that the solution is seen as 'too simple' (which is NOT a bad thing IMHO).

Perhaps it may indeed be too simple for a top general conference, but I also have no idea to what other type of conference or journal I could send this. And if this solution is too "practical", would it then perhaps be useful for a 'Software Engineering in Practice Track / Experience Track' at some conference? Although I have heard somebody talking about "engineering perspective", I haven't heard any reviewers suggesting about submitting to another track type.

The only thing that becomes clear to me from the reviews I have received, is that they are not really critical about the contents (although certain details can be strengthened of course), but rather about the significance of the contribution.

I have also noticed that the goal of this paper is generally misunderstood. People think that we are actually solving the complete licensing problem, but instead we provide an important ingredient, which is not there yet. Realising these build graphs, which cover complete build processes are already complicated enough, although the idea of using system call tracing for various purposes is not new. Nobody, however, has used system call tracing for this purpose yet (and therefore had to solve several problems as well). And furthermore, because we're using Nix, the process of experimenting with builds, suddenly becomes much simpler, which with conventional solutions will take significantly more effort.

If you look to the three questions I have given earlier, the paper is about the first question. The ASE 2010 paper provides a solution for the second. The third question is still future work, for an eventual license calculus. But in order to develop such a license calculus the ingredient of complete build trace graphs is required. I'm pretty sure that if I would talk to software deployment people about this, that this story will be appreciated. Unfortunately, as I have explained before, software deployment is a very cold research subject, without a real community.

I'm still thinking what to do with this paper, but I have no idea yet. Furthermore, the amount of time that I have left, is pretty limited. I have decided to put it online and announce it through this blog anyway. Normally, I always report about papers after they have been accepted, but not everything in research can be a 'success story'. Of course, I'm always open for all suggestions.

References

The paper is titled: 'Discovering Software License Constraints: Identifying a Binary's Sources by Tracing Build Processes'. As always, papers can be obtained from my publications page.

The techniques described in this blog post are becoming part of the service portfolio of Tjaldur, a company specialised in software governance. Furthermore, one day I expect that this tool is also going to be integrated in the Nix project.

UPDATE: Never give up! In the meantime, an updated version of this paper titled: "Tracing software build processes to uncover license compliance inconsistencies" has been accepted for ASE 2014! I owe a big thanks to Shane McIntosh, who did some major efforts in improving the paper, and he showed me that there are always new possibilities. Sometimes, it's good to be wrong about something! :-)

Tuesday, April 17, 2012

Software engineering fractions, collaborations and "The System"

As I have explained many times on this blog, software engineering is complicated. It happens quite often, that in order to solve problems or to gain knowledge, people collaborate with each other by various means. Nowadays, we have a wide range of means to share information and to collaborate. A few examples are:

Academic research publication libraries: ACM Digital Library, IEEE eXplore, SpringerLink
Conferences:
- Academic: ICSE, ASE, ISSRE, SPLASH
- Industrial:
- Free and Open-Source software related: FOSDEM, EclipseCon, LinuxCon
Internet services:
- Question and answers websites: Stack overflow
- Technology related websites: Slashdot, Phoronix, Ars Technica
- Social news websites: Reddit
- Source code sharing and collaboration: Sourceforge, Github
- (Micro)blogs: Twitter, Blogger
- Messaging: IRC, Mailing Lists

As you may notice, some of these means are quite common in certain fractions of the software engineering community and very uncommon in others. I can roughly divide the software engineering community into three fractions, having a number of distinct characteristics, interests and peculiarities (beware that I'm using stereotypes here and these fractions are not necessarily mutually exclusive):

Academic fraction

This is the group where I currently belong to. Academics are people working for a university and their main goal is doing scientific research. As I have explained earlier, scientific research within the software engineering domain is a bit strange, because there is no clear consensus what this actually means. Earlier, I have dived into literature and I have found a definition, which I will rephrase here once more:

Research in engineering is directed towards the efficient accomplishment of specific tasks and towards the development of tools that will enable classes of tasks to be accomplished more efficiently.

The "deliverables" that academic people produce are, in principle, papers published in peer-reviewed academic conference proceedings and scientific journals. Paper submissions are typically in competition with each other and only papers that are good enough are eligible for publication. The software engineering domain is a bit exceptional compared to many other scientific disciplines, because conference papers are more popular than journal papers.

Academic conferences are primarily about presenting research papers. Most of the attendees of these conferences are other academic people. According to some sources, there used to be a high participation degree of industry people in the past, but this is no longer true, unfortunately.

Industry fraction

The primary goal of industry is (not surprisingly) to develop and sell software (as a product or as a service) or to provide IT services and make profit, which is (preferably) as high as possible.

In order to achieve that goal, industry typically want to be as cost efficient as possible. Therefore, they don't want to invest too much time and effort in secondary goals, such as developing software tools, as these tools cost money and do not immediately give them any profits. Rather, they want to focus themselves as much on their primary goals as possible.

The conferences that industry people attend, are often related to the technology they are using. For example, companies using Java typically attend JavaOne, or companies using Microsoft technology may attend Microsoft TechEd. Companies may also participate in "trade show" conferences, such as CeBIT, to advertise their products and to attract potential new employees.

Community project fraction

Another fraction is what I call the "community project" fraction. Typically, a lot of people refer to this fraction as "Open-Source projects", but I don't want to refer to them like that. While most community projects are distributing their software under free and open-source licenses, there are also commercial parties doing this, without outside involvement. I have written an earlier blog post about Free and Open Source software explaining what this is all about.

Community projects are usually formed by various individuals having various affiliations, share a common interest and work together on common goals. There are many prominent examples of community projects around, such as The Linux Kernel and KDE.

Another notable trait of community projects is that nearly all contributors are also users of the software. Many community projects have an informal organization structure and copyright is owned by each individual contributor. Some community projects are also governed by a legal entity owning the entire codebase, such as the Free Software Foundation, Apache Software Foundation and the Eclipse Foundation.

There are also a number of free and open source conferences around, such as FOSDEM and LinuxCon. Most attendees of these conferences are either users or contributors to community projects and very much interested in new capabilities of software.

"The System"

The industrial and academic fractions have "targets" which they have to reach within a certain period of time. Usually this period is short term. These fractions also want to grow, improve and perform better than the competition.

People in both fractions are periodically assessed by some kind of measurement standard, indicating whether the results (according to this standard) are satisfiable and have increased enough. As a consequence, people in these fractions have a tendency to do as much as possible to improve these numbers according to this standard, rather than doing what have to be done. I call this phenomenon "The System".

Implications of "The System" on the academic fraction

In the academic world, publication records are used as the main productivity / efficiency measurement unit. This often means that the more papers you produce the better you are as a researcher. Furthermore, various other publication quality attributes are typically taken into account, such as the ranking of the conference or journal and the amount of citations that you have. Some metrics that measure a researcher's productivity are the G-index and the H-index.

Because publications are the primary (or sometimes the only) measurement unit of research, many researchers primarily work to improve these numbers. In my opinion, this is a bad thing.

As a consequence, many researchers spent most of their time aiming at a collection of academic conferences and journals. Each conference and journal have particular requirements, boundaries, traits and peculiarities, such as the allowed subjects, page limits, evaluation methods, stuff of which you know that is going to be well-received by the Program Committee members and stuff which don't. Some of the "tactics" researchers use to get their paper accepted is 'identify the champion', which means that you have a look at the list of Program Committee members and write your paper to please at least one of them, so that he will probably vote in favor of your acceptance of your paper.

Sometimes, I get the impression that doing research this way, looks like a darts game, in which you keep aiming at a fixed number of sections, and keep modifying your arrows until you hit the right score.

If you look at the definition of 'research in engineering' I have given earlier, writing publications and improving publication records is not the only thing that needs to be done. For example, sometimes also "uncommon" aspects have to be investigated, which do not necessarily produce great results, but are nonetheless worth knowing.

Furthermore, new software engineering concepts typically result in tools which have to be developed. Eventually, ideas developed in the academic world have to reach a broader audience and I think tools are the primary way to achieve that goal.

A lot of researchers refrain from doing these steps, because "The System" enforces them to do so. I have heard some people saying: "You shouldn't spent so much time on development. Just develop a prototype and then move on to your next goal!". As a consequence, lots of papers tend to become "forgotten knowledge" and the rest of the software engineering community doesn't care and perhaps reinvent the wheel some day, but implement it in a much crappier way.

Implications of "The System" on the industry fraction

In industry, quite often developers are seen as "code production units" and they have to be used as efficiently as possible. In order to be as efficient as possible, it is desired to reduce as many costs as possible and employees should focus themselves on the primary goals of the company as much as possible. One of the 'solutions' companies implement is to outsource labour to countries in which salaries are lower, to delegate certain tasks to other companies, or completely relying on an external vendor to provide a solution.

In my opinion this is a bad thing for the following reasons (I have taken these arguments from my colleague Rini van Solingen's video log: vlog episode 2 (in Dutch) ):

Developing software is not necessarily a process that merely costs some amount of money. Software development is an investment and also gives benefits. The benefits are often overlooked by a lot of managers. Reducing costs, e.g. by hiring cheaper employees with less knowledge, may typically result in fewer benefits.
In order to reach your primary goals, you also have to reach secondary goals. For example, banks are not really software companies, but they have to become software companies because their organization depend completely on software.

I think the same is true for software companies doing software engineering. People developing financial software, may not want to think about build management complexity issues, but they have to, because otherwise it is impossible to properly engineer systems for end-users.

Many companies have the tendency to delegate secondary problems as much as possible, with the assumption that others can do it better, cheaper and more efficiently. In my opinion this is not always necessarily true. Sometimes secondary problems are so specific to a particular industry that there is no general solution provided by an external vendor or by somebody else willing to provide a solution for this.

In such cases, you have to solve these secondary yourself, but many organizations refrain from doing so. They decide to keep living with the burden and prefer to be inefficient. I have worked for several companies (which I'm not going to mention here) and I can speak from experience. I have had several unconventional ideas in mind back then, and the only thing I was doing was fighting resistance, while I could have already solved many secondary problems already.

Another trait I have frequently encountered is that some companies are afraid to participate with other communities, because of the potential advantages the competitors could get. I think for most secondary goals this is not really an issue.

Implications of "The System" on community projects

As far as I can tell, there is no "System" for community projects, as these projects are typically not bound to deadlines or formal assessment procedures. Community projects basically have to keep themselves and the community as a whole happy. Furthermore, these projects are not composed by members of a single organization, but from various individuals all over the world. However, community projects also have a few peculiarities that I'd like to mention.

Quite often, because the developers are also the users of a particular application, it is difficult for non-technical users to get involved and have their problems solved. Sometimes applications delivered by community projects are seen as very unfriendly by non-techinal users.

Also, there are "social-issues" in community projects, such as developers who receive criticism (either about them in personally or about the project in general) immediately feel themselves offended and developers pissing of non-technical users claiming that they are stupid and they don't need a particular feature. The Linux Haters Blog often elaborates on this issue. Another famous phenomenon is "bikeshedding", in which big discussions are held over relatively minor problems, while important bigger problems are overlooked.

Discussion

"The System" of each of these fractions have arised with the intention of improving themselves, but personally I think that these "Systems" actually conflict with other and make things worse, not better.

In principle, the academic fraction (which investigates software engineering techniques) solve secondary problems for the industry fraction. But in order to completely "solve" a secondary problem, usable tools have to be developed, which academic people refrain to do so because it is a waste of time, according to their "System".

Second, industry has to focus themselves on their primary goal (because their "System" requires that) and they don't want to spent too much effort in secondary goals, such as working together with academics to successfully apply research or get involved with community projects sharing knowledge.

It almost looks a bit like a prisoners dilemma to me. Collaboration between these fractions obviously requires several small sacrifices, but it also offers all parties benefits. I also see community projects as a good means to collaborate between academia and industry (and anyone else who is interested). Although this is obvious, all fractions stick to their "System" and, as a consequence, they diverge from each other and don't benefit from each other at all.

I have a few concrete examples of this:

The Dutch government as well as the industry in the Netherlands, spent a relatively small amount of money in research, while they want our country to be a 'knowledge economy'. It's actually the least of all countries in the European Union. The government is planning to reduce these investments even more. They expect that companies invest more in research, but in my experience, apart from a few notable exceptions, most of companies are reluctant or have no clue what is going on in the research world. We have very good researchers in the Netherlands, but their work isn't applied that well in industry. In my opinion, that's a shame and a waste. (I have used this blog post as a reference (In Dutch) )
As mentioned in an earlier blog post, industry participation at academic conferences is low. Industry people often have different interests than academic people. Nowadays, they rather attend technology related conferences. For industry people, application and benefits of tools and technology is important. More important than a mathematical proof or evaluation showing numbers, which they don't understand.

I have encountered several boring presentations at academic conferences, showing lots of "greek symbols" and all kinds of complicated things I didn't understand. I'm pretty sure that if the presenter would have attendees from industry, they have no clue what they are talking about and they quickly lose interest.
Sometimes people reinvent the wheel, but in a crappier way, which they have to maintain themselves. In my research, for example, I have seen many custom build systems that have a significant maintenance burden. People usually stick to these suboptimal solutions for a long time, while there are many solutions available that are more convenient, more powerful and easier to maintain.

Recommendations

In order to improve the struggle of diverging fractions, I think all fractions have to cooperate better with each other (which is obvious of course) and let themselves go of their "System" in some degree (or better: make sure that the "System" changes). I have a few recommendations:

I think "The System" of academic researchers shouldn't be merely about publications. Furthermore, they shouldn't be merely about these fixed collection of conferences/journals each having their own "borders". Publishing stuff is not a bad thing in my opinion, as concepts must be properly explained, evaluated and peer-reviewed. But concepts are useless without any deliverables that can be applied.

"Playing darts" by only aiming at "sections on a dart board" is bad for research. For example, Edsger Dijkstra, one of the most famous computer scientists published a lot, but most of his publications were his "EWDs"; manuscripts, which he wrote about any subject he want, whenever he wanted, without aiming at anything or keeping some kind of "System" satisfied.
As an academic software engineering researcher, you are typically investigating stuff for some kind of "audience" (e.g. developers or testers) and it may possibly be related to certain kind of technology (e.g. Java, Eclipse, Linux etc.). Therefore, I think it's also important to directly work with people from these fractions, address them regularly (e.g. visit them and participate in their technology-related conferences) and see whether you can make your work interesting for them.
It's also a good thing to make your tools available by some means. Perhaps joining a community project or start your own community project can be good thing.
Companies must know that developing software is an investment and that secondary goals have to be reached in order to achieve primary goals. Some secondary goals cannot be solved efficiently by third parties, as it's too specific for the domain of the company.
Companies should not be afraid in participating in other fractions' community means. In fact, they should be more eager in finding out what other fractions have to offer them.

These recommendations may look challenging, but I think the barrier to build better "bridges" between software engineering fractions isn't that high. I have outlined a list various means in the beginning of this blog post, that may help you and I think they are relatively easy to apply without any additional costs.

For example, besides publishing academic papers, I also have this blog and I use Twitter to regularly report about results, findings and other stuff. Furthermore, the tools we are developing are made available to everyone through a community project, called the Nix project.

Apart from academic conferences, I have also presented at an industrial conference as well as FOSDEM, the biggest Free and Open-Source conference in Europe. Actually, our work has been very well received there. Far better than any academic conference I have attended so far.

Why am I writing this?

With the work that I'm doing as a PhD student, I'm trying to address the complete software engineering community, not just a small subset. It's also a shame if 4 years of hard work becomes forgotten knowledge that nobody cares about.

Furthermore, I think that for many reasons, building "bridges" between these software engineering fractions is essential and gives all fractions benefits. But the "Systems" of all these fractions (which are basically there to improve them) drive these fractions apart.

Besides publishing papers, I have spent I considerable amount of time in development of tools, case studies and examples. For example, according to the COCOMO estimation method of ohloh.net, I have spent 2 man-years of effort in Disnix (and I'm the sole author). Furthermore, I have also developed several extensions to Disnix and I also did many contributions to other Nix projects. Apart from development, I'm also maintaining this blog in which, apart from my research, I report about several other technical issues and fun projects.

While all this work is very much appreciated by the people I work with and talk with, it's actually is a waste of time according to "The System" of my fraction. If I would have sticked to "The System", then this blog would never exist. Moreover, I would have never produced the following blog posts, because I don't know how to submit them to any academic conference or journal:

These blog posts are very useful and appreciated. Perhaps not for academic people, but certainly for the other fractions! Finally, these blog posts have attracted many more readers than all my academic papers combined.

P.S. If anyone knows how to "sell" this stuff "scientifically" and knows to make a "dart" out of this which I can throw in a good "section", please let me know! I'd like to integrate this stuff in my PhD thesis, which is primarily about research! ;)

Conclusion

In this blog post, I have identified three fractions within the software engineering community. Each fraction have their distinct characteristics. The academic and industrial fractions have a "System", which have arised to improve the individual fractions, but drive the fractions apart from each other. I have proposed a few recommendations to "bridge" these fractions, but in order to achieve that goal, they have to let themselves go of their "System", which is not easy.

As a final point, I'd like to point out that I have used stereotypes to describe these fractions. These descriptions do not always accurately reflect what happens the real world. In practice, not every academic researcher is completely focused on writing papers. I know many researchers besides me, who write tools and make them publicly available. Actually, my supervisors encourage me to work on tooling and they appreciate the work I'm doing. Furthermore, many of my colleagues also have very good collaborations with other fractions, and often present at non-academic events, which I'm very happy about.

I also said that academic presentations are boring. While I have attended quite a number of boring presentations, I have also seen many good ones, which I liked very much. It's a shame that the rest of the software community does not know about these.

Furthermore, companies aren't necessarily completely driven by making profits. They also care about their customers in some degree and thinking about improvements in technology. And yes: there are collaborations between these fractions which sometimes produce good results.

But nonetheless, although the real word is a bit better than the "stereotype" world I have described, I still see a lot of room for improvement in "bridging" these fractions.