Thursday, December 21, 2023

On reading research papers and maintaining knowledge

Ten years ago I have obtained my PhD degree and made my (somewhat gradual) transition from academia to industry. Despite the fact that I made this transition a long time ago, I still often get questions from people who are considering doing a PhD.

Most of the discussions that I typically have with such people are about writing -- I have already explained plenty about writing in the past, including a recommendation to start a blog so that writing becomes a habit. Having a blog allows you to break up your work into manageable pieces and build up an audience for your work.

Recently, I have been elaborately reorganizing files on my hard-drive, a tedious task that I often do at the end of the year. This year, I have also been restructuring my private collection of research papers.

Reading research papers became a habit while working on my master's thesis and doing my PhD. Although I have left academia a long time ago, I have retained the habit, although the amount of papers and articles that I read today is much lower than my PhD days. I no longer need to study research works much, but I have retained the habit to absorb existing knowledge and put things into context whenever I intend to do something new, for example for my blog posts or software projects.

In 2020, during the first year of the COVID pandemic, I have increased my interest in research papers somewhat, because I had to revise some of the implementations of algorithms in the Dynamic Disnix framework that were based on work done by other researchers. Fortunately, the ACM temporarily opened their entire digital library to the public for free so that I could get access to quite an amount of interesting papers, without requiring to pay.

In addition to writing, reading in academic research is also very important, for the following reasons:

  • To expand your knowledge about your research domain.
  • To put your own work into context. If you want to publish a paper about your work, having a cool idea is not enough -- you have to explain what your research contributes: what the innovation is. As a result, you need to relate to earlier work and (optionally) to studies that motivate the relevance of your work. Furthermore, you can also not just take credit for work that has already been done by others. As a consequence, you need to very carefully investigate what is out there.
  • You may have to peer review papers for acceptance in conference proceedings and journals.

Reading papers is not an easy job -- it often takes me quite a bit of time and dedication to fully grasp a paper.

Moreover, when studying a research paper, you may also have to dive into related work (by examining a paper's references, and the references of these) to fully get an understanding. You may have to dive several levels deep to gain enough understanding, which is not a straightforward job.

In this blog post, I want to share my personal experiences with reading papers and maintaining knowledge.

My personal history with reading


I have an interesting relationship with reading. Already at young age, I used to study programming books to expand my programming knowledge.

With only limited education and knowledge, I was very practically minded -- I would relentlessly study books and magazines to figure out how to get things done, but as soon as I figured out enough to get something done I stopped reading, something that I consider a huge drawback of the younger version of me.

For example, I still vividly remember how I used to program 2D side scroller games for the Commodore Amiga 500 using the AMOS BASIC programming language. I figured out many basic concepts by reading books, magazines and the help pages, such as how to load IFF/ILBM pictures as backgrounds, load ProTracker modules as background music, using blitter objects for moving actors, using a double buffer to smoothly draw graphics, side scrolling, responding to joystick input etc.

Although I have managed to make somewhat playable games by figuring out these concepts, the games were often plagued by bugs and very slow performance. A big reason that contributed to these failures is because I stopped reading after mastering the basics.

For example, to improve performance, I should have disabled the autoback feature that automatically swaps the physical and logical screens on every drawing command and do the screen swap manually after all drawing instructions were completed. I knew that using a double screen buffer would take graphics glitches away, but I never bothered to study the concepts behind it.

As I grew older and entered middle school, I became more critical of myself. For example, I learned that it is essential to properly cite where you get your knowledge from rather than "pretending" that you are pulling something out of your own hat. :)

Fast forwarding to my studies at the university: reading papers from the academic literature became something that I had to commonly do. For example, I still remember the real-time systems and software architecture courses.

The end goal of the former course was to write your own research paper about a subject in the real-time systems domain. In this course, I learned, in addition to real-time system concepts, how academic research works: writing a research paper is not just about writing down a cool idea (with references that you used as an inspiration), but you also need to put your work into context -- a research paper is typically based on work already done by others, and your paper typically serves as an ingredient that can be picked up by other researchers.

In the latter course, I had to read quite a few papers in the software architecture domain, write summaries and discuss my findings with other students. Here, I learned that reading papers is all but a trivial job:

  • Papers are often densely written. As a result, I get overwhelmed with information and it requires quite a bit of energy from my side to consume all of it.
  • The formatting of many papers is not always helpful. Papers are typically written for print, not for reading from a screen. Also, the formatting of papers are not always good for displaying code fragments or diagrams.
  • There is often quite a bit of unexplained jargon in a paper. To get a better understanding you need to dive deeper into the literature, such as also studying the references of the papers or books that are related to the subject.
  • Sometimes authors frequently use multi-syllable words.
  • It is also not uncommon for authors to use logic and formulas to formalize concepts and mathematically prove their contributions. Although formalization helps to do this, reading formulas is often a tough job for me -- there is typically a huge load of information and Greek symbols. These symbols IMO are not always very helpful to relate to what concepts they represent.
  • Authors often tend to elaborately stress out the caveats of their contributions, making things hard to read.

Despite having read many papers in the last 16 years and I got better at it, reading still remains a tough job because of the above reasons.

In the final year of my master's, I had to do a literature survey before starting the work on my master's thesis. The first time I heard about this, I felt scared, because of my past experiences with reading papers.

Fortunately, my former supervisor: Eelco Visser, was very practically minded about the process -- he wanted us to first work on practical aspects of their research projects, such as WebDSL: a domain-specific language for developing web applications with a rich data model and related tools, such as Stratego/XT and the Nix package manager.

After mastering the practical concepts of these projects, doing a literature survey felt much easier -- instinctively, while using these tools in practice, I became more interested in learning about the concepts behind them. Many of their underlying concepts were described in research papers published my my colleagues in the same research department. While studying these papers, I also got more motivated/interested into diving deeper in the academic literature by studying the papers' references and searching for related subjects in the digital libraries of the ACM, IEEE, USENIX, Springer, Elsevier etc.

During my PhD reading research papers became even more important. In the first six months of my PhD, I had a very good start. I published a paper about an important aspect of my master's thesis: atomic upgrading of the static parts of a distributed system, and a paper about the overall objective of the research project that I was in. I have to admit that, despite having these papers instantly accepted, I still had the wrong mindset -- I was basically just "selling my cool ideas" and finding support in the academic literature, rather than critically studying what is out there.

For my third paper, that covers a new implementation of Disnix (the third major revision to be precise), I learned an important/hard lesson. The first version of the paper got badly rejected by the program committee, because of my "advertising cool ideas mindset" that I always used to have -- I failed to study the academic literature well enough to explain what the innovation of my paper is in comparison to other deployment solutions. As a consequence, I got some very hard criticisms from the reviewers.

Fortunately, they gave me good feedback. For example, I had to study papers from the Working Conference on Component Deployment. I have addressed their criticisms and the revised paper got accepted. I learned what I had to do in the future -- it is a requirement to also study the academic literature well enough to explain what your contribution is and demonstrate its relevance.

This rejection also changed my attitude how I deal with research papers. Previously, after my work for a paper was done, I would typically discard the artifacts that I no longer needed, including the papers that I used as a reference. After this rejection, I learned that I need to build my own personal knowledge base so that for future work, I could always relate to the things that I have read previously.

Reading research papers


I have already explained that for various reasons, reading research papers is all but an easy job. For some papers, in particular the ones in my former research domain: software deployment, I got better as I grew more familiar to the research domain.

Nonetheless, I still sometimes find reading papers challenging. For example, studying algorithmic papers is extremely hard IMO. In 2021, I had to revise my implementations of approximation solutions for the multi-way cut, and graph coloring problems in the Dynamic Disnix framework. I had to re-read the corresponding papers again. Because they were so hard to grasp, I wrote a blog post that explains how I practically applied them.

To fully grasp a paper, reading it a single time is often not enough. In particular the algorithmic papers that I mentioned earlier, I had to read them many times.

Interestingly enough, I learned that reading papers is also a subject of study. A couple of years ago I discovered a paper titled: "How to Read a Paper" that explains a strategy for reading research papers using a three-pass approach:

  • First pass: bird's eye view. Study the title, abstract, introduction, headings, conclusions. A single pass is often already enough to decide whether a paper is relevant to read or not.
  • Second pass: study in greater detail, but ignore the big details, such as mathematical proofs.
  • Third pass: read everything in detail by attempting to virtually re-implement the paper.

After discovering this paper, I have also been using the three pass approach. I have studied most of my papers in my collection in two passes, and some of them in detail in three passes.

Another thing that I discovered by accident is that to extensively study literature, a continuous approach works better for me (e.g. reserving certain timeslots in a week) than just reserving longer periods of time that consist of only reading papers.

Also, regularly discussing papers with your colleagues helps. During my PhD days, I did not do it that often (we had no formal "process" for it) but there were several good sessions, such as a program committee simulation organized by Arie van Deursen, head of our research group.

In this simulation, we organized a program committee meeting of the ICSE conference in which the members of the department represented program committee members. We have discussed submitted papers and voted for acceptance or rejection. Moreover, we also had to leave the room if there was a conflict of interest.

I also learned that Edsger Dijkstra, a famous Dutch computer scientist, organized the ETAC (Eindhoven Tuesday Afternoon Club) and ATAC (Austin Tuesday Afternoon Club) in which amongst other activities, reading and discussing research papers was a recurring activity.

Building up your personal knowledge base


As I have explained earlier, I used to throw away my downloaded papers when the work for a paper was done, but I changed that habit after that hard paper rejection.

There are many good reasons to keep and organize the papers that you have read, even if they do not seem to be directly relevant to your work:

  • As I have already explained, in addition to reading a single paper and writing your own research papers, you need to maintain your knowledge base so that you can put them into context.
  • It is not always easy to obtain papers. Many of them are behind a paywall. Without a subscription you cannot access them, so once you have obtained them it is better to think twice before you discard them. Fortunately, open access becomes more common but it still remains a challenge. Arie van Deursen has written a variety of blog posts about open access.
  • Although many papers are challenging to read, I also started to appreciate certain research papers.

My own personal paper collection has evolved in an interesting way. In the beginning, I just used to put any paper that I have obtained into a single folder called: papers until it grew large enough that I had to start classifying them.

Initially, there was a one-level folder structure, consisting of categories such as: deployment, operating systems, programming languages, DSL engineering etc. At some point, the content of some of these folders grew large enough and I introduced a second level directory structure.

For example, the sub folder for my former research domain: software deployment (the process that consists of all activities to make a software system available for use) contains the largest amount of papers. Currently, I have collected 168 deployment papers that I have divided over the following sub categories:

  • Deployment models. Papers whose main contribution is a means to model various deployment aspects of a system, such as the structure of a system and deployment activities.
  • Deployment planning. Papers whose main contribution are algorithms that decide a suitable/optimal deployment architecture based on functional and non-functional requirements of a system.
  • Empirical studies. Papers containing empirical studies about deployment in practice.
  • Execution. Papers in which the main contribution is executing deployment activities. I have also sub categorized this folder into technology-specific solutions (e.g. a solution is specific to a programming language, such as Java or component technology, such as CORBA) and generic solutions.
  • Practice reports. Papers that report on the use of deployment technologies in practice.
  • Surveys. Papers that analyse the literature and draw conclusions from them.

A hierarchical directory structure is not perfect for organizing papers -- for many papers there is an overlap between multiple sub domains in the software engineering domain. For example, deployment may also be related to a certain component technology, in service of optimizing the architecture of a system, related to other configuration management activities (versioning, status accounting, monitoring etc.) or an ingredient in integration testing. If there is an overlap, I typically look at the strongest kind of contribution that the paper makes.

For example, in the deployment domain, Eelco Dolstra wrote a paper about maximal laziness, an important implementation aspect of the Nix expression language. The Nix package manager is a deployment solution, but the contribution of the paper is not deployment, but making the implementation of a purely functional DSL efficient. As a result, I have categorized the paper under DSL engineering rather than deployment.

The organization of my paper collection is always in motion. Sometimes I gain new insights causing me to adjust the classifications, or when a collection of papers for a sub domain grows, I may introduce a second-level classification.

Some practical tips to get familiar with a certain research subject


So what is my recommended way to get familiar with a certain research subject in the software engineering domain?

I would start by doing something practical first. In software engineering research domain, often the goal is to develop or examine tools. Start by using these tools first and see if you can contribute to them from a practical point of view -- for example, by improving features, fixing bugs etc.

As soon as I have mastered the practical aspects, I may typically already get motivated to dive into their underlying concepts by studying the papers that cover them. Then I will apply the three pass reading strategy and eventually study the references of the papers to get a better understanding.

After my orientation phase has finished, the next thing I would typically look at is the conferences/venues that are directly related to the subject. For software deployment, for example, there used to be only one subject-related conference: the Working Conference On Component Deployment (that unfortunately was no longer organized after 2005). It is typically a good thing to have examined all the papers of the related conferences/venues, by at least using a first-pass approach.

Then a potential next step is to search for "early defining papers" in that research area. In my experience, many research papers are improving on concepts pioneered by these papers, so it is IMO a good thing to know where it all started.

For example, in the software deployment domain the paper: "A Characterization Framework for Software Deployment Technologies" is such an early defining paper, covering a deployment solution called "The Software Dock". The paper comes with a definition for the term: "software deployment" that is considered the canonical definition in academic research.

Alternatively, the paper: "Software Deployment, Past, Present and Future" is a more recent yet defining paper covering newer deployment technologies and also offers its own definition of the term software deployment.

For unknown reasons, I always seem to like early defining papers in various software engineering domains. These are some of my recommendations of early defining papers in other software engineering domains:


After studying all these kinds of papers, your knowledge level should already be decent enough to find your way to study the remaining papers that are out there.

Literature surveys


In addition to research papers that need to put themselves into context, extensive literature surveys can also be quite valuable to the research community. During my PhD, I learned that it is also possible to publish a paper about a literature survey.

For example, some of my former colleagues did an extensive and systematic literature survey in the dynamic analysis domain. In addition to the results, the authors also explain their methodology, that consists of searching on keywords, looking for appropriate conferences and journals and following the papers' references. From these results they have derived an attribute framework and classified all the papers into this attribute framework.

I have kept the paper as a reference for myself, because I like the methodology. I am not so interested in dynamic analysis or program comprehension from a research perspective.

Literature surveys also exist in my former research domain, such as a survey of deployment solutions for distributed systems.

Conclusions


In this blog post, I have shared my experiences with reading papers and maintaining knowledge. In research, it is quite important and you need to take it seriously.

Fortunately, during my PhD I have learned a lot. In summary, my recommendations are:

  • Archive your papers and build up a personal knowledge base
  • Start with something practical
  • Follow paper references
  • Study early defining papers
  • Find people to discuss with
  • Study continuously in small steps

Although I never did an extensive literature survey in the software deployment domain (it is not needed for submitting papers that contribute new techniques) I can probably even write a paper about software deployment literature myself. The only problem is that I am not quite up to date with work that has been published in the last few years, because I no longer have access to these digital libraries.

Moreover, I also need to find the time and energy to do it, if I really want to :)

No comments:

Post a Comment