Sander van der Burg's blog: September 2019

I have written two data exchange libraries -- not so long ago, I have created libnixxml that can be used to work with XML data following the so-called NixXML convention, which is useful to facilitate integration with tools in the Nix ecosystem, while still having meaningful XML data formats that can be used independently.

Many years ago, I wrote libiff that makes it possible to parse Interchange File Format (IFF) files that use so-called "chunks" to structure and organize binary files.

The goal of these two data exchange libraries is not to only facilitate data interchange -- in addition, they have also been designed to assist the user in constructing domain models in the C (or C++) programming language.

With the term: domain model, I am basically referring to an organization of data structures, e.g. structs, classes and abstract data structures, such as hash tables, lists, maps and trees, that have a strong connection to a (well-defined) problem domain (expressible in a natural language). Deriving such an organization is an important ingredient in object oriented design, but not restricted to object orientation only.

In addition to implementing a domain model in a C or C++ application with an understandable mapping to the problem domain, I also typically want the implementation to provide one or more of the following non-functional and cross-functional properties:

The data integrity should be maintained as much as possible. It should be difficult, for example, to mutate properties of an object in such a way that they have representations that cannot be interpreted by the program. For example, if an object requires the presence of another object, then it should be difficult to construct objects that have dangling references.
We may want to read a representation of the domain model from an external source, such as a file, and construct a domain model from it. Because external sources cannot be trusted, we also want this process to be safe.
In addition to reading, we may also want to write a data representation of a domain model to an external source, such as a file or the standard output. We also want to write a file in such a way that it can be safely consumed again.
We may want want to check the integrity of the data model and have decent error reporting in case an inconsistency was found.
It should not take too much effort in maintaining and adjusting the implementation of a domain model.

To implement the above properties, I have slowly adopted a number of conventions that I will describe in this blog post.

In addition to C and C++, these conventions also have some relevance to other programming languages, although most "higher level" languages, such as Java and Python, already have many facilities in their standard APIs to implement the above properties, whereas in C and C++ this is mostly the implementer's responsibility and requires a developer to think more consciously.

Constructing objects

When constructing objects, there are two concerns that stand out for me the most -- first, when constructing an object, I want to make sure that they never have inconsistent properties. To facilitate that, the solution is probably obvious -- by creating a constructor function that takes all mandatory properties as parameters that uses these parameters to configure the object members accordingly.

My second concern is memory allocation -- in C and C++ objects (instances of a struct or class) can be allocated both on the stack or the heap. Each approach has their own advantages and disadvantages.

For example, working with stack memory is generally faster and data gets automatically discarded when the scope of a block terminates. A disadvantage is that sizes of the data structures must be known at compile time, and some platforms have a limit of how much data can be allocated on the stack.

Heap memory, can be dynamically allocated (e.g. the size of memory to be allocated does not need to know at compile time), but is slower to allocate, and it is the implementer's responsibility to free up the allocated data when it is no longer needed.

What I generally do is for simple data structures (that do not contain too many fields, or members referring to data structures that require heap memory), I provide an initializer function that can be used on an object that is allocated on the stack to initialize its members.

Whenever a data structure is more complex, i.e. when it has many fields or members that require heap memory, I will create a constructor function that allocates the right amount of heap memory in addition to initializing its members.

Destructing objects

When an object is constructed, it may typically have resources allocated that need to be freed up. An obvious resource is heap memory -- as described earlier, when heap memory was previously allocated (e.g. for the data structure itself, but also for some of its members), it, at a later point in time, also needs to be freed up. Not freeing up memory causes memory leaks eventually causing a program to run out of memory.

Another kind of resource -- that is IMO often overlooked -- are file descriptors. Whenever a file has been opened, it also needs to be explicitly closed to allow the operating system to assign it to another process. Some operating systems have a very limited amount of file descriptors that can be allocated resulting in problems if a program is running for longer periods of time.

To maintain consistency and keep an API understandable, I will always create a destructor function when a constructor function exists -- in some cases (in particular with objects that have no members that require heap memory), it is very tempting to just tell (or simply expect) the API consumer to call free() explicitly (because that is essentially the only thing that is required). To avoid confusion, I always define a destructor explicitly.

Parsing an object from external source

As suggested in the introduction, I (quite frequently) do not only want to construct an object from memory, but I want it to be constructed from a definition originating from an external resource, such as a file on disk. As a rule of thumb (for integrity and security reasons), external input cannot be trusted -- as a result, it needs to be reliably parsed and checked, for which the data interchange libraries I developed provide a solution.

There is a common pitfall that I have encountered quite frequently in the process of constructing an object -- I typically assign default values to primitive members (e.g. integers) and NULL pointers to members that have a pointer type. The most important reason why I want all member fields to be initialized is to prevent them from staying garbage leading to unpredictable results if they are used by accident. In C, C++ when using malloc() or new() memory is allocated, but not automatically cleared, to, for example, zero bytes.

By using NULL pointers, I can later check whether all mandatory properties have been set and raise an error if this is not case.

A really tricky case with NULL pointers are pointers referring to data structures that encapsulate data collections, such as arrays, lists or tables. In some cases, it is fine that the input file does not define any data elements. The result should be an empty data collection. However, following the strategy to assign a NULL pointer by default introduces a problem -- in locations where a data collection is expected, the program will typically crash caused by a segmentation fault, because the program attempts to dereference a NULL pointer.

When assigning NULL pointers, I will always ask myself the question what kind of meaning NULL has. If I cannot provide an explanation, then I will make sure that a value is initialized with some other value than NULL. In practice, this means when members are referring to data collections, I will construct an empty data collection instead of assigning a NULL pointer. For data elements (e.g. strings), assigning NULL pointers to check whether they have been set is fine.

Finally, I also have the habit to make it possible to read from any file descriptor. In UNIX and UNIX-like operating systems everything is a file, and a generic file descriptor interface makes it possible to consume data from any resource that exposes itself as a file, such as a network connection.

Serializing/exporting objects to an external resource

In addition to retrieving and parsing objects from external resources, it is often desirable to do the opposite as well: serializing/exporting objects to an external resource, such as a file on disk.

Data that is consumed from an external source cannot be trusted, but if the output is not generated properly, the output most likely cannot be trusted either and reliably be consumed again.

For example, when generating JSON data with strings, a string that contains a double quote: " needs to be properly escaped, which is very easily overlooked when using basic string manipulation operations. The data exchange libraries provide convenience functions to reliably print and escape values.

We may also want to pretty print the output, e.g. adding indention, so that it can also be read by humans. Typically I add facilities for pretty printing to the functions that generate output.

Similar to the assigning a NULL pointer "dilemma" for empty data collections, we also face the dilemma to print an empty data collection or no elements at all. Typically, I would pick the option to print an empty data structure instead of omitting it, but I have no hard requirements for either of these choices.

As with reading and parsing data from external sources, I also typically facilitate writing to file descriptors so that it is possible to write data to any kind of file, such as the standard output or a remote network resource.

Checking the integrity of objects

Generally, I use constructor functions or mutation functions to prevent breaking the integrity of objects, but it is not always fully possible to fully avoid problems, for example, while parsing data from external resources. In such scenarios, I also typically implement functionality that checks the integrity of an object.

One of the primary responsibilities of a checking function is to examine the validity of all data elements. For example, to check whether a mandatory field has been set (i.e. it is not NULL) and whether they have the right format.

In addition to checking validity of all data elements, I typically also recursively traverse the data structure members and check their validity. When an error has been encountered in an abstract data structure, I will typically indicate which element (e.g. the array index number, or hash table key) is the problem, so that it can be more easily diagnosed by the end user.

When all fields of an object have been considered valid, I may also want to check whether the object's relationships are valid. For example, an object should not have a dangling reference to a non-existent object, that could result in segmentation faults caused by dereferencing NULL pointers.

Instead of invoking a check function explicitly, it is also possible to make a check function an integral part of a parse or constructor function, but I prefer to keep a check function separate, for the following reasons:

We do not need to perform a check if we are certain that the operations that we carry out, changed in any data in the wrong way.
We may want to perform checks in various stages of program, such as after parsing, after construction or after certain critical updates.

Comparing objects

Another important concern is the ability to compare objects for equality and/or ordering. I also typically implement a comparison function for each data structure.

In theory, recursively comparing a structure of objects could become quite expensive, especially if there are many nested data structures with many data elements. As an optimization, it may be possible to maintain integrity hashes and only check values if these hashes change, but so far I have never had run into any situations in which performance is really a bottleneck.

Naming

When developing data structures and function, I also try to follow a consistent naming convention for data structures and functions. For example, I may want to use: create_<ds_name> for a function creating a data structure and delete_<ds_name> for a function deleting a data structure.

Furthermore, I try to give meaningful names to data structures that have a correspondence with the problem domain.

Modularity

Although not mandatory in C or C++, I also typically try to use one header and one implementation file per data structure and functions that are related to it -- similarly, I follow the same convention for abstract data structure usages.

My main motivation to do this is to keep things understandable -- a module with many responsibilities is typically more difficult to maintain and harder to read.

Furthermore, I try to make all functions that have no relevance to be exposed publicly static.

Discussion

The conventions described in this blog post work particularly well for my own projects -- I have been able to considerably improve the reliability and maintainability of my programs and the error reporting.

However, they are not guaranteed to be the "silver bullet" for all coding problems. Some limitations that I see are:

Finding a well-defined description of a domain and implementing a corresponding domain model sounds conceptually simple, but is typically much harder than expected. It typically takes me several iterations to get it (mostly) right.
The conventions only makes sense for programs/code areas that are primarily data driven. Workflows that are primarily computationally driven may often have different kinds of requirements, e.g. for performance reasons, and most likely require a different organization.
The conventions are not there to facilitate high performance (but also do not always necessarily work against it). For example, splitting up data structures and corresponding functions into modules, makes it impossible to apply certain compiler optimizations that are possible if code would not have been separated into sepearte compilation units. Integrity, security, and maintenance are properties I consider to have higher priority over performance.

Sander van der Burg's blog

Sunday, September 8, 2019

Some personal conventions for implementing domain models in C/C++ applications