Sander van der Burg's blog: June 2021

I have written many blog posts about software deployment and configuration management. For example, a couple of years ago, I have discussed a very basic configuration management process for small organizations, in which I explained that one of the worst things that could happen is that a machine breaks down and everything that it provides gets lost.

Fortunately, good configuration management practices and deployment tools (such as Nix) can help you to restore a machine's configuration with relative ease.

Another problem is managing a machine's data, which in many ways is even more important and complicated -- software packages can be typically obtained from a variety of sources, but data is typically unique (and therefore more valuable).

Even if a machine stays operational, the data that it stores can still be at risk -- it may get deleted by accident, or corrupted (for example, by the user, or a hardware problem).

It also does not matter whether a machine is used for business (for example, storing data for information systems) or personal use (for example, documents, pictures, and audio files). In both cases, data is valuable, and as a result, needs to be protected from loss and corruption.

In addition to recovery, the availability of data is often also very important -- many users (including me) typically own multiple devices (e.g. a desktop PC, laptop and phone) and typically want access to the same data from multiple places.

Because of the importance of data, I sometimes get questions from non-technical users that want to know how I manage my personal data (such as documents, images and audio files) and what tools I would recommend.

Similar to most computer users, I too have faced my own share of reliability problems -- of all the desktop computers I owned, I ended up with a completely broken hard drive three times, and a completely broken laptop once. Furthermore, I have also worked with all kinds of external media (e.g. floppy disks, CD-ROMs etc.) each having their own share of reliability problems.

To cope with data availability and loss, I came up with a custom script that I have been conveniently using to create backups and synchronize my data between the machines that I use.

In this blog post, I will explain how this script works.

About storage media

To cope with the potential loss of data, I have always made it a habit to transfer data to external media. I have worked with a variety of them, each having their advantages and disadvantages:

In the old days, I used floppy disks. Most people who are (at the time reading this blog post) in their early twenties or younger, may probably have no clue what I am talking about (for those people perhaps the 'Save icon' used in many desktop applications looks familiar).

Roughly 25 years ago, floppy disks were a common means to exchange data between computers.

Although they were common, they had many drawbacks. Probably the biggest drawback was their limited storage capacity -- I used to own 5.25 inch disks that (on PCs) were capable of storing ~360 KiB (if both sides are used), and the more sturdy 3.5 inch disks providing double density (720 KiB) and high density capacity (1.44 MiB).

Furthermore, floppy disks were also quite slow and could be easily damaged, for example, by toughing the magnetic surface.
When I switched from the Commodore Amiga to the PC, I also used tapes for a while in addition to floppy disks. They provided a substantial amount of storage capacity (~500 MiB in 1996). As of 2019 (and this probably still applies to today), tapes are still considered very cheap and reliable media for archival of data.

What I found impractical about tapes is that they are difficult to use as random access memory -- data on a tape is stored sequentially. As a consequence, it is typically very slow to find files or to "update" existing files. Typically, a backup tool needs to scan the tape from the beginning to the end or maintain a database with known storage locations.

Many of my personal files (such as documents) are regularly updated and older versions do not have to be retained. Instead, they should be removed to clear up storage space. With tapes this is very difficult to do.
When writable CD/DVDs became affordable, I used them as a backup media for a while. Similar to tapes, they also have substantial storage capacity. Furthermore, they are very fast and convenient to read.

A similar disadvantage is that they are not a very convenient medium for updating files. Although it is possible to write multi-sessions discs, in which files can be added, overwritten, or made invisible (essentially a "soft delete"), it remained inconvenient because you can not clear up the storage space that a deleted file used to occupy.

I also learned the hard way that writable discs (and in particular rewritable discs) are not very reliable for long term storage -- I have discarded many old writable discs (10 years or older) that can no longer be read.

Nowadays, I use a variety of USB storage devices (such as memory sticks, hard drives) as backup media. They are relatively cheap, fast, have more than enough storage capacity, and I can use them as random access memory -- it is no problem at all to update and delete data existing data.

To cope with the potential breakage of USB storage media, I always make sure that I have at least two copies of my important data.

About data availability

As already explained in the introduction, I have multiple devices for which I want the same data to be available. For example, on both my desktop PC and company laptop, I want to have access to my music and research papers collection.

A possible solution is to use a shared storage medium, such as a network drive. The advantage of this approach is that there is a single source of truth and I only need to maintain a single data collection -- when I add a new document it will immediately be available to both devices.

Although a network drive may be a possible solution, it is not a good fit for my use cases -- I typically use laptops for traveling. When I am not at home, I can no longer access my data stored on the network drive.

Another solution is to transfer all required files to the hard drive on my laptop. Doing a bulk transfer for the first time is typically not a big problem (in particular, if you use orthodox file managers), but keeping collections of files up-to-date between machines is in my experience quite tedious to do by hand.

Automating data synchronization

For both backing up and synchronizing files to other machines I need to regularly compare and update files in directories. In the former case, I need to sync data between local directories, and for the latter I need to sync data between directories on remote machines.

Each time I want make updates to my files, I want to inspect what has changed, and see which files require updating before actually doing it, so that I do not end up wasting time or risk modifying the wrong files.

Initially, I started to investigate how to implement a synchronization tool myself, but quite quickly I realized that there is already a tool available that is quite suitable for the job: rsync.

rsync is designed to efficiently transfer and synchronize files between drivers and machines across networks by comparing the modification times and sizes of files.

The only thing that I consider a drawback is that it is not fully optimized to conveniently automate my personal workflow -- to accomplish what I want, I need to memorize all the relevant rsync command-line options and run multiple command-line instructions.

To alleviate this problem, I have created a custom script, that evolved into a tool that I have named: gitlike-rsync.

Usage

gitlike-rsync is a tool that facilitates synchronisation of file collections between directories on local or remote machines using rsync and a workflow that is similar to managing Git projects.

Making backups

For example, if we have a data directory that we want to back up to another partition (for example, that refers to an external USB drive), we can open the directory:

$ cd /home/sander/Documents

and configure a destination directory, such as a directory on a backup drive (/media/MyBackupDrive/Documents):

$ gitlike-rsync destination-add /media/MyBackupDrive/Documents

By running the following command-line instruction, we can create a backup of the Documents folder:

$ gitlike-rsync push
sending incremental file list
.d..tp..... ./
>f+++++++++ bye.txt
>f+++++++++ hello.txt

sent 112 bytes  received 25 bytes  274.00 bytes/sec
total size is 10  speedup is 0.07 (DRY RUN)
Do you want to proceed (y/N)? y
sending incremental file list
.d..tp..... ./
>f+++++++++ bye.txt
              4 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=1/3)
>f+++++++++ hello.txt
              6 100%    5.86kB/s    0:00:00 (xfr#2, to-chk=0/3)

sent 202 bytes  received 57 bytes  518.00 bytes/sec
total size is 10  speedup is 0.04

The output above shows me the following:

When no additional command-line parameters have been provided, the script will first do a dry run and show the user what it intends to do. In the above example, it shows me that it wants to transfer the contents of the Documents folder that consists of only two files: hello.txt and bye.txt.
After providing my confirmation, the files in the destination directory will be updated -- the backup drive that is mounted on /media/MyBackupDrive.

I can conveniently make updates in my documents folder and update my backups.

For example, I can add a new file to the Documents folder named: greeting.txt, and run the push command again:

$ gitlike-rsync push
sending incremental file list
.d..t...... ./
>f+++++++++ greeting.txt

sent 129 bytes  received 22 bytes  302.00 bytes/sec
total size is 19  speedup is 0.13 (DRY RUN)
Do you want to proceed (y/N)? y
sending incremental file list
.d..t...... ./
>f+++++++++ greeting.txt
              9 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=1/4)

sent 182 bytes  received 38 bytes  440.00 bytes/sec
total size is 19  speedup is 0.09

In the above output, only the greeting.txt file is transferred to backup partition, leaving the other files untouched, because they have not changed.

Restoring files from a backup

In addition to the push command, gitlike-rsync also supports pull that can be used to sync data from the configured destination folders. The pull command can be used as a means to restore data from a backup partition.

For example, if I accidentally delete a file from the Documents folder:

$ rm hello.txt

and run the pull command:

$ gitlike-rsync pull
sending incremental file list
.d..t...... ./
>f+++++++++ hello.txt

sent 137 bytes  received 22 bytes  318.00 bytes/sec
total size is 19  speedup is 0.12 (DRY RUN)
Do you want to proceed (y/N)? y
sending incremental file list
.d..t...... ./
>f+++++++++ hello.txt
              6 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=0/4)

sent 183 bytes  received 38 bytes  442.00 bytes/sec
total size is 19  speedup is 0.09

the script is able to detect that hello.txt was removed and restore it from the backup partition.

Synchronizing files between machines in a network

In addition to local directories, that are useful for back ups, the gitlike-rsync script can also be used in a similar way to exchange files between machines, such as my desktop PC and office laptop.

With the following command-line instruction, I can automatically clone the Documents folder from my desktop PC to the Documents folder on my office laptop:

$ gitlike-rsync clone sander@desktop-pc:/home/sander/Documents

The above command connects to my desktop PC over SSH and retrieves the content of the Documents/ folder. It will also automatically configure the destination directory to synchronize with the Documents folder on the desktop PC.

When new documents have been added on the desktop PC, I just have to run the following command on my office laptop to update it:

$ gitlike-rsync pull

I can also modify the contents of the Documents folder on my office laptop and synchronize the changed files to my desktop PC with a push:

$ gitlike-rsync push

About versioning

As explained in the beginning of this blog post, in addition to the recovery of failing machines and equipment, another important reason to create backups is to protect yourself against accidental modifications.

Although gitlike-rsync can detect and display file changes, it does not do any versioning of any kind. This feature is deliberately left unimplemented, for very good reasons.

For most of my personal files (e.g. images, audio, video) I do not need any versioning. As soon as they are organized, they are not supposed to be changed.

However, for certain kinds of files I do need versioning, such as software development projects. Whenever I need versioning, my answer is very simple: I use the "ordinary" Git, even for projects that are private and not supposed to be shared on a public hosting service, such as GitHub.

As seasoned Git users may probably already know, you can turn any local directory into a Git repository, by running:

$ git init

The above command creates a local .git folder that tracks and stores changes locally.

When using a public hosting service, such as GitHub, and cloning a repository from GitHub, a remote: origin has been automatically configured to automatically push and pull changes to and from GitHub.

It is also possible to synchronize Git changes between arbitrary computers using a private SSH connection. I can, for example, configure a remote for a private repository, as follows:

$ git remote add origin sander@desktop-pc:/home/sander/Development/private-project

the above command configures the Git project that is stored in the /home/sander/Development/private-project directory on my desktop PC as a remote.

I can pull changes from the remote repository, by running:

$ git pull origin

and push locally stored changes, by running:

$ git push origin

As you may probably have already noticed, the above workflow is very similar to exchanging documents, shown earlier in this blog post.

What about backing up private Git repositories? To do this, I typically create tarballs of the Git project directories and sync them to my backup media with gitlike-rsync. The presence of the .git folder suffices to retain a project's history.

Conclusion

In this blog post, I have described gitlike-rsync, a simple opinionated wrapper script for exchanging files between local directories (for backups) and remote directories (for data exchange between machines).

As its name implies, it heavily builds on top of rsync for efficient data exchange, and the concepts of git as an inspiration for the workflow.

I have been conveniently using this script for over ten years, and it works extremely well for my own use cases and a variety of operating systems (Linux, Windows, macOS and FreeBSD).

My solution is obviously not rocket science -- my contribution is only the workflow automation. The "true credits" should go the developers of rsync and Git.

I also have to thank the COVID-19 crisis that allowed me to finally find the time to polish the script, document it and give it a name. In the Netherlands, as of today, there are still many restrictions, but the situation is slowly getting better.

Availability

I have added the gitlike-rsync script described in this blog post to my custom-scripts repository that can be obtained from my GitHub page.

Sander van der Burg's blog

Tuesday, June 1, 2021

An unconventional method for creating backups and exchanging files