The 7 Deadly Sins of Versioning (Part 2)

Fred Simon
Fred Simon

In the first of this series of blog posts, we talked about the problems with SemVer. In this post, we move on to Hash Versioning.

Hash Versioning

We define hash versioning when the creation of a version is partly based on the hash code of a set of data (typically, the source files).

Hashing in Git

Git is based on and provides a hash SHA1 checksum for every commit effected on a repository. The hash number is a unique stamp, which represents both the state of the code within Git, as well as how many merges were accomplished. When branches are created out of a given commit hash and no new commit is added, the hash will remain the same. Thus, the git hash is not unique in a repository and does not represent a specific branch. Additionally, when carrying out merges, a new checksum is generated even if the final state of the source code is identical to the original branch. This can be avoided through the correct use of fast forward and rebase, but very few individuals possess a level of mastery necessary to manage Git with such perfection.

Why is hashing used?

Non-pre-release SemVer is structured sequentially in a major.minor.patch number format. It presumes, for example, that version A.B.C+1 is necessarily a newer (maybe better) version than A.B.C., and similarly, A.B+1.C is a more featured or advanced version than A.B.C., etc. SemVer also allows for representations of non-release versions as a pre-release part appendix to the version number (e.g., A.B.C-buildnumber). This can pose a significant problem in the age of continuous integration and continuous delivery (CI/CD). Not only might there be thousands of interim builds between releases, but development today is not sequential – it’s conducted in parallel, by different teams, working on different branches.

In a continuous environment, SemVer is inadequate because a larger version number cannot be assumed to represent the build containing a new and appropriately-tested feature. When parallel builds are running and/or parallel branches are continuously being built, hash versioning is preferred, as every parallel stream can automatically generate its own hash number; there’s no need to use a centralized counter to generate a sequential semantic version number.

So, how is hash versioning a deadly sin?

It isn’t, but as the best practice is to use both SemVer and hash, the sin is the way hash versioning is improperly generated and used. The biggest problem is when SemVer isn’t used at all.

It’s far better to replace the pre-release build number in the SemVer layout with the hash. This way one gets the best of both worlds through the production of human and machine-readable versions. The human part evolves much slower, which allows it to be managed by humans. Meanwhile, the machine part is produced rapidly fast and in a parallelized fashion.

Using a hash instead of a build number in a pre-release appendix allows for the best of both worlds – retaining the good aspect of SemVer (i.e., having a human-readable version) and the benefit derived from the use of machines/automation (i.e., hashing).

By way of example, in our 2018 book, Liquid Software: How to Achieve Trusted Continuous Updates in the DevOps World, we discuss the application of liquid software in the automotive industry. To extend that discussion, let’s consider the manufacture of a specific car model intended for sale in a particular country. Source code (representing the car design) is built and packed into a binary (the car), which now has an identifier (i.e., the hash of the source code). The hash identifies all facets of production that require testing and validation. However, the developer neglects to consider other variables that should go into that hash, which have no impact on the functionality, performance, quality, or safety of that automobile, such as color. This results in the creation of different binaries (different cars with separate colors) that are all using the same hash number.

In terms of sensible software development with highly practical applications, color can represent a different packing algorithm, a different pre-configuration default, and a special deployment setting. It can be one of several parameters that impact package content, but not testing and other validation processes to which the software is subjected. The lack of unique hash numbers to denote color variations generates unnecessary expenditures of time and money. This is because human beings must be involved in order fulfillments to assure that auto dealerships receive the vehicles they need in the colors desired by their customers. The objective is to avoid repetitive, manual installation and configuration of software.

Absolution from sin

Hashing is at the heart of the Git architecture. The hash of the source code is automatically generated and ready to use in a version of any given binary file that’s built from a given git state. The color dilemma arises when developers only use the git hash as part of a file’s version number. The solution to our car color conundrum is to either add the color variable into the hashing function, or to add the color variable to the source code.

If our theoretical auto-manufacturing developer doesn’t include color as part of the source code, Git cannot know to create a specific hash to address that variation. However, if color was added to the code base, it would trigger unnecessary testing and validation processes, as this build variable is irrelevant to product operability or quality.

A liquid software pipeline can solve the “problem” of such inconsequential automobile variables needing to influence the hash, but not be a part of the source code. After the first part of the pipeline has handled manufacturing, testing, and validation information, without color information, the pipeline then screens for additional parameters and creates specific binaries for them. These are used to calculate appropriate hash numbers to address such variations. For liquid software, this is very important because versioning should always be handled by machines and hash versioning is the perfect type of machine version.

Even still, there remain questions of order and prioritization to be resolved. Even with the best hashing system in place, the hash itself won’t make clear which version is preferable over another. For example, with hashes constantly being generated when problems are detected and fixed, how can the developer know that a new hash is actually the one that’s best to use? How can hash versions be sorted for quality? We might presume that the answer lies in chronology (i.e., the latest version is better than a prior version). However, hashes don’t contain chronological information.

Liquid software solves this problem through the deployment of a metadata server that’s part of a good CI/CD platform and version control system (VCS). The server is programmed with a set of filters and queries meant to identify desired, traceable parameters for a given situation. Similarly, as it’s very difficult to recreate a build and its environment from a hash version, having an automatic way to link between a hash version and the software offers an efficient way to debug a service that has already been deployed to production.

Trains and platforms

When it comes to modern software development, SemVer forces everyone to work within the context of the same sequential train, which can be, well…a train wreck. In a continuous pipeline flow, it’s a senseless constraint whose usefulness is limiting in scope and creates bottlenecks in development processes. SemVer is a tool designed for human readability, which works against our need to let machines do the work of versioning.

With an exponential rise in binary creation and a need for speed in the creation and deployment of updates, automation is served through Git, which is a tremendous advancement. Liquid software affords us with a next mile advantage. It builds on Git’s success by further refining development, deployment, analysis, and prioritization steps through continuous processes that assure the best hashes are always being used.

Upcoming posts in our Deadly Sins of Versioning series will address multiple packages for the same version and versioning in the code.

The Liquid Software Revolution of Continuous Updates is here. Get on board and join the revolution.
Read the first chapter.

Play Video