Image of A Technical Guide to Version Control System (VCS) Internals

ADVERTISEMENT

Summary

In this article, we'll provide a technical comparison of some of the most historically significant Version Control Systems (or VCS). We will discuss the following six VCS (we plan to add others in the future):

  1. First Generation

  2. Second Generation

  3. Third Generation

The first generation VCS were intended to track changes for individual files and checked-out files could only be edited locally by one user at a time. They were built on the assumption that all users would log into the same shared Unix host with their own accounts. The second generation VCS introduced networking which led to centralized repositories that contained the 'official' versions of their projects. This was good progress, since it allowed multiple users to checkout and work with the code at the same time, but they would all be committing back to the same central repository. Furthermore, network access was required to make commits. The third generation comprises the distributed VCS. In a distributed VCS, all copies of the repository are created equal - there is no central copy of the repository. This opens the path for commits, branches, and merges to be created locally without network access and pushed to other repositories as needed.

VCS Release History Timeline

For context, here is a timeline of the creation of these VCS tools:

Figure 1: Timeline of the Creation of Version Control Systems

Timeline of the creation of Version Control Systems

SCCS - Source Code Control System - First Generation

Background

SCCS is considered to be one of the first successful VCS tools created. It was developed by Marc Rochkind at Bell Labs in 1972. It is written in C and was created to solve the problems of source file revision tracking. Furthermore, it made it significantly easier to track down the source of bugs introduced into a program. SCCS is worth understanding at a basic level because it is the seed of the set of modern VCS tools that are so important to developers today.

Architecture

Like most modern day VCS, SCCS has a set of commands that allow developers to work with versioning of their files. These commands are used to:

  1. Check in files to have their history tracked using SCCS
  2. Check out specific file revisions for review or compilation
  3. Check out specific file revisions for editing
  4. Check in new file revisions along with a comment explaining the changes
  5. Revert changes made in a checked out file
  6. Basic branching and merging of changes
  7. Provide a log of a file's revision history

A special type of file called an s-file or a history file is created when a file is added for tracking with SCCS. This file is named using the original file name prefixed with a s. and is stored in a subdirectory called SCCS. So a file called test.txt would get a history file created in the ./SCCS/ directory with a name of s.test.txt. On creation, the history file contains the initial content of the original file as well as some metadata to assist with version tracking. Checksums are stored in the history files to verify that the content has not been tampered with. The history file content is not compressed or encoded (as we will see is the case with the later generation VCS).

Since the content of the original file is now stored in the history file, it can be retrieved into the working directory for review, compilation, or editing. Further changes made to the file such as line additions, modifications, and removals can be checked back into the history file, which increments its revision number.

Subsequent SCCS checkins only store only the deltas or changes to a file as opposed to the entire file content each time. This decreases the size of the history file. Each time a checkin is made, the delta is stored in a structure known as a delta table inside the history file. As previously mentioned, the actual file content is more or less copied verbatim, with special control sequences for marking the start and end of sections of added and removed content. Since SCCS history files don't use compression, they are typically larger in size that the actual file being tracked. SCCS uses a delta method known as interleaved deltas. This is beneficial since it allows constant-time checkouts regardless of how old the checked out revision is - i.e. older revisions don't take longer to checkout than newer revisions.

One important thing to note is that all files are tracked and checked in separately in SCCS. There is no way to checkin changes to multiple files as a part of one atomic unit - like a commit in Git. Each tracked file has a corresponding history file which stores its revision history. In general, this means that the version numbers of different files in a project will not usually match each other. However, matching revision numbers can be achieved by editing every file in the project at once (even if not all of the files have real changes) and checking them all at one time. This will increment the revision number for all the files to keep them consistent, but note that this is NOT the same as including multiple files in a single commit like in Git. In SCCS, this makes an individual checkin in each history file, as opposed to one big commit including all the changes at once.

When a file is checked out for editing in SCCS, a lock is placed on the file so it cannot be edited by anyone else. This prevents changes from being overwritten by other users, but also limits development since only one user can work with a given file at a time.

SCCS has support for branches that can store sequences of changes within a specific file. Branches can be merged back in with the original versions or merged with other branched versions of the same parent.

Basic Commands

Below is a list of the most common SCCS commands.

sccs create <filename.ext>: Check in a new file to SCCS and create a new history file for it (in the ./SCCS/ directory by default).
sccs get <filename.ext>: Check out a file from from its corresponding history file and place it in the working directory in readonly mode.
sccs edit <filename.ext>: Check out a file from the corresponding history file for editing. Locks the history file so no other users can modify it.
sccs delta <filename.ext>: Check in the modifications to the specified file. Will prompt for a comment, store the changes in the history file, and remove the lock.
sccs prt <filename.ext>: Display the revision log for a tracked file.
sccs diffs <filename.ext>: Display the differences between the current working copy of a file and the state of the file when it was checked out.

For more information on SCCS internals, see Eric Allman's guide and this Oracle guide on programming utilities.

Sample SCCS History File

^Ah20562                                                                                                                                                                           
^As 00001/00001/00002
^Ad D 1.3 19/11/26 14:37:08 jack 3 2 
^Ac Here is a comment.
^Ae
^As 00002/00000/00001
^Ad D 1.2 19/11/26 14:36:00 jack 2 1 
^Ac No. 
^Ae
^As 00001/00000/00000
^Ad D 1.1 19/11/26 14:35:27 jack 1 0 
^Ac date and time created 19/11/26 14:35:27 by jack
^Ae
^Au
^AU
^Af e 0 
^At
^AT
^AI 1
Hi there
^AE 1
^AI 2
^AD 3 This is a test of SCCS ^AE 2 ^AE 3 ^AI 3 A test of SCCS ^AE 3

RCS - Revision Control System - First Generation

Background

RCS was written in C by Walter Tichy in 1982 as an alternative to SCCS, which wasn't open source at the time.

Architecture

RCS shares many traits with its predecessor, including:

  • Handling revisions on a file-by-file basis
  • Changes across multiple files can't be grouped together into an atomic commit
  • Tracked files are intended to be modified by one user at a time
  • No network functionality
  • Revisions for each tracked file are stored in a corresponding history file
  • Basic branching and merging of revisions within individual files.

When a file is set checked into RCS for the first time, a corresponding history file is created for it in the local ./RCS/ directory. This file is postfixed with a ,v so a file named test.txt would be tracked by a file called test.txt,v.

RCS uses a reverse-delta scheme for storing file changes. When a file is checked in, a full snapshot of the file's content is stored in the history file. When the file is modified and checked in again, a delta is calculated based off of the existing history file content. The old snapshot is discarded and the new one is saved, along with the delta to get back to the older state. This is called reverse-delta since to check out an older revision, RCS starts with the newest version of the file and applies consecutive deltas until the older revision is reached. This method allows for very quick checkouts of current revisions since the full snapshot of the current revision is always available. However, the older the checkout revision, the longer the checkout takes since an increasing number of deltas need to be calculated against the current snapshot.

This is not the case with SCCS which takes the same amount of time to fetch any revision. In addition, no checksum is stored in RCS history files so file integrity cannot be ensured.

Basic Commands

Below is a list of the most common RCS commands:

ci <filename.ext>: Check in a new file to RCS and create a new history file for it (in the ./RCS/ directory by default).
co <filename.ext>: Check out a file from from its corresponding history file and place it in the working directory in readonly mode.
co -l <filename.ext>: Check out a file from the corresponding history file for editing. Locks the history file so no other users can modify it.
ci <filename.ext>: Check in file changes and create a new revision for it in its corresponding history file.
merge <file-to-merge-into.ext> <parent.ext> <file-to-merge-from.ext>: Merge changes from two modified children of the same parent file.
rcsdiff <filename.ext>: Display the differences between the current working copy of a file and the state of the file when it was checked out.
rcsclean: Removes working files that don't have locks.

For more information on RCS internals, see the GNU RCS manual.

Sample RCS History File

head    1.2;                                                                                                                                                                       
access;
symbols;
locks; strict;
comment @# @;

1.2 date 2019.11.25.05.51.55; author jstopak; state Exp; branches; next 1.1;
1.1 date 2019.11.25.05.49.02; author jstopak; state Exp; branches; next ;

desc @This is a test. @

1.2 log @Edited the file. @ text @hi there, you are my bud.
You are so cool!
The end. @

1.1 log @Initial revision @ text @d1 5 a5 1 hi there @

CVS - Concurrent Versions System - Second Generation

Background

CVS was created by Dick Grune in 1986 with the goal of adding a networking element to version control. It is also written in C. CVS kicked off the second generation of VCS tools which allowed geographically dispersed development teams to work on projects together.

Architecture

CVS is a frontend for RCS - it provides a set of commands for interacting with files in a project, but uses the RCS history file format and commands behind the scenes. For the first time in VCS history, CVS allowed multiple developers to check out and work on the same files simultaneously. It did this by using a centralized repository model. The first step is to set up a centralized repository on a remote server using CVS. Projects can then be imported into the repository. When a project is imported into CVS, each file is converted into a ,v history file and stored in a central directory known as a module. The repository generally lives on a remote server which is accessible over a local network or the Internet.

A developer checks out a copy the module which is copied to a working directory on their local machine. No files are locked in this process so there is no limit to the number of developers that can check out the module at one time. Developers can modify their checked out files and commit their changes as needed. If a developer commits a change, other developers will need to update their working copies via a (usually) automated merge process before committing their changes. Occasionally merge conflicts will need to be manually resolved before the commit can be made. CVS also provides the ability to create and merge branches.

Basic Commands

export CVSROOT=<path/to/repository>: Sets the CVS repository root directory so it doesn't need to be specified in each command.
cvs import -m 'Import module' <module-name> <vendor-tag> <release-tag>: Import a directory of files into a CVS module. Before running this browse into the root directory of the project you want to import.
cvs checkout <module-name>: Copy a module to the working directory.
cvs commit <filename.ext>: Commit a changed file back to the module in the central repository.
cvs add <filename.txt>: Add a new file to track revisions for.
cvs update: Update the working copy by merging in committed changes that exist in the central repository but not the working copy.
cvs status: Show general information about the checked out working copy of a module.
cvs tag <tag-name> <files>: Add an identifying tag to a single file or set of files.
cvs tag -b <new-branch-name>: Create a new branch in the repository (must be checked out before working on it locally).
cvs checkout -r <branch-name>: Checkout an existing branch to the working directory.
cvs update -j <branch-to-merge>: Merge an existing branch into the local working copy.

For more information on CVS internals, see the GNU CVS manual and Dick Grune's article.

Sample CVS History File

head     1.1;                                                                                                                                                                      
branch   1.1.1;
access   ;   
symbols  start:1.1.1.1 jack:1.1.1;
locks    ; strict;
comment  @# @;

1.1 date 2019.11.26.18.45.07; author jstopak; state Exp; branches 1.1.1.1; next ; commitid zsEBhVyPc4lonoMB;
1.1.1.1 date 2019.11.26.18.45.07; author jstopak; state Exp; branches ; next ; commitid zsEBhVyPc4lonoMB;

desc @@


1.1 log @Initial revision @ text @hi there @

1.1.1.1 log @Imported sources @ text @@

SVN - Subversion - Second Generation

Background

Subversion was created in 2000 by Collabnet Inc and is now maintained by the Apache Software Foundation. It is written in C and was designed to be a more robust centralized solution than CVS.

Architecture

Like CVS, Subversion uses a centralized repository model. Remote users must have a working network connection to commit their changes to the central repository.

Subversion introduced the functionality of atomic commits which ensured that a commit would either fully succeed, or be completely abandoned if an issue occurred. In CVS, if a commit operation failed midway, for example due to a network outage, the repository could be left in a corrupted and inconsistent state. Furthermore, a commit or revision in Subversion can include multiple files and directories. This is important since it allows users to track sets of related changes together as a grouped unit, instead of the past storage models that track changes separately for each file.

The current storage model that Subversion uses for tracked files is called FSFS or File System atop the File System. This name was chosen since it creates its database structure using a file and directory structure that match the operating system filesystem it is running on. The unique feature of the Subversion filesystem is that it is designed to track not only the files and the directories it contains, but the different versions of these files and directories and they change over time. It is a filesystem with an added time dimension. In addition, folders are first class citizens in Subversion. Empty folders can be committed in Subversion, whereas in the rest (even Git) empty folders are unnoticed.

When a Subversion repository is created, a (nearly) empty database of files and folders is created as a part of it. A directory called db/revs is created in which all revision tracking information for the checked-in (committed) files is stored. Each commit (which can include changes to multiple files) is stored in a new file in the revs directory and is named with a sequential numeric identifier starting with 1. When a file is committed for the first time, its full content is stored. Future commits of the same file will store only the changes - also called the diffs or deltas - in order to conserve space. In addition, the deltas are compressed using lz4 or zlib compression algorithms to further reduce their size.

By default, this is actually only true to a point. Although storing file deltas instead of the whole file each time does save on storage space, it adds time to checkout and commit operations since all the deltas need to be strung together in order to recreate the current state of the file. For this reason, by default Subversion stores up to 1023 deltas per file before storing a new full copy of the file. This achieves a nice balance of both storage and speed.

SVN does not use a conventional branching and tagging system. A normal Subversion repository layout is to have three folders in the root:

  • trunk/
  • branches/
  • tags/

The trunk/ folder is used for the production version of the application. The branches/ folder is used to store subfolders that correspond to individual branches. The tags/ folder is used to store tags which represent specific (usually significant) project revisions.

Basic Commands

svn create <path-to-repository>: Create a new, empty repository shell in the specified directory.
svn import <path-to-project> <svn-url>: Import a directory of files into the specified Subversion repository path.
svn checkout <svn-path> <path-to-checkout>: Copy a stored repository path to the desired working directory.
svn commit -m 'Commit message': Commit a set of changed files and folders along with a descriptive commit message.
svn add <filename.txt>: Add a new file to track revisions for.
svn update: Update the working copy by merging in committed changes that exist in the central repository but not the working copy.
svn status: Show a list of tracked files that have been changed in the working directory (if any).
svn info: Show a list of general details about the checked-out copy.
svn copy <branch-to-copy> <new-branch-path-and-name>: Create a new branch by copying an existing one.
svn switch <existing-branch>: Switch the working directory to an existing branch. This will checkout the specified branch.
svn merge <existing-branch>: Merge the specified branch into the current branch checked out in the working directory. Note this needs to be committed afterwards.
svn log: Show the commit history and associated descriptive messages for the active branch.

For more information on SVN internals, see the Version Control with Subversion book.

Sample SVN Revision File

DELTA                                                                                                                                                                              
SVN^B^@^@   ^B  
^A<89>  hi there
ENDREP
id: 2-1.0.r1/4
type: file
count: 0
text: 1 3 21 9 12f6bb1941df66b8f138a446d4e8670c 279d9035886d4c0427549863c4c2101e4a63e041 0-0/_4
cpath: /trunk/hi.txt
copyroot: 0 / 
DELTA SVN^B^@^@$^B%^A¤$K 6 hi.txt V 15 file 2-1.0.r1/4 END ENDREP id: 0-1.0.r1/6 type: dir count: 0 text: 1 5 48 36 d84cb1c29105ee7739f3e834178e6345 - - cpath: /trunk copyroot: 0 /
DELTA SVN^B^@^@'^B#^A¢'K 5 trunk V 14 dir 0-1.0.r1/6 END ENDREP id: 0.0.r1/2 type: dir pred: 0.0.r0/2 count: 1 text: 1 7 46 34 1d30e888ec9e633100992b752c2ff4c2 - - cpath: / copyroot: 0 /
_0.0.t0-0 add-dir false false false /trunk
_2.0.t0-0 add-file true false false /trunk/hi.txt

L2P-INDEX ^A<80>@^A^A^A^M^H^@ä^H÷^Aé^FDÎ^Bzè^AP2L-INDEX ^A<91>^E<80><80>@^A?^@'2^@<8d>»Ý<90>^C§^A^X^@õ ½½^N= ^@ü<8d>Ôã^Ft^V^@<92><9a><89>Ã^E; ^@<8a>®åw|I^@<88><83>Î<93>^L`^M^@ù­<92>À^Eïú?^[^@^@657 6aad60ec758d121d5181ea4b81a9f5f4 688 75f59082c8b5ab687ae87708432ca406I

Git - Third Generation

Background

Git was created in 2005 by Linus Torvalds (also the creator of Linux) and is written primarily in C combined with some shell scripts. It is a great VCS due to its features, flexibility, and speed. Linus Torvalds originally wrote it for the Linux codebase and it has grown to become the most popular VCS in use today.

Architecture

Git is a distributed VCS. This means that no copy of the repository needs to be designated as the centralized copy - all copies are created equal. This is in stark contrast to the second generation VCS which rely on a centralized copy for users to checkin and checkout from. What this means is that developers can share changes with each other directly before merging their changes into an official branch.

Furthermore, developers can commit their changes to their local copy of the repository without any other repositories knowing about it. This means that commits can be made without any network or Internet connection. Developers can work locally offline until they are ready to share their work with others. At that point, the changes can be pushed to other repositories for review, testing, or deployment.

When a file is added for tracking with Git, it is compressed using the zlib compression algorithm. The result is hashed using a SHA-1 hash function. This yields a unique hash value that corresponds specifically to the content in that file. Git stores this in an object database which is located in the hidden .git/objects folder. The name of the file is the generated hash value, and the file contains the compressed content. These files are called blobs and are created each time a new file (or changed version of an existing file) are added to the repository.

Git implements a staging index which acts as an intermediate area for changes that are getting ready to be committed. As new changes are staged for commit, their compressed contents are referenced in a special index file - which takes the form of a tree object. A tree is a Git object that connects blob objects to their real file names, file permissions and links to other trees, and in this way represents the state of a particular set of files and directories. Once all related changes are staged for commit, the index tree can be committed to the repository, which creates a commit object in the Git object database. A commit references the head tree for a particular revision as well as the commit author, email address, date, and a descriptive commit message. Each commit also stores a reference to its parent commit(s) and so over time a history of project development is established.

As mentioned, all Git objects - blobs, trees, and commits - are compressed, hashed, and stored in the object database based on their hash value. These are called loose objects. At this point no diffs have been utilized to save space which makes Git very fast since the full content of each file revision is accessible as a loose object. However, certain operations such as pushing commits to a remote repository, storing too many objects, or manually running Git's garbage collection command can cause Git to repackage the objects into pack files. In the packing process, reverse diffs are taken and compressed to eliminate redundant content and reduce size. This process results in .pack files containing the object content, each with a corresponding .idx (or index) file containing a reference of the packed objects and their locations in the pack file.

These pack files are transferred over the network when branches are pushed to or pulled from remote repositories. When pulling or fetching branches, the pack files are unpacked to create the loose objects in the object repository.

Basic Commands

git init: Initialize the current directory as a Git repository (creates the hidden .git folder and its contents).
git clone <git-url>: Download a copy of the Git repository at the specified URL.
git add <filename.ext>: Add an untracked file or changed file to the staging area (creates corresponding entries in the object database).
git commit -m 'Commit message': Commit a set of changed files and folders along with a descriptive commit message.
git status: Show information related to the state of the working directory, current branch, untracked files, modified files, etc.
git branch <new-branch>: Create a new branch based on the current checked-out branch.
git checkout <branch>: Checkout the specified branch into the working directory.
git merge <branch>: Merge the specified branch into the current branch checked out in the working directory.
git pull: Update the working copy by merging in committed changes that exist in the remote repository but not the working copy.
git push: Pack loose objects for local active branch commits into pack files and transfer to remote repository.
git log: Show the commit history and associated descriptive messages for the active branch.
git stash: Save all uncommitted changes in the working directory to a cache so that they can be retrieved later.

If you're interested in learning how Git's code works, check out our Baby Git Guidebook for Developers. For more information on Git internals, see the Pro Git book chapter on Git's internals.

Sample Git Blob, Tree, Commit Files

A blob file with hash value 37d4e6c5c48ba0d245164c4e10d5f41140cab980:

hi there

A tree object with hash value b769f35b07fbe0076dcfc36fd80c121d747ccc04:

100644 blob 37d4e6c5c48ba0d245164c4e10d5f41140cab980hi.txt

A commit object with hash value dc512627287a61f6111705151f4e53f204fbda9b:

tree b769f35b07fbe0076dcfc36fd80c121d747ccc04
author Jacob Stopak  1574915303 -0800
committer Jacob Stopak  1574915303 -0800
Initial commit

Mercurial - Third Generation

Background

Mercurial was created in 2005 by Matt Mackall and it is written in Python. It was also started with the goal of hosting the codebase for Linux, but Git was chosen instead. It is the second most popular distributed VCS after Git, but is used far less often.

Architecture

Like Git, Mercurial is a distributed version control system that allows any number of developers to work with their own copy of a project independently from others. Mercurial leverages many of the same technologies as Git, such as compression and SHA-1 hashing, but does so in different ways.

When a new file is committed for tracking in Mercurial, a corresponding revlog file is created for it in the hidden directory .hg/store/data/. You can think of a revlog (or revision log) file as a modernized version of the history files used by the older VCS like CVS, RCS, and SCCS. Unlike Git, which creates a new blob for every version of every staged file, Mercurial simply creates a new entry in the revlog for that file. To conserve space, each new entry only contains the delta (changes) from the previous version. Once a threshold number of deltas is reached, a full snapshot of the file is stored again. This reduces the lookup time when applying many deltas to reconstruct a particular file revision.

These file revlogs are named to match the files that they track, but are postfixed with .i and .d extensions. The .d files contained the compressed delta content. The .i files are used as indexes to quickly track down different revisions inside the .d files. For small files with low numbers of revisions, both the indexes and content are stored in .i files. Revlog file entries are compressed for performance and hashed for identification. The hash values are referred to as nodeids.

Whenever a new commit is made, Mercurial tracks the all file revisions in that commit in something called the manifest. The manifest is also a revlog file - it stores entries that correspond to particular states of the repository. However, instead of storing individual file content like the file revlogs, the manifest stores a list of filenames and nodeids that specify which file revision entries exist in each revision of the project. These manifest entries are also compressed and hashed. The hash values are again referred to as nodeids.

Lastly, Mercurial uses one more type of revlog called a changelog. The changelog contains a list of entries that associate each commit with the following information:

  • Manifest nodeid: Identifies the full set of file revisions that exist at a particular time.
  • Parent commit nodeid(s): This allows Mercurial to establish a timeline or branch of project history. One or two parent ID's are stored depending on the type of commit (normal vs merge).
  • Commit author
  • Commit date
  • Commit message

Each changelog entry also generates a hash known as its nodeid.

Basic Commands

hg init: Initialize the current directory as a Mercurial repository (creates the hidden .hg folder and its contents).
hg clone <hg-url>: Download a copy of the Mercurial repository at the specified URL.
hg add <filename.ext>: Add a new file for revision tracking.
hg commit -m 'Commit message': Commit a set of changed files and folders along with a descriptive commit message.
hg status: Show information related to the state of the working directory, untracked files, modified files, etc.
hg update <revision>: Checkout the specified branch into the working directory.
hg merge <branch>: Merge the specified branch into the current branch checked out in the working directory.
hg pull: Download new revisions from remote repository but don't merge them into the working directory.
hg push: Transfer new revisions to remote repository.
hg log: Show the commit history and associated descriptive messages for the active branch.

Sample Mercurial Files

Manifest revlog entry:

hey.txt208b6e0998e8099b16ad0e43f036ec745d58ec04
hi.txt74568dc1a5b9047c8041edd99dd6f566e78d3a42

Changelog revlog entry:

b8ee947ce6f25b84c22fbefecab99ea918fc0969
Jacob Stopak 
1575082451 28800
hey.txt
Add hey.txt

For more information on Mercurial internals, check out the following links:

Conclusion

In this article, we provided a technical comparison of some historically relevant version control systems. If you have any questions or comments, feel free to reach out to jacob@initialcommit.io.

A special thanks to Reddit user u/Teknikal_Domain, who provided expert details and insight that greatly contributed to the writing of this article.