Image of Decoding Git Guidebook: Chapter 2

ADVERTISEMENT

Table of Contents

2 An Overview of Git's Original Codebase

As mentioned in Chapter 1, Git's initial commit is a rudimentary version of the Git version control application that is widely used today. Git's original version refers to the set of commands that were the forerunner of the present-day Git application. The program is written in the C language and has a total of 7 commands that are run on the command line.

The original Git program enables a user to set up a local repository to which the user can add files to be tracked and subsequently update the repository when changes are made to those files. Other functionalities include the ability to list the files that have been committed to the repository, to display the contents of those files, and to show the differences between the cached snapshots of the files and the working versions of those files.

As stated in Git’s original README file, the original Git program was designed as an efficient directory management system. The key to achieving this goal was to make use of file deflation (compression) in combination with a hash function, specifically the SHA-1 hash function, to map arbitrary file contents to unique hash values. Figure 1.4 illustrates a high level overview of the data compression and decompression process that Git makes use of.

Figure 1.4: File Compression

Git file compression

The 'Original data' box represents the code files in a software project. These files are compressed to a size that makes them more efficient to work with and stored in a local repository. When the data is retrieved for use, it is decompressed to yield the original contents.

Let us now move on to hash functions. A hash function is a one-way function – a function that case be easily run in one direction to convert an input to an output, but cannot easily be reversed to convert the output value back into the input. In addition, each input to the hash function produces a unique output. Consider a file with the text ‘hello world’ inside of it. We can take that text and pump it through the SHA-1 hash function to yield a unique output code. If we update the text to read ‘hello cruel world’ and re-hash it, we will get a new unique hash output value since the content has changed. As we will see, all Git's original objects are indexed and referenced through their respective SHA-1 hash values. Figure 1.5 illustrates how hash functions work.

Figure 1.5: Hash Function

Git hash function

As an aside, hash functions are a core part of the HTTPS protocol that secures Internet traffic as well as cryptocurrency technologies like Bitcoin. In both of these use cases, there is an important need to compute a unique hash value of a secret 'private' key, in which the hash value (output) becomes the 'public key' - think Bitcoin public address or HTTPS certificate - which can be shared publicly.

In the case of SHA-1, a hash value has a length of 160 bits, or 20 bytes. In hexadecimal representation, this value is rendered as a sequence of 40 digits, with each 2-digit hexadecimal number having possible values from 00 to FF, or 0 to 255 in decimal representation. Here is an example of a SHA-1 hash value in hexadecimal representation:

47a013e660d408619d894b20806b1d5086aab03b

Since all objects in Git are indexed and referenced through their hash values, a hash rendered in this particular hexadecimal format is the basic naming unit that users of Git will be working with to manage their repository.

In general terms, an initial workflow for adding and committing files to a Git repository consists of the following steps:

  1. Initialize the repository database
  2. Add user files to the repository cache
  3. Write the cache to the repository database
  4. Commit the current changeset to the repository database

When changes are subsequently made to the tracked files, the user can then update the repository database and cache with these new changes.

Components of Git's original repository

Git's original repository has four basic components:

  1. Objects
  2. An object database
  3. A current directory cache
  4. A working directory

Objects

An object is an abstraction of data and metadata. It is an arbitrary set of content – it could be text content in a file, or audio data, or any arbitrary byte stream – that is indexed and referenced through its hash value. In other words, the name of an object is its unique hash value, and this hash value is used to refer to and 'look up' that specific piece of content. It is important to note that an object's hash value is the hash of the deflated (compressed) data and metadata represented by the object, not of the inflated data and metadata.

The general structure of a stored object is as follows:

object tag
' '                   (single space)
size of object data   (in bytes)
'\0'                  (null character)
object binary data

As you can see, the first part of an object consists of the object metadata, and the second part consists of the object data (object binary data).

The metadata consists of a string containing an object tag that indicates the object type and a string containing the size of the object data in bytes before deflation.

There are three types of objects in Git's original version: A blob object, a tree object, and a commit or changeset object. Thus, the object tag in the structure above can be 'blob', 'tree' or 'commit'.

Blob object

A Git blob (Binary Large OBject) object is a general abstraction for user data. In practical use, this type of object contains the contents of a digital file that a user has added to the repository. In other words, a blob object is a type of object that corresponds to a user file in binary form – it could be any type of file the user wishes to add to the repository, such as a plaintext file, video file, Word document, etc. The Git program will generate a blob object for each file that a user adds to the repository, and this blob object will be named, indexed, and referenced through the deflated blob object's SHA-1 hash value.

A blob object has the following structure:

'blob'         (blob object tag)
' '            (single space)
size of blob   (in bytes)
'\0'           (null character)
blob data

Tree object

A tree object contains a list of files that have been added by a user to the repository. A tree’s purpose is to correlate file names and other file metadata with blobs representing the actual file content that has been added to the repository. As with any object, a tree object is named, indexed, and referenced through the deflated tree object's SHA-1 hash value.

A tree object has the following structure:

'tree'              (tree object tag)
' '                 (single space)
size of tree        (in bytes)
'\0'                (null character)
file 1 mode         (octal number)
' '
file 1 name
'\0'
file 1 SHA-1 hash   (hash of file's deflated contents)
file 2 mode
' '
file 2 name
'\0'
file 2 SHA-1 hash
...
file N mode
' '
file N name
'\0'
file N SHA-1 hash

In the structure above, file 1, file 2, etc. refer to files that the user has added to the repository. For each file, the mode, name, and hash value are stored. The file mode is an octal number whose value gives the file permissions and type, as defined in Unix-like systems. The file path is just the regular path (including name) of the file. The SHA-1 hash is the hash value of the deflated blob object in the object database that corresponds to the file.

The file information is sorted lexicographically by file path. The size of the tree that's recorded in the tree object metadata is the sum of the sizes of the file information entries in the tree object data.

Commit object

A commit or changeset object is the result of committing a tree object to the repository database. It contains the hash value of the tree object being committed, the hash values of any parent tree objects specified by the user, metadata about the user who committed the tree, the time and date when the commit was made, and a user-supplied comment.

By enabling a user to specify parent trees of the tree object being committed, a commit object makes possible keeping a history of the committed tree object. Like any other object, a commit object is named, indexed, and referenced through the deflated commit object's SHA-1 hash value.

A commit object has the following structure:

'commit'                     (commit object tag)
' '                          (single space)
size of data                 (in bytes)
'\0'                         (null character)
'tree' SHA-1 hash            (hash value of committed tree)
'parent' SHA-1 hash          (hash value of first parent commit)
'parent' SHA-1 hash          (hash value of second parent commit)
...
'author' ID email date
'committer' ID email date
                             (empty line)
comment

As usual, the hash values indicated above all refer to hash values of deflated objects. Note that it is possible to specify up to 16 parent tree objects in Git's original version.

The object database

Another component of a Git repository is the object database. This database is simply an organized set of folders and files used for locally storing blob, tree, and commit objects for use with Git.

The first step in setting up a Git repository is to initialize the object database. The user can specify an existing objects directory through an environment variable. If not specified by the user, the program defaults to creating .dircache/objects in the current folder to represent the object database directory.

Under the objects directory, the program creates 256 subfolders that are named from 00 to ff, corresponding to the 256 possible values of a two-digit hexadecimal number. In other words, the database directories will be as follows:

.dircache/objects/00
.dircache/objects/01
.dircache/objects/02
...
.dircache/objects/fd
.dircache/objects/fe
.dircache/objects/ff

When Git needs to store an object, it is stored under the directory whose name is the same as the first two digits of the object's SHA-1 hash value rendered in a 40-digit hexadecimal representation. The remaining 38 digits of an object's hash value in hexadecimal representation are then used as the base filename of that object. For example, an object (let’s assume it’s a blob object) with a hash value of:

47a013e660d408619d894b20806b1d5086aab03b

will have a path in the database that is equal to

.dircache/objects/47/a013e660d408619d894b20806b1d5086aab03b

Note how the first two characters of the hash value represent the name of the subfolder to store the object in, and the remaining characters of the hash value specify the file name. Thus, an object's hash value completely specifies the path and name of the object in the object database.

The current directory cache

The current directory cache is the equivalent of the modern day Git 'staging area'. It is a binary file that contains information, or metadata, about files that have been added by the user to the repository. When a user adds a file to the repository, the program adds a corresponding blob file to the object database, and the current directory cache is also updated to contain the file's metadata and the blob object's hash value.

File information in the current directory cache is sorted lexicographically by file path. All the information needed for creating a tree object is stored in the cache. The cache also holds additional file metadata that is not used in tree objects. The current directory cache can thus be thought of as an intermediate representation of a tree object.

File metadata and blob object hashes stored in the cache do not have to be consistent with the working versions of the corresponding user files. Edits and changes to the user files are not automatically reflected in the cache. However, the cache is used to provide an efficient way of obtaining any differences between the file information stored in it and the working versions of the files.

The current directory cache is stored in a file called .dircache/index in the user’s current working directory and has the following structure:

cache header
cache entry 1
cache entry 2
cache entry 3
...
cache entry N

In this structure, cache entry 1, cache entry 2, etc. refer respectively to information about file 1, file 2, etc. Each cache entry stores the following information about a user file:

file's last access time
file's last modification time
ID of device containing the file
file inode number
file mode (permissions and type)
file user ID
file group ID
file size
SHA-1 hash value of file's deflated contents
file name length
file name

Note that several of these are file properties that are specific to Unix-like systems, namely the status change time, device ID, inode number, mode, user ID, and group ID.

The cache structure also contains a cache header, which consists of the following information:

signature (common to all cache headers) 
version number 
number of cache entries, N, in the cache 
SHA-1 hash value of the deflated cache

Working directory

The working directory refers to the base user directory containing the user files that the user wants to track and add to the repository. Only files residing in this directory or below it can be added to the repository, and not all existing user files have to be added to the repository.

The working directory also contains the dircache directory created by the Git program to store the index file (current directory cache), as well as the objects directory (object database) if an existing object database was not specified by the user. The dircache directory and its contents should not be edited by the user, except possibly for moving the object database to another directory.

User files and folders in the working directory and below it can be freely edited. The dircache/index cache file in the working directory is used by the Git program to determine if changes have been made to files in the working directory since those files were last added to or updated in the cache.

Figure 1.6 below illustrates the components of Git's original repository.

Figure 1.6: Components of a Git Repository

Git repository components

Synopsis of Git's original commands

The 7 original Git commands are:

init-db
update-cache
write-tree
commit-tree
read-tree
cat-file
show-diff

The sequence of commands below is a synopsis of the use of these commands and is shown here to provide the reader a general picture of the commands. A more detailed tutorial can be found in Chapter 4.

In the commands below, the files hello.txt and changelog are both user files. The file hello.txt contains the text 'Hello world!' and the file changelog contains the text Initial commit. Notice the 40-digit hash values that are output by some of the commands.

$ init-db
defaulting to private storage area
$
$ update-cache hello.txt
$
$ write-tree
60471dd2f6a67990755795708b2a09a5b3da505c
$
$ commit-tree 60471dd2f6a67990755795708b2a09a5b3da505c < changelog
Committing initial tree 60471dd2f6a67990755795708b2a09a5b3da505c
deae7e7a5111831cd90e4f10379d798e135f734a
$
$ cat-file deae7e7a5111831cd90e4f10379d798e135f734a
temp_git_file_LarhKM: commit
$
$ cat temp_git_file_LarhKM
tree 60471dd2f6a67990755795708b2a09a5b3da505c
author Crusoe,,, <rcrusoe@island> Fri Apr 27 13:41:01 2018
committer Crusoe,,, <rcrusoe@island> Fri Apr 27 13:41:01 2018
Initial commit.
$
$ read-tree 60471dd2f6a67990755795708b2a09a5b3da505c
100664 hello.txt (694cdbbdce662aa1060d07cba1d4e0cfeb822bee)
$
$ cat-file 694cdbbdce662aa1060d07cba1d4e0cfeb822bee
temp_git_file_kEtFzM: blob
$
$ cat temp_git_file_kEtFzM
Hello world!
$
$ show-diff 
hello.txt: ok

The reader can download the original Git package that accompanies this guidebook from either of these links:

If you're interested in learning more about how Git works under the hood, check out our Decoding Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how Git works at the code level. To do this we documented the first version of Git's code and discuss it in detail.

Back to Chapter 1

Final Notes