1 Introduction

Git is a tool that is widely used by software development teams to track and manage changes in software projects as they evolve over time. The term 'software project' may sound intimidating and vague, but really it is very simple. As illustrated in Figure 1.1 below, a software project is nothing more than a set of files and folders containing code.

Figure 1.1: Structure of a Software Project

Git programming languages

The full set directories and files that make up a software project is called a codebase. The 'Project Root' is the highest-level folder in the project’s directory tree. Code files can be included directly in the project root or organized into multiple levels of folders.

As specified in the second step of Figure 1.1, when the codebase is ready for testing or deployment it can be 'built' into the program that will actually run on your computer. The 'build' process can include one or more steps that convert the code written by humans into a form that is understandable by your computer’s processing chips. Once the code is built, your program is ready to run on your specific operating system, such as Linux, OSX, or Windows.

Over time, developers update the project code to add new features, fix bugs, implement security updates, and more. In general, there are three ways developers can make these changes to a software project:

  1. Add new files and folders to the project
  2. Edit the code in existing files and folders
  3. Delete existing files and folders

This begs the question: 'How the heck do all these developers, who may be geographically spread out all around the world, keep track of their software project code in such a way that they can work together on a single project?' Development teams need a way to keep track of exactly what changes were made to the code, which files or folders were affected, who made each change, and need a way for each developer to be able to obtain the updates from all other developers. Figure 1.2 illustrates a simplified development scenario with 3 team members, Mat, Jack, and Karina. Git provides a way to accomplish all of this and more. Tools that provide this ability are called Version Control Systems, or VCS for short.

Figure 1.2: Collaborative Development Efforts

Git collaborative development efforts

Git is versatile and not limited to the field of software development. It can be used to accurately track changes in most digital files and provides a convenient means for keeping a history of those changes. A group of scientists, for example, could use Git to write a scientific paper collaboratively. The edits made to the Word document draft of this guidebook as it is being written are tracked through Git as well.

VCS systems like Git are tools that enable a user to take a snapshot of the state of a project at chosen times, whether it is a project that consists of a single file or a larger project consisting of a cascade of directories. Perhaps more accurately, the snapshot that is recorded is the set of changes made to the project since the previous snapshot. In VCS jargon, the user commits the changes to a repository of those changes.

It is easy to think of ways that such a means for accurately knowing the state of a project at particular points in the past could be useful to the project. Besides record keeping and providing a potential backup source, such a system, for example, enables one to fix mistakes in the current state of a project by reverting the project to a previously working state.

Each snapshot also serves as a potential jumping-off point for a new line of project development. In other words, each snapshot could serve as a jumping-off point for a new branch of the project. The accurate tracking of the state of a project made possible by a versioning system also makes possible collaboration, in which multiple contributors could simultaneously make their respective changes to the project, changes which they could later merge into a coherent version of the project.

The basic functionalities of a VCS like Git can then be summarized as providing users the ability to:

  • Add project files and folders to a repository
  • Commit changes in the files to the repository at user-chosen times, i.e. save snapshots of the project
  • Access the history of the changes committed to the repository
  • See the difference in the state of the project at different commit times
  • Branch from a particular committed version
  • Merge changes from different branches

Created by Linus Torvalds (the creator of Linux) in 2005, Git has evolved over more than a decade to become the sophisticated, convenient, and ubiquitous tool that it is today. This manual, however, is about the precursor to this evolved version of Git. It is about a set of commands that were, in the parlance of versioning, the Initial Commit of the Git application. The authors of this manual have coined the term Baby Git to refer to this germinal version of Git.

Baby Git could be thought of as a first rudimentary version of Git. Although much less sophisticated and convenient than its grown-up version, Baby Git nevertheless encapsulates the core ideas behind modern-day Git. Figure 1.3 illustrates the origins of Baby Git. Moreover, a primary motivation behind Baby Git was to be efficient in implementing these repository functionalities, and it did so by implementing data deflation (compression) and the use of a cryptographic hash function called SHA-1 to map data to hash values or message digests. This data deflation and hashing is the central concept that underlies the algorithms used in Baby Git.

Figure 1.3: Origins of Baby Git

Origins of Baby Git

Baby Git is written in the C programming language and consists of about 1,000 lines of code and a total of 7 commands, and they actually work. The simplicity and 'smallness' of the code make Baby Git the perfect codebase for curious developers to study in order to learn how the code works. The fact that arguably the most popular and important tool for collaborative software development in the history of the coding world is simple enough for a novice developer to understand directly from its initial code is really an amazing thing.

The goals of this Baby Git guidebook are twofold:

  1. It aims to introduce the reader to the concepts and components behind a Baby Git repository and the Baby Git commands, and to provide a tutorial of how these commands are used in practice.
  2. It aims to use Baby Git as a tool for exploring the underpinnings of the Git versioning system. By exploring the concepts and implementation behind this rudimentary program, it is hoped that the reader will gain insights into how a much larger application like Git is programmed at conception.

This manual is divided roughly along these lines into two parts. Part I is a general user guide to Baby Git, comprising Chapters 1 through 4. In Chapter 2, we discuss the general concepts and components of Baby Git. In Chapter 3, we provide a guide for installing Baby Git in a local machine. And in Chapter 4, we provide a tutorial for using the 7 Baby Git commands.

Part II of this manual, consisting of Chapters 5 through 13, delves into the actual Baby Git code. Each of the 7 Baby Git commands is discussed in detail in its own chapter and we look under the hood at the more salient parts of the command's underlying C code.

This guidebook is targeting readers who have some experience using the Git application and who are interested in knowing more about its origins, underlying concepts, and how it is implemented at the code level. Some programming experience would be an advantage but not necessary. Readers with no programming experience, for example, might be more interested in Part I of this manual. These are, of course, not requirements of the reader, and an innate curiosity about how things work might be sufficient reason for perusing this guidebook.

The reader can download the Baby Git package that accompanies this guidebook from either of these links:

Go to Chapter 2