The Definitive Guide to Git's Code
Over the past 15 years Git has grown from a tiny program written by a single developer to the most popular version control software (VCS) on the planet. Git is an essential tool that developers use to share code and collaborate on software projects. It has become a staple tool that developers are expected to know how to use if they are going to be brought onto a team project.
But what does Git actually do? Git's core functionality can be simplified into 3 basic parts:
- Allow developers to keep track of updates to their code over time
- Allow developers to easily combine their code updates with previous updates or new updates made by other people
- Allow developers to easily share code over the Internet
There are many tutorials on how to use Git spread across the Internet, and eventually we hope to build a library of those here at Initial Commit, but for now we'll focus on the inner workings of Git's code to help curious developers understand how it functions.
Why Learn How Git's Code Works?
Before diving in, we should address the question "Who in their right mind would want to spend time learning about how Git's code works?" Here are a few reasons to learn about Git's code:
- Git's codebase – at least in it's initial form – is a manageable size to wrap your head around. Git's initial commit comes packaged in only 10 files, and comprises less than 1000 lines total. This is tiny compared to most codebases of any scope and maximizes the knowledge-to-effort ratio of this endeavor.
- Git's code actually runs in its initial form. Later on we'll walk through the steps to download Git's full codebase, retrieve it's initial form, and run it's original commands.
- Git's creator and original author Linus Torvalds is very picky about design principles, so understanding how he built this thing offers useful knowledge for structuring your own software projects in the future.
- The code itself is not that hard to grasp. If you have a basic or intermediate knowledge of programming, you should be able to follow along with the detailed inline code comments in our Baby Git project.
- Curiosity – In my experience as a software developer, I've found that each new programming language, tool, or project that I've integrated into my repertoire has expanded my skill set and correspondingly the set of opportunities that I have in my professional and hobbyist careers. Sometimes exploring a topic in depth purely due to curiosity is a good enough reason!
What language(s) is Git written in?
As of March 27, 2019 Git's code is made up of the following programming languages, as seen on Git's Github page:
Figure 1: Distribution of Git's Programming Languages
From this we can see that almost 50% of Git's code is written in C. This means knowledge of the C programming language will be very important to help us understand how Git functions. In fact, Git's original code base – or initial commit – is entirely 100% written in C (besides the Makefile). If you're familiar with other more modern statically typed languages like Java or C++, you shouldn't have too many problems reading C code. However, there are 2 major differences between C and Java/C++ that you will need to grasp:
- C doesn't have classes. That's right; C is not an object-oriented language, that's why C++ was created. The closest structure C has to the class is the Struct. You can read more about this on my guest post here.
- C uses pointers often. (C++ does too, so if you are familiar them that's great!). You can think of a pointer as a memory address that points to a particular variable you are working with. This makes accessing variable addresses and values a bit different than in higher-level languages like Java and Python.
Is accessing, downloading, reading, editing, and sharing Git's code even legal?
YES! Git's code is free and open source under the GNU General Public License version 2.
Where to Find Git's Code?
Git's code is stored on Git's Github page. You can download the ZIP file directly from GitHub or open a terminal window and clone the repository using the following command:
git clone https://github.com/git/git.git
Navigate into the freshly downloaded
git directory and run the
git log command to take a peek at the latest commits made by the Git development community.
If you take a look at the files and directors in the project root (i.e. in the main
git directory), you'll see a large collection of C header files (files ending in the .h extension, such as
blob.h) and source code files (files ending in the .c extension, such as
blob.c). The .h files contain information to be shared among multiple source files using the #include preprocessing directive. The .c files contain the actual code that makes Git tick.
You'll also notice some files ending in the .sh extension, which are shell scripts, and some files ending in the .perl extension, which are Perl scripts. In general, each of these files corresponds to a particular Git feature, command, or object (more info on Git objects here).
However, analyzing this current version of Git's codebase would get unwieldy fast, simply because there are so many files and folders to go through. Let's break this problem down into one of a more manageable size.
The Initial Version of Git: Git's Initial Commit
As mentioned above, Git's initial commit is small in size, and it actually works – so how to we retrieve it? We can do that by running the following commands in a terminal window in the
git log --reverse
This command will display a list of Git's commit history starting at the inception of its development, instead of the most recent commits. Note that the very first commit in the list has an ID of
git checkout e83c5163316f89bfbde7d9ab23ca2e25604af290
Now if you examine the contents of the
gitdirectory, you'll notice almost all of the files have disappeared! In fact there are only 10 files left (11 if you include the README):
Feel free to look through these files for a peek at how Git works under the hood. The Initial Commit team has thoroughly documented this codebase with inline code comments.