Image of What is the most popular initial commit message in Git?

ADVERTISEMENT

Introduction

In this article, we will explore which initial commit messages are the most popular. We will do this by analyzing a public GitHub dataset from Google BigQuery that contains data from almost 3 million Git repositories. We will leverage this data to compile a list of the 20 most commonly used initial commit messages in the dataset.

If you aren't familiar with initial commits, check out our article What is an Initial Commit in Git? before reading this one.

We'll start off by briefly explaining what Google BigQuery is.

Some background on Google BigQuery

Google BigQuery is a data warehouse hosted on the Google Cloud Platform that is accessible over web services. It was designed to host big data in a cloud environment and provide fast, convenient access to that data over the Internet.

As a part of the service, Google provides numerous public datasets for developers to experiment with. Many of these are useful to the scientific and research communities, including data from the Food and Drug Administration, the US Census, the National Oceanic and Atmospheric Administration, and of course, GitHub.

We'll be using the GitHub dataset in this article, which contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. We'll mainly be using the commit messages to try and gain some insight into the most common initial commit messages that developers use.

For more information on the GitHub dataset, see Google's overview of the GitHub Activity Data. If you're interested, this link includes a button to access the data via the BigQuery console. Note that you'll have to provide a billing account details even though your access will be free.

Now let's jump in and query some Git data!

Writing SQL queries against Google BigQuery

Here is what Google's BigQuery interface looks like:

Google BigQuery Console

The bottom-left panel in the console lists the different datasets that are available. Note that we have selected the github_repos dataset and expanded it. Each item in the expanded list represents a table in the database. We'll be making use of the commits table which contains the commit message information we'll need. The commit message information is stored in the message column in this table. The commits table contains data from the 1990's all the way through the present.

The big white panel in the top-center of the screen is an editor that we can use to write and execute SQL queries against the datasets. Here is the query I used to fetch, group, and count the commit messages:

SELECT TRIM(LOWER(message)), COUNT(*)
FROM bigquery-public-data.github_repos.commits
WHERE author.date.seconds >= 946684800
AND author.date.seconds <= 1585800000
AND ARRAY_LENGTH(parent) = 0
AND LENGTH(TRIM(LOWER(message))) > 0
GROUP BY TRIM(LOWER(message))
ORDER BY COUNT(*) DESC
LIMIT 15999;

This query simply groups together all commits with the same commit message and counts up how many commits contained each message. The commits table has a field called parent which stores the parent commit IDs for each commit, if any. I filtered on this field using the clause ARRAY_LENGTH(parent) = 0, since initial commits don't have any parents. Note that any commits made from a detached head state would also not contain parents, but these can be manually excluded based on the commit message content.

I used the LOWER() function to ensure that uppercase/lowercase letters didn't prevent the same messages from being grouped together. I also used the TRIM() function to remove any leading or trailing whitespace before grouping. I added a filter on the author.data.seconds column to bring back commits made between January 1st 2000 and April 22nd 2020. I ordered the resulting commit messages by their frequency of occurrence, from highest to lowest. Finally, I limited the results to 15999 records since that is the maximum amount that can be conveniently downloaded from the console interface (and that will be way more than we need).

Now let's move on to the findings!

The top 20 most popular initial commit messages

After running the above query, I simply picked out the top 20 results in the ranking. Here are the top 20 commit messages ranked by frequency of occurrence in the dataset:

Commit MessageCount% to Total
initial commit195709686
first commit1510427
init393572
initial commit.366162
initial178821
initial import147351
create readme.md115101
init commit9686<1
update license.md6606<1
first6029<1
first commit.5689<1
initial version5326<1
create license.md3968<1
inital commit3852<1
initial import.3460<1
create gh-pages branch via github3372<1
initial release3347<1
initial checkin3185<1
initial commit to add default .gitignore and .gitattribute files.2967<1
Total2285725100

These top 20 initial commits messages combined for a total of 2,285,725 commits.

From the results, we can see that initial commit is by far the most popular message used, representing 86% of the sample. The second most popular message is first commit, representing 7%. The third ranked message is init with 2%. Clearly the percentages drop off extremely quickly. The remaining results were mostly made up of the word initial or init mixed with another term like import, version, release, or checkin. A few of the remaining messages were related to including the readme.md, license.md, .gitignore, and .gitattribute files. Lastly, there was the message create gh-pages branch via github, likely indicating that the GitHub pages feature is gaining traction.

So from what we can tell, the initial commit messages used in this dataset are strongly weighted toward initial commit, with a small minority favoring first commit and a smattering of other options. As you can probably tell by the name of my website, I run with the herd on this one and favor initial commit!

Conclusion

In this article we analyzed a public GitHub dataset from Google BigQuery to explore the most popular initial commit messages used by developers working with Git. If you found this topic interesting, check out our analysis that uses Google BigQuery to estimate the percentage of commit messages that use the imperative mood.