Image of What is the most popular initial commit message in Git?

ADVERTISEMENT

Introduction

In this article, we will explore which initial commit messages are the most popular. We will do this by analyzing a public GitHub dataset from Google BigQuery that contains data from almost 3 million Git repositories. We will leverage this data to compile a list of the 20 most commonly used initial commit messages in the dataset.

If you aren't familiar with initial commits, check out our article What is an Initial Commit in Git? before reading this one.

We'll start off by briefly explaining what Google BigQuery is.

Some background on Google BigQuery

Google BigQuery is a data warehouse hosted on the Google Cloud Platform that is accessible over web services. It was designed to host big data in a cloud environment and provide fast, convenient access to that data over the Internet.

As a part of the service, Google provides numerous public datasets for developers to experiment with. Many of these are useful to the scientific and research communities, including data from the Food and Drug Administration, the US Census, the National Oceanic and Atmospheric Administration, and of course, GitHub.

We'll be using the GitHub dataset in this article, which contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. We'll mainly be using the commit messages to try and gain some insight into the most common initial commit messages that developers use.

For more information on the GitHub dataset, see Google's overview of the GitHub Activity Data. If you're interested, this link includes a button to access the data via the BigQuery console. Note that you'll have to provide a billing account details even though your access will be free.

Now let's jump in and query some Git data!

Writing SQL queries against Google BigQuery

Here is what Google's BigQuery interface looks like:

Google BigQuery Console

The bottom-left panel in the console lists the different datasets that are available. Note that we have selected the github_repos dataset and expanded it. Each item in the expanded list represents a table in the database. We'll be making use of the commits table which contains the commit message information we'll need. The commit message information is stored in the message column in this table. The commits table contains data from the 1990's all the way through the present.

The big white panel in the top-center of the screen is an editor that we can use to write and execute SQL queries against the datasets. Here is the query I used to fetch, group, and count the commit messages:

SELECT TRIM(LOWER(message)), COUNT(*)
FROM bigquery-public-data.github_repos.commits
WHERE author.date.seconds >= 946684800 and author.date.seconds <= 1585800000
GROUP BY TRIM(LOWER(message))
ORDER BY COUNT(*) DESC
LIMIT 15999;

This query simply groups together all commits with the same commit message and counts up how many commits contained each message. I used the LOWER() function to ensure that uppercase/lowercase letters didn't prevent the same messages from being grouped together. I also used the TRIM() function to remove any leading or trailing whitespace before grouping. I added a filter on the author.data.seconds column to bring back commits made between January 1st 2000 and April 22nd 2020. I ordered the resulting commit messages by their frequency of occurrence, from highest to lowest. Finally, I limited the results to 15999 records since that is the maximum amount that can be conveniently downloaded from the console interface (and that will be way more than we need).

Now let's move on to the findings!

The top 20 most popular initial commit messages

After running the above query, I looked through the first couple hundred results manually and picked out the ones that seemed related to initial commits - this of course, was not a science. Here are the top 20 commit messages ranked by frequency of use:

Commit MessageCount% to Total
initial commit1,809,85277
first commit176,5738
create readme.md153,1767
init69,0043
initial30,0111
initial import26,914 1
init commit14,1841
initial version12,9261
initial release8,390<1
first version7,921<1
new version7,498<1
initial checkin6,471<1
first release5,447<1
initial revision4,981<1
init project4,908<1
initial upload3,379<1
initial code commit3,311<1
first working version3,055<1
initial implementation2,938<1
initial code2,392<1
Total2,353,331100

These top 20 commits messages combined for a total of 2,353,331 commits. This number is probably overestimating the number of initial commits in the sample since some of these may have actually been later commits instead of initial commits - there is no real way to know - but it is probably pretty close.

From the results, we can see that initial commit is by far the most popular message used, representing 77% of the sample. The second most popular message is first commit, representing 7.5%. The third ranked message is create readme.md with 6.5%. This one could definitely have been overestimated since it is likely that a good chunk of those were not necessarily the initial commit. The remaining results were mostly made up of the word initial or init mixed with another term like import, version, release, checkin, upload, revision, implementation, or code.

So from what we can tell, the initial commit messages used in this dataset are strongly weighted toward initial commit, with a small minority favoring first commit and a smattering of other options. As you can probably tell by the name of my website, I run with the herd on this one and favor initial commit!

Conclusion

In this article we analyzed a public GitHub dataset from Google BigQuery to explore the most popular initial commit messages used by developers working with Git.