What is the most popular initial commit message in Git?
In this article, we will explore which initial commit messages are the most popular. We will do this by analyzing a public GitHub dataset from Google BigQuery that contains data from almost 3 million Git repositories. We will leverage this data to compile a list of the 20 most commonly used initial commit messages in the dataset.
If you aren't familiar with initial commits, check out our article What is an Initial Commit in Git? before reading this one.
We'll start off by briefly explaining what Google BigQuery is.
Some background on Google BigQuery
Google BigQuery is a data warehouse hosted on the Google Cloud Platform that is accessible over web services. It was designed to host big data in a cloud environment and provide fast, convenient access to that data over the Internet.
As a part of the service, Google provides numerous public datasets for developers to experiment with. Many of these are useful to the scientific and research communities, including data from the Food and Drug Administration, the US Census, the National Oceanic and Atmospheric Administration, and of course, GitHub.
We'll be using the GitHub dataset in this article, which contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. We'll mainly be using the commit messages to try and gain some insight into the most common initial commit messages that developers use.
For more information on the GitHub dataset, see Google's overview of the GitHub Activity Data. If you're interested, this link includes a button to access the data via the BigQuery console. Note that you'll have to provide a billing account details even though your access will be free.
Now let's jump in and query some Git data!
Writing SQL queries against Google BigQuery
Here is what Google's BigQuery interface looks like:
The bottom-left panel in the console lists the different datasets that are available. Note that we have selected the
github_repos dataset and expanded it. Each item in the expanded list represents a table in the database. We'll be making use of the
commits table which contains the commit message information we'll need. The commit message information is stored in the
message column in this table. The
commits table contains data from the 1990's all the way through the present.
The big white panel in the top-center of the screen is an editor that we can use to write and execute SQL queries against the datasets. Here is the query I used to fetch, group, and count the commit messages:
SELECT TRIM(LOWER(message)), COUNT(*) FROM bigquery-public-data.github_repos.commits WHERE author.date.seconds >= 946684800 AND author.date.seconds <= 1585800000 AND ARRAY_LENGTH(parent) = 0 AND LENGTH(TRIM(LOWER(message))) > 0 GROUP BY TRIM(LOWER(message)) ORDER BY COUNT(*) DESC LIMIT 15999;
This query simply groups together all commits with the same commit message and counts up how many commits contained each message. The
commits table has a field called
parent which stores the parent commit IDs for each commit, if any. I filtered on this field using the clause
ARRAY_LENGTH(parent) = 0, since initial commits don't have any parents. Note that any commits made from a detached head state would also not contain parents, but these can be manually excluded based on the commit message content.
I used the
LOWER() function to ensure that uppercase/lowercase letters didn't prevent the same messages from being grouped together. I also used the
TRIM() function to remove any leading or trailing whitespace before grouping. I added a filter on the
author.data.seconds column to bring back commits made between January 1st 2000 and April 22nd 2020. I ordered the resulting commit messages by their frequency of occurrence, from highest to lowest. Finally, I limited the results to 15999 records since that is the maximum amount that can be conveniently downloaded from the console interface (and that will be way more than we need).
Now let's move on to the findings!
The top 20 most popular initial commit messages
After running the above query, I simply picked out the top 20 results in the ranking. Here are the top 20 commit messages ranked by frequency of occurrence in the dataset:
|Commit Message||Count||% to Total|
|create gh-pages branch via github||3372||<1|
|initial commit to add default .gitignore and .gitattribute files.||2967||<1|
These top 20 initial commits messages combined for a total of 2,285,725 commits.
From the results, we can see that
initial commit is by far the most popular message used, representing 86% of the sample. The second most popular message is
first commit, representing 7%. The third ranked message is
init with 2%. Clearly the percentages drop off extremely quickly. The remaining results were mostly made up of the word
init mixed with another term like
checkin. A few of the remaining messages were related to including the
.gitattribute files. Lastly, there was the message
create gh-pages branch via github, likely indicating that the GitHub pages feature is gaining traction.
So from what we can tell, the initial commit messages used in this dataset are strongly weighted toward
initial commit, with a small minority favoring
first commit and a smattering of other options. As you can probably tell by the name of my website, I run with the herd on this one and favor
In this article we analyzed a public GitHub dataset from Google BigQuery to explore the most popular initial commit messages used by developers working with Git. If you found this topic interesting, check out our analysis that uses Google BigQuery to estimate the percentage of commit messages that use the imperative mood.