What is the most popular initial commit message in Git?
In this article, we will explore which initial commit messages are the most popular. We will do this by analyzing a public GitHub dataset from Google BigQuery that contains data from almost 3 million Git repositories. We will leverage this data to compile a list of the 20 most commonly used initial commit messages in the dataset.
If you aren't familiar with initial commits, check out our article What is an Initial Commit in Git? before reading this one.
We'll start off by briefly explaining what Google BigQuery is.
Some background on Google BigQuery
Google BigQuery is a data warehouse hosted on the Google Cloud Platform that is accessible over web services. It was designed to host big data in a cloud environment and provide fast, convenient access to that data over the Internet.
As a part of the service, Google provides numerous public datasets for developers to experiment with. Many of these are useful to the scientific and research communities, including data from the Food and Drug Administration, the US Census, the National Oceanic and Atmospheric Administration, and of course, GitHub.
We'll be using the GitHub dataset in this article, which contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. We'll mainly be using the commit messages to try and gain some insight into the most common initial commit messages that developers use.
For more information on the GitHub dataset, see Google's overview of the GitHub Activity Data. If you're interested, this link includes a button to access the data via the BigQuery console. Note that you'll have to provide a billing account details even though your access will be free.
Now let's jump in and query some Git data!
Writing SQL queries against Google BigQuery
Here is what Google's BigQuery interface looks like:
The bottom-left panel in the console lists the different datasets that are available. Note that we have selected the
github_repos dataset and expanded it. Each item in the expanded list represents a table in the database. We'll be making use of the
commits table which contains the commit message information we'll need. The commit message information is stored in the
message column in this table. The
commits table contains data from the 1990's all the way through the present.
The big white panel in the top-center of the screen is an editor that we can use to write and execute SQL queries against the datasets. Here is the query I used to fetch, group, and count the commit messages:
SELECT TRIM(LOWER(message)), COUNT(*) FROM bigquery-public-data.github_repos.commits WHERE author.date.seconds >= 946684800 and author.date.seconds <= 1585800000 GROUP BY TRIM(LOWER(message)) ORDER BY COUNT(*) DESC LIMIT 15999;
This query simply groups together all commits with the same commit message and counts up how many commits contained each message. I used the
LOWER() function to ensure that uppercase/lowercase letters didn't prevent the same messages from being grouped together. I also used the
TRIM() function to remove any leading or trailing whitespace before grouping. I added a filter on the
author.data.seconds column to bring back commits made between January 1st 2000 and April 22nd 2020. I ordered the resulting commit messages by their frequency of occurrence, from highest to lowest. Finally, I limited the results to 15999 records since that is the maximum amount that can be conveniently downloaded from the console interface (and that will be way more than we need).
Now let's move on to the findings!
The top 20 most popular initial commit messages
After running the above query, I looked through the first couple hundred results manually and picked out the ones that seemed related to initial commits - this of course, was not a science. Here are the top 20 commit messages ranked by frequency of use:
|Commit Message||Count||% to Total|
|initial code commit||3,311||<1|
|first working version||3,055||<1|
These top 20 commits messages combined for a total of 2,353,331 commits. This number is probably overestimating the number of initial commits in the sample since some of these may have actually been later commits instead of initial commits - there is no real way to know - but it is probably pretty close.
From the results, we can see that
initial commit is by far the most popular message used, representing 77% of the sample. The second most popular message is
first commit, representing 7.5%. The third ranked message is
create readme.md with 6.5%. This one could definitely have been overestimated since it is likely that a good chunk of those were not necessarily the initial commit. The remaining results were mostly made up of the word
init mixed with another term like
So from what we can tell, the initial commit messages used in this dataset are strongly weighted toward
initial commit, with a small minority favoring
first commit and a smattering of other options. As you can probably tell by the name of my website, I run with the herd on this one and favor
In this article we analyzed a public GitHub dataset from Google BigQuery to explore the most popular initial commit messages used by developers working with Git.