Trucks and Beer 🍺

Inspired by a post on Big-ish Data, I’ve started working on a textual analysis of popular country music.

More specifically, I scraped for a list of the top female and male country artists of the last 100 years and used my python wrapper for the Genius API to download the lyrics to each song by every artist on the list. After my script ran for about six hours I was left with the lyrics to 12,446 songs by 83 artists stored in a 105 MB JSON file. As a bit of an outsider to the world of country music, I was curious whether some of the preconceived notions I had about the genre were true.

Some pertinent questions:

You can find my code for this project on GitHub.


If you like beer, you may also like…

I’m interested in whether an artist’s tendency to use certain terms correlate together. For example, if an artist is more likely to mention beer in their songs, are they more likely to also mention trucks?

It turns out that yes, they are.

Each point on the two plots below represents a single artist. The values for each point were calculated as the percentage of times the given artist mentions a particular term across all of their songs. For example, I had 46 songs by Cole Swindell in my database, and he mentioned beer in 24 of them, arriving at a percentage of 52%.

Think about that. Cole Swindell mentions beer in more than half of his songs. You can also count on him referencing trucks once every five songs. Dustin Lynch turned out to be the artist that sang about trucks most often, with the word truck appearing in 23.8% of his songs.


I’ve also added the artists’ genders to the plot. I’ll need to do some more analysis, but there does appear to be a relationship between an artist’s gender and their tendency to sing about trucks and beer. It’s hard to tell from this plot, but roughly half of the female artists in my dataset are actually stacked on top of each other at the origin, meaning they didn’t mention beer or trucks in any of their songs.

Check out the relationship between an artist’s use of the words girl and love. There’s an even more obvious trend with gender here. The more often a male country singer uses the word girl in his songs, the less likely he is to mention love. Interesting.


Love is falling out of fashion

I also wanted to look at how vocabulary changes over time for all country artists. We know from the plot above that male country artists are less likely to sing about love and more likely to sing about girls. The above plots combined songs from all years – I wonder if we’ll see different effects after separating songs into the years they were published.

The plot below displays the percentage of songs mentioning a given term for each year, excluding years that had less than ten songs in my database. It looks like it’s becoming less common for country artists to sing about love. The correlation isn’t all that strong, but there is a noticeable downward trend.


I also found it funny that as love gets mentioned less frequently, it’s becoming more common for country songs to include the word girl. And, if we remember from the plot of girl vs. love above, we know it’s primarily men who are driving the rise in popularity of this term. The popularity of the word boy hasn’t changed much over time, appearing in about 12% of songs each year.

Country music is getting more repetitive

After reading Kaylin Walker’s excellent post “50 Years of Pop Music”, I decided to look at how vocabulary sizes have changed over time. Similar to what Kaylin found when looking at Bilboard hits from the last 50 years, it appears that, while the correlation is weak, the average total word count for country songs has increased with time. The average unique word count has also increased with time but only slightly.


In other words, country lyrics have become more repetitive over the years.

To dig a little deeper, I took a look at the lexical diversity of the lyrics over time. Lexical diversity is a fancy term for a simple concept: what percentage of the words in a body of text are unique? A body of text where each word is only used once would have the highest possible lexical diversity with a value of 1.


Sure enough, if we look from year 2002 to the present (where the effect is most pronounced), there is a clear downward trend in lexical diversity. This trend holds for pop music in general, as illustrated by this excellent post from The Pudding on the repetitive nature of pop lyrics. Might this be an indication that country music has gotten more poppy over time?

And the most repetitive artist is…

At the start of this post I asked which artists have the most and least diverse lyrics. Lexical diversity can give us some insight into this question. For each artist, I calculated the average lexical diversity across each of their songs. The figure below shows the distribution of different lexical diversity values. The average country artist has a lexical diversity of 0.50. There would be a number of possible ways to arrive at 0.50, but one intuition for the value is that about half the words in a song are unique.


So, to answer the question, “whose lyrics are most diverse?”, we can just look at the extremes of the distribution. Kane Brown comes in at the bottom with a lexical diversity score of 0.39, and Kellie Pickler is sitting on top with 0.59. To try and get a sense for what the two artists sing about, I generated word clouds from their respective sets of lyrics.


It’s difficult to draw much from the word clouds by themselves, but it’s a fun way to get a quick impression of which words an artist uses most often. Kane Brown’s favorite words are girl, back, know, and yeah. Kellie Pickler’s are love, know, want, and go.

Days of the week


This one is pretty straightforward. Looks like we’re all just livin’ for the weekend.

Making predictions

So, we’ve collected a bunch of country lyrics, and we’ve started to notice some interesting trends in the data. Can the trends we’ve identified provide any further insight into the genre? Might we be able to use the trends we’ve identified in the lyrics to make inferences about the artists? Could we, for example, given a set of song lyrics guess the gender of the artist who wrote them?

To make these sorts of predictions and inferences from the lyrics, we’ll need to identify useful features and train a classifier, both of which I’ll go into detail on in my next post. Stay tuned!

To my complete surprise, this post was picked as a winner in the 2018 Pudding Cup!

If you aren’t already familiar with The Pudding’s incredible data storytelling work, go check them out.



Perpetually inquisitive Data Scientist.

comments powered by Disqus
John W. Miller © 2024
rss facebook twitter instructables GoogleScholar github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora