Can you predict song popularity

This week we presented our personal projects at Zipfian Academy. It was a lot of fun to see the diversity of topics and analyses.

And here is my project:

Can you predict popularity?

Like many pop-culture driven areas, pop music is driven by hits and hits follow a particular pattern:

Screen Shot 2013-11-21 at 3.30.31 PM

Many companies (Hit Song Science, Mixcloud, MusicXray, and BandMetrics) use Music Information Retrieval (MIR) techniques for quantifying the audio features that make a song a hit. For my project, I wanted to see what could be done with analyzing lyrics alone. Writing is the beginning of the hit song production pipeline. By identifying hits within this first stage, one could save money by not wasting time in the studio recording demos with a producer.

In order to measure what is a hit, one needs a metric. To measure popularity, I first took a look at the The Echo Nest Taste Profile Subset which has over 48 million user-song – play triplets from undisclosed third-party internet radio companies. While I found some interesting statistics regarding plays, I couldn’t find an adequate signal for measuring popularity. So, I went to the US gold standard, The Billboard 500. Billboard 500 uses Neilsen SoundScan to integrate music sales (from those of us who still buy music), radio play, and the myriad music-based activities in the cloud.

The data looks like this:

the data

Echo Nest graciously released a million songs with Echo Nest attributes for research. Since I’m interested in lyrics, I looked at musiXmatch lyric set of over 230k lyrics provided in stemmed, bag of words format. I found Billboard 500 rankings from the 1920s to 2013. Of this set, around 5,000 songs were represented in the musiXmatch dataset. This set became my set of positive training examples for popularity. For the negative training example, I chose a random example from the larger musiXmatch set.

Screen Shot 2013-11-22 at 4.46.43 PM

Overall, popular songs have fewer word occurrences even though song lengths are comparable. First/Second person is used more frequently as is the word “love”. This is not surprising. I took the posterior probabilities to find the most frequent distinguishing features of popular and unpopular songs. “Louder” was more frequent in popular songs. Research has shown that popular songs are getting louder, I didn’t realize that they are “louder” as well.