What is mainstream music?

Million Songs Dataset Exploration

Jingying Zhou, Yibo Zhu, Yimin Zhang, Ziyue Jin, Ziyue Wu

April 27, 2016

Data and Methodology

Data Integration

We mainly integrated four datasets:

We mainly analyze the clusters of songs to extract insights on Mainstream Music and the potential further application.

  1. we applied sound characteristics(like tempo, loudness, pitch and other 100 features) to get hierarchical clustering of songs
  2. we used LDA to generate topic model of lyrics, and thus gained clusters of songs based on bag of words description.
  3. We further used the user playcount data to work out the pairwise "distance" between songs, and thus calculated the used-defined similarity. So that user-defined clusters can be reached by feature selection.
  4. We visualized and discussed the statistically significant link between lyrics and sound characteristic of songs.
  5. Our results can be formed into a good song recommendation system, which incorporates user preference, sound characteristics and lyrics preference.

Explore Our Dataset

Overview of our songs

Upon our previous EDA, we're very curious about why mainstream music changes over time. What're the characteristics of a mainstream song? How niche differ from mainstream music? Dose lyrics count in the difference? What is the potential application of a better understanding of Million Song Data Set.

Mainstream?

So here is a questions. Do you think the "Little Apple" is a mainstream song?

How to Define a Mainstream Music?

Sound Features + Lyrics = Music

Part1: Sound Cluster Analysis

Part2: Lyrics Cluster Analysis

Topic modeling by 15 topics

Comparison Between Sound Clusters and lyrics Clusters

How are these two kinds of music clusters related?

We obtained the Chi-squared test of independence with p-value = 2.2e-16, so we find that lyrics has a strong corelation with sound characteristics.

The Strong coorelation between lyrics and sound can help us dig out insights on mainsteam music. As what we can see in the above bubble figure(classifier overlap), we find topic model and sound hierachical has a dominate overlap in (1,1), (1,6), (1,10). The size-dominated bubbles themselves show they are more prone to be mainstream music, because there are more songs in those clusters which are similar to each other. And this is how we define the mainstream music. Furthermore, the common occurrence means that mainstream music share some common characteristics in sound and lyrics!

Here is a deeper interpretation of mainstream music.

Sound Feature of mainstream music

  • From sound features analysis, we find mainstream music tends to be louder and faster (high loudness and more tempos). They don't have much difference in duration and pitch.
  • Lyrics of mainstream music

  • Mostly, what appeared in mainstream music are some common words
  • In others words, if the lyrics of a song have more common words. They're more likely to be mainstream music. Therefore, they're also more likely to be louder and faster
  • Verification of mainstream music by adding users information

    On the above, we use sound features and lyrics to do the clusters, and then find "mainstream music". But whether they are maintream music? In fact we need user play count info to justify they are. If some songs are similar to a lot of other songs from users' perspective, they are indeed mainstream music. So our goal in this section is to verify that users regard the songs with common words, higher loudness and more tempos as mainsteam music.

    Sound Features vs User defined Similarity

    User defined similarity monitoring model:
    If two songs are listened by one user, we regard this phenomenon as similarity between two songs. But we need to distinguish similarity from popularity. So here is the modelling function:

    After calculating similarity, we found that similarity matrix is pretty large but sparse. So we used case control to cut down computing size, while keep the representativeness of the original similarity matrix between two songs

  • Sound Features: independent variables
  • User defined Similarity: dependent variables
  • Case control to subset the sparse data
  • Sound Feature Selection Based on User Play Count

    We use user play count to select features, and then use selected features to do the hierachical cluster of songs. This result can show the user opionions very well.

    LASSO

    Random Forest:

    Most of tracks defined in our mainstream formula(louder, faster, common words) are falling into our largest clusters on the above models: (1154/1175), which means those songs are indeed mainsteram music

    Conclusion

    1. Lyrics and Sound feature are highly correlated (p-value = 2.2e-16)
    2. Mainstream music tends to be lounder and faster compared with niche music

      Thus, it can also explain why jazz becomes less popular over time.
    3. After adding user defined similarity, the result we had reinforced conslusion 2
    4. Based on the models we built, we have three dimensions ( sound, lyrics, user defined similarity) to describe a music. Those three dimensions can be used for recommending similar music.