From Data to Dance: Analyzing Spotify Songs for Playlist Creation
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    From Data to Dance: Analyzing Spotify Songs for Playlist Creation

    Background

    The Significance of Data-Driven Playlist Curation

    Using machine learning for playlist curation can an create an coherent experience by maintaining the flow and the desired emotional state from the music. In addition a machine learning model can help with the following aspects of playlist curation simplifing a possibly complex task:

    • Personalization: With user inputs and or data, playlists can be created to their personal preferences which can also be applied to a group setting such that everyone can enjoy the party playlist and that no one is completely left out or excluded.
    • Efficiency: By leveraging machine learning models, creating playlists with flow, variety, consistent mode and tones will simplify a complex task.
    • Exploration: New songs can be introduced in the playlists that may not be as popular but are relatively similar to the songs that our employees enjoy, incentivizing a positive, novel experience for the participents of the dance party.
    • Variety: Although it is important to have a flow in the playlist, a playlist that does not have enough variety in genre, artists, or the overall music itself can lead to too much redundency which can make the party boring and lead to bad vibes.
    • Supporting Lesser Known Artists: Supporting all artists is the goal of any audiophile especially of good artists, but, there are good artists which have not attained popularity yet, and by introducing songs which are similar to the popular song intermittently throughout the playlist we can help members of the dancing community discover new artists!

    Executive Summary

    Through exploratory analysis, we found the similarity and disimilarity between different song characteristics as well as the distributions between them, created different clustering methods to group similar songs, genres, and artists together. Finally, a predictive model was created to determine the danceability of a song and it was imbedding into a recommendation system that used user inputs to help them make a playlist.

    Methodologies of the Analysis

    In this analysis a recommendation system will be created to make playlists for an individuals and it will be dividing into three sections. The first one will explore the distributions of important variables reguarding characteristics of the song and its popularity, using these insights to guide the analysis from other sections. Next the similarity between differing genres will be explore in order to add variety to the playlists, and the same will be done for the artists.

    A Deep Dive into Song Characteristics

    Distribution of song popularity

    Hidden code

    In the distributions of the variable Popularity, it can clearly be seen that there is a large disparity among a songs popularity with the most songs having around 0 popularity. This means that some songs will exhibit similarities in their song traits while still being unpopular which leads to an opportunity to introduce these less popular songs to a set of possible listeners.

    When looking at the distribution of the danceability of a song, the data is relatively close to a normal distribution with a possibly negligable skew to the left despite the data having a cutoff at 1 and 0. The majority of songs have a danceability score that is between 0.5 and 0.75.

    Relationships Between Song Characteristics

    Hidden code

    The correlation matrix presented above shows the linear correlations of the different variables of song characteristics with most of them having a statistical significance, although, an r within there are a handfull of variables that move relatively well together or against each other:

    • energy and loudness (0.765)
    • acousticness and energy (-0.736)
    • danceability and valence (0.478) Moreover, since none of the variables have a really high correlation, none of them will be removed due to colinearity during the model building process.

    In addition, the scatter plots (visualized by the lower triagular matrix) have a lot of variation within each variable and in most cases there doesn't appear to be a visually distinct relationship between combinations of multiple variables.

    Further Exploration of Variables Correlated with Danceability

    Hidden code

    After taking a closer look at some of the song characteristics against their danceability score, they may need a transformation of some sort depending on the models being used as there appears to be non-linear relationships present in the variables accousticness, energy, and tempo with danceability. Other interesting notes can be observed:

    • Energy, and accousticness appear to have a curvilinear relationship with danceability
    • Speechiness appears to have two visible groups, and the smaller group which has values of speechiness that is close to 1 appear to be close to what would be the peak of the danceability distribution (0.5, 0.75)
    • There is an interesting curvilinear relationship between tempo and danceability such that most of the danceable songs have a tempo around 100 and 150 beats per minute and almost creates a lower bound for danceability.

    Analyzing Song Characters using PCA

    Hidden code
    Hidden code

    Principle component analysis was also conducted to look at the relationship between the different variables and their importance when measuring the amount of influence they have on the algorithm. The variables with the most importance (in order of influence):

    • Energy
    • Acousticness
    • Loudness
    • Danceability
    • Valence
    • Instrumentalness

    The variables that move together:

    • Danceability, valence, and popularity
    • Loudness, Energy, tempo and liveness

    The variables that move against each other:

    • Acousticness and (Loudness and Energy)
    • Instrumentalness and (Danceability and valence)

    In addition, when looking across the amount of variance capture by each component, the first three components of the algorithm was able to capture more than 50% of the variance within the song characteristics. Meanwhile six of the components were able to capture 80%. Depending on the components ability to improve the predictive models, they will be added to the final result.

    Clustering and Visualization of Song Characteristics

    In the realm of music, there are many different genres and each artist puts their artistic flavor in each song they produce, moreover, multiple artists can produce a song together. Because of these reasons, songs can have similar characteristics despite being in different genre's, created by different musicians and artists, or music created in different times of an era or an artist's life. Using clustering methods can capture these similarities and can lead to impressive results allowing suggestion of similar music despite these differences.

    Hidden code

    Iterating through 15 different k-means algorithms with 1 to 15 centers, the elbow plot was presented above. Although there is not a distinct elbow in the given plot, it does appear that after 8 clusters, with each successive cluster there is less of a decrease in the within cluster sum of squares. Thus, a cluster of 8 will used at the song level clustering.

    Although the prefered method of cluster in this situation was originally DBSCAN (Density-Based Spatial Clustering of Applications with Noise), due to the computational time, the size of the dataset, and determining the proper epsilon, it was infeasible despite downsampling the data.

    Hidden code