It’s Party Time
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    It’s Party Time

    1. Introduction

    It's that vibrant time of year again - Summer has arrived (for those of us in the Northern Hemisphere at least)! There's an energy in the air that inspires us to get up and move. In sync with this exuberance, your company has decided to host a dance party to celebrate. And you, with your unique blend of creativity and analytical expertise, have been entrusted with the crucial task of curating a dance-themed playlist that will set the perfect mood for this electrifying night. The question then arises - How can you identify the songs that would make the attendees dance their hearts out? This is where your coding skills come into play.

    2. Executive Summary

    N.B. Please do not use the reader mode some outputs will be not displayed.

    2.1 The task

    The task is to devise an analytically-backed, dance-themed playlist for the company's summer party.

    • Use descriptive statistics and data visualization techniques to explore the audio features and understand their relationships.
    • Develop and apply a machine learning model that predicts a song's danceability.
    • Interpret the model outcomes and utilize the data-driven insights to curate your ultimate dance party playlist of the top 50 songs according to the model.
    • The choices must be justified with a comprehensive report explaining the methodology and reasoning.

    2.2 The methodology

    Dataset Inspection and Cleaning

    The analysis began with a comprehensive review of the provided dataset, which involved standard procedures for extracting information, inspecting basic statistics, and cleaning the data. The dataset comprises 20 columns and 113.027 entries, and a diligent assessment was made to ensure data integrity and validity. This preliminary phase involved a series of pivotal observations and proactive measures:

    • Handling Missing Values: A mere three missing values were identified, all associated with track_id = 1kR4gIb7nGxHPI3D2ifs59. These missing entries were promptly removed to ensure data integrity.
    • Duplicate Rows: A total of 444 duplicate rows were detected and subsequently removed from the dataset.
    • Duplicate Tracks by the Same Artist: An examination was conducted to ascertain whether identical tracks were featured in multiple albums attributed to the same artist. This analysis identified 32.009 duplicated songs that appeared on multiple albums by the same artist. These duplicates were removed to maintain a single entry per song in the dataframe.
    • Erroneous Data: Certain columns, namely tempo and time_signature, exhibited values equal to zero, which is implausible. These entries were deemed erroneous and were consequently eliminated.

    Numeric Column Distribution Analysis

    A detailed analysis of the distribution of numeric columns was conducted, as depicted in Section 4.2 Fig 1. Key findings from this analysis include:

    • None of the columns appeared to follow a normal distribution.
    • duration_ms and speechiness exhibited right-skewed distributions, while loudness skewed to the left.
    • Columns such as popularity, speechiness, acousticness, instrumentalness, and liveness displayed negative asymmetry.
    • Conversely, danceability and energy showcased positive asymmetry.

    Handling Outliers

    The analysis identified a substantial number of outliers, totaling 57.322 instances, accounting for approximately 71% of the data post-cleaning. Given the significant information loss that would result from their wholesale removal, a prudent decision was made to retain only those outliers constituting less than or equal to 5% of the data in the dataframe.

    Following this outlier management strategy, the revised dataframe contained 76.136 entries.

    Bivariate Analysis

    A bivariate analysis was conducted to explore potential linear relationships among variables, with the notable exception of a moderate positive linear relationship observed between energy and loudness (Section 4.4 Fig 2).

    Given the aim of constructing a predictive model for track danceability, an inspection of its correlation with other features in the dataset was executed using the Pearson correlation matrix (Section 4.4 Fig 3). Key findings included:

    • Moderate positive correlation with valence.
    • Slight positive correlation with loudness.
    • Slight negative correlation with acousticness and instrumentalness.

    Model Selection

    Considering danceability as a continuous numeric variable, a regression model was deemed suitable. Three distinct model types were selected:

    1. ElasticNet: A linear regression model combining L1 (Lasso) and L2 (Ridge) regularization techniques. It offers flexibility in feature selection and managing correlated features.
    2. SupportVectorRegressor: A machine learning model tailored for regression tasks, particularly effective with non-linear data due to its use of kernel functions.
    3. XGBoostRegressor: An ensemble learning model renowned for its performance, which leverages the iterative construction of decision trees through gradient boosting to rectify errors from preceding iterations to produce accurate and robust predictions.

    Data Preprocessing

    The dataset was preprocessed by standardizing the features using StandardScaler, ensuring a distribution with a mean of 0 and a standard deviation of 1. This standardization is essential as many machine learning algorithms perform optimally on standardized data. Subsequently, the data was split into train and test sets for model training.

    Model Training

    The base models underwent training as follows:

    • ElasticNet: Employed ElasticNetCV for hyperparameter tuning, determining optimal values for the L1 and L2 regularization parameters.
    • SupportVectorRegressor: Utilized default parameters, specifically the Radial Basis Function (rbf) kernel to capture non-linear relationships.
    • XGBoostRegressor: Trained with default parameters, specifying the objective parameter as reg:squarederror for regression tasks.

    Model Evaluation

    The performance of the models was assessed using the following metrics:

    • R-squared: A measure of goodness of fit, ranging from 0 to 1, indicating how much variance in the dependent variable is explained by the model.
    • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values, reflecting prediction accuracy.
    • Mean Absolute Error (MAE): Provides the average absolute difference between predicted and actual values, offering insight into prediction error.

    Model Selection and Fine-Tuning

    The XGBoostRegressor emerged as the top-performing model, exhibiting the highest R-squared and the lowest MSE and MAE on both the train and test sets (Section 5.3 Table 1). These results were further validated by cross-validation (Section 5.3 Fig 4). However, a slight discrepancy between train and test set scores indicated a tendency toward overfitting.

    To mitigate overfitting, a fine-tuning process was executed using RandomizedSearchCV, resulting in a 3.8% improvement in R-squared on the test set. Additionally, the gap between train and test set performance narrowed by approximately 20.6% on R-squared, 19% on MSE, and 18.3% on MAE, indicating reduced overfitting.

    Residual Analysis

    The residual analysis (Section 5.4.1 Fig 5) indicated that the model is generally well-specified, with residuals randomly distributed around the regression line. However, it was noted that the model tends to underestimate observed values, suggesting the possibility of unaccounted factors influencing danceability. The introduction of a regularization factor to penalize the negative values could potentially enhance model performance.

    2.3 The result

    Our model was used to curate a playlist by making predictions based on the danceability of songs. Given our desire to foster a festive and upbeat atmosphere and considering the positive correlation between danceability and valence, which indicates the positivity of a song, we applied a filter to the data set, selecting items with a valence score greater than or equal to 1.5 times the average valence score. This criterion was used to create a playlist consisting of the most euphoric songs. Next, we meticulously selected the top 50 songs based on their predictable danceability and conducted a thorough examination of each song's genre to ensure its alignment with the desired dance-oriented theme.

    For a complete view of the top 50 songs playlist and further details, please refer to section 6.1 of this report.

    Click here to play the Summer Party Top 50 songs playlist

    2.4 Conclusion

    In conclusion, the methodology employed in this analysis demonstrates a comprehensive and systematic approach to handling a complex dataset to predict track danceability. The data underwent meticulous inspection, cleaning, and preprocessing to ensure its quality and suitability for modeling. We explored various regression models, with the XGBoostRegressor emerging as the most effective choice, showcasing superior predictive performance. While the model exhibited some degree of overfitting initially, a fine-tuning process mitigated this issue, resulting in significant enhancements in model generalization. Residual analysis suggests that the model is well-specified but tends to slightly underestimate observed values, implying room for further refinement. Overall, this study underscores the importance of rigorous data preparation, model selection, and fine-tuning to achieve robust predictive modeling outcomes in the realm of music track analysis.

    2.5 Recommendations

    • Refine Model with Additional Features: To improve the model's predictive accuracy, consider incorporating additional relevant features that might influence track danceability. Exploring domain-specific music features or external data sources could yield valuable insights for enhancing model performance.
    • Regularization for Improved Fit: As observed in the residual analysis, the model tends to slightly underestimate danceability values. Implementing a regularization technique, such as adding a penalty for negative predictions, may help align the model's predictions more closely with the actual values.
    • Data Augmentation: To further enrich the dataset, explore techniques for data augmentation. This can involve generating synthetic data points or creating variations of existing data to increase the diversity and representativeness of the dataset, potentially improving model robustness.
    • Continuous Monitoring and Updating: Music preferences and trends evolve over time, so it's crucial to continuously monitor model performance and update it as new data becomes available. Regularly retraining the model with fresh data ensures that it remains relevant and effective in predicting danceability for contemporary music.
    • User Feedback Integration: Consider incorporating user feedback and domain expertise into the model evaluation and refinement process. Input from music experts or enthusiasts can help fine-tune the model to better align with human perceptions of danceability.

    3. The Data

    You have assembled information on more than 125 genres of Spotify music tracks in a file called spotify.csv, with each genre containing approximately 1000 tracks. All tracks, from all time, have been taken into account without any time period limitations. However, the data collection was concluded in October 2022. Each row represents a track that has some audio features associated with it.

    ColumnDescription
    track_idThe Spotify ID number of the track.
    artistsNames of the artists who performed the track, separated by a ; if there's more than one.
    album_nameThe name of the album that includes the track.
    track_nameThe name of the track.
    popularityNumerical value ranges from 0 to 100, with 100 being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently.
    duration_msThe length of the track, measured in milliseconds.
    explicitIndicates whether the track contains explicit lyrics. true means it does, false means it does not or it's unknown.
    danceabilityA score ranges between 0.0 and 1.0 that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity.
    energyA score ranges between 0.0 and 1.0 indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy.
    keyThe key the track is in. Integers map to pitches using standard Pitch class notation. E.g.0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
    loudnessThe overall loudness, measured in decibels (dB).
    modeThe modality of a track, represented as 1 for major and 0 for minor.
    speechinessMeasures the amount of spoken words in a track. A value close to 1.0 denotes speech-based content, while 0.33 to 0.66 indicates a mix of speech and music like rap. Values below 0.33 are usually music and non-speech tracks.
    acousticnessA confidence measure ranges from 0.0 to 1.0, with 1.0 representing the highest confidence that the track is acoustic.
    instrumentalnessInstrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to 1.0 indicates a higher probability that the track lacks vocal content.
    livenessA measure of the probability that the track was performed live. Scores above 0.8 indicate a high likelihood of the track being live.
    valenceA score from 0.0 to 1.0 representing the track's positiveness. High scores suggest a more positive or happier track.
    tempoThe track's estimated tempo, measured in beats per minute (BPM).
    time_signatureAn estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4.
    track_genreThe genre of the track.

    Source (data has been modified)

    4. Exploratory Data Analysis (EDA)

    Spotify Dataframe

    Hidden code

    4.1 Data inspection and cleaning process

    Hidden code

    We shall eliminate null values, duplicate rows, and instances of identical tracks featured in multiple albums attributed to the same artist, resulting in a singular, consolidated entry per track within the dataframe.

    Hidden code