Come on... let's pydance!

Beta

# Few installs to keep all nice and type
!pip install --upgrade scikit-learn numpy --quiet
!pip install edaeasy scikit-plot --quiet

Which songs are most suitable for a dancing party?

📖 Background

It's that vibrant time of year again - Summer has arrived (for those of us in the Northern Hemisphere at least)! There's an energy in the air that inspires us to get up and move. In sync with this exuberance, your company has decided to host a dance party to celebrate. And you, with your unique blend of creativity and analytical expertise, have been entrusted with the crucial task of curating a dance-themed playlist that will set the perfect mood for this electrifying night. The question then arises - How can you identify the songs that would make the attendees dance their hearts out? This is where your coding skills come into play.

🕶️ Executive summary

This analysis dives into Spotify data with the aim of creating a statistically perfect playlist. Here are the key takeaways:

Data Cleaning and Preparation: The initial step involved cleaning the dataset, addressing missing values, and removing duplicate songs. This process ensures the dataset's integrity and accuracy for further analysis.
Feature Selection: Several features such as track_id, artists, album_name, and track_genre were deemed less important for creating a statistically perfect playlist and were set aside.
Exploratory Analysis: A thorough exploratory analysis revealed insights into the distribution and characteristics of various numerical and categorical variables in the dataset. Notably, none of the variables exhibited a normal distribution.
Correlation Analysis: Correlation analysis highlighted the relationships between variables, with notable findings including the positive correlation between danceability and valence, and the strong correlation between energy and loudness.
Modeling: Four different models, including Linear Regressor, HuberRegressor, SVM, and Random Forest, were explored to predict danceability, a regression problem. Random Forest emerged as the most effective model for this task.
Data Preprocessing: Data preprocessing involved feature scaling, particularly for variables with non-standard ranges or significant outliers, to enhance model performance.
Feature Importance: The top five features that contribute most to predicting danceability were identified: Valence, Tempo, Acousticness, Energy, and Speechiness.
Playlist Curation: The Random Forest model was used to assign new danceability scores to songs, allowing for the selection of the best 50 tracks for an ultra-curated playlist.

This analysis provides valuable insights into creating a statistically perfect playlist by leveraging Spotify data. It demonstrates the importance of data cleaning, exploratory analysis, and model selection in achieving this goal. The top features contributing to danceability were identified, allowing for the curation of an exceptional playlist tailored to the desired characteristics.

️✏️ Introduction

Danceability analysis of songs is an exciting and essential aspect in the field of music analysis. It refers to the measure of how suitable a song is for dancing based on its rhythm, tempo, and beat. By examining the danceability of songs, we can gain insights into the overall energy and groove of the music, allowing us to understand its potential impact on listeners.

Understanding the danceability of songs is important for various reasons. Firstly, it helps music producers and DJs in selecting the perfect tracks for dance events, clubs, or parties, ensuring an enjoyable and engaging experience for the audience. Additionally, danceability analysis can aid in music recommendation systems, where songs with similar danceability scores can be suggested to users based on their preferences. Moreover, it provides valuable information for artists and musicians to create music that resonates with the desired mood and movement.

Overall, the analysis of danceability in songs plays a significant role in the music industry, enabling better curation, recommendation, and creation of music that caters to the diverse tastes and preferences of listeners.

Let's dive into the exploratory analysis, uncovering key features and relationships, followed by a blazing data preprocessing, and some cool modeling to find the best tunes for that party!

Let's get started!

💾 The Data

You have assembled information on more than 125 genres of Spotify music tracks in a file called spotify.csv, with each genre containing approximately 1000 tracks. All tracks, from all time, have been taken into account without any time period limitations. However, the data collection was concluded in October 2022. Each row represents a track that has some audio features associated with it.

Column	Description
`track_id`	The Spotify ID number of the track.
`artists`	Names of the artists who performed the track, separated by a `;` if there's more than one.
`album_name`	The name of the album that includes the track.
`track_name`	The name of the track.
`popularity`	Numerical value ranges from `0` to `100`, with `100` being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently.
`duration_ms`	The length of the track, measured in milliseconds.
`explicit`	Indicates whether the track contains explicit lyrics. `true` means it does, `false` means it does not or it's unknown.
`danceability`	A score ranges between `0.0` and `1.0` that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity.
`energy`	A score ranges between `0.0` and `1.0` indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy.
`key`	The key the track is in. Integers map to pitches using standard Pitch class notation. E.g.`0 = C`, `1 = C♯/D♭`, `2 = D`, and so on. If no key was detected, the value is `-1`.
`loudness`	The overall loudness, measured in decibels (dB).
`mode`	The modality of a track, represented as `1` for major and `0` for minor.
`speechiness`	Measures the amount of spoken words in a track. A value close to `1.0` denotes speech-based content, while `0.33` to `0.66` indicates a mix of speech and music like rap. Values below `0.33` are usually music and non-speech tracks.
`acousticness`	A confidence measure ranges from `0.0` to `1.0`, with `1.0` representing the highest confidence that the track is acoustic.
`instrumentalness`	Instrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to `1.0` indicates a higher probability that the track lacks vocal content.
`liveness`	A measure of the probability that the track was performed live. Scores above `0.8` indicate a high likelihood of the track being live.
`valence`	A score from `0.0` to `1.0` representing the track's positiveness. High scores suggest a more positive or happier track.
`tempo`	The track's estimated tempo, measured in beats per minute (BPM).
`time_signature`	An estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from `3` to `7` indicating time signatures of `3/4`, to `7/4`.
`track_genre`	The genre of the track.

Source (data has been modified)

import pandas as pd
spotify = pd.read_csv('data/spotify.csv')
spotify

🤔 Know your data!

First impressions

Let's dive into some exciting insights from the Spotify data! We'll use the information from the last data description and types to guide us on what to explore and what to leave behind.

spotify.info()

🚀 First, we've noticed that there's just one song missing some crucial information about the artist, track, and album names. Let's eliminate that song and tidy up our dataset! 🧹

🔍 Now, we're on the hunt for duplicated songs, as suggested by the data description. Keep in mind, a key rule here is that the same track should have the same track_id, and their metrics should match. We suspect that the same song, but in different albums, should have different track_ids. We'll confirm that shortly! 🎶

spotify.dropna(inplace=True)
duplicated_songs = spotify.duplicated(subset=['track_id']).sum()
print(f'Number of dupicated records {duplicated_songs}')
display(spotify[spotify.duplicated(subset=['track_id'],keep=False)])

The data is rich, but we've got a whopping 24,132 duplicated records in our dataset! 📊 Most of these duplicates occur because Spotify lists the same track with different genres (yes, songs can have multiple genres) as separate entries. Also, songs with different albums have different track_ids, but they can have varying metrics, especially popularity. We should keep the latter entries and clear out the former. It's crucial to remove these songs, as they can skew our analysis and hurt our model's performance. Let's keep it clean! 🧼

spotify.drop_duplicates(subset=['track_id'], inplace=True)
new_dupl_val = spotify.duplicated().sum()
print(f'Final duplicated songs number: {new_dupl_val}')

Perfecto! Now, let's talk about what's really important for our statistically perfect playlist. It turns out that track_id, artists, album_name, track_name, and track_genre may not carry as much weight as other variables. They often don't tell us much about the unique characteristics of each song, especially considering that many artists have diverse styles within the same album or throughout their careers and same genre songs could be very appart from each other, so let the numbers take that burden.

Time for some optimization! Working with object types isn't the most efficient way to go, especially when some of them contain numerical data. We can make our dataset more efficient by converting certain columns to more suitable data types, like using a key-bool or mode-category for better representation.

💡 We'll create a new working DataFrame to preserve the original Spotify data for future use.

‌
‌
‌