Which songs are most suitable for a dancing party?
📖 Background
It's that vibrant time of year again - Summer has arrived (for those of us in the Northern Hemisphere at least)! There's an energy in the air that inspires us to get up and move. In sync with this exuberance, your company has decided to host a dance party to celebrate. And you, with your unique blend of creativity and analytical expertise, have been entrusted with the crucial task of curating a dance-themed playlist that will set the perfect mood for this electrifying night. The question then arises - How can you identify the songs that would make the attendees dance their hearts out? This is where your coding skills come into play.
💾 The Data
You have assembled information on more than 125
genres of Spotify music tracks in a file called spotify.csv
, with each genre containing approximately 1000
tracks. All tracks, from all time, have been taken into account without any time period limitations. However, the data collection was concluded in October 2022
.
Each row represents a track that has some audio features associated with it.
Column | Description |
---|---|
track_id | The Spotify ID number of the track. |
artists | Names of the artists who performed the track, separated by a ; if there's more than one. |
album_name | The name of the album that includes the track. |
track_name | The name of the track. |
popularity | Numerical value ranges from 0 to 100 , with 100 being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently. |
duration_ms | The length of the track, measured in milliseconds. |
explicit | Indicates whether the track contains explicit lyrics. true means it does, false means it does not or it's unknown. |
danceability | A score ranges between 0.0 and 1.0 that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity. |
energy | A score ranges between 0.0 and 1.0 indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy. |
key | The key the track is in. Integers map to pitches using standard Pitch class notation. E.g.0 = C , 1 = C♯/D♠, 2 = D , and so on. If no key was detected, the value is -1 . |
loudness | The overall loudness, measured in decibels (dB). |
mode | The modality of a track, represented as 1 for major and 0 for minor. |
speechiness | Measures the amount of spoken words in a track. A value close to 1.0 denotes speech-based content, while 0.33 to 0.66 indicates a mix of speech and music like rap. Values below 0.33 are usually music and non-speech tracks. |
acousticness | A confidence measure ranges from 0.0 to 1.0 , with 1.0 representing the highest confidence that the track is acoustic. |
instrumentalness | Instrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to 1.0 indicates a higher probability that the track lacks vocal content. |
liveness | A measure of the probability that the track was performed live. Scores above 0.8 indicate a high likelihood of the track being live. |
valence | A score from 0.0 to 1.0 representing the track's positiveness. High scores suggest a more positive or happier track. |
tempo | The track's estimated tempo, measured in beats per minute (BPM). |
time_signature | An estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4 , to 7/4 . |
track_genre | The genre of the track. |
Source (data has been modified)
import pandas as pd
spotify = pd.read_csv('data/spotify.csv')
spotify
💪 Challenge
Your task is to devise an analytically-backed, dance-themed playlist for the company's summer party. Your choices must be justified with a comprehensive report explaining your methodology and reasoning. Below are some suggestions on how you might want to start curating the playlist:
- Use descriptive statistics and data visualization techniques to explore the audio features and understand their relationships.
- Develop and apply a machine learning model that predicts a song's
danceability
. - Interpret the model outcomes and utilize your data-driven insights to curate your ultimate dance party playlist of the top 50 songs according to your model.
Executive Summary
Our dataset consists of 20 columns and 113027 rows. The value we should predict is 'danceability'. Our unique identifier is 'track_id' and we have 88895 unique tracks.
1. Cleaning the data.
I eliminated all columns that are description of the tracks (for example, album's names and so on) and dropped all duplicates and rows with null values. All rows with values outside acceptable levels where deleted as outliers. The cleaned dataset has 86978 rows.
2. Feature selection and engineering.
- I deleted the 'key' feature because it doesn't have much influence on danceability.
- 'time_signature' feature was converted into the new boolean feature 'TimeSig_4' because only tracks with time signature 4 influence danceability.
- 'liveness' was converted into a boolean feature according to the threshold 0.8.
- 'speechiness' was cut into 3 categories 'speech', 'mix', 'music'. Only mix tracks improve danceability so a new boolean feature was created - 'Speech_mix'.
The main goal of my investigation is to select tracks with a high danceability rating and not to predict it's exact value. I also decided to divide all tracks into danceable and not danceable using threshold of 0.8. Our danceable tracks are labeled with 1 and now we have a classification problem.
3. Choosing the baseline model.
Our dataset is highly unbalanced with only 8% danceable tracks. For choosing a baseline model I decided to downsample False data. I also did an upsampling of True data which later I will use for tuning the model.
I chose 6 basic models:
- LogisticRegression
- KNeighborsClassifier
- Support Vector Machine
- DecisionTreeClassifier
- RandomForestClassifier
- GradientBoostingClassifier
For the first 3 of them I scaled the data and I split all data into train and test (30%) sets. The baseline model is GradientBoostingClassifier with 80.7% accuracy and 77.6% precision.
4. Tuning and validating the model.
After investigating feature importances I eliminated 2 features: 'cut_liveness' and 'mode'. Then I searched for the best hyperparameters of the model. Accuracy score improved and became 95%, cross-validation mean accuracy score = 95%. Accuracy score looks good for our balanced dataset.
I also decided to check scores for the initial unbalanced dataset:
The final model is tuned GradientBoostingClassifier with 98% accuracy ('accuracy paradox' of unbalanced data) and 83% precision of the initial dataset.
5. 50 top tracks and recommendations.
According to the predicted probabilities of my model, I sorted all tracks and selected the best 50. I compared charactestics of the full dataset vs. predicted and vs. top 50 tracks and based on that I recommend to choose tracks that are more positive (valence), less acoustic and less instrumental.
1 Cleaning the data
First of all I imported all required libraries and set a random state.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
import time
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score, KFold, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,\
roc_curve,roc_auc_score,classification_report, auc, ConfusionMatrixDisplay
seed=123
Next I made a copy of the dataframe and looked at the data shape.
sf=spotify.copy()
print(f'Our dataset has {sf.shape[0]} rows and {sf.shape[1]} columns')
To better understand the range of values we have I converted the duration of tracks into minutes. I also divided popularity by 100 and converted some columns into more correct data types.
sf['duration_m']=round(sf['duration_ms']/60000,2)
sf.drop('duration_ms',axis=1,inplace=True)
sf['popularity']=sf['popularity']/100.0
sf['mode']=sf['mode'].astype(bool)
sf[['key','time_signature']]=sf[['key','time_signature']].astype('category')
column_names=sf.columns
For further analysis I classified all columns into the following categories:
- to_predict - 'danceability'
- boolean features
- category features
- ratings
- specific features of the track
- unique_id - 'track_id'
I also set acceptable ranges of values according to the task or to common sense.
Now let's look at our data:
to_predict=['danceability']
boolean=['explicit','mode']
category=['key','time_signature']
ratings=['popularity','energy', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence']
measures=['loudness', 'tempo', 'duration_m']
specific=['artists', 'album_name', 'track_name','track_genre']
unique_id=['track_id']
classification=[]
for i in sf.columns:
if i in to_predict:
classification.append('to_predict')
elif i in boolean:
classification.append('boolean')
elif i in category:
classification.append('category')
elif i in unique_id:
classification.append('unique_id')
elif i in measures:
classification.append('measure')
elif i in specific:
classification.append('specific')
else:
classification.append('rating')
acceptable_range=[]
for i in sf.columns:
if i in ratings or i in to_predict:
acceptable_range.append('0.0-1.0')
elif i=='duration_m':
acceptable_range.append('1-10 min')
elif i=='tempo':
acceptable_range.append('30-240 BPM')
elif i=='loudness':
acceptable_range.append('-30 - -1 dB')
elif i=='time_signature':
acceptable_range.append('3-7')
elif i=='key':
acceptable_range.append('-1-11')
else:
acceptable_range.append('-')
types=sf.dtypes
notnull_values=sf.notna().sum()
null_values=sf.isna().sum()
unique_values=[len(sf[i].unique()) for i in column_names]
mins=[min(sf[i]) if sf[i].dtype!='object' else '-' for i in column_names]
maxs=[max(sf[i]) if sf[i].dtype!='object' else '-' for i in column_names]
c=pd.DataFrame(zip(column_names, types, notnull_values, null_values, unique_values,classification, mins,maxs,acceptable_range),
columns=['name','type','not_null','null','unique','classification','min','max','acceptable_range'])
c.sort_values('classification',ascending=False)
We have 88 895 unique track_ids. I dropped specific columns which describe the tracks and after that dropped duplicated rows.
Next we need to correct our data according to acceptable intervals and eliminate outliers.
Here is our cleaned dataset:
‌
‌