Skip to content
Song Genre Classifier
  • AI Chat
  • Code
  • Report
  • Spinner

    Spotify Music Data

    Photo by Anthony DELANOIX on Unsplash

    Introduction

    This dataset consists of ~600 songs that were in the top songs of the year from 2010 to 2019 (as measured by Billboard). You can explore interesting song data pulled from Spotify such as the beats per minute, amount of spoken words, loudness, and energy of every song. The purpose of this project is to complete the following tasks.

    • πŸ—ΊοΈ Explore: Which artists and genres are the most popular?
    • πŸ“Š Visualize: Visualize the numeric values as a time-series by year. Can you spot any changes over the years?
    • πŸ”Ž Analyze: Train and build a classifier to predict a song's genre based on columns 3 to 13.

    Data dictionary

    VariableExplanation
    0titleThe title of the song
    1artistThe artist of the song
    2top genreThe genre of the song
    3yearThe year the song was in the Billboard
    4bpmBeats per minute: the tempo of the song
    5nrgyThe energy of the song: higher values mean more energetic (fast, loud)
    6dnceThe danceability of the song: higher values mean it's easier to dance to
    7dBDecibel: the loudness of the song
    8liveLiveness: likeliness the song was recorded with a live audience
    9valValence: higher values mean a more positive sound (happy, cheerful)
    10durThe duration of the song
    11acousThe acousticness of the song: likeliness the song is acoustic
    12spchSpeechines: higher values mean more spoken words
    13popPopularity: higher values mean more popular

    Source of dataset.

    import pandas as pd, numpy as np, matplotlib.pyplot as plt
    import tensorflow as tf
    df = pd.read_csv("spotify_top_music.csv", index_col=0)
    df
    df.info()
    df['top genre'].value_counts()
    df.describe()
    # Check Duplicates and Missing Values
    print(f'Number of duplicates: {df.duplicated().sum()}')
    print(f'Number of missing values per column:\n{df.isna().sum()}')

    1. Who is the most popular artist?

    The dataset contains a column, pop, which contains integers that measure the popularity of a song. We may consider this column as a metric to figure out who is the most popular artist. However, we may also consider how many times an artist appears in this dataset as a metric. The more an artist appears in the dataset-- the more songs they have on the billboard charts, which is a clear sign of their popularity. We will observe the following metrics of popularity:

    1. Average pop score per artist
    2. Number of songs on the billboard charts (which also just means how many times the artist appears on the dataset)
    # 1. Average Popularity Score
    artists = df.groupby('artist')['pop'].apply(np.mean)
    print("Metric 1- Average pop score per artist")
    artists.sort_values(ascending=False).head(10)
    # 2. Number of songs on billboard charts
    df['artist'].value_counts().head(10).plot(kind='bar')
    plt.title('Top 10 Artists with most Billboard charting singles')
    plt.ylabel('Number of songs')
    plt.xlabel('Artist')
    plt.show()

    Which metric is better?

    Using metric (2.) we can see names such as Katy Perry, Justin Bieber, Rihanna, and Lady Gaga--all prolific artists during the 2010's. However, if we use metric (1.) we see names such as Lewis Capaldi, SHAED, Lizzo, and MABEL. These names are relatively less known and have a much more niche audience. The reason they have such a high popularity score is that they may have had 1 or 2 very popular songs on the billboard chart, which skews their average popularity score. Hence, using metric (1.) is a better measure of who is the most popular artist. Therefore, Katy Perry is the most popular artist according to this dataset.

    2. How have the numeric values changed over the years?

    To answer this question we must first convert the year column datatype from integers to datetime objects. Then, we will plot the change of each numeric column over time.

    df['year'] = pd.to_datetime(df['year'], format = '%Y')
    df['year'].value_counts()

    Further preprocessing

    To visualize how the numerical columns have changed over time we will obtain the median of each numerical column per year and plot them as a timeseries.

    β€Œ
    β€Œ
    β€Œ