Building a movie recommendation system with NLP
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Building a movie recommendation system with NLP

    Overview

    In this project, we aim to develop a movie recommendation system using Natural Language Processing (NLP) techniques, in particular TF-IDF and Cosine Similarity. The goal is to provide movie recommendations to users based on movie plots. By leveraging the power of NLP, we can analyze and understand textual data associated with movies to generate accurate and relevant recommendations.

    # Modules
    import numpy as np
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import linear_kernel
    
    # Function to get an overview of a dataframe
    def overview(df):
        # First 5 rows
        display(df.head())
        # Shape    
        display(df.shape)
        # Duplicates
        print("Duplicated:", df.duplicated().sum())
        # Missing values
        if df.isnull().values.any() == False:
            print("Missing:", df.isnull().values.any())
        else:
            print("\nMissing:\n\n", df.isnull().sum())

    Data Cleaning

    We will clean and preprocess the movie dataset, removing any irrelevant information and handling missing values.

    Import the dataset.

    # Columns to consider
    usecols_ = ["id", "title", "overview"]
    
    # Read the data
    movies = pd.read_csv("movies.csv", usecols=usecols_)
    
    # Get an overview
    overview(movies)

    Take a look at movies without overview and duplicated records.

    # No overview
    print("No overview:")
    display(movies[movies.overview.isnull()])
    
    # Duplicated
    print("Duplicated:")
    display(movies[movies.duplicated()])

    Remove missing values and duplicates.

    # Remove missing values
    movies.dropna(inplace=True)
    
    # Remove duplicates
    movies.drop_duplicates(inplace=True)
    
    # Reset the index
    movies.reset_index(drop=True, inplace = True)
    
    overview(movies)

    Check data types to make sure they are correct.

    display(movies.dtypes)

    NLP

    We will create a NLP algorithm to perform tf-idf and cosine similarity over the movie plots. Then, we will create a function to get 10 recommended movies for a given title.

    tf-idf stands for Term Frequency - Inverse Document Frequency. It consists of a method to determine how much a word contributes to characterizing a document. Here is the formula:

    Create a tf-idf matrix with a tf-idf vectorizer.