Building a movie recommendation system with NLP

Beta

Building a movie recommendation system with NLP

Overview

In this project, we aim to develop a movie recommendation system using Natural Language Processing (NLP) techniques, in particular TF-IDF and Cosine Similarity. The goal is to provide movie recommendations to users based on movie plots. By leveraging the power of NLP, we can analyze and understand textual data associated with movies to generate accurate and relevant recommendations.

# Modules
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Function to get an overview of a dataframe
def overview(df):
    # First 5 rows
    display(df.head())
    # Shape    
    display(df.shape)
    # Duplicates
    print("Duplicated:", df.duplicated().sum())
    # Missing values
    if df.isnull().values.any() == False:
        print("Missing:", df.isnull().values.any())
    else:
        print("\nMissing:\n\n", df.isnull().sum())

Data Cleaning

We will clean and preprocess the movie dataset, removing any irrelevant information and handling missing values.

Import the dataset.

# Columns to consider
usecols_ = ["id", "title", "overview"]

# Read the data
movies = pd.read_csv("movies.csv", usecols=usecols_)

# Get an overview
overview(movies)

Take a look at movies without overview and duplicated records.

# No overview
print("No overview:")
display(movies[movies.overview.isnull()])

# Duplicated
print("Duplicated:")
display(movies[movies.duplicated()])

Remove missing values and duplicates.

# Remove missing values
movies.dropna(inplace=True)

# Remove duplicates
movies.drop_duplicates(inplace=True)

# Reset the index
movies.reset_index(drop=True, inplace = True)

overview(movies)

Check data types to make sure they are correct.

display(movies.dtypes)

NLP

We will create a NLP algorithm to perform tf-idf and cosine similarity over the movie plots. Then, we will create a function to get 10 recommended movies for a given title.

tf-idf stands for Term Frequency - Inverse Document Frequency. It consists of a method to determine how much a word contributes to characterizing a document. Here is the formula:

Create a tf-idf matrix with a tf-idf vectorizer.

‌
‌
‌

Building a movie recommendation system with NLP

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Building a movie recommendation system with NLP

Overview

Data Cleaning

NLP

Building a movie recommendation system with NLP