Building a movie recommendation system with NLP
Overview
In this project, we aim to develop a movie recommendation system using Natural Language Processing (NLP) techniques, in particular TF-IDF and Cosine Similarity. The goal is to provide movie recommendations to users based on movie plots. By leveraging the power of NLP, we can analyze and understand textual data associated with movies to generate accurate and relevant recommendations.
# Modules
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
# Function to get an overview of a dataframe
def overview(df):
# First 5 rows
display(df.head())
# Shape
display(df.shape)
# Duplicates
print("Duplicated:", df.duplicated().sum())
# Missing values
if df.isnull().values.any() == False:
print("Missing:", df.isnull().values.any())
else:
print("\nMissing:\n\n", df.isnull().sum())
Data Cleaning
We will clean and preprocess the movie dataset, removing any irrelevant information and handling missing values.
Import the dataset.
# Columns to consider
usecols_ = ["id", "title", "overview"]
# Read the data
movies = pd.read_csv("movies.csv", usecols=usecols_)
# Get an overview
overview(movies)
Take a look at movies without overview and duplicated records.
# No overview
print("No overview:")
display(movies[movies.overview.isnull()])
# Duplicated
print("Duplicated:")
display(movies[movies.duplicated()])
Remove missing values and duplicates.
# Remove missing values
movies.dropna(inplace=True)
# Remove duplicates
movies.drop_duplicates(inplace=True)
# Reset the index
movies.reset_index(drop=True, inplace = True)
overview(movies)
Check data types to make sure they are correct.
display(movies.dtypes)
NLP
We will create a NLP algorithm to perform tf-idf and cosine similarity over the movie plots. Then, we will create a function to get 10 recommended movies for a given title.
tf-idf stands for Term Frequency - Inverse Document Frequency. It consists of a method to determine how much a word contributes to characterizing a document. Here is the formula:
Create a tf-idf matrix with a tf-idf vectorizer.