Predicting Christmas movie grossings

Predicting Christmas Movie Grossings

📖 Background

Imagine harnessing the power of data science to unveil the hidden potential of movies before they even hit the silver screen! As a data scientist at a forward-thinking cinema, you're at the forefront of an exhilarating challenge: crafting a cutting-edge system that doesn't just predict movie revenues, but reshapes the entire landscape of cinema profitability. This isn't just about numbers; it's about blending art with analytics to revolutionize how movies are marketed, chosen, and celebrated.

Your mission? To architect a predictive model that dives deep into the essence of a movie - from its title and running time to its genre, captivating description, and star-studded cast. And what better way to sprinkle some festive magic on this project than by focusing on a dataset brimming with Christmas movies? A highly-anticipated Christmas movie is due to launch soon, but the cinema has some doubts. It wants you to predict its success, so it can decide whether to go ahead with the screening or not. It's a unique opportunity to blend the cheer of the holiday season with the rigor of data science, creating insights that could guide the success of tomorrow's blockbusters. Ready to embark on this cinematic adventure?

3 hidden cells

💪 Competition challenge

Create a report that covers the following:

Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
- Analysis of the genres
- Descriptive statistics and histograms of the grossings
- Word clouds
Develop a model to predict the movie's domestic gross based on the available features.
- Remember to preprocess and clean the data first.
- Think about what features you could define (feature engineering), e.g.:
  - number of times a director appeared in the top 1000 movies list,
  - highest grossing for lead actor(s),
  - decade released
Evaluate your model using appropriate metrics.
Explain some of the limitations of the models you have developed. What other data might help improve the model?
Use your model to predict the grossing of the following fictitious Christmas movie:

Title: The Magic of Bellmonte Lane

Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.

Director: Greta Gerwig

Cast:

Emma Thompson as Emily, a kind-hearted and curious woman
Ian McKellen as Mr. Grayson, the stern corporate developer
Tom Hanks as George, the wise and elderly owner of the local cafe
Zoe Saldana as Sarah, Emily's supportive best friend
Jacob Tremblay as Timmy, a young boy with a special Christmas wish

Runtime: 105 minutes

Genres: Family, Fantasy, Romance, Holiday

Production budget: $25M

🧾 Executive Summary

🎯 Aim:

To create a report about christmas movies and make a Machine Learning model that predicts the gross.

🛠 Method:

Exploratory data analysis of the dataset with informative plots
Develop a model to predict the movie's domestic gross
Evaluate your model using appropriate metrics.
Explain some of the limitations of the models you have developed. What other data might help improve the model?
Use your model to predict the gross of a given movie.

🗒 Results:

Statistics:

On average, War movies last the longest. Whereas Animated movies last the shortest.
On average, Sci-Fi and Action movies are the most successful (highest gross), whereas Mystery and Horror are the least successful (lowest gross).
Fantasy films are the most oldest films to be created, whereas Game shows are the most recently made ones
The words Christmas and Family are the most commonly used in descriptions
The most used actors in the dataset are Lacey Chabert and Alicia Witt
As years went on, movies were more successful

Machine Learning Model Experimentation

Lots of models that correct the mistakes of their parent models have a higher accuracy on average
6000+ data points make the model not overfit
The decade and Movie columns made the ML model worse
Predicted gross of the row of data given: $54,815,900

(Data Validation performed underneath in the hidden cells)

34 hidden cells

💪 Challenge

📊 Exploratory data analysis of the dataset

First, I'm going to do different statistics by each genre and how they differ.

Below is the table that I made and it will turn into informative plots.

temp = xmas_movies.copy()
temp['genre'] = temp['genre'].str.split(', ')
all_genres = temp.explode(column='genre')

mean_of_avgs = lambda arr: np.round(np.mean([np.nanmean(arr), np.nanmedian(arr)]),1)
by_genre = all_genres.groupby('genre').agg(
                                                first_produced=pd.NamedAgg(column='release_year', aggfunc=min),
                                                avg_runtime=pd.NamedAgg(column='runtime', aggfunc=mean_of_avgs),
                                                avg_gross=pd.NamedAgg(column='gross', aggfunc=mean_of_avgs)
)
by_genre.sort_values('first_produced')

⏱ Average runtime for each genre

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
sort_runtime = by_genre.sort_values('avg_runtime')

plt.figure(figsize=(10,6))
sns.barplot(y=sort_runtime.index, x=sort_runtime['avg_runtime'], orient='h', alpha=0.7, palette='spring', saturation=1, zorder=1)
sns.despine(bottom=True)

for i in range(0,101,10):
    plt.plot([i,i], [-0.5,27], color='white', alpha=0.3, linewidth=1, zorder=0)

plt.ylim(-0.5, 27)
plt.title('Average runtime for each genre')
plt.show()

‌
‌
‌