Predicting Christmas movie grossings
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Predicting Christmas Movie Grossings

    ๐Ÿ“– Background

    Imagine harnessing the power of data science to unveil the hidden potential of movies before they even hit the silver screen! As a data scientist at a forward-thinking cinema, you're at the forefront of an exhilarating challenge: crafting a cutting-edge system that doesn't just predict movie revenues, but reshapes the entire landscape of cinema profitability. This isn't just about numbers; it's about blending art with analytics to revolutionize how movies are marketed, chosen, and celebrated.

    Your mission? To architect a predictive model that dives deep into the essence of a movie - from its title and running time to its genre, captivating description, and star-studded cast. And what better way to sprinkle some festive magic on this project than by focusing on a dataset brimming with Christmas movies? A highly-anticipated Christmas movie is due to launch soon, but the cinema has some doubts. It wants you to predict its success, so it can decide whether to go ahead with the screening or not. It's a unique opportunity to blend the cheer of the holiday season with the rigor of data science, creating insights that could guide the success of tomorrow's blockbusters. Ready to embark on this cinematic adventure?

    ๐Ÿ’ช Competition challenge

    Create a report that covers the following:

    1. Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
      • Analysis of the genres
      • Descriptive statistics and histograms of the grossings
      • Word clouds
    2. Develop a model to predict the movie's domestic gross based on the available features.
      • Remember to preprocess and clean the data first.
      • Think about what features you could define (feature engineering), e.g.:
        • number of times a director appeared in the top 1000 movies list,
        • highest grossing for lead actor(s),
        • decade released
    3. Evaluate your model using appropriate metrics.
    4. Explain some of the limitations of the models you have developed. What other data might help improve the model?
    5. Use your model to predict the grossing of the following fictitious Christmas movie:

    Title: The Magic of Bellmonte Lane

    Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.

    Director: Greta Gerwig

    Cast:

    • Emma Thompson as Emily, a kind-hearted and curious woman
    • Ian McKellen as Mr. Grayson, the stern corporate developer
    • Tom Hanks as George, the wise and elderly owner of the local cafe
    • Zoe Saldana as Sarah, Emily's supportive best friend
    • Jacob Tremblay as Timmy, a young boy with a special Christmas wish

    Runtime: 105 minutes

    Genres: Family, Fantasy, Romance, Holiday

    Production budget: $25M

    ๐Ÿงพ Executive Summary

    ๐ŸŽฏ Aim:

    To create a report about christmas movies and make a Machine Learning model that predicts the gross.

    ๐Ÿ›  Method:

    1. Exploratory data analysis of the dataset with informative plots
    2. Develop a model to predict the movie's domestic gross
    3. Evaluate your model using appropriate metrics.
    4. Explain some of the limitations of the models you have developed. What other data might help improve the model?
    5. Use your model to predict the gross of a given movie.

    ๐Ÿ—’ Results:

    Statistics:
    • On average, War movies last the longest. Whereas Animated movies last the shortest.
    • On average, Sci-Fi and Action movies are the most successful (highest gross), whereas Mystery and Horror are the least successful (lowest gross).
    • Fantasy films are the most oldest films to be created, whereas Game shows are the most recently made ones
    • The words Christmas and Family are the most commonly used in descriptions
    • The most used actors in the dataset are Lacey Chabert and Alicia Witt
    • As years went on, movies were more successful
    Machine Learning Model Experimentation
    • Lots of models that correct the mistakes of their parent models have a higher accuracy on average
    • 6000+ data points make the model not overfit
    • The decade and Movie columns made the ML model worse
    • Predicted gross of the row of data given: $54,815,900

    (Data Validation performed underneath in the hidden cells)

    ๐Ÿ’ช Challenge

    ๐Ÿ“Š Exploratory data analysis of the dataset

    First, I'm going to do different statistics by each genre and how they differ.

    Below is the table that I made and it will turn into informative plots.

    temp = xmas_movies.copy()
    temp['genre'] = temp['genre'].str.split(', ')
    all_genres = temp.explode(column='genre')
    
    mean_of_avgs = lambda arr: np.round(np.mean([np.nanmean(arr), np.nanmedian(arr)]),1)
    by_genre = all_genres.groupby('genre').agg(
                                                    first_produced=pd.NamedAgg(column='release_year', aggfunc=min),
                                                    avg_runtime=pd.NamedAgg(column='runtime', aggfunc=mean_of_avgs),
                                                    avg_gross=pd.NamedAgg(column='gross', aggfunc=mean_of_avgs)
    )
    by_genre.sort_values('first_produced')

    โฑ Average runtime for each genre

    import matplotlib.pyplot as plt
    import seaborn as sns
    
    sns.set()
    sort_runtime = by_genre.sort_values('avg_runtime')
    
    plt.figure(figsize=(10,6))
    sns.barplot(y=sort_runtime.index, x=sort_runtime['avg_runtime'], orient='h', alpha=0.7, palette='spring', saturation=1, zorder=1)
    sns.despine(bottom=True)
    
    for i in range(0,101,10):
        plt.plot([i,i], [-0.5,27], color='white', alpha=0.3, linewidth=1, zorder=0)
    
    plt.ylim(-0.5, 27)
    plt.title('Average runtime for each genre')
    plt.show()
    โ€Œ
    โ€Œ
    โ€Œ