Competition - Christmas movie grossings
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Predicting Christmas Movie Grossings

    📖 Background

    Imagine harnessing the power of data science to unveil the hidden potential of movies before they even hit the silver screen! As a data scientist at a forward-thinking cinema, you're at the forefront of an exhilarating challenge: crafting a cutting-edge system that doesn't just predict movie revenues, but reshapes the entire landscape of cinema profitability. This isn't just about numbers; it's about blending art with analytics to revolutionize how movies are marketed, chosen, and celebrated.

    Your mission? To architect a predictive model that dives deep into the essence of a movie - from its title and running time to its genre, captivating description, and star-studded cast. And what better way to sprinkle some festive magic on this project than by focusing on a dataset brimming with Christmas movies? A highly-anticipated Christmas movie is due to launch soon, but the cinema has some doubts. It wants you to predict its success, so it can decide whether to go ahead with the screening or not. It's a unique opportunity to blend the cheer of the holiday season with the rigor of data science, creating insights that could guide the success of tomorrow's blockbusters. Ready to embark on this cinematic adventure?

    💾 The data

    We're providing you with a dataset of 788 Christmas movies, with the following columns:

    • christmas_movies.csv
    VariableDescription
    titlethe title of the movie
    release_yearyear the movie was released
    descriptionshort description of the movie
    typethe type of production e.g. Movie, TV Episode
    ratingthe rating/certificate e.g. PG
    runtimethe movie runtime in minutes
    imdb_ratingthe IMDB rating
    genrelist of genres e.g. Comedy, Drama etc.
    directorthe director of the movie
    starslist of actors in the movie
    grossthe domestic gross of the movie in US dollars (what we want to predict)

    You may also use an additional dataset of 1000 high-rated movies, with the following columns:

    • imdb_top1k.csv
    VariableDescription
    titlethe title of the movie
    release_yearyear the movie was released
    descriptionshort description of the movie
    typethe type of production e.g. Movie, TV Episode
    ratingthe ratig/certificate e.g. PG
    runtimethe movie runtime in minutes
    imdb_ratingthe IMDB rating
    genrelist of genres e.g. Comedy, Drama etc.
    directorthe director of the movie
    starslist of actors in the movie
    grossthe domestic gross of the movie in US dollars (what we want to predict)

    Finally you have access to a dataset of movie production budgets for over 6,000 movies, with the following columns:

    • movie_budgets.csv
    VariableMeaning
    yearyear the movie was released
    datedate the movie was released
    titletitle of the movie
    production budgetproduction budget in US dollars

    Note: while you may augment the Christmas movies with the general movie data, the model should be developed to predict ratings of Christmas movies only.

    import pandas as pd
    xmas_movies = pd.read_csv('data/christmas_movies.csv')
    christmas_movies_df = xmas_movies
    christmas_movies_df
    top1k_movies = pd.read_csv('data/imdb_top1k.csv')
    imdb_top1k_df = top1k_movies
    imdb_top1k_df
    movie_budgets = pd.read_csv('data/movie_budgets.csv')
    movie_budgets_df = movie_budgets

    💪 Competition challenge

    Create a report that covers the following:

    1. Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
      • Analysis of the genres
      • Descriptive statistics and histograms of the grossings
      • Word clouds
    2. Develop a model to predict the movie's domestic gross based on the available features.
      • Remember to preprocess and clean the data first.
      • Think about what features you could define (feature engineering), e.g.:
        • number of times a director appeared in the top 1000 movies list,
        • highest grossing for lead actor(s),
        • decade released
    3. Evaluate your model using appropriate metrics.
    4. Explain some of the limitations of the models you have developed. What other data might help improve the model?
    5. Use your model to predict the grossing of the following fictitious Christmas movie:

    Title: The Magic of Bellmonte Lane

    Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.

    Director: Greta Gerwig

    Cast:

    • Emma Thompson as Emily, a kind-hearted and curious woman
    • Ian McKellen as Mr. Grayson, the stern corporate developer
    • Tom Hanks as George, the wise and elderly owner of the local cafe
    • Zoe Saldana as Sarah, Emily's supportive best friend
    • Jacob Tremblay as Timmy, a young boy with a special Christmas wish

    Runtime: 105 minutes

    Genres: Family, Fantasy, Romance, Holiday

    Production budget: $25M

    Distribution of Genres in Christmas Movies:

    This bar plot illustrates the frequency of different genres in the dataset. It helps in understanding which genres are more common in Christmas movies. Histogram of Gross Earnings for Christmas Movies:

    The histogram displays the distribution of gross earnings of the movies. This visualization is key to understanding the typical financial success of Christmas movies, indicating common gross earning ranges. Word Cloud for Movie Descriptions:

    The word cloud is generated from the movie descriptions. It highlights the most frequently used words, giving a visual representation of common themes and elements in Christmas movie descriptions.

    Step 1: Data Preprocessing and Cleaning Load Datasets: The datasets imdb_top1k.csv, christmas_movies.csv, and movie_budgets.csv are loaded into pandas dataframes.

    Data Cleaning:

    Gross Conversion: In the christmas_movies dataset, the gross column, which represents the movie's earnings, is converted from a string format (with symbols like $ and M) to a numerical format for easier analysis. Release Year Conversion: The release_year column is converted to a numeric format to facilitate mathematical operations and comparisons. Additional data cleaning steps might be necessary depending on the specific characteristics of the datasets, such as handling missing values or outliers.

    # Preprocess and clean the datasets
    # Convert gross to a numerical value
    christmas_movies_df['gross_clean'] = christmas_movies_df['gross'].replace('[\$,M]', '', regex=True).astype(float)
    
    # Convert the release year to a numeric value
    christmas_movies_df['release_year'] = pd.to_numeric(christmas_movies_df['release_year'], errors='coerce')
    
    # You might need additional preprocessing based on the specific characteristics of your datasets
    

    Step 2: Feature Engineering Director's Popularity: A new feature director_top1k_count is created to represent the number of times a director appears in the top 1000 movies list (imdb_top1k.csv). This is based on the assumption that more frequently listed directors might be associated with higher grossing movies.

    Lead Actor's Highest Grossing: Assuming the first actor listed in the stars column of the christmas_movies dataset is the lead actor, this feature represents the highest grossing movie of each lead actor. This requires cross-referencing with the movie_budgets dataset to find the maximum gross for each actor. The assumption here is that actors who have been in high-grossing movies in the past might contribute to higher grossings in new movies.

    Decade Released: Another feature created is decade, which categorizes movies based on the decade they were released. This can capture trends or preferences in movie grossing over different time periods.

    # Feature 1: Number of times a director appeared in the top 1000 movies list
    director_counts = imdb_top1k_df['Director'].value_counts()
    christmas_movies_df['director_top1k_count'] = christmas_movies_df['director'].map(director_counts).fillna(0)
    
    # Feature 2: Highest grossing for lead actor(s)
    # Extract the lead actor from the 'stars' column in christmas_movies_df
    christmas_movies_df['lead_actor'] = christmas_movies_df['stars'].str.split(',').str[0].str.strip()
    
    # For imdb_top1k_df, we will consider 'Star1' as the lead actor
    imdb_top1k_df['lead_actor'] = imdb_top1k_df['Star1'].str.strip()
    
    # Clean the 'Gross' column in imdb_top1k_df to be numeric
    imdb_top1k_df['Gross_clean'] = imdb_top1k_df['Gross'].replace('[\$,]', '', regex=True).astype(float)
    
    # Calculate the highest grossing for each lead actor in imdb_top1k_df
    actor_gross = imdb_top1k_df.groupby('lead_actor')['Gross_clean'].max()
    
    # Map this information back to the christmas_movies_df
    christmas_movies_df['lead_actor_highest_gross'] = christmas_movies_df['lead_actor'].map(actor_gross).fillna(0)
    
    # Feature 3: Decade released
    christmas_movies_df['decade'] = (christmas_movies_df['release_year'] // 10) * 10
    

    Step 3: Model Development Feature Selection: A subset of features is selected for the model. These include runtime, IMDb rating, director's count in top 1000 movies, lead actor's highest grossing movie, and the decade of release.

    Preparing the Data: The dataset is split into a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate its performance.

    Model Training: A Linear Regression model is chosen for its simplicity and interpretability. The model is trained using the training data.

    Model Evaluation: The model's performance is evaluated using the Root Mean Squared Error (RMSE) metric on the test data. RMSE provides a sense of how far off the predictions are from the actual values (lower values are better).

    This code serves as a foundational approach to predictive modeling in this context. Depending on the data's quality and the results from initial modeling, further refinement, such as advanced feature engineering, handling of categorical variables, or trying different machine learning algorithms, might be necessary.

    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error
    import numpy as np
    
    # Selecting features and target variable for christmas_movies_df
    features = ['runtime', 'imdb_rating', 'director_top1k_count', 'lead_actor_highest_gross', 'decade']
    target = 'gross_clean'  # Ensure this is the correct column name for gross earnings in christmas_movies_df
    
    # Drop rows with missing target variable in christmas_movies_df
    christmas_movies_df = christmas_movies_df.dropna(subset=[target])
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(christmas_movies_df[features], christmas_movies_df[target], test_size=0.2, random_state=42)
    
    # Initialize and train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Make predictions and evaluate the model
    predictions = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    print(f"Root Mean Squared Error: {rmse}")
    

    Error Magnitude: RMSE represents the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are. An RMSE of 44.6764 means that the predictions made by the model are, on average, about 44.6764 units away from the actual gross earnings values.

    Unit of Measure: It's crucial to consider the unit of measure for the gross earnings. If the gross earnings are in millions of dollars, an error of 44.6764 means an average error of approximately $44.68 million per movie. This could be significant depending on the typical gross earnings range of the movies in your dataset.

    Interpretation Relative to the Data: The interpretation of the RMSE value depends heavily on the scale and distribution of the gross earnings in your dataset. If the typical gross earnings for movies in your dataset range from a few million to several hundred million dollars, an RMSE of 44.6764 might indicate moderate prediction accuracy. However, if your dataset includes many low-grossing movies (e.g., independent, smaller budget films), this RMSE might indicate poor model performance.

    Model Performance: An RMSE value alone cannot definitively determine if the model is good or bad. It should be compared against a baseline model or against the RMSE of other predictive models on the same dataset. Lower RMSE values indicate better fit to the data, but the acceptability of the error magnitude is context-dependent.

    Improvement Possibilities: A relatively high RMSE suggests there's room for improvement in the model. This could involve more sophisticated feature engineering, trying different modeling techniques, or acquiring additional relevant data.

    In summary, while an RMSE of 44.6764 provides a general understanding of the model's predictive accuracy, its significance should be interpreted in the context of the gross earnings range of the movies in your dataset and the goals of your predictive modeling.

    ‌
    ‌
    ‌