Predicting Google Apps Sentiment Score with Naïve Beyes
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Predicting Google Apps Sentiment with Naïve Beyes

    This dataset consists of web scraped data of 60,000 app reviews, including the text of the review and sentiment scores. We will try to predict sentiment based on the text through a Bayesian model.

    Data Dictionary

    variableclassdescription
    AppcharacterThe application name
    Translated_ReviewcharacterUser review (translated to English)
    SentimentcharacterThe sentiment of the user - Positive/Negative/Neutral
    Sentiment_PolaritycharacterThe sentiment polarity score
    Sentiment_SubjectivitycharacterThe sentiment subjectivity score

    Source of dataset.

    # Modules
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
    
    # Seaborn parameters for data visualization
    sns.set(rc={"figure.figsize":(15, 7)})
    sns.set_context("notebook")
    sns.set_style("white")
    reviews = pd.read_csv('review_data.csv', usecols = ["App", "Translated_Review", "Sentiment"])
    
    display(reviews.head())
    display(reviews.shape)

    Data Validation

    # Check for missing values
    display(reviews.isnull().sum())
    
    # Percentage of null in df
    percentage_null = round(sum(reviews.isnull().sum()) / (reviews.shape[0] * reviews.shape[1]), 2)
    
    print("\nNull percentage in the entire dataframe: ", percentage_null)
    # Remove null values
    reviews.dropna(inplace=True)

    Exploratory Data Analysis

    # Count values in Sentiment column
    display(reviews.Sentiment.value_counts())
    
    # Plot
    ax = reviews.Sentiment.value_counts().plot(kind="bar", color="cadetblue")
    ax.set_xlabel("Sentiment")
    ax.set_xticklabels(reviews.Sentiment.value_counts().index, rotation = 360)
    ax.set_ylabel("Count")
    ax.set_title("Sentiment Frequency");

    Since neutral sentiment is neither positive nor negative, we can sum negative and neutral into a single "not positive" class and deal with a binary target variable (1 positive, 0 not positive).

    # Replace Sentiment column with dummy variables
    reviews["Sentiment"] = [1 if x == "Positive" else 0 for x in reviews.Sentiment]
    # Count values in adjusted Sentiment column
    display(reviews.Sentiment.value_counts())
    
    # Plot
    reviews.Sentiment.value_counts().plot(kind="bar", color="cadetblue")
    plt.xlabel("Sentiment")
    plt.ylabel("Count")
    plt.title("Sentiment Frequency Adjusted");

    The Sentiment variable is binary, so the positive reviews rate for each app is equal to the Sentiment mean for that app. Let's see what the top 10 rated apps are.

    top_rated_app = reviews.groupby("App").Sentiment.agg(["count", "mean"]) \
                            .sort_values(["mean", "count"], ascending=False).reset_index().head(10)
    
    fig, ax = plt.subplots()
    sns.barplot(x="count", y="App", data=top_rated_app, color="cadetblue", ax=ax)
    ax2 = ax.twiny()
    ax2.plot(top_rated_app["mean"], top_rated_app.App, color="red", linestyle="--")
    plt.xlim([0,1.05])
    ax.set_xlabel("Reviews", color="cadetblue")
    ax2.set_xlabel("Positive sentiment rate", color="red")
    ax.tick_params("x", colors="cadetblue")
    ax2.tick_params("x", colors="red")
    fig.suptitle("Top 10 rated apps", y= 1);