Beta
Predicting Google Apps Sentiment with Naïve Beyes
This dataset consists of web scraped data of 60,000 app reviews, including the text of the review and sentiment scores. We will try to predict sentiment based on the text through a Bayesian model.
Data Dictionary
variable | class | description |
---|---|---|
App | character | The application name |
Translated_Review | character | User review (translated to English) |
Sentiment | character | The sentiment of the user - Positive/Negative/Neutral |
Sentiment_Polarity | character | The sentiment polarity score |
Sentiment_Subjectivity | character | The sentiment subjectivity score |
Source of dataset.
# Modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
# Seaborn parameters for data visualization
sns.set(rc={"figure.figsize":(15, 7)})
sns.set_context("notebook")
sns.set_style("white")
reviews = pd.read_csv('review_data.csv', usecols = ["App", "Translated_Review", "Sentiment"])
display(reviews.head())
display(reviews.shape)
Data Validation
# Check for missing values
display(reviews.isnull().sum())
# Percentage of null in df
percentage_null = round(sum(reviews.isnull().sum()) / (reviews.shape[0] * reviews.shape[1]), 2)
print("\nNull percentage in the entire dataframe: ", percentage_null)
# Remove null values
reviews.dropna(inplace=True)
Exploratory Data Analysis
# Count values in Sentiment column
display(reviews.Sentiment.value_counts())
# Plot
ax = reviews.Sentiment.value_counts().plot(kind="bar", color="cadetblue")
ax.set_xlabel("Sentiment")
ax.set_xticklabels(reviews.Sentiment.value_counts().index, rotation = 360)
ax.set_ylabel("Count")
ax.set_title("Sentiment Frequency");
Since neutral sentiment is neither positive nor negative, we can sum negative and neutral into a single "not positive" class and deal with a binary target variable (1 positive, 0 not positive).
# Replace Sentiment column with dummy variables
reviews["Sentiment"] = [1 if x == "Positive" else 0 for x in reviews.Sentiment]
# Count values in adjusted Sentiment column
display(reviews.Sentiment.value_counts())
# Plot
reviews.Sentiment.value_counts().plot(kind="bar", color="cadetblue")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.title("Sentiment Frequency Adjusted");
The Sentiment variable is binary, so the positive reviews rate for each app is equal to the Sentiment mean for that app. Let's see what the top 10 rated apps are.
top_rated_app = reviews.groupby("App").Sentiment.agg(["count", "mean"]) \
.sort_values(["mean", "count"], ascending=False).reset_index().head(10)
fig, ax = plt.subplots()
sns.barplot(x="count", y="App", data=top_rated_app, color="cadetblue", ax=ax)
ax2 = ax.twiny()
ax2.plot(top_rated_app["mean"], top_rated_app.App, color="red", linestyle="--")
plt.xlim([0,1.05])
ax.set_xlabel("Reviews", color="cadetblue")
ax2.set_xlabel("Positive sentiment rate", color="red")
ax.tick_params("x", colors="cadetblue")
ax2.tick_params("x", colors="red")
fig.suptitle("Top 10 rated apps", y= 1);