Data-Driven Merchandise Sales Strategy: Exploring Pokemon Choices
Scenario: Business Task and Objectives
I am a junior data analyst working for a business intelligence consultant. I have been assigned a project for a new client, a retail store specializing in anime and game merchandise. My task is to conduct a data-driven analysis to provide recommendations for curating a compelling lineup of Pokemon merchandise to kickstart the retail store's offerings. The objective is to optimize the selection of Pokemon plushies that will resonate with customers, align with their preferences, and maximize sales potential.
To achieve this, my analysis will consider various factors, including:
- Analyzing Ash's Journey and Pokemon Win Rates: By examining Ash's journey in the Pokemon animated series and games, I will identify the Pokemon that have been consistently used by Ash and have achieved high win rates in battles. These Pokemon hold significant popularity and can serve as strong candidates for inclusion in the merchandise lineup.
- Investigating Reddit Survey Data: I will explore data from a Reddit survey that captures the favorite Pokemon of Reddit users. This survey provides valuable insights into the preferences and beloved Pokemon among online communities. Analyzing the survey results will help identify Pokemon that garner significant attention and are highly regarded by fans.
- Examining Japan Survey Data for Pokemon Scarlet/Violet: Additionally, I will investigate data from a Japan survey specifically focused on the favorite Pokemon of players in the Pokemon Scarlet/Violet games. This survey provides insights into the preferences and favorite Pokemon of the Japanese market, which is a significant audience for Pokemon merchandise. Incorporating these findings will help ensure that the pokemon lineup appeals to the Japanese customer base.
- Creating a Ranking of Popularity: A key aspect of this project is to create a comprehensive ranking of Pokemon popularity, this will be derived from the analysis of the datasets obtained from the previous section. By incorporating this fourth objective, we ensure a holistic approach to our analysis, encompassing various factors and capturing the evolving preferences and popularity of Pokemon within different contexts.
Data Sources
The analysis will involve the following datasets:
-
Ash's Pokemon Win Percentages: Kanto-Alola (2020)
- This dataset records pokemon captured and used by Ash Ketchup in battles, including both official and unofficial battles, up to gen 7.
- Source: Ash Pokemon rank by win percentage
-
Complete Pokemon dataset (2022)
- This comprehensive dataset contains information on pokemon up to generation 9 and was compiled from pokemondb.net.
- Source: Kaggle dataset
-
Reddit's Favourite Pokemon Survey (2019)
- This dataset originates from a Reddit survey and includes the favorite pokemon of 52,836 users.
- Source: Reddit's Favourite Pokemon Survey
-
Japan's Favorite Pokemon Survey (2022)
- This dataset comprises the results of a survey conducted in Japan, highlighting the favorite pokemon among 177,829 respondents.
- Source: Favorite Pokémon in 2022
In the 'ash_pokemon_kanto-alola.csv' file, the following information is available:
- name: The name of each pokemon.
- battles: the total number of battles for each pokemon.
- wins: The number of battles won by each pokemon.
- losses: The number of battles lost by each pokemon.
From the 'Pokedex_Cleaned.csv' file, I will import the following information:
- type: The pokemon type for each entry.
From the 'reddit_survey_data.csv' file:
- Pokemon: The name of each pokemon.
- Number of votes: The number of votes received by each pokemon in the survey.
Lastly, from the 'japan_survey_data.csv' file:
- name: The name of each pokemon.
- votes: The number of votes received by each pokemon in the survey.
Data Preparation and Transformation
In this section, several actions were performed to prepare and transform the data for further analysis and recommendation generation. These actions included:
- Cleaning and Standardization: I focused on ensuring data integrity and accuracy by checking missing values, handling duplicate entries, and resolving inconsistencies within the dataset.
- Feature Engineering: I derived new features from the Ash Pokemon dataset, including win percentage and a variety of ranks based off different datasets.
- Data Integration: Incorporated the pokemon type data from the pokedex dataset into each respective dataset, to enhance the visualization capabilities.
- Color Palette Setup: I established a color palette based on Pokemon types. This palette will be utilized in the visualization created in a later section, enhancing the visual representation of Pokemon attributes.
- Rank-Based Dataset Creation and Recommendation List: I created a comprehensive dataset that incorporates rankings based on the number of battles, win percentage, Reddit survey votes, and Japan survey votes for each Pokemon. Using this rank-based dataset, I generated a recommendation list that highlights the top-performing Pokemon across various categories. This approach ensures that our recommendations are based on objective criteria and provide valuable insights.
# Importing libraries
library(tidyverse)
# Loading Ash pokemon data
data_raw <- read.csv("data/ash_pokemon_kanto-alola.csv")
head(data_raw)
# Loading Pokedex data
pokedex_data <- read.csv("data/Pokedex_Cleaned.csv", skip = 1, header = FALSE, stringsAsFactors = FALSE)
# Exclude the first column
pokedex_data <- pokedex_data[, -1]
# Remove the additional columns
pokedex_data <- pokedex_data[, 1:2]
# Rename columns
colnames(pokedex_data) <- c("name", "type_1")
# Print the modified dataset
head(pokedex_data)
# Loading Reddit Survey data
reddit_survey_data <- read.csv("data/reddit_survey_data.csv", skip = 1, header = TRUE)
# Rename columns and drop the ones I don't need
colnames(reddit_survey_data) <- c("name", "votes")
reddit_survey_data <- subset(reddit_survey_data, select = c("name", "votes"))
head(reddit_survey_data)
# Loading Japan Survey data
japan_survey_data <- read.csv("data/japan_survey_data.csv")
head(japan_survey_data);
# Removing duplicate data
pokedex_data <- pokedex_data[!duplicated(pokedex_data$name), ]
# Merging data
merged_pokemon <- left_join(data_raw, pokedex_data, by= "name") %>%
select(name, battles, wins, losses, type_1)
# Data preview
head(merged_pokemon)
# Adding another column
ash_pokemon <- merged_pokemon %>% mutate(win_pct = wins / battles)
# Renaming column type_1 to type
ash_pokemon <- ash_pokemon %>% rename(type = type_1)
head(ash_pokemon)
# Check for missing values in each column for all Pokemon
pokemon_with_missing <- apply(ash_pokemon, 1, function(row) any(is.na(row)))
# Subset the dataset to show Pokemon with missing values
pokemon_missing_values <- ash_pokemon[pokemon_with_missing, ]
print(pokemon_missing_values)
# Setting up color palette based on pokemon types
type_colors <- c("Bug" = "#A8B820", "Dark" = "#705848", "Dragon" = "#7038F8",
"Electric" = "#F8D030", "Fairy" = "#EE99AC", "Fighting" = "#C03028",
"Fire" = "#F08030", "Flying" = "#A890F0", "Ghost" = "#705898",
"Grass" = "#78C850", "Ground" = "#E0C068", "Ice" = "#98D8D8",
"Normal" = "#A8A878", "Poison" = "#A040A0", "Psychic" = "#F85888",
"Rock" = "#B8A038", "Steel" = "#B8B8D0", "Water" = "#6890F0")
# Create a new dataset with ranks for battles and win percentage
new_dataset <- data.frame(
name = ash_pokemon$name,
battle_rank = rank(-ash_pokemon$battles, na.last = "keep"),
win_pct_rank = rank(-ash_pokemon$win_pct, na.last = "keep")
)
# Creating rank for the reddit survey data
reddit_survey_data$votes_rank <- rank(-reddit_survey_data$votes, na.last = "keep")
# Merging the new dataset with the reddit survey results
new_dataset <- merge(new_dataset, reddit_survey_data[, c("name", "votes_rank")], by = "name", all = TRUE)
# Creating rank for the japan survey data
japan_survey_data$japan_votes_rank <- rank(-japan_survey_data$votes, na.last = "keep")
# Merging the new dataset with the japan survey results
new_dataset <- merge(new_dataset, japan_survey_data[, c("name", "japan_votes_rank")], by = "name", all = TRUE)
# Print the new dataset with rank
head(new_dataset)
# Calculate the ranking scores for battles
battle_score <- ifelse(is.na(new_dataset$battle_rank), 0,
ifelse(new_dataset$battle_rank <= 5, 50,
ifelse(new_dataset$battle_rank <= 10, 40,
ifelse(new_dataset$battle_rank <= 20, 30,
ifelse(new_dataset$battle_rank <= 50, 20,
ifelse(new_dataset$battle_rank <= 100, 10, 5))))))
# Calculate the ranking scores for win percentage
win_pct_score <- ifelse(is.na(new_dataset$win_pct_rank), 0,
ifelse(new_dataset$win_pct_rank <= 5, 50,
ifelse(new_dataset$win_pct_rank <= 10, 40,
ifelse(new_dataset$win_pct_rank <= 20, 30,
ifelse(new_dataset$win_pct_rank <= 30, 20,
ifelse(new_dataset$win_pct_rank <= 100, 10, 5))))))
# Calculate the ranking scores for Reddit survey
reddit_score <- ifelse(is.na(new_dataset$votes_rank), 0,
ifelse(new_dataset$votes_rank <= 5, 50,
ifelse(new_dataset$votes_rank <= 10, 40,
ifelse(new_dataset$votes_rank <= 20, 30,
ifelse(new_dataset$votes_rank <= 50, 20,
ifelse(new_dataset$votes_rank <= 100, 10, 5))))))
# Calculate the ranking scores for Japan survey
japan_score <- ifelse(is.na(new_dataset$japan_votes_rank), 0,
ifelse(new_dataset$japan_votes_rank == 1, 70,
ifelse(new_dataset$japan_votes_rank <= 3, 65,
ifelse(new_dataset$japan_votes_rank <= 5, 60,
ifelse(new_dataset$japan_votes_rank <= 10, 50,
ifelse(new_dataset$japan_votes_rank <= 20, 55,
ifelse(new_dataset$japan_votes_rank <= 30, 52,
ifelse(new_dataset$japan_votes_rank <= 50, 51,
ifelse(new_dataset$japan_votes_rank <= 100, 1, 0)))))))))
# Calculate the aggregated score
aggregated_score <- battle_score + win_pct_score + reddit_score + japan_score
# Combine Pokémon data with aggregated score
pokemon_scores <- data.frame(name = new_dataset$name, score = aggregated_score)
# Sort the Pokémon based on the aggregated score
sorted_pokemon <- pokemon_scores[order(pokemon_scores$score, decreasing = TRUE), ]
# Select the top-ranked Pokémon as recommendations
top_recommendations <- head(sorted_pokemon, 25) # <-- Modify this number to modify the recommendations table
# Initialize an empty data frame to store the details
pokemon_details <- data.frame()
# Iterate over each Pokemon in the top_recommendations list
for (pokemon in top_recommendations$name) {
# Get the details for the current Pokemon
pokemon_details <- rbind(pokemon_details, new_dataset[new_dataset$name == pokemon, ])}
# Create a new data frame with the desired columns
pokemon_table <- data.frame(
Rank = 1:nrow(pokemon_details),
pokemon_name = pokemon_details$name,
battle_rank = ifelse(!is.na(pokemon_details$battle_rank), paste0("#", pokemon_details$battle_rank), "-"),
win_pct_rank = ifelse(!is.na(pokemon_details$win_pct_rank), paste0("#", pokemon_details$win_pct_rank), "-"),
reddit_survey = ifelse(!is.na(pokemon_details$votes_rank), paste0("#", pokemon_details$votes_rank), "-"),
japan_survey = ifelse(!is.na(pokemon_details$japan_votes_rank),
paste0("#", pokemon_details$japan_votes_rank),
"-"),
stringsAsFactors = FALSE)
Supporting Visualizations and Key Findings
In this section, the objective is to delve into the data and gain insights into the top-performing Pokemon across different categories and datasets.
# Plotting Top 15 Ash pokemon by win percentage
# Preparing the data
top_win_pct <- ash_pokemon %>%
select(name, battles, wins, win_pct, type) %>%
arrange(desc(win_pct)) %>%
top_n(15, win_pct)
# Plotting the data
ggplot(top_win_pct, aes(x=win_pct, y=reorder(name, win_pct), fill=type)) +
geom_bar(stat= 'identity') +
geom_text(aes(label=paste0(round(win_pct*100,1), "%")), hjust = -0.1, size = 3, position = position_stack(vjust = 0.5)) +
scale_x_continuous(limits = c(0, 1), expand = c(0, 0), labels = scales::percent_format(accuracy = 1)) +
xlab('Win percentage') +
ylab('Pokemon') +
ggtitle('Top 15 pokemon by win percentage') +
scale_fill_manual(values = type_colors) +
theme(axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 8), plot.title = element_text(hjust = 0.5))