Skip to content
0

Understanding Lego sets popularity

📖 Background

We are going use the provided dataset to understand the popularity of different Lego sets and themes. The idea is to become familiarized with the data to be ready for an interview with a business stakeholder.

We will be utilizing PostgreSQL queries to perform data analysis and gain valuable insights from our data set. This will enable us to make informed decisions and improve our operations.

💾 The data

We got a database with the following tables. You can also see above a visualization of how the tables are related to each other. (source):
inventory_parts
  • "inventory_id" - id of the inventory the part is in (as in the inventories table)
  • "part_num" - unique id for the part (as in the parts table)
  • "color_id" - id of the color
  • "quantity" - the number of copies of the part included in the set
  • "is_spare" - whether or not it is a spare part
parts
  • "part_num" - unique id for the part (as in the inventory_parts table)
  • "name" - name of the part
  • "part_cat_id" - part category id (as in part_catagories table)
part_categories
  • "id" - part category id (as in parts table)
  • "name" - name of the category the part belongs to
colors
  • "id" - id of the color (as in inventory_parts table)
  • "name" - color name
  • "rgb" - rgb code of the color
  • "is_trans" - whether or not the part is transparent/translucent
inventories
  • "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
  • "version" - version number
  • "set_num" - set number (as in sets table)
inventory_sets
  • "inventory_id" - id of the inventory the part is in (as in the inventories table)
  • "set_num" - set number (as in sets table)
  • "quantity" - the quantity of sets included
sets
  • "set_num" - unique set id (as in inventory_sets and inventories tables)
  • "name" - the name of the set
  • "year" - the year the set was published
  • "theme_id" - the id of the theme the set belongs to (as in themes table)
  • "num_parts" - the number of parts in the set
themes
  • "id" - the id of the theme (as in the sets table)
  • "name" - the name of the theme
  • "parent_id" - the id of the larger theme, if there is one

Acknowledgments: Rebrickable.com

1. What is the average number of Lego sets released per year?

Spinner
DataFrameas
df
variable
WITH set_count_per_year AS (
    SELECT year, COUNT(set_num) AS set_count
    FROM sets
    GROUP BY year
)
SELECT AVG(set_count) AS avg_sets_per_year
FROM set_count_per_year;

/* This query first uses a Common Table Expression (CTE) 
to group the sets by year and count the number of sets per year. 
Then it uses the AVG function to calculate the average number of 
sets across all the years. */

2. What is the average number of Lego parts per year?

Spinner
DataFrameas
avg_num_parts
variable
SELECT 
    year, 
    AVG(num_parts) as avg_num_parts 
FROM sets 
GROUP BY year
ORDER BY AVG(num_parts) DESC, year;

/* The query will return the average number of Lego parts 
for each year and it will be ordered by the average number of parts 
in descending order and then by year. */

3. Create a visualization for item 2.

Current Type: Bar
Current X-axis: year
Current Y-axis: avg_num_parts
Current Color: None

The average number of Lego parts per year

4. What are the 5 most popular colors used in Lego parts?

Spinner
DataFrameas
popular_colors
variable
WITH color_count AS
  (SELECT color_id,
          COUNT(color_id) AS color_count
   FROM inventory_parts
   GROUP BY color_id)
SELECT c.name,
       cc.color_count
FROM color_count cc
JOIN colors c ON cc.color_id = c.id
ORDER BY cc.color_count DESC
LIMIT 5;

/*   This query first uses a Common Table Expression (CTE) to group the parts by color_id 
and count the number of parts per color. Then it joins the color_count CTE with the colors table 
to retrieve the name of the color. Finally, it sorts the results by color_count in descending order 
and limits the results to the top 5. This query will give us the 5 most popular colors 
and the count of parts used in that color. */

5. What proportion of Lego parts are transparent?

Spinner
DataFrameas
df
variable
WITH transparent_parts AS (
    SELECT COUNT(DISTINCT ip.part_num)::float as transparent_parts
    FROM inventory_parts ip
    LEFT JOIN colors c ON ip.color_id = c.id
    WHERE c.is_trans = true
),
all_parts AS (
    SELECT COUNT(DISTINCT part_num)::float as all_parts
    FROM inventory_parts
)
SELECT (transparent_parts/all_parts * 100)::numeric(5,1) || '%' as percentage_transparent
FROM transparent_parts, all_parts;

/*     This query first uses two Common Table Expressions (CTEs) to count the number of distinct parts 
that are transparent and the number of all distinct parts. By using the '::float' operator, it ensures 
that the columns data type are converted to float. Then it selects the proportion of transparent parts 
by dividing the number of transparent parts by the total number of parts, then it multiplies it by 100, 
rounds it to 1 decimal place and concatenates the percent sign. */