Estimating the probability of being born a woman: Case Study from Spain.
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Background

    In this article we are going to estimate if the probability of being born as a woman is 50% using birth data from Spain across its 17 regions from 1975 to 2020. Here, we'll be using data importing, cleaning, exploratory data analysis and inference.

    Introduction

    A few days ago, I was reading a really nice book called Understanding Probability: Chance Rules in Everyday Life, written by (Tijms, 2021), were he introduces basic statistical concepts using approachable explanations and nice examples. In section 5.7.2. of the book, about creating confidence intervals for probabilities, One example he brings out is the common belief that there is a 50 percent chance of being born either a man or a woman. Using a sample size of 585.609 births from the Netherlands during the years 1989, 1990 and 1991, he estimates the probability of being born a woman as 48.86%. Moreover, he creates a confidence interval of 95% around this estimation, resulting in (0.4873, 0.4899), which doesn't contain the 50% probability, suggesting the true probability of being a woman at birth is not 50%. In this article, I just want to try the same hypothesis from the previous example but using this dataset from the National Institute of Statistics (Instituto Nacional de Estadística in spanish) from Spain, with a much bigger sample size (a total of 41.802.854 births) and across 17 different regions inside the country. Let's begin!

    Exploratory Analysis and Cleaning

    Let's explore what we have:

    # Installing neccesary packages
    install.packages("nortest")
    
    # Importing packages
    library(tidyverse)
    library(broom)
    library(nortest)
    
    # Importing data
    nacidos_serie_raw <- read_csv2("nacimientos_ccaa_sexo_seriehistorica.csv",
                               locale = locale(encoding = "utf8"))
    
    head(nacidos_serie_raw)

    The glossary of this data frame (which is in spanish) is as follows:

    VariableTranslationTypeDescription
    Comunidades y Ciudades AutónomasRegionscharacterThe biggest administrative regions from Spain
    SexoGendercharacterThe gender of the newborns
    PeriodoPerioddoubleThe year when someone was born
    TotalTotaldoubleNumber of births in the given region, gender and period
    # Brief exploration
    summary(nacidos_serie_raw)
    glimpse(nacidos_serie_raw)
    # Distint categories from the character variables
    nacidos_serie_raw %>%
    select_if(is.character) %>%
    map(unique)

    The numbers next to the regions serve no useful purpose for our study, and can be confusing, specially when we sort regions by sample size, like we will in the following sections. Let's remove them:

    nacidos_serie <- nacidos_serie_raw %>%
    # Strip away numbers from CCAA names
      mutate(`Comunidades y Ciudades Autónomas` = str_remove(`Comunidades y Ciudades Autónomas`, "\\d+ "))
    
    head(nacidos_serie %>% filter(`Comunidades y Ciudades Autónomas` != "Total"))

    Which are the biggest and the smalles values from the dataset?

    # Ten biggest values
    nacidos_serie %>%
    arrange(desc(Total)) %>%
    head()
    # Ten smallest values
    nacidos_serie %>%
    arrange(Total) %>%
    head()

    Among the smallest values we can see that in 1995, for the region Extranjero (which refers to people born abroad) there were a total of 2 births. This might be a recording issue and can be safely treated as an outlier, so we will remove it from the dataset:

    # Lets remove the observation from Extranjero in 1995 just by
    # filtering for more than 30 births
    nacidos_serie <- nacidos_serie %>%
    filter(Total > 30)
    
    nacidos_serie %>%
    arrange(Total) %>%
    head()
    # How big in terms of births each region is?
    nacidos_serie %>%
    # Filter for Total in gender to avoid double counting
    filter(Sexo == "Total") %>%
    group_by(`Comunidades y Ciudades Autónomas`) %>%
    summarise(Total_births = sum(Total),
             Mean_per_year_births = mean(Total)) %>%
    arrange(desc(Total_births))