Richard Pallangyo














Sign up
Water Quality Project (Classification Project)
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Richard Pallangyo

    Introduction

    Arsenic naturally occurs in groundwater sources around the world. Arsenic contamination of groundwater affects millions of people around the world, including the United States, Nicaragua, Argentina, China, Mexico, Chile, Bangladesh, India, and Vietnam, for example (Smith et al., 2000; Amini et al. 2008; Lin et al. 2017). The World Health Organization (WHO 2018a) estimates that over 140 million people in 50 countries are exposed to arsenic-contaminated drinking water above the WHO guideline of 10 g/L. Health effects of arsenic exposure include numerous types of cancer and other disorders.

    This project follows an analysis of a public health study performed in rural Bangladesh (Gelman et al. 2004). In this study, wells used for drinking water were analyzed for arsenic contamination and correspondingly labeled as safe or unsafe. The study determined whether households switched the well used for drinking water. Additionally, several variables were measured that were thought to influence whether or not to switch wells possibly.

    This project will investigate how accurately we can predict whether or not a household will switch wells based on the measured environmental variable, particularly arsenic levels. The central goal of our analysis and prediction is to determine whether households changed the wells they were using after wells were labeled as either safe or unsafe, based on measured arsenic levels.

    Data Collection

    See Gelman et al. (2004) for a discussion of data collection. Briefly, arsenic levels were measured in Araihazar, Bangladesh during the years 1999 - 2000. Additional information was collected by a survey:

    1. Whether or not the household swithed wells.
    2. The distance (in meters) to the closest known safe well.
    3. Whether any members of the household are involved in community organizations.
    4. The highest education level in the household.

    Load necessary packages

    
    #skimr provides a nice summary of a data set
    library(skimr)
    #tidyverse contains packages we will use for processing and plotting data
    library(tidyverse)
    #GGally has a nice pairs plotting function
    library(GGally)
    #tidymodels has a nice workflow for many models. We will use it for XGBoost
    library(tidymodels)
    #xgboost lets us fit XGBoost models
    library(xgboost)
    #vip is used to visualize the importance of predicts in XGBoost models
    library(vip)
    
    #Set the plotting theme
    theme_set(theme_bw())
    

    Loading the data set contained in the file wells.dat and naming the data frame df.

    Run cancelled
    df <- read.table('wells.dat')

    Explore the contents of the data set

    Look at the first few rows of the data frame.

    Run cancelled
    head(df)
    Explore the columns

    The variables in the data set are:

    1. switch: An indicator of whether a household switches wells.

    2. arsenic: The arsenic level of the household’s well.

    3. dist: The distance (in meters) to the closest known safe well.

    4. assoc: An indicator of whether any members of the household are involved in community organizations.

    5. educ: The highest education level in the household.

    What variable(s) do we want to predict? We are interested in whether households switched the wells they were using after wells were labeled as either safe or unsafe, based on measured arsenic levels. So, we are trying to predict switch.