Richard Pallangyo
Introduction
Arsenic naturally occurs in groundwater sources around the world. Arsenic contamination of groundwater affects millions of people around the world, including the United States, Nicaragua, Argentina, China, Mexico, Chile, Bangladesh, India, and Vietnam, for example (Smith et al., 2000; Amini et al. 2008; Lin et al. 2017). The World Health Organization (WHO 2018a) estimates that over 140 million people in 50 countries are exposed to arsenic-contaminated drinking water above the WHO guideline of 10 g/L. Health effects of arsenic exposure include numerous types of cancer and other disorders.
This project follows an analysis of a public health study performed in rural Bangladesh (Gelman et al. 2004). In this study, wells used for drinking water were analyzed for arsenic contamination and correspondingly labeled as safe or unsafe. The study determined whether households switched the well used for drinking water. Additionally, several variables were measured that were thought to influence whether or not to switch wells possibly.
This project will investigate how accurately we can predict whether or not a household will switch wells based on the measured environmental variable, particularly arsenic levels. The central goal of our analysis and prediction is to determine whether households changed the wells they were using after wells were labeled as either safe or unsafe, based on measured arsenic levels.
Data Collection
See Gelman et al. (2004) for a discussion of data collection. Briefly, arsenic levels were measured in Araihazar, Bangladesh during the years 1999 - 2000. Additional information was collected by a survey:
- Whether or not the household swithed wells.
- The distance (in meters) to the closest known safe well.
- Whether any members of the household are involved in community organizations.
- The highest education level in the household.
Load necessary packages
#skimr provides a nice summary of a data set
library(skimr)
#tidyverse contains packages we will use for processing and plotting data
library(tidyverse)
#GGally has a nice pairs plotting function
library(GGally)
#tidymodels has a nice workflow for many models. We will use it for XGBoost
library(tidymodels)
#xgboost lets us fit XGBoost models
library(xgboost)
#vip is used to visualize the importance of predicts in XGBoost models
library(vip)
#Set the plotting theme
theme_set(theme_bw())
Loading the data set contained in the file wells.dat
and naming the data frame df
.
df <- read.table('wells.dat')
Explore the contents of the data set
Look at the first few rows of the data frame.
head(df)
Explore the columns
The variables in the data set are:
-
switch
: An indicator of whether a household switches wells. -
arsenic
: The arsenic level of the household’s well. -
dist
: The distance (in meters) to the closest known safe well. -
assoc
: An indicator of whether any members of the household are involved in community organizations. -
educ
: The highest education level in the household.
What variable(s) do we want to predict? We are interested in whether households switched the wells they were using after wells were labeled as either safe or unsafe, based on measured arsenic levels. So, we are trying to predict switch.