User avatar
William Agyapong’s workbooks














Sign up
Beta
Spinner

Getting Tree Selection Right for Manhattan, New York City

Introduction

This report is in response to the following prompt:

You work for a nonprofit organization advising the planning department on ways to improve the quantity and quality of trees in New York City. The urban design team believes tree size (using trunk diameter as a proxy for size) and health are the most desirable characteristics of city trees. The city would like to learn more about which tree species are the best choice to plant on the streets of Manhattan.

Street trees provide multiple benefits that improve quality of life in New York City. Trees purify and cool the air, reduce stormwater runoff, and conserve energy. They increase property values, beautify neighborhoods, and improve human health and well-being.

In this report, we aim to address the following with the last on the list as the main objective:

  • What are the most common tree species in Manhattan?
  • Which are the neighborhoods with the most trees?
  • A visualization of Manhattan's neighborhoods and tree locations.
  • Which ten tree species should the city plant in the future?

Key Findings and Recommendations

  • Honeylocust, Callery pear, ginkgo, pin oak, and Sophora, in decreasing order of frequency, are the five most common tree species in Manhattan.
  • The Upper West Side neighborhood has the most number of trees, followed by Upper East Side-Carnegie Hill, West Village, Central Harlem North-Polo Grounds, when considering the top four.
  • Large trees are generally healthy.
  • Majority of trees in Manhattan are below or equal to 20 inches in size (DBH). Thus, the trees are mostly small or medium.
  • Generally, trees in Manhattan are not adversely affected by root, trunk, and branch related problems.The highest is 6 out of 9 which comes from Callery pear and honeylocust trees.
  • Root related problems are the most occurring with approximately 0.58 proportion of occurrence, while trunk and branch related problems have low and roughly equal chances of occurrence (0.20 and 0.22, respectively).
  • Our analysis suggested these species as the best 10 choices that can be planted on the streets of Manhattan in order of rank from the top best: American Elm, Willow Oak, Black Locust, Pin Oak, Honeylocust, Sophora, London planetree, Golden Raintree, Green Ash, and Sawtooth Oak.
#----------- let's create some custom functions -------------

# installing new packages
install_pkg <- function (pkg_name='') {
    suppressWarnings(suppressMessages(install.packages(pkg_name, verbose = T, quiet=T)))
}

# loading packages
load_pkg <- function(pkg_name) {
    suppressPackageStartupMessages(library(pkg_name, character.only = T))
}

# custom ggplot theme
my_ggtheme <- function (legend='none', center = 0.5, face='bold', yaxis='on') {
    theme(
        plot.title = element_text(hjust = center, face=face),
        plot.subtitle = element_text(hjust = center, face=face),
        legend.position = legend
    )
}
#-------- install and load required packages
install_pkg('naniar')
install_pkg('ggridges')

load_pkg('tidyverse')
load_pkg('ggridges')
load_pkg('scales')
load_pkg('naniar')
load_pkg('sf')
load_pkg('plotly')

#-------- change some default settings
theme_set(theme_classic())
options(dplyr.summarise.inform = FALSE)
options(warn=-1) # suppress warnings, might want to remove this when in development mode.
# tree health colors
health_col <- c("#5c4033", "lightgreen", "darkgreen")
# import the trees data
trees <- readr::read_csv('data/trees.csv', show_col_types = FALSE)

# import the neighborhoods information
neighborhoods <- st_read("data/nta.shp", quiet=TRUE)
# plot(neighborhoods)

# get map data for the Manhattan Borough and subset the necessary columns
man_neighborhoods <- neighborhoods %>% filter(boroname=='Manhattan') %>%
	select(boroname, ntacode, ntaname, geometry)

Highlights

# prepare highlights
n_spc <- length(unique(trees$spc_common))
n_trees <- nrow(trees)
n_neigh <- length(unique(trees$nta_name))
alive_dead <- trees %>%
    group_by(status) %>%
    summarise(n = n()) %>%
    mutate(prop = percent(n/sum(n), 0.2),
           label = paste0(number(n,big.mark = ','), ' (',prop,')')) %>%
    pull(label)
health_stat <- trees %>%
	# filter(!is.na(health)) %>%
	group_by(health) %>%
	summarise(n = n()) %>%
    mutate(prop = percent(n/sum(n), 0.1),
           label = paste0(number(n,big.mark = ','), ' (',prop,')')) %>%
    pull(label)
curb_stat <- trees %>%
	# filter(!is.na(health)) %>%
	group_by(curb_loc) %>%
	summarise(n = n()) %>%
    mutate(prop = percent(n/sum(n), 0.1),
           label = paste0(number(n,big.mark = ','), ' (',prop,')')) %>%
    pull(label)

size_stat <- trees %>%
	mutate(below_20inc = if_else(tree_dbh < 20, "Yes", "No")) %>%
	group_by(below_20inc) %>%
	summarise(n = n()) %>%
    mutate(prop = percent(n/sum(n), 0.1),
           label = paste0(number(n,big.mark = ','), ' (',prop,')')) %>%
    pull(label)

prevalent_spc <- trees %>%
	filter(!is.na(spc_common)) %>%
	group_by(spc_common) %>%
	summarise(n = n()) %>%
    mutate(prop = percent(n/sum(n), 0.1),
           label = paste0(number(n,big.mark = ','), ' (',prop,')')) %>%
   arrange(desc(n)) %>% slice(1) %>%
    pull(label)


largest_spc <- trees %>%
	filter(status == 'Alive') %>%
	group_by(spc_common) %>%
	summarise(median_dbh = median(tree_dbh)) %>%
   arrange(desc(median_dbh)) %>% slice(1) 


# seq(0.2, 0.8, by=0.2), seq(0.2, 0.8, by=0.2)
# draw facts on plot
fact_df <- data.frame(
    x= rep(seq(0.2, 0.8, by=0.2), 4),
    y = c(rep(0.92, 4), rep(0.64, 4), rep(0.36, 4), rep(0.08, 4)),
    fact = c(paste(n_neigh, '\n neighborhoods'),
             paste(n_spc-1, '\n species'),
             paste(number(n_trees,big.mark = ','), '\nstreet trees'),
             paste('Alive:\n',alive_dead[1],'\nDead:\n',alive_dead[2]),
             
             paste(health_stat[2], '\n trees in\n Good health'),
             paste(health_stat[1], '\n trees in\n Fair health'),
             paste(health_stat[3], '\n trees in\n Poor health'),
             paste(health_stat[4], '\n trees\n health unknown'),
             
             paste(size_stat[2], '\n trees below 20"\n DBH in size'),
             paste(size_stat[1], '\n trees above 20"\n DBH in size'),
             paste(curb_stat[2], '\n tree beds \nalong\n the curb'),
             paste(curb_stat[1], '\n tree beds \n offset from\n the curb'),
             
             paste('Honeylocust:\n largest pop.\n', prevalent_spc),
             paste('318 DBH\n largest tree size\n from Pin Oak'),
             paste('Weeping Willow:\n largest median \nsize of\n 14 DBH'),
             paste('Most common\n problem: \n root related\n (58%)')
             )
)

ggplot(fact_df, aes(x, y, label=fact, fill=fact)) +
    geom_point(size = 50, shape = 21, show.legend = F,
               color='grey', alpha=0.8) +
    geom_text(vjust = 0.4) +
    xlim(0,1) + ylim(0,1) + labs(x='', y='') +
   theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks.y = element_blank(),
        axis.ticks.x = element_blank())

Apart from the image of the Manhattan tree map at the begining of this report, which was extracted from the NYC Tree Map, all displays and outputs were generated directly using the R statistical programming language. The corresponding R codes can be found in the Appendix section for reference and a complete reproducibility of results.

Data Description

We have access to the 2015 tree census and geographical information on New York City neighborhoods (trees, neighborhoods) as described below:

Tree Census data

  • "tree_id" - Unique id of each tree.
  • "tree_dbh" - The diameter of the tree in inches measured at 54 inches above the ground, abbreviated DBH.
  • "curb_loc" - Location of the tree bed in relation to the curb. Either along the curb (OnCurb) or offset from the curb (OffsetFromCurb).
  • "spc_common" - Common name for the species.
  • "status" - Indicates whether the tree is alive or standing dead.
  • "health" - Indication of the tree's health (Good, Fair, and Poor).
  • "root_stone" - Indicates the presence of a root problem caused by paving stones in the tree bed.
  • "root_grate" - Indicates the presence of a root problem caused by metal grates in the tree bed.
  • "root_other" - Indicates the presence of other root problems.
  • "trunk_wire" - Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk.
  • "trnk_light" - Indicates the presence of a trunk problem caused by lighting installed on the tree.
  • "trnk_other" - Indicates the presence of other trunk problems.
  • "brch_light" - Indicates the presence of a branch problem caused by lights or wires in the branches.
  • "brch_shoe" - Indicates the presence of a branch problem caused by shoes in the branches.
  • "brch_other" - Indicates the presence of other branch problems.
  • "postcode" - Five-digit zip code where the tree is located.
  • "nta" - Neighborhood Tabulation Area (NTA) code from the 2010 US Census for the tree.
  • "nta_name" - Neighborhood name.
  • "latitude" - Latitude of the tree, in decimal degrees.
  • "longitude" - Longitude of the tree, in decimal degrees.

Neighborhoods' geographical information

  • "ntacode" - NTA code (matches Tree Census information).
  • "ntaname" - Neighborhood name (matches Tree Census information).
  • "geometry" - Polygon that defines the neighborhood.

Data Preparation

In this section, we basically bring in the available data and inspect them for any irregularities before delving into the main analysis.

# view first few observations
head(trees, 100)
# check the dimension of the trees data
# dim(trees)
# length(unique(trees$tree_id))

The trees data available to us contains 64,229 trees each with 20 characteristics or attributes as described above. The data represent all the trees on the streets of Manhattan, New York City per the 2015-2016 tree census that was dubbed TreesCount! 2015.

# display neighborhoods data
man_neighborhoods

# check how many neighborhoods there are.
# dim(man_neighborhoods)
# length(unique(man_neighborhoods$ntaname))

One can see from the above table that there are 29 different neighborhoods in the neighborhoods data set pertaining to the Manhattan Borough, which does not match the 28 neighborhoods available in the tree census data set. Our investigation revealed that, the extra neighborhood without tree data is named park-cemetery-etc-Manhattan with Neighborhood Tabulation Area (NTA) code MN99. Thus, it appears trees in parks were not surveyed. Later in this report, where we visualize the neighborhoods, we will be able to identify the specific areas in Manhattan which constitute the park-cemetery-etc-Manhattan neighborhood.

Are there any missing values?

At this point, we inspect the data for missing values in any of the attributes.

Amount of missing values in the tree census data

# check for missing values
trees %>%
	miss_var_summary() %>%
	rename(`Variable name` = variable,
           `Number of missing values` = n_miss,
           `Percent missing` = pct_miss)



  • AI Chat
  • Code