Competition - City Tree Species
  • AI Chat
  • Code
  • Report
  • Spinner

    Which tree species should the city plant?

    📖 Background

    You work for a nonprofit organization advising the planning department on ways to improve the quantity and quality of trees in New York City. The urban design team believes tree size (using trunk diameter as a proxy for size) and health are the most desirable characteristics of city trees.

    The city would like to learn more about which tree species are the best choice to plant on the streets of Manhattan.

    💾 The data

    The team has provided access to the 2015 tree census and geographical information on New York City neighborhoods (trees, neighborhoods):

    Tree Census
    • "tree_id" - Unique id of each tree.
    • "tree_dbh" - The diameter of the tree in inches measured at 54 inches above the ground.
    • "curb_loc" - Location of the tree bed in relation to the curb. Either along the curb (OnCurb) or offset from the curb (OffsetFromCurb).
    • "spc_common" - Common name for the species.
    • "status" - Indicates whether the tree is alive or standing dead.
    • "health" - Indication of the tree's health (Good, Fair, and Poor).
    • "root_stone" - Indicates the presence of a root problem caused by paving stones in the tree bed.
    • "root_grate" - Indicates the presence of a root problem caused by metal grates in the tree bed.
    • "root_other" - Indicates the presence of other root problems.
    • "trunk_wire" - Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk.
    • "trnk_light" - Indicates the presence of a trunk problem caused by lighting installed on the tree.
    • "trnk_other" - Indicates the presence of other trunk problems.
    • "brch_light" - Indicates the presence of a branch problem caused by lights or wires in the branches.
    • "brch_shoe" - Indicates the presence of a branch problem caused by shoes in the branches.
    • "brch_other" - Indicates the presence of other branch problems.
    • "postcode" - Five-digit zip code where the tree is located.
    • "nta" - Neighborhood Tabulation Area (NTA) code from the 2010 US Census for the tree.
    • "nta_name" - Neighborhood name.
    • "latitude" - Latitude of the tree, in decimal degrees.
    • "longitude" - Longitude of the tree, in decimal degrees.
    Neighborhoods' geographical information
    • "ntacode" - NTA code (matches Tree Census information).
    • "ntaname" - Neighborhood name (matches Tree Census information).
    • "geometry" - Polygon that defines the neighborhood.

    Tree census and neighborhood information from the City of New York NYC Open Data.

    💪 Challenge

    Create a report that covers the following:

    • What are the most common tree species in Manhattan?
    • Which are the neighborhoods with the most trees?
    • A visualization of Manhattan's neighborhoods and tree locations.
    • What ten tree species would you recommend the city plant in the future?

    ⌛️ Time is ticking. Good luck!

    Preparing the data

    Prepare the names of the columns for each dataframe in an organized Python dictionary for visualizing and reporting purposes.

    Import the needed libraries

    
    import pandas as pd
    import geopandas as gpd
    import matplotlib.pyplot as plt
    import plotly.express as px
    import plotly.graph_objects as go
    from collections import Counter
    

    Reed the trees Neighborhoods and files

    
    trees = pd.read_csv('data/trees.csv')
    trees
    neighborhoods = gpd.read_file('data/nta.shp')
    neighborhoods

    Checking general information about each dataset, including column names, column types, sum of null values, and sum of unique values.

    data_types = {
        "numeric": ["int64", "int32", "float64", "float32", "bool", "timedelta", "complex", "decimal"],
        "temporal": ["datetime", "timestamp", "period"],
        "categorical": ["category"],
        "other": ["object", "ExtensionDtype"],
    }
    # function to get the name of the object as string
    def get_column_info(df, data_types):
        # Create a DataFrame containing the names and dtypes of the columns in the input DataFrame
        column_info_df = df.dtypes.reset_index().rename(columns={'index': 'name', 0: 'dtype'})
    
        # Add a new column to the DataFrame, mapping the dtypes to the categories in the dictionary
        column_info_df['category'] = column_info_df['dtype'].apply(lambda x: next((key for key, values in data_types.items() if x in values), None))
    
        return column_info_df
    
    
    # get the name for the string object 
    def get_variable_name(variable):
        globals_dict = globals()
    
        return [var_name for var_name in globals_dict if globals_dict[var_name] is variable]
    
    
    
    # get data from mutliple dataframes
    def info_df(df):
        dfs = []
        highlight = get_variable_name(df)
        df_info = pd.DataFrame(columns=['columns_name_'+highlight[0], 'columns_type_'+highlight[0]])
        df_info['columns_name_'+highlight[0]] = list(df.columns)
        # df_info['columns_type_'+highlight] = list(df.dtypes)
        column_info_df = get_column_info(df, data_types)
        df_info['columns_type_'+highlight[0]] = column_info_df['dtype']
        df_info['columns_main_type_'+highlight[0]] = column_info_df['category']
        df_info['null_values_'+highlight[0]] = list(df.isnull().sum())
        df_info['unique_values_num_'+highlight[0]] = [len(df[item].unique()) for item in list(df_info['columns_name_'+highlight[0]].unique())]
        df_info['unique_values_list_'+highlight[0]] = [df[item].unique() for item in list(df_info['columns_name_'+highlight[0]].unique())]
        df_shape = pd.DataFrame(columns=['columns_num_'+highlight[0]])
        df_shape['columns_num_'+highlight[0]] =  [len(df_info['columns_name_'+highlight[0]][df_info['columns_name_'+highlight[0]].notnull()])]
        dfs.append(df_info)
        dfs.append(df_shape)
        return dfs
        
    df_infos_trees = info_df(trees)[0]
    df_infos_neighborhoods = info_df(neighborhoods)[0]
    df_data_tree = info_df(trees)[1]
    df_data_neighborhoods = info_df(neighborhoods)[1]