Oscar Herrada/

Visualizing Denver to find Expansion Neighborhoods


Where to open a new coffee shop in the Denver area?

The top 3 locations for a Coffee Shop:

- Central Park

- Capitol Hill

- Hampden

the text for 'Central Park' and 'Capitol Hill' do not appear in visual above due to overlapping, 'Hampden' is down to the bottom-right above 'Southmoor Park'

📖 Background

You are helping a client who owns coffee shops in Colorado. The company's coffee shops serve high-quality and responsibly sourced coffee, pastries, and sandwiches. They operate three locations in Fort Collins and want to expand into Denver.

Your client believes that the ideal location for a new store is close to affluent households, and the store appeals to the 20-35 year old demographic.

Your team collected geographical and demographic information about Denver's neighborhoods to assist the search. They also collected data for Starbucks stores in Denver. Starbucks and the new coffee shops do not compete for the same clients; the team included their location as a reference.

💾 The data

You have assembled information from three different sources (locations, neighborhoods, demographics):

Starbucks locations in Denver, Colorado
  • "StoreNumber" - Store Number as assigned by Starbucks
  • "Name" - Name identifier for the store
  • "PhoneNumber" - Phone number for the store
  • "Street 1, 2, and 3" - Address for the store
  • "PostalCode" - Zip code of the store
  • "Longitude, Latitude" - Coordinates of the store
Neighborhoods' geographical information
  • "NBHD_ID" - Neighborhood ID (matches the census information)
  • "NBHD_NAME" - Name of the statistical neighborhood
  • "Geometry" - Polygon that defines the neighborhood
Demographic information
  • "NBHD_ID" - Neighborhood ID (matches the geographical information)
  • "NBHD_NAME' - Nieghborhood name
  • "POPULATION_2010' - Population in 2010
  • "AGE_ " - Number of people in each age bracket (< 18, 18-34, 35-65, and > 65)
  • "NUM_HOUSEHOLDS" - Number of households in the neighborhood
  • "FAMILIES" - Number of families in the neighborhood
  • "NUM_HHLD_100K+" - Number of households with income above 100 thousand USD per year

Starbucks locations were scrapped from the Starbucks store locator webpage by Chris Meller.
Statistical Neighborhood information from the City of Denver Open Data Catalog, CC BY 3.0 license.
Census information from the United States Census Bureau. Publicly available information.

pip install geopandas
# Load libraries
import re
import inspect
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import as px
import as pio
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.offline as pyo 
import geopandas as gpd
import statistics
from scipy import stats

pio.templates.default = "simple_white"


%matplotlib inline

#To run functions from external folder
%run -i



Cleaning of denver.csv

# Loading 'denver.csv' as "cofee_shops", and a peek at the data.
coffee_shops = pd.read_csv('./data/denver.csv')

From the initial look there a few areas to address:
  • "PhoneNumber" column is useless, contact info will not help with determing a good location.
  • "StoreNumber" can be checked to see if half of the store number can be kept and still maintain its identity or use the whole number so that the column can be converted into an int.
  • All street column may not be necessary, if "Street1" is sufficient then dropping the other 2 columns may be ideal in trimming some unecessary data.
  • Truncating the "PostalCode" column to reflect the first 5 digits as shown in 2 of the first observations.
  • Assuring that both "Longitude" and "Latitude" are already of type float with no major deviations from what appears in this early peek.

Three things point out here
  • Both "Street2" and "Street3" have only 11-12 unique values while "Street1" has the same amount of unique values as "StoreNumber" and "Name".
  • "Name" and "Street1" both have 78 unique values and both represent the same thing in essence, the coffe shop. Dropping "Name" would leave "Street1" which has a more useable column that would not require as much converting if needed.
  • "PostalCode" has too many unique values for this being data on a city's neighborhoods.
"Street2" and "Street3" only have 11,15 values total per column. In the heatmap it shows how much of that data is missing in comparison to column "PhoneNumber" missing only one value. This validates the ability to trim without worrying about losing valuable available data.

# Drop unecessary columns
coffee_shops.drop(['PhoneNumber', 'Name', 'Street2', 'Street3'], axis = 1, inplace = True)

# Convert "StoreNumber" into just the first half of the number -- ex.27708-240564 = 27708
coffee_shops['StoreNumber'] = [int(store[0]) for store in coffee_shops['StoreNumber'].str.split(pat = '-')]

# Converting "PostalCode" to just the first 5 digits per observation
coffee_shops['PostalCode'] = [int(re.findall('\d{5}', postal)[0]) for postal in coffee_shops['PostalCode'].astype(str)]
# Check to see that the postal codes are within an acceptable range


# Take a peek at the dataframe
# Check unique value per column again
Now it seems we have filtered are data to only include what can be used, and nothing appears to be unusual.

# Check to ensure both "Longitude" and "Latitude" are within an acceptable range as well.






# Extract cleaned df "coffee_shops" as "coffee_shops_c"
coffee_shops.to_csv('./data/coffee_shops_c.csv', index=False)

Cleaning neigborhoods.shp

# Loading 'neighborhoods.shp' as "neighborhoods", and a peek at the data.
neighborhoods = gpd.read_file('data/neighborhoods.shp')