Workspace
Maximiliano Castelli/

Bikeshare Insights: Summer in the Windy City

0
Beta
Spinner

Bikeshare Insights: Summer in the Windy City

This dataset contains information on Divvy Bikes, a bikeshare program that provides residents and visitors of Chicago with a convenient way to explore the city.

The workspace is set up with one CSV file containing bikeshare activities at the peak of the summer-July 2023. Columns include ride ID, bike type, start and end times, station names and IDs, location coordinates, and member type. Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

Source: Divvy Bikes

๐ŸŒŽ Some guiding questions to help you explore this data:

  1. How many observations are in the dataset? Are there null values?
  2. How would you clean and prepare the data for analysis?
  3. Which bike types are popular and which ones aren't? Check if being a member or casual rider makes a difference in bike choice.
  4. Time check! What are the peak and off-peak riding times during the day?

๐Ÿ“Š Visualization ideas

  • Bar chart: Display the number of times each bike type is used to identify the most and least used bikes.
  • Grouped bar chart: Compare bike usage by member type (member vs. casual) to see if it affects bike choice.
  • Heatmap: Vividly illustrate the popularity of bikes at different times during the day and week.

You can query the pre-loaded CSV files using SQL directly. Hereโ€™s a sample query:

Unknown integration
DataFrameavailable as
df
variable
SELECT distinct rideable_type, start_station_name
FROM '202307-divvy-tripdata.parquet'
LIMIT 100
This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.
Unknown integration
DataFrameavailable as
df1
variable
SELECT start_station_name,start_station_id,count(start_station_name)
FROM '202307-divvy-tripdata.parquet'
GROUP BY start_station_name,start_station_id
having count(start_station_name)>=1
This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.
Unknown integration
DataFrameavailable as
df2
variable
select *
FROM '202307-divvy-tripdata.parquet'
where start_station_id is null
This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.
import pandas as pd
divvy_jan2023 = pd.read_parquet("202307-divvy-tripdata.parquet")
divvy_jan2023.head(100)

Ready to share your work?

Click "Share" in the upper right corner, copy the link, and share it! You can also add this workspace to your DataCamp Portfolio

divvy_jan2023.describe()
divvy_jan2023.dtypes
divvy_jan2023.isna().sum()
NA_divvy_jan2023 = divvy_jan2023[divvy_jan2023['start_station_id'].isna()]

#divvy_jan2023[NA_divvy_jan2023['start_lat']==divvy_jan2023['start_lat']]
#na_start_lat = NA_divvy_jan2023['start_lat'].isin(divvy_jan2023['start_lat'])

#na_start_lat

#NA_divvy_jan2023.head()
#divvy_jan2023[(divvy_jan2023['start_lat']=='41.88') & (divvy_jan2023['start_lng']=='-87.63')]
#divvy_jan2023[(divvy_jan2023['start_lat']=='41.88')]
#divvy_jan2023[(divvy_jan2023['start_lng']=='-87.63')]

divvy_jan2023[(divvy_jan2023['start_lat'].isin(NA_divvy_jan2023['start_lat'])) & (divvy_jan2023['start_lng'].isin(NA_divvy_jan2023['start_lng'])) | divvy_jan2023['start_station_id'].isna()]

#6013F950DD94D2D0 41.88 -87.63

  1. Split into completeness and missingness datasets
  2. Round numerical columns to new ones rounded in Missing
  3. Compare Completeness rounded to new rounded in Missing dset
  4. Find match stations id to start and end missing ones
#round(divvy_jan2023['start_lat'],1)
#divvy_jan2023[(round(divvy_jan2023['start_lat'],1)==41.9) & (divvy_jan2023['start_station_id'].isna())].head(100)
divvy_jan2023[(round(divvy_jan2023['start_lat'],3)==41.88)]
import missingno as msno
import matplotlib.pyplot as plt
# Visualize missingness

#1. Split into completeness and missingness datasets

q_isna_start_station_id = (divvy_jan2023['start_station_id'].isna())
q_isna_end_station_id = (divvy_jan2023['end_station_id'].isna())
q_isna_end_lat = (divvy_jan2023['end_lat'].isna())
q_isna_end_lng = (divvy_jan2023['end_lng'].isna())

#divvy_jan2023[q_isna_start_station_id | q_isna_end_station_id]
divvy_jan2023_missing= divvy_jan2023[q_isna_start_station_id | q_isna_end_station_id | q_isna_end_lat | q_isna_end_lng]

#divvy_jan2023_missing

divvy_jan2023_complete = divvy_jan2023[~divvy_jan2023['ride_id'].isin(divvy_jan2023_missing['ride_id'])]

msno.matrix(divvy_jan2023_missing.sort_values(by=['start_station_id','end_station_id','end_lat','end_lng']))
plt.show()
#2. Round numerical columns to new ones rounded in Missing

divvy_jan2023_complete['start_lat_rnd'] = round(divvy_jan2023_complete['start_lat'],2)
divvy_jan2023_complete['start_lng_rnd'] = round(divvy_jan2023_complete['start_lng'],2)
divvy_jan2023_complete['end_lat_rnd'] = round(divvy_jan2023_complete['end_lat'],2)
divvy_jan2023_complete['end_lng_rnd'] = round(divvy_jan2023_complete['end_lng'],2)

#3. Compare Completeness rounded to lat & long values in Missing dset

match_start_lat = (divvy_jan2023_complete['start_lat_rnd'].isin(divvy_jan2023_missing['start_lat']))
match_start_lng = (divvy_jan2023_complete['start_lng_rnd'].isin(divvy_jan2023_missing['start_lng']))
match_end_lat = (divvy_jan2023_complete['end_lat_rnd'].isin(divvy_jan2023_missing['end_lat']))
match_end_lng = (divvy_jan2023_complete['end_lng_rnd'].isin(divvy_jan2023_missing['end_lng']))

#divvy_jan2023_start_st_NEW = divvy_jan2023[match_start_lat | match_start_lng] 
#divvy_jan2023_complete[match_start_lat & match_start_lng]

match_start_fields = ['start_lat_rnd','start_lng_rnd']
match_end_fields = ['end_lat_rnd','end_lat']
#divvy_jan2023_start_st_NEW = divvy_jan2023_complete.merge(divvy_jan2023_missing,left_on='start_lat_rnd',right_on='start_lat',how='right',suffixes=('_comp','_miss'))
#divvy_jan2023_start_st_NEW

#122910
divvy_jan2023_missing[(divvy_jan2023_missing['start_lat'].isin(divvy_jan2023_complete['start_lat_rnd'])) & (divvy_jan2023_missing['start_lng'].isin(divvy_jan2023_complete['start_lng_rnd']))]

#63228
#divvy_jan2023_missing[(divvy_jan2023_missing['end_lat'].isin(divvy_jan2023_complete['end_lat'])) & (divvy_jan2023_missing['end_lng'].isin(divvy_jan2023_complete['end_lng']))]