Skip to content
🤔 How Much of the World Has Access to the Internet?
↴ Here are some answers
⚙️ First of all we need tools for our research, lets add some
# Import pandas for data manipulation
import pandas as pd
import matplotlib.pyplot as plt
# Import seaborn for data visualisation
import seaborn as sns
import plotly.express as px
👌 Ok, now it time to get and tidy up data
# Read input tables and prepare data, simplify column names
internet = pd.read_csv('data/internet.csv').rename(columns={'Internet_Usage':'Usage'})
people = pd.read_csv('data/people.csv').rename(columns={'Users':'Users'})
broadband = pd.read_csv('data/broadband.csv').rename(columns={'Broadband_Subscriptions':'Broadband'})
# Read additional table (fresh data from the world bank database regarding countries regions)
regions_prep = pd.read_csv('add_data/world-regions-according-to-the-world-bank.csv')
regions = regions_prep.drop(columns=['Year']).rename(columns={'World Region according to the World Bank':'Region'})
# Merge Data into single dataframe
ip = pd.merge(internet, people, how='outer', left_on=['Entity','Code','Year'], right_on = ['Entity','Code','Year'])
ipb = pd.merge(ip, broadband, how='outer', left_on=['Entity','Code','Year'], right_on = ['Entity','Code','Year'])
ipbr = pd.merge(ipb, regions, how='outer', left_on=['Entity','Code'], right_on = ['Entity','Code'])
# Simplify name for dataframe
df = ipbr.sort_values(['Region','Entity','Year'], ascending=[True,True,True])
df.head()
# Create some beautiful vis
map_df = df.groupby(by=['Region','Entity','Code'])
map_df = map_df.apply(lambda x: x.ffill().bfill()['Usage'].max()).reset_index().rename(columns={0:'Usage'}).dropna()
fig = px.scatter_geo(map_df, locations="Code", color="Region",
hover_name="Entity", size="Usage",
scope='world', size_max=20, fitbounds=False, labels=False,
width=1000, height=500)
fig.show()
# Region code added for better visual representation, such long names don't fit in the screen
usage_top_r_code = df['Region'].replace({'East Asia and Pacific': 'EAP',
'Europe and Central Asia': 'ECA',
'Latin America and Caribbean': 'LAC',
'Middle East and North Africa': 'MENA',
'North America': 'NA',
'South Asia': 'SA'
})
df = df.join(usage_top_r_code, lsuffix='', rsuffix='_Code') # Adding new column
df.head() # See what we got
◉ 1. What are the top 5 countries with the highest internet use (by population share)?
🔍 Lets try to figure out
# Get top 5 countries by internet usage
entity_group = df.groupby(['Entity']) # Group df by country names
usage_max = entity_group['Usage','Region','Region_Code'].apply(lambda x: x.max()).reset_index() # Get max values in each
usage_top = usage_max.sort_values(by='Usage', ascending=False).head(5) # Get top 5 rows
usage_top # See what we got
📝 We can also note that 4 of the top 5 countries are from the Middle East and North Africa
◉ 2. How many people had internet access in those countries in 2019?
🔍 Let's find out
# Preparings
select_date = (df['Year'] == 2019)
usage_top_list = usage_top['Entity'].tolist()
# Select Date, remove columns and fill NAs with previous values
df_by_date = df[select_date].drop(columns=['Usage', 'Broadband', 'Region', 'Code','Region_Code'])#.fillna(method='ffill')
df_by_date = df_by_date[df_by_date['Entity'].isin(usage_top_list)].sort_values(['Users'], ascending=[False])
df_by_date # See what we got
⚠️ As we can see, unfortunately we don't have any record of Liechtenstein internet users for 2019. But no need to worry. At this point we can just take the value from previous years, let's do that below.
# Select Date, remove columns and fill NAs with previous values
df_by_date = df[select_date].drop(columns=['Usage', 'Broadband', 'Region', 'Code']).fillna(method='ffill')
df_by_date = df_by_date[df_by_date['Entity'].isin(usage_top_list)].sort_values(['Users'], ascending=[False])
df_by_date # See what we got