Skip to content
World Internet Access. Briefly and simply about the complex
  • AI Chat
  • Code
  • Report
  • Spinner

    🤔 How Much of the World Has Access to the Internet?

    ↴ Here are some answers

    ⚙️ First of all we need tools for our research, lets add some

    # Import pandas for data manipulation
    import pandas as pd
    import matplotlib.pyplot as plt
    # Import seaborn for data visualisation
    import seaborn as sns
    import plotly.express as px
    

    👌 Ok, now it time to get and tidy up data

    # Read input tables and prepare data, simplify column names
    internet  	 = pd.read_csv('data/internet.csv').rename(columns={'Internet_Usage':'Usage'})
    people    	 = pd.read_csv('data/people.csv').rename(columns={'Users':'Users'})
    broadband 	 = pd.read_csv('data/broadband.csv').rename(columns={'Broadband_Subscriptions':'Broadband'})
    
    # Read additional table (fresh data from the world bank database regarding countries regions)
    regions_prep = pd.read_csv('add_data/world-regions-according-to-the-world-bank.csv')
    regions 	 = regions_prep.drop(columns=['Year']).rename(columns={'World Region according to the World Bank':'Region'})
    # Merge Data into single dataframe
    ip 			= pd.merge(internet, people,    how='outer', left_on=['Entity','Code','Year'], 	right_on = ['Entity','Code','Year'])
    ipb 		= pd.merge(ip,       broadband, how='outer', left_on=['Entity','Code','Year'], 	right_on = ['Entity','Code','Year'])
    ipbr 		= pd.merge(ipb,      regions,   how='outer', left_on=['Entity','Code'], 		right_on = ['Entity','Code'])
    
    # Simplify name for dataframe
    df        	= ipbr.sort_values(['Region','Entity','Year'], ascending=[True,True,True])
    df.head()
    # Create some beautiful vis
    
    map_df = df.groupby(by=['Region','Entity','Code'])
    map_df = map_df.apply(lambda x: x.ffill().bfill()['Usage'].max()).reset_index().rename(columns={0:'Usage'}).dropna()
    
    fig = px.scatter_geo(map_df, locations="Code", color="Region",
    					hover_name="Entity", size="Usage",
    					scope='world', size_max=20, fitbounds=False, labels=False,
    					width=1000, height=500)
    fig.show()
    # Region code added for better visual representation, such long names don't fit in the screen
    usage_top_r_code = df['Region'].replace({'East Asia and Pacific': 'EAP',
                                             'Europe and Central Asia': 'ECA', 
                                             'Latin America and Caribbean': 'LAC', 
                                             'Middle East and North Africa': 'MENA',
                                             'North America': 'NA',
                                             'South Asia': 'SA'
                                            })
    
    df = df.join(usage_top_r_code, lsuffix='', rsuffix='_Code') # Adding new column
    
    df.head() # See what we got

    ◉ 1. What are the top 5 countries with the highest internet use (by population share)?

    🔍 Lets try to figure out

    # Get top 5 countries by internet usage
    entity_group 	= df.groupby(['Entity']) 													# Group df by country names
    usage_max 		= entity_group['Usage','Region','Region_Code'].apply(lambda x: x.max()).reset_index() # Get max values in each
    usage_top 		= usage_max.sort_values(by='Usage', ascending=False).head(5) 				# Get top 5 rows
    
    usage_top # See what we got

    📝 We can also note that 4 of the top 5 countries are from the Middle East and North Africa

    ◉ 2. How many people had internet access in those countries in 2019?

    🔍 Let's find out

    # Preparings
    select_date 	= (df['Year'] == 2019)
    usage_top_list 	= usage_top['Entity'].tolist()
    
    # Select Date, remove columns and fill NAs with previous values
    df_by_date = df[select_date].drop(columns=['Usage', 'Broadband', 'Region', 'Code','Region_Code'])#.fillna(method='ffill')
    df_by_date = df_by_date[df_by_date['Entity'].isin(usage_top_list)].sort_values(['Users'], ascending=[False])
    
    df_by_date # See what we got

    ⚠️ As we can see, unfortunately we don't have any record of Liechtenstein internet users for 2019. But no need to worry. At this point we can just take the value from previous years, let's do that below.

    # Select Date, remove columns and fill NAs with previous values
    df_by_date = df[select_date].drop(columns=['Usage', 'Broadband', 'Region', 'Code']).fillna(method='ffill')
    df_by_date = df_by_date[df_by_date['Entity'].isin(usage_top_list)].sort_values(['Users'], ascending=[False])
    
    df_by_date # See what we got