Internet: A Global Phenomenon

internet_users.csv
- users - The number of people who have used the internet in the last three months
- share - The share of the entity's population who have used the internet in the last three months
adoption.csv
- fixed_telephone_subs - The number of people who have a telephone landline connection
- fixed_telephone_subs_share - The share of the entity's population who have a telephone landline connection
- fixed_broadband_subs - The number of people who have a broadband internet landline connection
- fixed_broadband_subs_share - The share of the entity's population who have a broadband internet landline connection
- mobile_cell_subs - The number of people who have a mobile subscription
- mobile_cell_subs_share - The share of the entity's population who have a mobile subscription

Both data files are indexed on the following 3 attributes:

entity - The name of the country, region, or group.
code - Unique id for the country (null for other entities).
year - Year from 1990 to 2020.

Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

Source: Our World In Data

1. Initial Exploration:

1.1. Load the datasets:

First, we need to import the necessary libraries and load the datasets:

import pandas as pd
import numpy as np

# Load the datasets
internet_users = pd.read_csv('internet_users.csv')
adoption = pd.read_csv('adoption.csv')

1.2. Basic Information:

Now, let's gather some basic information about the datasets:

# For the internet_users dataset
print(internet_users.info())
print(internet_users.head())

# For the adoption dataset
print(adoption.info())
print(adoption.head())

1.3. Check for missing values:

It's essential to know if there are any missing values in the datasets and handle them:

# Check for missing values in both datasets
print("Missing values in internet_users:\n", internet_users.isnull().sum())
print("\nMissing values in adoption:\n", adoption.isnull().sum())

1.4 Recommendations for missing values:

1.4.1 For `code` in both datasets:

It seems to represent the country code. In many datasets, broader regions, continents, or non-country-specific categories are often included for aggregate data analysis. Since these are not countries, they don't have ISO country codes. As a result, the code is set as null for these entries. I decided replacing these with a unified code like OTH (as for Other).

# Replace NaN values in 'code' column with 'OTH' for both datasets

internet_users['code'].fillna('OTH', inplace=True)
adoption['code'].fillna('OTH', inplace=True)

1.4.2 For `share` column in internet users

The 'share' column is a numerical feature representing the percentage or proportion of a certain metric (in this case, the share of the entity's population that have used the internet in the last three months).

Filling missing or non-applicable numerical data with a distinct value like -1 offers several advantages. First, it maintains the numerical nature of the column, ensuring that subsequent operations or analytics aren't disrupted by unexpected data types. Second, by using a value that is unlikely to occur naturally in the dataset, such as -1 for a percentage or proportion column, we can easily differentiate genuine data from placeholders. This distinction becomes invaluable during data analysis, as it allows for swift filtering or conditional operations. Moreover, using such a distinct placeholder helps preserve the integrity of statistics; for instance, when calculating averages or sums, these placeholder values can be quickly excluded to obtain accurate results.

internet_users['share'].fillna(-1, inplace=True)

1.4.3 The rest of the values

‌
‌
‌