Internet: A Global Phenomenon
This dataset contains information on internet access around the world.
The workspace is set up with two CSV files containing information on global internet access for years ranging from 1990 to 2020.
internet_users.csv
users
- The number of people who have used the internet in the last three monthsshare
- The share of the entity's population who have used the internet in the last three months
adoption.csv
fixed_telephone_subs
- The number of people who have a telephone landline connectionfixed_telephone_subs_share
- The share of the entity's population who have a telephone landline connectionfixed_broadband_subs
- The number of people who have a broadband internet landline connectionfixed_broadband_subs_share
- The share of the entity's population who have a broadband internet landline connectionmobile_cell_subs
- The number of people who have a mobile subscriptionmobile_cell_subs_share
- The share of the entity's population who have a mobile subscription
Both data files are indexed on the following 3 attributes:
entity
- The name of the country, region, or group.code
- Unique id for the country (null for other entities).year
- Year from 1990 to 2020.
Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.
Source: Our World In Data
1. Initial Exploration:
1.1. Load the datasets:
First, we need to import the necessary libraries and load the datasets:
import pandas as pd
import numpy as np
# Load the datasets
internet_users = pd.read_csv('internet_users.csv')
adoption = pd.read_csv('adoption.csv')
1.2. Basic Information:
Now, let's gather some basic information about the datasets:
# For the internet_users dataset
print(internet_users.info())
print(internet_users.head())
# For the adoption dataset
print(adoption.info())
print(adoption.head())
1.3. Check for missing values:
It's essential to know if there are any missing values in the datasets and handle them:
# Check for missing values in both datasets
print("Missing values in internet_users:\n", internet_users.isnull().sum())
print("\nMissing values in adoption:\n", adoption.isnull().sum())
1.4 Recommendations for missing values:
1.4.1 For code
in both datasets:
code
in both datasets:It seems to represent the country code. In many datasets, broader regions, continents, or non-country-specific categories are often included for aggregate data analysis. Since these are not countries, they don't have ISO country codes. As a result, the code is set as null for these entries. I decided replacing these with a unified code like OTH
(as for Other).
# Replace NaN values in 'code' column with 'OTH' for both datasets
internet_users['code'].fillna('OTH', inplace=True)
adoption['code'].fillna('OTH', inplace=True)
1.4.2 For share
column in internet users
share
column in internet usersThe 'share' column is a numerical feature representing the percentage or proportion of a certain metric (in this case, the share of the entity's population that have used the internet in the last three months).
Filling missing or non-applicable numerical data with a distinct value like -1 offers several advantages. First, it maintains the numerical nature of the column, ensuring that subsequent operations or analytics aren't disrupted by unexpected data types. Second, by using a value that is unlikely to occur naturally in the dataset, such as -1 for a percentage or proportion column, we can easily differentiate genuine data from placeholders. This distinction becomes invaluable during data analysis, as it allows for swift filtering or conditional operations. Moreover, using such a distinct placeholder helps preserve the integrity of statistics; for instance, when calculating averages or sums, these placeholder values can be quickly excluded to obtain accurate results.
internet_users['share'].fillna(-1, inplace=True)
1.4.3 The rest of the values