Workspace
Siriwat Narkvong/

Data Manipulation with pandas

0
Beta
Spinner

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Add your notes here

# Add your code snippets here

# Code to make understand DataFrame
.head() returns the first few rows (the “head” of the DataFrame).

.info() shows information on each of the columns, such as the data type and number of missing values.

.shape returns the number of rows and columns of the DataFrame.

.describe() calculates a few summary statistics for each column.

.values: A two-dimensional NumPy array of values.

.columns: An index of columns: the column names.

.index: An index for the rows: either row numbers or row names.

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
# Sort rows by a specific column in ascending order
df.sort_values('column_name', ascending=True, inplace=True)

# Sort rows by multiple columns in descending order
df.sort_values(['column_name1', 'column_name2'], ascending=[False, False], inplace=True)
Hidden output
# Subset a single column
df['column_name']

# Subset multiple columns
df[['column_name1', 'column_name2']]
Hidden output
# Subset rows where a specific column meets a condition
subset = df[df['column_name'] > 10]

# Subset rows where multiple columns meet conditions
subset = df[(df['column_name1'] > 10) & (df['column_name2'] == 'value')]
Hidden output
# Subset rows where a specific column has certain categorical values
subset = df[df['column_name'].isin(['value1', 'value2', 'value3'])]
Hidden output
df['new_column_name'] = values

df['total'] = df['column1'] + df['column2']

df['average'] = df['column1'] / df['column1'].mean()dfd
Hidden output

Use Case for data manipulation

# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
print(result)
# Import the necessary libraries
import pandas as pd

# Create a DataFrame
data = {'column_name': [5, 10, 15, 20, 25],
        'column_name1': [5, 10, 15, 20, 25],
        'column_name2': ['value1', 'value2', 'value3', 'value4', 'value5']}
df = pd.DataFrame(data)

# Subset rows where a specific column meets a condition
subset = df[df['column_name'] > 10]

# Subset rows where multiple columns meet conditions
subset = df[(df['column_name1'] > 10) & (df['column_name2'] == 'value')]

# Subset rows where a specific column has certain categorical values
subset = df[df['column_name'].isin(['value1', 'value2', 'value3'])]

# Create a new column with calculated values
df['new_column_name'] = values

# Perform arithmetic operations on columns
df['total'] = df['column1'] + df['column2']
df['average'] = df['column1'] / df['column1'].mean()

# Manipulate data using pandas functions
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)
result = high_homelessness_srt[["state", "indiv_per_10k"]]
# Import the necessary libraries
import pandas as pd
import numpy as np
import scipy.stats as stats

# Create a DataFrame
data = {'column_name': [5, 10, 15, 20, 25],
        'column_name1': [5, 10, 15, 20, 25],
        'column_name2': ['value1', 'value2', 'value3', 'value4', 'value5']}
df = pd.DataFrame(data)

# Calculate descriptive statistics
mean = df['column_name'].mean()
median = df['column_name'].median()
mode = df['column_name'].mode().values[0]
std_dev = df['column_name'].std()
variance = df['column_name'].var()

# Perform hypothesis testing
t_stat, p_value = stats.ttest_1samp(df['column_name'], 0)

# Perform correlation analysis
correlation = df['column_name'].corr(df['column_name1'])

# Perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['column_name'], df['column_name1'])
# Define a custom function to calculate the inter-quartile range (IQR)
def calculate_iqr(column):
    q3 = column.quantile(0.75)
    q1 = column.quantile(0.25)
    iqr = q3 - q1
    return iqr

# Apply the custom function to the 'column_name' column of the DataFrame
iqr = df['column_name'].agg(calculate_iqr)
# Update the aggregation functions called by .agg(): include iqr and np.median in that order
iqr_median = df['column_name'].agg([calculate_iqr, np.median])
# Calculate summary cumulative statistics for the 'column_name' column
cumulative_sum = df['column_name'].cumsum()
cumulative_min = df['column_name'].cummin()
cumulative_max = df['column_name'].cummax()
cumulative_prod = df['column_name'].cumprod()

# Print the summary cumulative statistics
cumulative_sum, cumulative_min, cumulative_max, cumulative_prod
df.drop_duplicates()

df.drop_duplicates(subset=['column_name1', 'column_name2'])

df.drop_duplicates(inplace=True)
# Count the number of unique values in the 'column_name2' column of the 'df' DataFrame
unique_values = df['column_name2'].nunique()

# Count the frequency of each unique value in the 'column_name2' column of the 'df' DataFrame
value_counts = df['column_name2'].value_counts()

# Print the number of unique values and the frequency of each unique value
unique_values, value_counts
# Calculate the inter-quartile range (IQR) for the 'column_name' column of the 'df' DataFrame
q3 = df['column_name'].quantile(0.75)
q1 = df['column_name'].quantile(0.25)
iqr = q3 - q1

# Calculate the median for the 'column_name' column of the 'df' DataFrame
median = df['column_name'].median()

# Calculate the cumulative sum, minimum, maximum, and product for the 'column_name' column of the 'df' DataFrame
cumulative_sum = df['column_name'].cumsum()
cumulative_min = df['column_name'].cummin()
cumulative_max = df['column_name'].cummax()
cumulative_prod = df['column_name'].cumprod()

# Drop duplicate rows from the 'df' DataFrame
df.drop_duplicates(inplace=True)

# Count the number of unique values in the 'column_name2' column of the 'df' DataFrame
unique_values = df['column_name2'].nunique()

# Count the frequency of each unique value in the 'column_name2' column of the 'df' DataFrame
value_counts = df['column_name2'].value_counts()
# Count the number of rows in the 'df' DataFrame
num_rows = len(df)

# Count the number of columns in the 'df' DataFrame
num_columns = len(df.columns)

# Calculate the sum of values in the 'column_name' column of the 'df' DataFrame
column_sum = df['column_name'].sum()

# Calculate the mean of values in the 'column_name' column of the 'df' DataFrame
column_mean = df['column_name'].mean()

# Calculate the standard deviation of values in the 'column_name' column of the 'df' DataFrame
column_std = df['column_name'].std()

# Calculate the minimum and maximum values in the 'column_name' column of the 'df' DataFrame
column_min = df['column_name'].min()
column_max = df['column_name'].max()

# Calculate the correlation between the 'column_name1' and 'column_name2' columns of the 'df' DataFrame
column_corr = df['column_name1'].corr(df['column_name2'])

# Print the results
num_rows, num_columns, column_sum, column_mean, column_std, column_min, column_max, column_corr
Hidden output
# Group the DataFrame 'df' by the 'group_column' column and calculate the sum, mean, and standard deviation of the 'value_column' column for each group
grouped_df = df.groupby('group_column')['value_column'].agg(['sum', 'mean', 'std'])

# Print the grouped DataFrame
grouped_df