Workspace
Symon Jasmin/

Live Training - Working with Categorical Data in Python (Webinar)

0
Beta
Spinner

The General Social Survey (GSS)

About Dataset ​​The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.

The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

Altogether the GSS is the single best source for sociological and attitudinal trend data covering the United States. It allows researchers to examine the structure and functioning of society in general as well as the role played by relevant subgroups and to compare the United States to other nations. (Source)

This dataset is a csv version of the Cumulative Data File, a cross-sectional sample of the GSS from 1972-current.

https://www.kaggle.com/datasets/norc/general-social-survey?select=gss.csv

# Import packages
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic

# Read in csv as a DataFrame and preview it
df = pd.read_csv('gss_sub.csv')
df

Data Validation and Cleaning

df.info()

Above we see that our DataFrame contains float64 column (numerical data), as well as a number of object columns. Object data types contain strings.

Inspecting individual columns

df['environment'].value_counts(normalize = True)

Manipulating categorical data

  • The categorical variable type can be useful, especially here:
    • Save on memory when there are only a few different values.
    • You can specify a precise order to the categories when the default order may be incorrect (e.g., via alphabetical).
    • Can be compatible with other Python libraries.
# Create a dictionary of column and data type mappings
conversion_dict = {k: 'category' for k in df.select_dtypes(include='object').columns}
conversion_dict

# Convert our DataFrame and check the data types
df = df.astype(conversion_dict)
df.info()

Already we can see that the memory usage of the DataFrame has been halved from 7 mb to 4 mb! This can help when working with large quantities of data, such as this survey that we'll be working with.

Cleaning up the labor_status column

df['labor_status'].cat.categories

collapse some of these categories. The easiest way to do this is to replace the values inside the column using a dictionary, and then reset the data type back to a category.

# Create a dictionary of categories to collapse
new_labor_status = {"UNEMPL, LAID OFF": "UNEMPLOYED", 
                    "TEMP NOT WORKING": "UNEMPLOYED",
                    "WORKING FULLTIME": "EMPLOYED",
                    "WORKING PARTTIME": "EMPLOYED"
                   }

# Replace the values in the column and reset as a category
df['labor_status_clean'] = df['labor_status'].replace(new_labor_status).astype('category')

# Preview the new column
df['labor_status_clean'].value_counts()

Reordering categories

df['environment'].cat.categories

Let's loop through the three variables and give them all an order. While we're at it, let's drop two categories that don't have any use for us: "DK" (don't know) and "IAP" (inapplicable). By removing them as categories, we set them to null so they won't be counted in the final analysis.

# Set the new order
new_order = ["TOO LITTLE", "ABOUT RIGHT", "TOO MUCH", "DK", "IAP"]
categories_to_remove = ["DK", "IAP"]

# Loop through each column
for col in ['environment', 'law_enforcement', 'drugs']:
    # Reorder and remove the categories
    df[col + '_clean'] = df[col].cat.reorder_categories(new_order, ordered=True)
    df[col + '_clean'] = df[col + '_clean'].cat.remove_categories(categories_to_remove)

# Preview one of the columns' categories
df['environment_clean'].value_counts(dropna=False)

Now let's also apply these steps to education level in one go: collapsing, removing, and reording.

# Define a dictionary to map old degree categories to new ones
new_degree = {"LT HIGH SCHOOL": "HIGH SCHOOL", 
              "BACHELOR": "COLLEGE/UNIVERSITY",
              "GRADUATE": "COLLEGE/UNIVERSITY",
              "JUNIOR COLLEGE": "COLLEGE/UNIVERSITY"}

# Replace old degree categories with new ones and convert to categorical data type
df['degree_clean'] = df['degree'].replace(new_degree).astype('category')

# Remove "DK" category from degree_clean column
df['degree_clean'] = df['degree_clean'].cat.remove_categories(['DK'])

# Reorder degree_clean categories and set as ordered
df['degree_clean'] = df['degree_clean'].cat.reorder_categories(['HIGH SCHOOL', "COLLEGE/UNIVERSITY"], ordered=True)

# Preview the new column
df['degree_clean'].value_counts(dropna=False)

Let's simplify our dates data

We can also bin numerical data to create categorical variables. There are a few reasons that we might want to do this:

  • It can simplify data and allow us to more easily spot trends and patterns.
  • It can make visualizing data easier, such as when you want to use bar plots.
# Set the decade boundaries and labels
decade_boundaries = [(1970, 1979), (1979, 1989), (1989, 1999), (1999, 2009), (2009, 2019)]
decade_labels = ['1970s', '1980s', '1990s', '2000s', '2010s']

# Set the bins and cut the DataFrame
bins = pd.IntervalIndex.from_tuples(decade_boundaries)
df['decade'] = pd.cut(df['year'],bins)

# Rename the categories
df['decade'].cat.rename_categories(decade_labels, inplace=True)

# Preview the new column
df[["year", "decade"]]

Visualizing categorical variables