Live Training - Working with Categorical Data in Python (Webinar)
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Analyzing Categorical Data from the General Social Survey in Python

    Welcome to your webinar workspace! In this session, we will introduce you to categorical variables in Python. We will be using a subset of data from the General Social Survey.

    The following code block imports some of the main packages we will be using, which are pandas, NumPy, and Plotly. We will also use statsmodels for a special type of categorical plot.

    We will read in our data and preview it as an interactive table. Please follow along with the code and feel free to ask any questions!

    # Import packages
    import pandas as pd
    import numpy as np
    import plotly.express as px
    import matplotlib.pyplot as plt
    from statsmodels.graphics.mosaicplot import mosaic
    
    # Read in csv as a DataFrame and preview it
    
    

    Above we see that our DataFrame contains float64 column (numerical data), as well as a number of object columns. Object data types contain strings.

    Inspecting individual columns

    To inspect a categorical column, use the .describe() method with the include parameter to select a particular DataType (in this case "O"). This returns the count, number of unique values, the mode, and frequency of the mode.

    The .value_counts() method can give you a greater insight into the distribution and structure of a column.

    Manipulating categorical data

    Let's convert our object columns to categories

    • The categorical variable type can be useful, especially here:
      • Save on memory when there are only a few different values.
      • You can specify a precise order to the categories when the default order may be incorrect (e.g., via alphabetical).
      • Can be compatible with other Python libraries.

    Let's take our existing categorical variables and convert them from strings to categories. Here, we use .select_dtypes() to return only object columns, and with a dictionary set their type to be a category.

    # Create a dictionary of column and data type mappings
    
    
    # Convert our DataFrame and check the data types
    
    

    Already we can see that the memory usage of the DataFrame has been halved from 7 mb to 4 mb! This can help when working with large quantities of data, such as this survey that we'll be working with.

    Cleaning up the labor_status column

    To analyze the relationship between employment and attitudes over time, we need to clean up the labor_status column. We can preview the existing categories using .categories.

    Let's collapse some of these categories. The easiest way to do this is to replace the values inside the column using a dictionary, and then reset the data type back to a category.