Elias Dabbas














Sign up
Pandas Tutorial: DataFrames in Python
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Pandas Tutorial: DataFrames in Python

    Explore data analysis with Python. Pandas DataFrames make manipulating your data easy, from selecting or replacing columns and indices to reshaping your data.

    Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.

    This tutorial covers Pandas DataFrames, from basic manipulations to advanced operations, by tackling 11 of the most popular questions so that you understand -and avoid- the doubts of the Pythonistas who have gone before you.

    Content

    • How To Create a Pandas DataFrame
    • How To Select an Index or Column From a DataFrame
    • How To Add an Index, Row or Column to a DataFrame
    • How To Delete Indices, Rows or Columns From a DataFrame
    • How To Rename the Columns or Indices of a DataFrame
    • How To Format the Data in Your DataFrame
    • How To Create an Empty DataFrame
    • Does Pandas Recognize Dates When Importing Data?
    • When, Why and How You Should Reshape Your DataFrame
    • How To Iterate Over a DataFrame
    • How To Write a DataFrame to a File

    (For more practice, try the first chapter of this Pandas DataFrames course for free!)

    What Are Pandas Data Frames?

    Before you start, let’s have a brief recap of what DataFrames are.

    Those who are familiar with R know the data frame as a way to store data in rectangular grids that can easily be overviewed. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable. This means that a data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.

    Now, DataFrames in Python are very similar: they come with the Pandas library, and they are defined as two-dimensional labeled data structures with columns of potentially different types.

    In general, you could say that the Pandas DataFrame consists of three main components: the data, the index, and the columns.

    1. Firstly, the DataFrame can contain data that is:
    • a Pandas DataFrame
    • a Pandas Series: a one-dimensional labeled array capable of holding any data type with axis labels or index. An example of a Series object is one column from a DataFrame.
    • a NumPy ndarray, which can be a record or structured
    • a two-dimensional ndarray
    • dictionaries of one-dimensional ndarray's, lists, dictionaries or Series.

    Note the difference between np.ndarray and np.array() . The former is an actual data type, while the latter is a function to make arrays from other data structures.

    Structured arrays allow users to manipulate the data by named fields: in the example below, a structured array of three tuples is created. The first element of each tuple will be called foo and will be of type int, while the second element will be named bar and will be a float.

    Record arrays, on the other hand, expand the properties of structured arrays. They allow users to access fields of structured arrays by attribute rather than by index. You see below that the foo values are accessed in the r2 record array.- An example:

    %%capture
    !pip install -r requirements.txt
    import pandas as pd 
    import numpy as np
    # A structured array
    my_array = np.ones(3, dtype=([('foo', int), ('bar', float)]))
    # Print the structured array
    print(my_array['foo'])
    
    # A record array
    my_array2 = my_array.view(np.recarray)
    # hello
    # Print the record array
    print(my_array2.foo)
    1. Besides data, you can also specify the index and column names for your DataFrame. The index, on the one hand, indicates the difference in rows, while the column names indicate the difference in columns. You will see later that these two components of the DataFrame will come in handy when you’re manipulating your data.

    If you’re still in doubt about Pandas DataFrames and how they differ from other data structures such as a NumPy array or a Series, you can watch the small presentation below:

    Note that in this post, most of the times, the libraries that you need have already been loaded in. The Pandas library is usually imported under the alias pd, while the NumPy library is loaded as np. Remember that when you code in your own data science environment, you shouldn’t forget this import step, which you write just like this:

    import numpy as np import pandas as pd

    Now that there is no doubt in your mind about what DataFrames are, what they can do and how they differ from other structures, it’s time to tackle the most common questions that users have about working with them!

    1. How To Create a Pandas DataFrame

    Obviously, making your DataFrames is your first step in almost anything that you want to do when it comes to data munging in Python. Sometimes, you will want to start from scratch, but you can also convert other data structures, such as lists or NumPy arrays, to Pandas DataFrames. In this section, you'll only cover the latter. However, if you want to read more on making empty DataFrames that you can fill up with data later, go to question 7

    Among the many things that can serve as input to make a ‘DataFrame’, a NumPy ndarray is one of them. To make a data frame from a NumPy array, you can just pass it to the DataFrame() function in the data argument.

    data = np.array([['','Col1','Col2'],
                    ['Row1',1,2],
                    ['Row2',3,4]])
                    
    print(pd.DataFrame(data=data[1:,1:],
                      index=data[1:,0],
                      columns=data[0,1:]))

    Pay attention to how the code chunks above select elements from the NumPy array to construct the DataFrame: you first select the values that are contained in the lists that start with Row1 and Row2, then you select the index or row numbers Row1 and Row2 and then the column names Col1 and Col2.

    Next, you also see that, in the DataCamp Light chunk above, you printed out a small selection of the data. This works the same as subsetting 2D NumPy arrays: you first indicate the row that you want to look in for your data, then the column. Don’t forget that the indices start at 0! For data in the example above, you go and look in the rows at index 1 to end, and you select all elements that come after index 1. As a result, you end up selecting 1, 2, 3 and 4.

    This approach to making DataFrames will be the same for all the structures that DataFrame() can take on as input.

    Try it out in the code chunk below: Remember that the Pandas library has already been imported for you as pd.

    # Take a 2D array as input to your DataFrame 
    my_2darray = np.array([[1, 2, 3], [4, 5, 6]])
    print(pd.DataFrame(my_2darray))
    
    # Take a dictionary as input to your DataFrame 
    my_dict = {1: ['1', '3'], 2: ['1', '2'], 3: ['2', '4']}
    print(pd.DataFrame(my_dict))
    
    # Take a DataFrame as input to your DataFrame 
    my_df = pd.DataFrame(data=[4,5,6,7], index=range(0,4), columns=['A'])
    print(pd.DataFrame(my_df))
    
    # Take a Series as input to your DataFrame
    my_series = pd.Series({"Belgium":"Brussels", "India":"New Delhi", "United Kingdom":"London", "United States":"Washington"})
    print(pd.DataFrame(my_series))

    Note that the index of your Series (and DataFrame) contains the keys of the original dictionary, but that they are sorted: Belgium will be the index at 0, while the United States will be the index at 3. After you have created your DataFrame, you might want to know a little bit more about it. You can use the shape property or the len() function in combination with the .index property:

    df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
    
    # Use the `shape` property
    print(df.shape)
    
    # Or use the `len()` function with the `index` property
    print(len(df))

    These two options give you slightly different information on your DataFrame: the shape property will provide you with the dimensions of your DataFrame. That means that you will get to know the width and the height of your DataFrame. On the other hand, the len() function, in combination with the index property, will only give you information on the height of your DataFrame.

    This all is totally not extraordinary, though, as you explicitly give in the index property.

    You could also use df[0].count() to get to know more about the height of your DataFrame, but this will exclude the NaN values (if there are any). That is why calling .count() on your DataFrame is not always the better option.

    If you want more information on your DataFrame columns, you can always execute list(my_dataframe.columns.values). Try this out for yourself in the DataCamp Light block above!

    Fundamental DataFrame Operations

    Now that you have put your data in a more convenient Pandas DataFrame structure, it’s time to get to the real work!

    This first section will guide you through the first steps of working with DataFrames in Python. It will cover the basic operations that you can do on your newly created DataFrame: adding, selecting, deleting, renaming, … You name it!

    ## id="question2">2. How To Select an Index or Column From a Pandas DataFrame

    Before you start with adding, deleting and renaming the components of your DataFrame, you first need to know how you can select these elements. So, how do you do this?

    Even though you might still remember how to do it from the previous section: selecting an index, column or value from your DataFrame isn’t that hard, quite the contrary. It’s similar to what you see in other languages (or packages!) that are used for data analysis. If you aren't convinced, consider the following:

    In R, you use the [,] notation to access the data frame’s values.

    Now, let’s say you have a DataFrame like this one:

    A B C 0 1 2 3 1 4 5 6 2 7 8 9

    And you want to access the value that is at index 0, in column ‘A’.

    Various options exist to get your value 1 back:

    df = pd.DataFrame({"A":[1,4,7], "B":[2,5,8], "C":[3,6,9]})
    print(df)
    # Using `iloc[]`
    print(df.iloc[0][0])
    
    # Using `loc[]`
    print(df.loc[0]['A'])
    
    # Using `at[]`
    print(df.at[0,'A'])
    
    # Using `iat[]`
    print(df.iat[0,0])

    The most important ones to remember are, without a doubt, .loc[] and .iloc[]. The subtle differences between these two will be discussed in the next sections.

    Enough for now about selecting values from your DataFrame. What about selecting rows and columns? In that case, you would use: