Introduction to Importing Data in Python
1. Import the course packages
# Import the course packages import pandas as pd import numpy as np import matplotlib.pyplot as plt import scipy.io import h5py from sas7bdat import SAS7BDAT from sqlalchemy import create_engine import pickle # Import the course datasets titanic = pd.read_csv("datasets/titanic_sub.csv") battledeath_2002 = pd.ExcelFile("datasets/battledeath.xlsx").parse("2002") engine = create_engine('sqlite:///Chinook.sqlite') con = engine.connect() rs = con.execute('SELECT * FROM Album') chinook = pd.DataFrame(rs.fetchall()) seaslug = np.loadtxt("seaslug.txt", delimiter="\t", dtype=str)
1.1.2 Importing entire text files
- It is a text file that contains the opening sentences of Moby Dick.
# Open a file: file file = open('moby_dick.txt','r') # Print it print(file.read()) # Check whether file is closed print(file.closed) # Close file file.close() # Check whether file is closed print(file.closed)
1.1.3 Importing text files line by line
- For large files,may not want to print all of their content to the shell: may wish to print only the first few lines.
- Enter the readline() method.
# Read & print the first 3 lines with open('moby_dick.txt') as file: print(file.readline()) print(file.readline()) print(file.readline())
1.3.1 Using NumPy to import flat files
- We'll load the MNIST digit recognition dataset using the numpy function loadtxt()
# Import package import numpy as np import matplotlib.pyplot as plt # Assign filename to variable: file file = 'digits.csv' # Load file as array: digits digits = np.loadtxt(file, delimiter=',', skiprows=1) # Print datatype of digits print(type(digits)) # Select and reshape a row im = digits[21, 1:] im_sq = np.reshape(im, (28, 28)) # Plot reshaped data (matplotlib.pyplot already loaded as plt) plt.imshow(im_sq, cmap='Greys', interpolation='nearest') plt.show()
1.3.2 Customizing your NumPy import
- a number of arguments that np.loadtxt() takes that are useful:
- delimiter changes the delimiter that loadtxt() is expecting.
- ',' for comma-delimited.
- '\t' for tab-delimited.
- skiprows to specify how many rows (not indices) to skip
- usecols takes list of the indices of the columns to keep.
# Import numpy import numpy as np # Assign the filename: file file = 'digits_header.txt' # Load the data: data data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,3]) # Print data print(data)
1.3.3 Importing different datatypes
These data consists of percentage of sea slug larvae that had metamorphosed in a given time period.
Due to the header,to import it using np.loadtxt(),
Python would throw you a ValueError(tell you that it could not convert string to float).
Two ways to deal with this:
Firstly, you can set the data type argument dtype equal to str (for string).
Alternatively, you can skip the first row , using the skiprows argument.
import matplotlib.pyplot as plt # Assign filename: file file = 'seaslug.txt' # Import file: data data = np.loadtxt(file, delimiter='\t', dtype=str) # Print the first element of data print(data) # Import data as floats and skip the first row: data_float data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1) # Print the 10th element of data_float print(data_float) # Plot a scatterplot of the data plt.scatter(data_float[:, 0], data_float[:, 1]) plt.xlabel('time (min.)') plt.ylabel('percentage of larvae') plt.show()
1.3.4 Working with mixed datatypes
- Used np.genfromtxt() to import data containing mixed datatypes.
- There is another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None
# Assign the filename: file file = 'titanic.csv' # Import file using np.recfromcsv: d d = np.recfromcsv(file) # Print out first three entries of d print(d[:3])
1.4.1 Using pandas to import flat files as DataFrames
- able to import flat files containing columns with different datatypes as numpy arrays
- The DataFrame object in pandas is a more appropriate structure in which to store such data;
- thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv() and read_table()
# Import pandas as pd import pandas as pd # Assign the filename: file file = 'titanic.csv' # Read the file into a DataFrame: df df = pd.read_csv(file) # View the head of the DataFrame print(df.head())
1.4.2 Using pandas to import flat files as DataFrames II
- It is straightforward to retrieve corresponding numpy array using the attribute values .
# Assign the filename: file file = 'digits.csv' # Read the first 5 rows of the file into a DataFrame: data data = pd.read_csv(file,nrows = 5,header=None) # Build a numpy array from the DataFrame: data_array data_array = data.values # Print the datatype of data_array to the shell print(type(data_array))
1.4.3 Customizing your pandas import
- The pandas package is great at dealing with many of the issues encountered when importing data as a data scientist;
- Such as comments occurring in flat files, empty lines and missing values.
- Note that missing values are also commonly referred to as NA or NaN.
# Import matplotlib.pyplot as plt import matplotlib.pyplot as plt import pandas as pd # Assign filename: file file = 'titanic_corrupt.txt' # Import file: data data = pd.read_csv(file, sep='\t', comment='#', na_values=['Nothing']) # Print the head of the DataFrame print(data.head()) # Plot 'Age' variable in a histogram pd.DataFrame.hist(data[['Age']]) plt.xlabel('Age (years)') plt.ylabel('count') plt.show()