Workspace
James Mwangi/

Introduction to Importing Data in Python

0
Beta
Spinner

Introduction to Importing Data in Python

1. Import the course packages

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import h5py
from sas7bdat import SAS7BDAT
from sqlalchemy import create_engine
import pickle

# Import the course datasets
titanic = pd.read_csv("datasets/titanic_sub.csv")
battledeath_2002 = pd.ExcelFile("datasets/battledeath.xlsx").parse("2002")
engine = create_engine('sqlite:///Chinook.sqlite')
con = engine.connect()
rs = con.execute('SELECT * FROM Album')
chinook = pd.DataFrame(rs.fetchall())
seaslug = np.loadtxt("seaslug.txt", delimiter="\t", dtype=str)

1.1.2 Importing entire text files

  • It is a text file that contains the opening sentences of Moby Dick.
# Open a file: file
file = open('moby_dick.txt','r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

1.1.3 Importing text files line by line

  • For large files,may not want to print all of their content to the shell: may wish to print only the first few lines.
  • Enter the readline() method.
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

1.3.1 Using NumPy to import flat files

  • We'll load the MNIST digit recognition dataset using the numpy function loadtxt()
# Import package
import numpy as np
import matplotlib.pyplot as plt

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',', skiprows=1)

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

1.3.2 Customizing your NumPy import

  • a number of arguments that np.loadtxt() takes that are useful:
  • delimiter changes the delimiter that loadtxt() is expecting.
  • ',' for comma-delimited.
  • '\t' for tab-delimited.
  • skiprows to specify how many rows (not indices) to skip
  • usecols takes list of the indices of the columns to keep.
# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,3])

# Print data
print(data)

1.3.3 Importing different datatypes

  • These data consists of percentage of sea slug larvae that had metamorphosed in a given time period.

  • Due to the header,to import it using np.loadtxt(),

  • Python would throw you a ValueError(tell you that it could not convert string to float).

  • Two ways to deal with this:

  • Firstly, you can set the data type argument dtype equal to str (for string).

  • Alternatively, you can skip the first row , using the skiprows argument.

import matplotlib.pyplot as plt

# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1) 

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

1.3.4 Working with mixed datatypes

  • Used np.genfromtxt() to import data containing mixed datatypes.
  • There is another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None
# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d = np.recfromcsv(file)

# Print out first three entries of d
print(d[:3])

1.4.1 Using pandas to import flat files as DataFrames

  • able to import flat files containing columns with different datatypes as numpy arrays
  • The DataFrame object in pandas is a more appropriate structure in which to store such data;
  • thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv() and read_table()
# Import pandas as pd
import pandas as pd

# Assign the filename: file
file = 'titanic.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)

# View the head of the DataFrame
print(df.head())
 

1.4.2 Using pandas to import flat files as DataFrames II

  • It is straightforward to retrieve corresponding numpy array using the attribute values .
# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file,nrows = 5,header=None)

# Build a numpy array from the DataFrame: data_array
data_array = data.values

# Print the datatype of data_array to the shell
print(type(data_array))

1.4.3 Customizing your pandas import

  • The pandas package is great at dealing with many of the issues encountered when importing data as a data scientist;
  • Such as comments occurring in flat files, empty lines and missing values.
  • Note that missing values are also commonly referred to as NA or NaN.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import pandas as pd

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values=['Nothing'])

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()