Beta
Course Notes
- In This course We Analyze our Data step by step
- Read your data in CSV File
- Summarize the number of missing values and statistical or numeric data
- Use histogram to look at distribution of numeric data
- How to celect numeric or categorical data from DataFrame and How to create new columns
- Use Seaborn plots to calculate median use(boxplot) to calculate mean use(barplot)
- To calculate the relationship between to values use (scatterplot)
- Strategies for adderssing missing data
- Inputing summaries statistic
- Converting and analyzing categorical data
- How to clean outliers 11.Data Time Correlations Relative class frequancy (crosstab) Hypothesis
# Import any packages you want to use here
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn as sb
# Display the dataset
clean_books = pd.read_csv('datasets/clean_books.csv', encoding='utf-8')
clean_books
# Summarize the numbers of missing value in each columns data type and memory usage
clean_books.info()
# Numeric data
clean_books.describe()
# Plot the data use histogram to look at the distribution of numeric data
sns.histplot(x="rating", data=clean_books)
plt.show()
# Look to data type for each columns
clean_books.dtypes
# Comparing between value use isin() method
clean_books['genre'].isin(['Fiction', 'None Fiction'])
# Count the values
clean_books.value_counts('genre')
# Use opretor tilde to denie the column return True is value exist
~clean_books['genre'].isin(['Fiction'])
# Check if year 2020 is exist
~clean_books['year'].isin(['2020']).head()
sns.boxplot(x=clean_books["year"].astype(int), y=clean_books['rating'])
plt.xticks(rotation=45)
plt.show()
# What is median year
sns.set()
sns.boxenplot(x=clean_books["year"].astype(int))
plt.show()
import numpy as np
print(np.median(clean_books['year']), 'Median')
print(np.max(clean_books['year']), 'Max')
print(np.min(clean_books['year']), 'Min')