chapter 4

Beta

#4.2 ttest
import pandas as pd
from scipy.stats import shapiro
#import statsmodels as stats
import scipy.stats as stats
import numpy as np
import math

per=pd.read_csv('2018-personality-data.csv')
print(per.size)
per=per[['userid',' agreeableness']]
per.columns=['userid','agreeableness']
per.head()

from numpy.random import seed, randn
syn=np.random.normal(4, 1.0, 1834)
print(len(per.userid))
new=pd.DataFrame({'userid':per['userid'].to_list(),'agreeableness':list(syn)})
per=new
per['agreeableness']=per.agreeableness.round(2)#np.floor)
per

#col=per.columns
df1=per.sample(frac=.5,random_state=1)
df2=per.merge(df1.drop_duplicates(), on=per.columns.to_list(),how='left',indicator=True)
df2=df2[df2._merge=='left_only'].iloc[:,:-1]
print('df1',df1.head(7))
print('df2',df2.head(7))

#OF agreableness column:
#calculate means
df1_mean=df1.agreeableness.mean()
df2_mean=df2.agreeableness.mean()
print('df1mean=',df1_mean)
print('df2mean=',df2_mean)
#check normality of data using Shapiro-Wilks test
df1_norm=stats.shapiro(df1.agreeableness)
df2_norm=stats.shapiro(df2.agreeableness)
print(df1_norm) #pvalue<.05; normal
print(df2_norm) #pvalue<.05; normal

#df1var=levene(df1.agreeableness)
#df2var=levene(df2.agreeableness)
#print('df1var',df1var)
#print('df2var',df2var)
stats.levene(df1.agreeableness,df2.agreeableness)

#calculate ttest
from statsmodels.stats.weightstats import ttest_ind
ttest_ind(df1.agreeableness,df2.agreeableness)

#the pvalue of the test comes out to be equal to .413, which is greater than the significance level alpha, that is .05. THis implies that we can say that the average agreeablenss of people surveyed by one reseracher is not different from the average agreeableness of people surveyed by the other researcher.

'''
Susan is convinced that women are more extraverted that men, simply because her best friend Betsy, is a party animal, and her other best friend Oliver, would rather lock himself up in his room reading books. To support her hypothesis, she posts a survey online, asking all her facebook friends to score themselves on how extraverted they think they are. 1 indicates completely introvert, while 8 indicates completely extrovert.
She wants to compare the average male score from `male_survey`, to the average female score, `female_survey`, using the two-sample ttest, but must first, check out the three assumptions associated with the test.

`Pandas` and `statsmodels` has been uploaded for you as `pd` and `stats`, respectively.
'''

per=pd.read_csv('psyc.csv')
print(per.size)
per=per[['gender','extraversion']]
#per['extraversion']=(per.extraversion-per.extraversion.min())/per.extraversion.std()
#per['extraversion']=(per.extraversion-per.extraversion.min())/(per.extraversion.max()-per.extraversion.min())
per.head()

#gender
male=per[per.gender=='Male']
female=per[per.gender=='Female']
female.size
'''
Calculate the mean extraversion of male_survey and female_survey.
'''

print(male.extraversion.mean())
print(female.extraversion.mean())

'''
Test for normality of extraversion from male_survey and female_survey.

Use the Shapiro-Wilks test here
'''
#test for normality assumption
print('male_norm=',stats.shapiro(male.extraversion))
print('female_norm=',stats.shapiro(female.extraversion))

'''
Test to see if the variance of the extraversion from male_survey and female_survey is identical.

Use the Levene tests here
'''
#test for normality assumption
print('variance_test=',stats.levene(male.extraversion,female.extraversion))

#Even though the male portion of the survey is independent of the female portion of the survey, and the levene pvalue is greater than 0.05, indicating identical variances, the data for each gender group is not normal, as indicated by the p-values being lower than an alpha value of 0.05. We will not be able to the two-sample t-test to analyze both gender's extraversion mean, therefore, Susan can't use conclusively move on with her hypothesis just yet! She will have to use another test to know whether to compare her mean male extraversion with her mean female extraversion.

'''If Susans survey passed all three assumptions for the two sample t-test on `male_survey` and `female_survey`, what syntax would she use to run the test?
Note: assume that ttest_ind is imported from statsmodels.stats.weightstats'''

from statsmodels.stats.weightstats import ttest_ind
print(ttest_ind(df1.agreeableness,df2.agreeableness))
print(stats.ttest_ind(df1.agreeableness,df2.agreeableness))

'''
a. male_survey.extraversion.ttest_ind(female_survey.extraversion) ##Mmmm, not quite.
b. ttest_ind(male_survey,female_survey) #Eesh, tricky one. We're not necessarily comparing between two different dataframes.
c. ttest_ind(male_survey.extraversion,female_survey.extraversion) #Wohooo! So if the pvalue was greater than an alpha value of 0.05, then this would imply that the average extraversion between male and female from the survey are the same, and Susan's personal hypothesis would be challenged.
'''

Congratulations! You've reached the final video of this course. Let's briefly review what you've learned.

In chapter one you learned of the different types of survey variables that are typically analyzed, how to interpret inferential and descriptive statistics in surveys, and visualization functions like .scatter() to aid with determining the most appropriate statistical modeling technique.

In chapter two you learned two ways to create a random sample from a population data survey, account for sampling error with either stratified random sampling, weighted sampling, or cluster sampling, and learned visualization functions such as .pie() and .barh() to help present your survey results.

In chapter three you learned the difference between descriptive and inferential statistics through some practical survey examples, and further interpreted the meaning of different variables, key measures such as central tendency and zscore, and interpreted results for actionable steps.

And finally in chapter four, you learned the relatedness of two numerical variables in surveus with a linear regression, employed the two-sample t-test to compare the significance of two survey averages, and the relatedness of two categorical variable via the chi-square test.

You have dedicated your time to learn some invaluable tools and methods and you should be proud. I hope that you will find these tools and methods invaluable in your own work and I wish you all the best in your learning journey.

#lesson 4.4
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

foot_traffic = pd.read_csv('black_friday_survey.csv').dropna()

# Define variable, x and y
x = foot_traffic.year.tolist()
y = foot_traffic.visitors.tolist()

# Add the constant term
x = sm.add_constant(x)

# Perform .OLS() regression and fit
result = sm.OLS(y,x).fit()

# Print the summary table
print(result.summary())

import numpy as np
# Plot the original values using a scatter plot
x = foot_traffic.year.tolist()
y = foot_traffic.visitors.tolist()
plt.scatter(x,y)
#plt.show()

# Get the range of data
max_x = max(x)
print(max_x)
min_x = min(x)
print(min_x)

# Get the range of values
a = np.arange(min_x,max_x,1)
print(a)
b = 5.9786 * a - 11840

# Plot the regression line
plt.plot(a,b,'r')
plt.show()

foot_traffic.head(20)
#type(foot_traffic.year.min())