chapter 4
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner
    #4.2 ttest
    import pandas as pd
    from scipy.stats import shapiro
    #import statsmodels as stats
    import scipy.stats as stats
    import numpy as np
    import math
    
    per=pd.read_csv('2018-personality-data.csv')
    print(per.size)
    per=per[['userid',' agreeableness']]
    per.columns=['userid','agreeableness']
    per.head()
    
    from numpy.random import seed, randn
    syn=np.random.normal(4, 1.0, 1834)
    print(len(per.userid))
    new=pd.DataFrame({'userid':per['userid'].to_list(),'agreeableness':list(syn)})
    per=new
    per['agreeableness']=per.agreeableness.round(2)#np.floor)
    per
    #col=per.columns
    df1=per.sample(frac=.5,random_state=1)
    df2=per.merge(df1.drop_duplicates(), on=per.columns.to_list(),how='left',indicator=True)
    df2=df2[df2._merge=='left_only'].iloc[:,:-1]
    print('df1',df1.head(7))
    print('df2',df2.head(7))
    #OF agreableness column:
    #calculate means
    df1_mean=df1.agreeableness.mean()
    df2_mean=df2.agreeableness.mean()
    print('df1mean=',df1_mean)
    print('df2mean=',df2_mean)
    #check normality of data using Shapiro-Wilks test
    df1_norm=stats.shapiro(df1.agreeableness)
    df2_norm=stats.shapiro(df2.agreeableness)
    print(df1_norm) #pvalue<.05; normal
    print(df2_norm) #pvalue<.05; normal
    
    #df1var=levene(df1.agreeableness)
    #df2var=levene(df2.agreeableness)
    #print('df1var',df1var)
    #print('df2var',df2var)
    stats.levene(df1.agreeableness,df2.agreeableness)
    #calculate ttest
    from statsmodels.stats.weightstats import ttest_ind
    ttest_ind(df1.agreeableness,df2.agreeableness)
    
    #the pvalue of the test comes out to be equal to .413, which is greater than the significance level alpha, that is .05. THis implies that we can say that the average agreeablenss of people surveyed by one reseracher is not different from the average agreeableness of people surveyed by the other researcher.
    '''
    Susan is convinced that women are more extraverted that men, simply because her best friend Betsy, is a party animal, and her other best friend Oliver, would rather lock himself up in his room reading books. To support her hypothesis, she posts a survey online, asking all her facebook friends to score themselves on how extraverted they think they are. 1 indicates completely introvert, while 8 indicates completely extrovert.
    She wants to compare the average male score from `male_survey`, to the average female score, `female_survey`, using the two-sample ttest, but must first, check out the three assumptions associated with the test.
    
    `Pandas` and `statsmodels` has been uploaded for you as `pd` and `stats`, respectively.
    '''
    
    per=pd.read_csv('psyc.csv')
    print(per.size)
    per=per[['gender','extraversion']]
    #per['extraversion']=(per.extraversion-per.extraversion.min())/per.extraversion.std()
    #per['extraversion']=(per.extraversion-per.extraversion.min())/(per.extraversion.max()-per.extraversion.min())
    per.head()
    
    #gender
    male=per[per.gender=='Male']
    female=per[per.gender=='Female']
    female.size
    '''
    Calculate the mean extraversion of male_survey and female_survey.
    '''
    
    print(male.extraversion.mean())
    print(female.extraversion.mean())
    '''
    Test for normality of extraversion from male_survey and female_survey.
    
    Use the Shapiro-Wilks test here
    '''
    #test for normality assumption
    print('male_norm=',stats.shapiro(male.extraversion))
    print('female_norm=',stats.shapiro(female.extraversion))
    '''
    Test to see if the variance of the extraversion from male_survey and female_survey is identical.
    
    Use the Levene tests here
    '''
    #test for normality assumption
    print('variance_test=',stats.levene(male.extraversion,female.extraversion))

    #Even though the male portion of the survey is independent of the female portion of the survey, and the levene pvalue is greater than 0.05, indicating identical variances, the data for each gender group is not normal, as indicated by the p-values being lower than an alpha value of 0.05. We will not be able to the two-sample t-test to analyze both gender's extraversion mean, therefore, Susan can't use conclusively move on with her hypothesis just yet! She will have to use another test to know whether to compare her mean male extraversion with her mean female extraversion.

    '''If Susans survey passed all three assumptions for the two sample t-test on `male_survey` and `female_survey`, what syntax would she use to run the test?
    Note: assume that ttest_ind is imported from statsmodels.stats.weightstats'''
    
    from statsmodels.stats.weightstats import ttest_ind
    print(ttest_ind(df1.agreeableness,df2.agreeableness))
    print(stats.ttest_ind(df1.agreeableness,df2.agreeableness))
    
    '''
    a. male_survey.extraversion.ttest_ind(female_survey.extraversion) ##Mmmm, not quite.
    b. ttest_ind(male_survey,female_survey) #Eesh, tricky one. We're not necessarily comparing between two different dataframes.
    c. ttest_ind(male_survey.extraversion,female_survey.extraversion) #Wohooo! So if the pvalue was greater than an alpha value of 0.05, then this would imply that the average extraversion between male and female from the survey are the same, and Susan's personal hypothesis would be challenged.
    '''

    Congratulations! You've reached the final video of this course. Let's briefly review what you've learned.

    In chapter one you learned of the different types of survey variables that are typically analyzed, how to interpret inferential and descriptive statistics in surveys, and visualization functions like .scatter() to aid with determining the most appropriate statistical modeling technique.

    In chapter two you learned two ways to create a random sample from a population data survey, account for sampling error with either stratified random sampling, weighted sampling, or cluster sampling, and learned visualization functions such as .pie() and .barh() to help present your survey results.

    In chapter three you learned the difference between descriptive and inferential statistics through some practical survey examples, and further interpreted the meaning of different variables, key measures such as central tendency and zscore, and interpreted results for actionable steps.

    And finally in chapter four, you learned the relatedness of two numerical variables in surveus with a linear regression, employed the two-sample t-test to compare the significance of two survey averages, and the relatedness of two categorical variable via the chi-square test.

    You have dedicated your time to learn some invaluable tools and methods and you should be proud. I hope that you will find these tools and methods invaluable in your own work and I wish you all the best in your learning journey.

    #lesson 4.4
    import statsmodels.api as sm
    import pandas as pd
    import matplotlib.pyplot as plt
    import statsmodels.formula.api as smf
    
    foot_traffic = pd.read_csv('black_friday_survey.csv').dropna()
    
    # Define variable, x and y
    x = foot_traffic.year.tolist()
    y = foot_traffic.visitors.tolist()
    
    # Add the constant term
    x = sm.add_constant(x)
    
    # Perform .OLS() regression and fit
    result = sm.OLS(y,x).fit()
    
    # Print the summary table
    print(result.summary())
    import numpy as np
    # Plot the original values using a scatter plot
    x = foot_traffic.year.tolist()
    y = foot_traffic.visitors.tolist()
    plt.scatter(x,y)
    #plt.show()
    
    # Get the range of data
    max_x = max(x)
    print(max_x)
    min_x = min(x)
    print(min_x)
    
    # Get the range of values
    a = np.arange(min_x,max_x,1)
    print(a)
    b = 5.9786 * a - 11840
    
    # Plot the regression line
    plt.plot(a,b,'r')
    plt.show()
    foot_traffic.head(20)
    #type(foot_traffic.year.min())