Richard Pallangyo














Sign up
Beta
Spinner

Introduction

This project addresses education inequality in U.S. high schools. The quality of a high school education can be measured in multiple ways, but here we will focus on average student performance on the ACT or SAT exams that students take as part of the college application process.

We expect a range of school performance on these exams, but is school performance predicted by socioeconomic factors?

This project seeks to find whether and how socioeconomic factors—such as household income, unemployment, adult educational attainment, and family structure—affect U.S. high school performances on the ACT or SAT exams. School average scores in these exams will be modeled against different socioeconomic factors. The primary goal of the project is to use the socioeconomic factors to explain school average ACT or SAT scores by performing exploratory and predictive analysis that will give insights on;

  • Whether socioeconomic factors can predict U.S. high school performances on the ACT or SAT exams
  • How the different socioeconomic factors affect U.S. high school performances on the ACT or SAT exams
  • Which socioeconomic factors best explains U.S. high school performances on the ACT or SAT exams

Data Collection

This project utilizes two data sets. The primary data set is the EdGap data set from EdGap.org. This data set from 2016 includes information about average ACT or SAT scores for schools and several socioeconomic characteristics of the school district. The secondary data set is basic information about each school from the National Center for Education Statistics.

EdGap data

All socioeconomic data (household income, unemployment, adult educational attainment, and family structure) is from the Census Bureau's American Community Survey.

EdGap.org report that ACT and SAT score data is from each state's department of education or some other public data release. The nature of the other public data release is not known.

The quality of the census data and the department of education data can be assumed to be reasonably high.

EdGap.org do not indicate that they processed the data in any way. The data were assembled by the EdGap.org team, so there is always the possibility for human error. Given the public nature of the data, we would be able to consult the original data sources to check the quality of the data if we had any questions.

School information data

The school information data is from the National Center for Education Statistics. This data set consists of basic identifying information about schools can be assumed to be reasonably high. As for the EdGap.org data, the school information data is public, so we would be able to consult the original data sources to check the quality of the data if we had any questions.

Data ethics

Did the students provide consent for their test scores to be made public as a school average?

Are there sources of bias in the data?

Can we cause harm to students by publicizing the data?

We will discuss these issues in class.

Data Preparation

Load necessary packages

#tidyverse contains packages we will use for processing and plotting data
library(tidyverse)
#readxl lets us read Excel files
library(readxl)
#GGally has a nice pairs plotting function
library(GGally)
#skimr provides a nice summary of a data set
library(skimr)
#leaps will be used for model selection
library(leaps)
#kableExtra will be used to make tables in the html document
library(kableExtra)
#latex2exp lets us use LaTex in ggplot
library(latex2exp)

Load the data

Load the EdGap set

Load the data set contained in the file EdGap_data.xlsx and name the data frame edgap.

edgap <- read_excel("EdGap_data.xlsx")

Explore the contents of the data set

Look at the first few rows of the data frame.




  • AI Chat
  • Code