Indicators of Education Attainment and Graduation in New York City School Districts

What can socioeconomic factors and academic progress indicators tell us about the graduation rate in NYC? How can we improve education outcomes?

Katharine Shao
12 min readApr 12, 2021

Why is this important?

Education plays a crucial role in one’s life, as it provides the foundation for future success. Unfortunately, access to a quality education is not always equitable, especially in urban areas. In fact, New York City is one of the most segregated districts in the United States. The city has 1,700 public schools across 32 school districts and a budget of around $25 billion, but resource allocation often favors wealthier zipcodes, leaving many low-income areas to be underserved and undereducated.

The national graduation rate in 2020 was 88%, whereas NYC’s was 78.8%. Although the rate has been steadily rising over the years, it’s nearly 10% lower than the national average. Environmental and socioeconomic factors also impacts students’ ability to engage in their education and graduate. For example, lower income areas tend to have less access to healthcare, which can adversely affect a family’s finances and student’s may have to spend additional time caring for sick relatives.

Thus, it’s important to identify indicators of education success, both inside and outside of the classroom.

Key Questions

  1. What is the relationship between the four-year graduation rate and median household income?
  2. What are the demographics in these school districts, and what can it reveal about educational disparities in the city?
  3. Are there early indicators in school performance that predict graduation rate?

Insights and Analysis

I analyzed New York City’s graduation data from Cohort 2001–2011 (Class of 2005–2015) by NYC school districts (NYC has 32 districts amongst its five boroughs).

Map of New York City’s 32 School Districts

Graduation Rate vs Median Income of Families

Income levels often reflect numerous environmental factors that affect local school quality and engagement in education.

To start off, I first matched the school data, which was organized by the school district, and matched it with zip codes, which would be easier to compare with outside data sources. Since the NYC school districts aren’t correlated with census tracts nor zipcodes, I used a tool from Proximity One, which determined a singular zip code corresponding to each school district. Because median income doesn’t typically differ significantly among zip codes in the same school district, using one zip code should be representative of the district. I then used the Citizens’ Committee for Children of New York (CCC) to find median household incomes for families with children for each corresponding zip code.

Figure 1: Graduation Rates by Median Household Income, by Graduation Cohort

Figure 1: Graduation Rates by Median Household Income, by Graduation Cohort

In this scatterplot, each point represents an NYC school district, and each color represents a different graduation cohort (where cohort year represents the school year entered into 9th grade). I ran a linear regression for each of the Cohort years and plotted it on the graph.

  • From the plot, we can see there is indeed a positive correlation between higher income and higher four-year graduation rates.
  • Each year is shown to have increasing graduation rates. This reflects the increased focus on boosting education policy in NYC during this period, such as increasing accountability, raising standards, and opening more than 656 schools. In general, these new schools outperform existing ones.

Figure 2: Graduation Rate (Red) and Median Income of Families (Blue) by School District

Figure 2: Average Graduation Rate (Red) and Median Income of Families (Blue) by School District

Figure 2 shows the average graduation rate and adjusted median income of families in 2019 by each school district. In NYC, the federal poverty threshold for 2019 for a two-adult, two-children family, was $25,926, which I graphed onto the figure.

  • Graduation rates are relatively consistent with median household income. Higher rates correspond to wealthy or upper-middle median incomes, like Districts 2, 3, 13, 26, and 28.
  • Certain districts have disproportionately higher incomes*, like Districts 2, 3, and 15.
  • However, the average graduation rate in those two districts isn’t significantly higher than in other districts with much lower median incomes. In fact, we can see that certain districts near or below the federal poverty line, like Districts 4, 7, 9, and 23, have reasonably high graduation rates.
  • District 4 (East Harlem) is especially interesting because, despite its relatively low income, it has a higher average graduation rate than wealthy districts, like 2, 3, and 15.

Looking deeper, I discovered Distrct 4 is a“choice” district, which means applicants from other school zones can apply and attend in these areas. Districts 1, 7, and 23 are also examples of choice districts. Thus, these districts could serve students that aren’t representative of the population actually living there. For instance, District 7’s median income is below the poverty threshold, but ranks 10th in the dataset’s average graduation rate.

*The median incomes are adjusted in terms of $100,000 for easier comparison to graduation rate, which is in decimal form.

Figure 3a: Racial Makeup in NYC Districts

Figure 3a: Graduation Rate by Race; Figure 3b: Racial Makeup by District

Figure 3a shows the relationship between graduation rate and the racial makeup of the student population in each district. In Figure 3b, we can more distinctly see the racial makeup of students in each district as a proportion of 100%. This figure only displays the percentages relative to the four largest racial groups; other racial groups each represented >2% of the district’s population.

  • In Figure 3a, there are clear correlations between graduation rate and race. Places with higher Asian and White student population are positively correlated with graduation rate, whereas the opposite is true for districts with ahigh Black and Hispanic concentration.
  • Based on the findings from Figure 2, we can see that areas with lower income, like Districts 4 , 7, 12, and 23, are overwhelmingly comprised of Black and Hispanic students. The converse is true for districts with high concentration of Asian and White students.
  • Thus, higher income and greater concentration of White and Asian students are positively correlated with graduation rates.

Figure 3c: Linear Regression of Racial Makeup and Median Income by Graduation Rate

Call:
lm(formula = `Average Total Grads % of cohort` ~ `Pct White` +
`Pct Black` + `Pct Asian` + `Pct Hispanic`, data = demo.grad)

Residuals:
Min 1Q Median 3Q Max
-0.132193 -0.050542 -0.002159 0.047749 0.159162

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.353 1.404 1.677 0.105
`Pct White` -1.582 1.467 -1.078 0.291
`Pct Black` -1.916 1.440 -1.330 0.195
`Pct Asian` -1.716 1.448 -1.185 0.246
`Pct Hispanic` -1.863 1.415 -1.317 0.199

Residual standard error: 0.07966 on 27 degrees of freedom
Multiple R-squared: 0.4203, Adjusted R-squared: 0.3344
F-statistic: 4.894 on 4 and 27 DF, p-value: 0.004241

This linear regression shows that racial makeup is not a statistically significant predictor of graduation rate. However, based on the individual regressions in Figure 3a, there is still a clear correlation between graduation rate and race.

English Language Arts (ELA) exam outcomes and graduation rate

Learning fundamentals like reading comprehension is crucial for future academic success. New York conducts ELA assessments for students grades 3 through 8, tests that can serve as an early indicator of academic progress and provides insight into academic preparedness prior to entering high school.

To investigate if there are correlations between exam performance and graduation rate, I matched the graduation cohort data (2001–2011 data) with the corresponding ELA exam grades given in corresponding years (2006–2012 data). For example, the class of 2015 (2011 graduation cohort) would have been in fourth grade in the school year starting in 2006.

Figure 4: ELA Proficiency Level by District, Cohort 2011

Figure 4a: ELA Proficiency Level by District, Cohort 2011, Grade 4.
Figure 4b and 4c: ELA Proficiency Level by District, Cohort 2011, Grade 5 and 6, respectively
Figure 4d and 4e: ELA Proficiency Level by District, Cohort 2011, Grade 7 and 8, respectively

Collectively, Figure 4 shows the ELA proficiency by district for the 2011 cohort every year since fourth grade compared to both graduation rate and dropout rate.

  • From the graph of Grade 4 outcomes, we can see a relationship between the percentage of students scoring at Level 3*, which means “Meeting Learning Standards” and the graduation rate. This pattern holds for most districts throughout the grades.
  • Grade 7 saw the largest improvement in ELA proficiency and relatively low variation among districts, which indicates high average performance throughout the city.
  • Interestingly, we see a big drop in ELA proficiency in Grade 8 among the majority of districts, and a large increase in Level 2 proficiency. In Grade 8, the disparity of both Level 3 and Level 4 outcomes increases between districts with lower income, like Districts 4 and 7, and those with higher income, like Districts 2 and 3.
  • Since graduation rates are still correlated with Levels 3 and 4, the Grade 8 pattern indicates a key intervention point to improve education outcomes, especially as students move from middle school to high school.

*According to the dataset’s variable codebook, categorized exam results are as follows:

  • 1 = Not Meeting Learning Standards
  • 2 = Partially Meeting Learning Standards
  • 3 = Meeting Learning Standards
  • 4 = Meeting Learning Standards with Distinction Level 3 and 4 combined

Linear Regressions

I also wanted to know which year(s) assessments are most indicative of graduation rate and dropout rate. To investigate this, I ran a linear regression on both Graduation and Dropout Rates against the percent of the cohort at each of the four proficiency levels. I then repeated this process for each grade in a cohort. For example, I ran a regression for the ELA scores of Class of 2015 (Cohort 2011) from Grade 4 to Grade 8 (excluding Grade 3; 2005 data unavailable).

Regression Key (variable suffixes)

  • x=Grade 4
  • y=Grade 5
  • x.x=Grade 6
  • y.y=Grade 7
  • Grade 8 has no suffixes

Figure 5a: Graduation Rate Linear Regression:

Call:
lm(formula = `Grad Adjusted` ~ `Pct Level 1.x` + `Pct Level 1.y` +
`Pct Level 1.x.x` + `Pct Level 1.y.y` + `Pct Level 1` + `Pct Level 2.x` +
`Pct Level 2.y` + `Pct Level 2.x.x` + `Pct Level 2.y.y` +
`Pct Level 2` + `Pct Level 3.x` + `Pct Level 3.y` + `Pct Level 3.x.x` +
`Pct Level 3.y.y` + `Pct Level 3` + `Pct Level 4.x` + `Pct Level 4.y` +
`Pct Level 4.x.x` + `Pct Level 4.y.y` + `Pct Level 4` + `Pct Level 3 and 4.x` +
`Pct Level 3 and 4.y` + `Pct Level 3 and 4.x.x` + `Pct Level 3 and 4.y.y` +
`Pct Level 3 and 4`, data = all.testc11)

Residuals:
Min 1Q Median 3Q Max
-9.5283 -2.1988 -0.4108 2.2718 7.8927

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 187014.02 141306.40 1.323 0.2339
`Pct Level 1.x` -977.14 488.02 -2.002 0.0921 .
`Pct Level 1.y` -1073.23 505.33 -2.124 0.0779 .

`Pct Level 1.x.x` -472.83 452.71 -1.044 0.3365
`Pct Level 1.y.y` 796.19 448.80 1.774 0.1264
`Pct Level 1` -127.84 507.51 -0.252 0.8095
`Pct Level 2.x` -970.47 487.20 -1.992 0.0935 .
`Pct Level 2.y` -1069.29 503.05 -2.126 0.0777 .

`Pct Level 2.x.x` -478.41 448.89 -1.066 0.3275
`Pct Level 2.y.y` 771.68 448.43 1.721 0.1361
`Pct Level 2` -119.54 507.67 -0.235 0.8217
`Pct Level 3.x` -995.60 479.65 -2.076 0.0832 .
`Pct Level 3.y` 158.85 314.50 0.505 0.6315
`Pct Level 3.x.x` 274.86 707.19 0.389 0.7109
`Pct Level 3.y.y` 294.26 341.75 0.861 0.4223
`Pct Level 3` -497.79 261.60 -1.903 0.1058
`Pct Level 4.x` -997.73 481.44 -2.072 0.0836 .
`Pct Level 4.y` 163.29 313.58 0.521 0.6212
`Pct Level 4.x.x` 277.85 705.72 0.394 0.7074
`Pct Level 4.y.y` 292.88 343.44 0.853 0.4265
`Pct Level 4` -497.05 260.82 -1.906 0.1053
`Pct Level 3 and 4.x` 26.39 443.75 0.059 0.9545
`Pct Level 3 and 4.y` -1232.51 621.35 -1.984 0.0946 .
`Pct Level 3 and 4.x.x` -752.13 1016.74 -0.740 0.4874
`Pct Level 3 and 4.y.y` 476.60 486.83 0.979 0.3654
`Pct Level 3 and 4` 376.96 575.22 0.655 0.5366
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.352 on 6 degrees of freedom
Multiple R-squared: 0.8378, Adjusted R-squared: 0.1619
F-statistic: 1.24 on 25 and 6 DF, p-value: 0.4257

From the graduation rates, we see that there are no significant correlations across all grade levels. However, we can see that some measurements, like scoring a Level 1 or 2 (Not or Partially Meeting Learning Standards) in Grade 4 and/or Grade 5, are somewhated correlated with graduation outcome. In fact, any outcome level in Grades 4 and 5 appears to weakly predict graduation rate.

Figure 5b: Dropout Rate Linear Regression:

Call:
lm(formula = `Dropped Adjusted` ~ `Pct Level 1.x` + `Pct Level 1.y` +
`Pct Level 1.x.x` + `Pct Level 1.y.y` + `Pct Level 1` + `Pct Level 2.x` +
`Pct Level 2.y` + `Pct Level 2.x.x` + `Pct Level 2.y.y` +
`Pct Level 2` + `Pct Level 3.x` + `Pct Level 3.y` + `Pct Level 3.x.x` +
`Pct Level 3.y.y` + `Pct Level 3` + `Pct Level 4.x` + `Pct Level 4.y` +
`Pct Level 4.x.x` + `Pct Level 4.y.y` + `Pct Level 4` + `Pct Level 3 and 4.x` +
`Pct Level 3 and 4.y` + `Pct Level 3 and 4.x.x` + `Pct Level 3 and 4.y.y` +
`Pct Level 3 and 4`, data = all.testc11)

Residuals:
Min 1Q Median 3Q Max
-3.3906 -0.6556 -0.2364 0.7801 2.5394

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -84604.964 45405.479 -1.863 0.1117
`Pct Level 1.x` 324.541 156.815 2.070 0.0839 .
`Pct Level 1.y` 172.302 162.377 1.061 0.3295
`Pct Level 1.x.x` 282.308 145.469 1.941 0.1003
`Pct Level 1.y.y` -91.783 144.210 -0.636 0.5480
`Pct Level 1` 152.898 163.076 0.938 0.3846
`Pct Level 2.x` 322.590 156.549 2.061 0.0850 .
`Pct Level 2.y` 170.591 161.645 1.055 0.3319
`Pct Level 2.x.x` 282.992 144.241 1.962 0.0974 .
`Pct Level 2.y.y` -81.698 144.091 -0.567 0.5913
`Pct Level 2` 150.551 163.126 0.923 0.3917
`Pct Level 3.x` 330.886 154.124 2.147 0.0754 .
`Pct Level 3.y` -70.287 101.057 -0.696 0.5128
`Pct Level 3.x.x` -176.474 227.240 -0.777 0.4669
`Pct Level 3.y.y` -73.416 109.814 -0.669 0.5287
`Pct Level 3` 104.495 84.061 1.243 0.2602
`Pct Level 4.x` 332.235 154.698 2.148 0.0754 .
`Pct Level 4.y` -72.937 100.763 -0.724 0.4964
`Pct Level 4.x.x` -172.757 226.766 -0.762 0.4750
`Pct Level 4.y.y` -73.019 110.356 -0.662 0.5328
`Pct Level 4` 103.553 83.809 1.236 0.2628
`Pct Level 3 and 4.x` -8.655 142.590 -0.061 0.9536
`Pct Level 3 and 4.y` 242.359 199.656 1.214 0.2704
`Pct Level 3 and 4.x.x` 458.724 326.705 1.404 0.2099
`Pct Level 3 and 4.y.y` -7.768 156.432 -0.050 0.9620
`Pct Level 3 and 4` 46.260 184.834 0.250 0.8107
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.005 on 6 degrees of freedom
Multiple R-squared: 0.8694, Adjusted R-squared: 0.3251
F-statistic: 1.597 on 25 and 6 DF, p-value: 0.2922

Similar to graduation rates, we can see that dropout rates can be weakly predicted by results in Grade 4 and a Level 2 outcome in Grade 5.

Overall, these linear regressions indicate that performances in earlier grades are somewhat predictive of future academic success.

Thus, it’s crucial to concentrate resources in elementary schools and intervene early when scores are low. Of course, sufficient resources should continue into high school, but these early indicators serve as a key intervention point not only to address academic performance, but perhaps other environmental factors in the student’s life.

Conclusions

From our analysis, we’ve confirmed that there is a correlation between median income and four-year graduation rate. Although race wasn’t a determining factor in graduation rate, it still has a high correlation with academic performance. Crucially, the ELA outcomes indicate that outcomes at an early stage, like fourth and fifth grade, as well as right before high school, like eight grade, are indicative of graduation outcomes. Thus, concentrating resources and programs these time points is crucial for improving outcomes. Given the disparity between some districts’ graduation outcome and income and racial makeup, it’s also key to ensure more funding and socioeconomic help, not just academic help, in those areas to improve outcomes.

Next Steps

  • This analysis scratches the surface of the many nuances in NYC’s public education system.
  • I would like to investigate more factors that affect home life, such as parental education level, history of abuse, health insurance coverage, childcare services (meaning an older student may have to care for a younger sibling or find work to support their family), and food access (i.e. food deserts, SNAP eligibility, free meals at school). These factors will provide deeper insight into disadvantaged areas and further inform potential policy changes.
  • I also want to look at the relationship of these factors against students who are still enrolled after four years (meaning students who haven't dropped out and will likely graduate in 5–6 years). In particular, I want to understand if these longer graduation timelines are a result of academic difficulties (and by extension quality of education), external environmental factors, or degrees of both.

Data Sources

  1. 2006–2012 English Language Arts (ELA) Test Results —by School (from NYC Open Data)
  2. 2005–2015 Graduation Rates by District (from NYC Open Data)
  3. CITIZENS’ COMMITTEE FOR CHILDREN OF NEW YORK
  4. Proximity One (corresponding zip codes to NYC school districts; racial makeup)

Software Used

  1. R

Katharine Shao is a junior at The Wharton School. This data project was created for OIDD 245: ​Analytics and the Digital Economy

--

--