Skip to contents

How have populations changed in Ohio’s four largest school districts?

To answer this with hercacstables, you should break it down into four component questions.

Who?

What population or populations are you trying to understand? This will help us identify the groups and variables that we will specifically ask for.

All of those groups and values can be accessed offline with hercacstables::METADATA_FOR_ACS_GROUPS and hercacstables::METADATA_FOR_ACS_VARIABLES.

We want to know about populations in school districts. It makes sense then to only look at school-aged children, say 5-18. Our requests should only ask for data about kids in that age range. We can do this with the built-in table hercacstables::GLOSSARY_OF_AGE_AND_SEX.

school_children_glossary <- hercacstables::GLOSSARY_OF_AGE_AND_SEX |>
    dplyr::filter(
        .data$`Lower Age` >= 5,
        .data$`Upper Age` <= 18
    )

knitr::kable(school_children_glossary)
group index variable Sex Lower Age Upper Age
B01001 4 B01001_004E Male 5 9
B01001 5 B01001_005E Male 10 14
B01001 6 B01001_006E Male 15 17
B01001 28 B01001_028E Female 5 9
B01001 29 B01001_029E Female 10 14
B01001 30 B01001_030E Female 15 17

From this table we can see that we will need to include six variables in our API query.

SCHOOL_CHILDREN_VARIABLES <- school_children_glossary$variable
print(SCHOOL_CHILDREN_VARIABLES)
#> [1] "B01001_004E" "B01001_005E" "B01001_006E" "B01001_028E" "B01001_029E"
#> [6] "B01001_030E"

Where?

What kind of place, and specific location, should the data describe? The Census API can provide information at many levels of geographical detail. Some questions might pertain to entire states. Others might involve comparisons among all of the tracts in a metropolitan area. You can find details about the different geographic levels in hercacstables::METADATA_FOR_ACS_GEOGRAPHIES. If you know the real-world name of a place, but not the numeric code that the Census uses for it, then you can look it up here.

For the question that we’re dealing with, we want to look at the geographic area served by a school district. There are actually three different kinds of school district geographic levels.

school_district_geographies <- hercacstables::METADATA_FOR_ACS_GEOGRAPHIES |>
    dplyr::filter(stringr::str_detect(.data$`Geographic Level`, "school"))

school_district_geographies |>
    knitr::kable()
Geographic Level Containing Geographies Wildcard Option ACS1 ACS5
school district (elementary) {state} {state} 2023-01-01 2023-01-01
school district (secondary) {state} {state} 2023-01-01 2023-01-01
school district (unified) {state} {state} 2023-01-01 2023-01-01

It looks like the specific geographic level that we should ask about is “school districts (unified)”.

The “Containing Geographies” column lets us know that we also have to tell the API which state we are interested in. The “Wildcard Option” column tells us that we can use “*” to say “all of the states.”

We just want Ohio, so we must supply its FIPS code. If you don’t know it off the top of your head, you can use a page on the Census’s website to search for the FIPS code of a specific geography.

Ohio’s FIPS code is 39.

UNIFIED_SCHOOL_DISTRICT <- school_district_geographies$`Geographic Level`[3]
print(UNIFIED_SCHOOL_DISTRICT)
#> [1] "school district (unified)"
STATE_FIPS <- 39

When?

A request to the Census API must include year its path. That means that it takes multiple API calls to pull data for multiple years. Each of these calls will be identical to the others, except for the year in the path. This repetitiveness might just be the single biggest motivator for developing hercacstables. It’s very easy in R to iterate over an array of years, making one function call for each year, and then bundling all of the results into one tidy data frame.

The other aspect of the “when” question is whether or not to use 1-year or 5-year estimates. The ACS reports both. Five-year estimates include data from the preceding four years. This increases sample sizes enough to provide information for smaller geographic levels, like tracts and block groups. One-year estimates are more responsive to rapid changes (like unemployment during the COVID-19 pandemic, for example), but are only available for larger geographic levels.

Unified school districts are large enough for the 1-year ACS to include them. For our example, we will use every year of data available. That is 2005 to 2022, except for 2020, which does not have 1-year ACS data.

YEARS_FOR_ACS <- c(2005:2019, 2021:2023)

What? Describe the steps to answer our question

Thus far, we have identified what we need to deal with the “populations” (who), “school districts” and “in Ohio” (where), and “changed” (when) aspects of our question. The final aspect is the “four largest” parts. It will take three steps to find the four largest school districts in Ohio.

The first step is to use the [hercacstables::fetch_data()] function to pull data from the Census API. We need the number of children that live in each school district in Ohio in the most recent year available.

fetch_district_children <- function(.state_fips, .year, .district_geoids = "*") {
    hercacstables::fetch_data(
        variables = c("NAME", SCHOOL_CHILDREN_VARIABLES), # district names, too
        year = .year,                                     # most recent year
        survey_type = "acs",                              # American Community Survey
        table_or_survey_code = "acs1",                    # One-year estimates
        for_geo = UNIFIED_SCHOOL_DISTRICT,                # "school district (unified)"
        for_items = .district_geoids,                     # every school district
        state = STATE_FIPS                                # our state's two-digit code
    )
}

latest_school_districts_raw <- fetch_district_children(STATE_FIPS, 2023)

latest_school_districts_raw |>
    dplyr::slice_sample(n = 8) |>
    knitr::kable()
NAME state school district (unified) Group Index Value Year
Toledo City School District, Ohio 39 04490 B01001 6 3872 2023
Columbus City School District, Ohio 39 04380 B01001 29 14856 2023
West Clermont Local School District, Ohio 39 04635 B01001 28 2916 2023
Fairfield City School District, Ohio 39 04610 B01001 6 1314 2023
Toledo City School District, Ohio 39 04490 B01001 29 6601 2023
Parma City School District, Ohio 39 04463 B01001 4 2072 2023
Lakota Local School District (Butler County), Ohio 39 04611 B01001 5 3350 2023
West Clermont Local School District, Ohio 39 04635 B01001 4 2016 2023

The next step is to sum up the populations for each district and then sort them from biggest to smallest.

wrangle_latest_kids <- function(.raw_children){
    .raw_children |>
        dplyr::count(
            .data$NAME,
            .data[[UNIFIED_SCHOOL_DISTRICT]],
            name = "Children",
            wt = .data$Value
        ) |>
        dplyr::arrange(dplyr::desc(.data$Children))
}

latest_school_districts <- wrangle_latest_kids(latest_school_districts_raw)

latest_school_districts |>
    dplyr::slice(1:8) |>
    dplyr::mutate(Children = scales::label_comma(accuracy = 1)(.data$Children)) |>
    knitr::kable(align = "lrr")
NAME school district (unified) Children
Columbus City School District, Ohio 04380 79,141
Cleveland Municipal School District, Ohio 04378 53,773
Cincinnati City School District, Ohio 04375 48,798
Toledo City School District, Ohio 04490 31,622
Akron City School District, Ohio 04348 31,297
South-Western City School District, Ohio 04480 24,636
Olentangy Local School District, Ohio 04676 22,083
Dayton City School District, Ohio 04384 21,972

The last step is to pull the GEOIDs for each of the four biggest districts.

top_district_geoids <- function(.districts, n){
    .districts[[UNIFIED_SCHOOL_DISTRICT]][1:n]
}

districts_for_query <- top_district_geoids(latest_school_districts, 4)

print(districts_for_query)
#> [1] "04380" "04378" "04375" "04490"

Pulling it all together

Now we have all of the components that we need to pull the population histories for the four biggest school districts in Ohio.

First, we fetch the data. Notice that we can reuse the fetching function that we defined above!

fetch_district_histories <- function(.state_fips, .district_geoids = "*") {
    YEARS_FOR_ACS |>
        purrr::map(
            \(.year) fetch_district_children(.state_fips,
                                             .year, 
                                             .district_geoids)
        ) |>
        purrr::list_rbind()
} 

big_district_histories_raw <- fetch_district_histories(STATE_FIPS,
                                                       districts_for_query)

Next, we sum up the number of children and create human-readable names for each district.

wrangle_histories <- function(.raw_district_histories, .latest_districts){
    .raw_district_histories |>
    dplyr::count(
        .data[[UNIFIED_SCHOOL_DISTRICT]],
        .data$Year,
        name = "Children",
        wt = .data$Value
    ) |>
    dplyr::inner_join(
        dplyr::select(
            .latest_districts,
            tidyselect::all_of(c("NAME", UNIFIED_SCHOOL_DISTRICT))
            ),
        by = UNIFIED_SCHOOL_DISTRICT
    ) |>
    dplyr::mutate(
        District = stringr::str_extract(.data$NAME,
                                        "^\\S*")
    ) |>
    dplyr::select(
        "District",
        "Year",
        "Children"
    )
}

big_district_histories <- wrangle_histories(big_district_histories_raw,
                                            latest_school_districts) 

Finally, we can plot the results.

plot_district_histories <- function(.district_histories){
    .district_histories |>
        ggplot2::ggplot(
            ggplot2::aes(
                x = .data$Year,
                y = .data$Children,
                color = .data$District
            )
        ) +
        ggplot2::geom_line(
            linewidth = 2
        ) +
        ggplot2::geom_point(
            size = 5
        ) +
        ggplot2::scale_x_continuous(
            name = NULL,
            labels = scales::label_number(big.mark = ""),
            breaks = scales::breaks_width(5),
            minor_breaks = scales::breaks_width(1)
        ) +
        ggplot2::scale_y_continuous(
            name = "School-aged children in school district's area",
            labels = scales::label_comma(accuracy = 1,
                                         scale = 0.001,
                                         suffix = "K"),
            limits = c(0, NA)
        ) +
        ggplot2::scale_color_discrete(
            guide = ggplot2::guide_legend(title = NULL,
                                          position = "top")
        ) +
        ggplot2::theme_minimal()
}

plot_district_histories(big_district_histories)

It looks like maybe a lot of people are moving from Cleveland to Columbus, and taking their kids with them.

Everything is reusable

Now that you’ve seen the example with Ohio, you may be thinking “that question is not relevant to me.” However! You could rerun all of the code, changing just the STATE_FIPS to your state’s FIPS code, and get the same analysis for a different state.

Let’s look at changes in population for the five largest school districts in Delaware.

STATE_FIPS <- 10

latest_de_districts <- STATE_FIPS |>
    fetch_district_children(2023) |>
    wrangle_latest_kids()

de_district_geoids <- latest_de_districts |>
    top_district_geoids(5)

big_de_district_histories <- STATE_FIPS |>
    fetch_district_histories(de_district_geoids) |>
    wrangle_histories(latest_de_districts)

plot_district_histories(big_de_district_histories)

You could do the same thing with any other state. That’s the goal of this project: help define workflows for making reports from Census data so that they can be easily rerun for comparable places and times.