Population changes in school districts • hercacstables

How have populations changed in Ohio’s four largest school districts?

To answer this with hercacstables, you should break it down into four component questions.

Who?

What population or populations are you trying to understand? This will help us identify the groups and variables that we will specifically ask for.

All of those groups and values can be accessed offline with hercacstables::METADATA_FOR_ACS_GROUPS and hercacstables::METADATA_FOR_ACS_VARIABLES.

We want to know about populations in school districts. It makes sense then to only look at school-aged children, say 5-18. Our requests should only ask for data about kids in that age range. We can do this with the built-in table hercacstables::GLOSSARY_OF_AGE_AND_SEX.

school_children_glossary <- hercacstables::GLOSSARY_OF_AGE_AND_SEX |>
    dplyr::filter(
        .data$`Lower Age` >= 5,
        .data$`Upper Age` <= 18
    )

knitr::kable(school_children_glossary)

group	index	variable	Sex	Lower Age	Upper Age
B01001	4	B01001_004E	Male	5	9
B01001	5	B01001_005E	Male	10	14
B01001	6	B01001_006E	Male	15	17
B01001	28	B01001_028E	Female	5	9
B01001	29	B01001_029E	Female	10	14
B01001	30	B01001_030E	Female	15	17

From this table we can see that we will need to include six variables in our API query.

SCHOOL_CHILDREN_VARIABLES <- school_children_glossary$variable
print(SCHOOL_CHILDREN_VARIABLES)
#> [1] "B01001_004E" "B01001_005E" "B01001_006E" "B01001_028E" "B01001_029E"
#> [6] "B01001_030E"

Where?

What kind of place, and specific location, should the data describe? The Census API can provide information at many levels of geographical detail. Some questions might pertain to entire states. Others might involve comparisons among all of the tracts in a metropolitan area. You can find details about the different geographic levels in hercacstables::METADATA_FOR_ACS_GEOGRAPHIES. If you know the real-world name of a place, but not the numeric code that the Census uses for it, then you can look it up here.

For the question that we’re dealing with, we want to look at the geographic area served by a school district. There are actually three different kinds of school district geographic levels.

school_district_geographies <- hercacstables::METADATA_FOR_ACS_GEOGRAPHIES |>
    dplyr::filter(stringr::str_detect(.data$`Geographic Level`, "school"))

school_district_geographies |>
    knitr::kable()

Geographic Level	Containing Geographies	Wildcard Option	ACS1	ACS5
school district (elementary)	{state}	{state}	2023-01-01	2023-01-01
school district (secondary)	{state}	{state}	2023-01-01	2023-01-01
school district (unified)	{state}	{state}	2023-01-01	2023-01-01

It looks like the specific geographic level that we should ask about is “school districts (unified)”.

The “Containing Geographies” column lets us know that we also have to tell the API which state we are interested in. The “Wildcard Option” column tells us that we can use “*” to say “all of the states.”

We just want Ohio, so we must supply its FIPS code. If you don’t know it off the top of your head, you can use a page on the Census’s website to search for the FIPS code of a specific geography.

Ohio’s FIPS code is 39.

UNIFIED_SCHOOL_DISTRICT <- school_district_geographies$`Geographic Level`[3]
print(UNIFIED_SCHOOL_DISTRICT)
#> [1] "school district (unified)"
STATE_FIPS <- 39

When?

A request to the Census API must include year its path. That means that it takes multiple API calls to pull data for multiple years. Each of these calls will be identical to the others, except for the year in the path. This repetitiveness might just be the single biggest motivator for developing hercacstables. It’s very easy in R to iterate over an array of years, making one function call for each year, and then bundling all of the results into one tidy data frame.

The other aspect of the “when” question is whether or not to use 1-year or 5-year estimates. The ACS reports both. Five-year estimates include data from the preceding four years. This increases sample sizes enough to provide information for smaller geographic levels, like tracts and block groups. One-year estimates are more responsive to rapid changes (like unemployment during the COVID-19 pandemic, for example), but are only available for larger geographic levels.

Unified school districts are large enough for the 1-year ACS to include them. For our example, we will use every year of data available. That is 2005 to 2022, except for 2020, which does not have 1-year ACS data.

YEARS_FOR_ACS <- c(2005:2019, 2021:2023)

What? Describe the steps to answer our question

Thus far, we have identified what we need to deal with the “populations” (who), “school districts” and “in Ohio” (where), and “changed” (when) aspects of our question. The final aspect is the “four largest” parts. It will take three steps to find the four largest school districts in Ohio.

The first step is to use the [hercacstables::fetch_data()] function to pull data from the Census API. We need the number of children that live in each school district in Ohio in the most recent year available.

fetch_district_children <- function(.state_fips, .year, .district_geoids = "*") {
    hercacstables::fetch_data(
        variables = c("NAME", SCHOOL_CHILDREN_VARIABLES), # district names, too
        year = .year,                                     # most recent year
        survey_type = "acs",                              # American Community Survey
        table_or_survey_code = "acs1",                    # One-year estimates
        for_geo = UNIFIED_SCHOOL_DISTRICT,                # "school district (unified)"
        for_items = .district_geoids,                     # every school district
        state = STATE_FIPS                                # our state's two-digit code
    )
}

latest_school_districts_raw <- fetch_district_children(STATE_FIPS, 2023)

latest_school_districts_raw |>
    dplyr::slice_sample(n = 8) |>
    knitr::kable()

NAME	state	school district (unified)	Group	Index	Value	Year
Toledo City School District, Ohio	39	04490	B01001	6	3872	2023
Columbus City School District, Ohio	39	04380	B01001	29	14856	2023
West Clermont Local School District, Ohio	39	04635	B01001	28	2916	2023
Fairfield City School District, Ohio	39	04610	B01001	6	1314	2023
Toledo City School District, Ohio	39	04490	B01001	29	6601	2023
Parma City School District, Ohio	39	04463	B01001	4	2072	2023
Lakota Local School District (Butler County), Ohio	39	04611	B01001	5	3350	2023
West Clermont Local School District, Ohio	39	04635	B01001	4	2016	2023

The next step is to sum up the populations for each district and then sort them from biggest to smallest.

wrangle_latest_kids <- function(.raw_children){
    .raw_children |>
        dplyr::count(
            .data$NAME,
            .data[[UNIFIED_SCHOOL_DISTRICT]],
            name = "Children",
            wt = .data$Value
        ) |>
        dplyr::arrange(dplyr::desc(.data$Children))
}

latest_school_districts <- wrangle_latest_kids(latest_school_districts_raw)

latest_school_districts |>
    dplyr::slice(1:8) |>
    dplyr::mutate(Children = scales::label_comma(accuracy = 1)(.data$Children)) |>
    knitr::kable(align = "lrr")

NAME	school district (unified)	Children
Columbus City School District, Ohio	04380	79,141
Cleveland Municipal School District, Ohio	04378	53,773
Cincinnati City School District, Ohio	04375	48,798
Toledo City School District, Ohio	04490	31,622
Akron City School District, Ohio	04348	31,297
South-Western City School District, Ohio	04480	24,636
Olentangy Local School District, Ohio	04676	22,083
Dayton City School District, Ohio	04384	21,972

The last step is to pull the GEOIDs for each of the four biggest districts.

top_district_geoids <- function(.districts, n){
    .districts[[UNIFIED_SCHOOL_DISTRICT]][1:n]
}

districts_for_query <- top_district_geoids(latest_school_districts, 4)

print(districts_for_query)
#> [1] "04380" "04378" "04375" "04490"

Pulling it all together

Now we have all of the components that we need to pull the population histories for the four biggest school districts in Ohio.

First, we fetch the data. Notice that we can reuse the fetching function that we defined above!

fetch_district_histories <- function(.state_fips, .district_geoids = "*") {
    YEARS_FOR_ACS |>
        purrr::map(
            \(.year) fetch_district_children(.state_fips,
                                             .year, 
                                             .district_geoids)
        ) |>
        purrr::list_rbind()
} 

big_district_histories_raw <- fetch_district_histories(STATE_FIPS,
                                                       districts_for_query)

Next, we sum up the number of children and create human-readable names for each district.

wrangle_histories <- function(.raw_district_histories, .latest_districts){
    .raw_district_histories |>
    dplyr::count(
        .data[[UNIFIED_SCHOOL_DISTRICT]],
        .data$Year,
        name = "Children",
        wt = .data$Value
    ) |>
    dplyr::inner_join(
        dplyr::select(
            .latest_districts,
            tidyselect::all_of(c("NAME", UNIFIED_SCHOOL_DISTRICT))
            ),
        by = UNIFIED_SCHOOL_DISTRICT
    ) |>
    dplyr::mutate(
        District = stringr::str_extract(.data$NAME,
                                        "^\\S*")
    ) |>
    dplyr::select(
        "District",
        "Year",
        "Children"
    )
}

big_district_histories <- wrangle_histories(big_district_histories_raw,
                                            latest_school_districts)

Finally, we can plot the results.

plot_district_histories <- function(.district_histories){
    .district_histories |>
        ggplot2::ggplot(
            ggplot2::aes(
                x = .data$Year,
                y = .data$Children,
                color = .data$District
            )
        ) +
        ggplot2::geom_line(
            linewidth = 2
        ) +
        ggplot2::geom_point(
            size = 5
        ) +
        ggplot2::scale_x_continuous(
            name = NULL,
            labels = scales::label_number(big.mark = ""),
            breaks = scales::breaks_width(5),
            minor_breaks = scales::breaks_width(1)
        ) +
        ggplot2::scale_y_continuous(
            name = "School-aged children in school district's area",
            labels = scales::label_comma(accuracy = 1,
                                         scale = 0.001,
                                         suffix = "K"),
            limits = c(0, NA)
        ) +
        ggplot2::scale_color_discrete(
            guide = ggplot2::guide_legend(title = NULL,
                                          position = "top")
        ) +
        ggplot2::theme_minimal()
}

plot_district_histories(big_district_histories)

It looks like maybe a lot of people are moving from Cleveland to Columbus, and taking their kids with them.

Everything is reusable

Now that you’ve seen the example with Ohio, you may be thinking “that question is not relevant to me.” However! You could rerun all of the code, changing just the STATE_FIPS to your state’s FIPS code, and get the same analysis for a different state.

Let’s look at changes in population for the five largest school districts in Delaware.

STATE_FIPS <- 10

latest_de_districts <- STATE_FIPS |>
    fetch_district_children(2023) |>
    wrangle_latest_kids()

de_district_geoids <- latest_de_districts |>
    top_district_geoids(5)

big_de_district_histories <- STATE_FIPS |>
    fetch_district_histories(de_district_geoids) |>
    wrangle_histories(latest_de_districts)

plot_district_histories(big_de_district_histories)

You could do the same thing with any other state. That’s the goal of this project: help define workflows for making reports from Census data so that they can be easily rerun for comparable places and times.