hercacstables

Welcome

For the newest R users:

If you are new to R, but not data science, then I suggest checking out R for Data Science. If have a background that is not data science, then I have heard good things about this book.

First-time users of `hercacstables`

If this is your first time using this package then you probably want to register with the Census. Registering is important because it helps you make more queries to the API. The Census limits the number of anonymous API queries that it receives from any one IP address. The vignette("set_up_your_api_key") will guide you through the process.

Otherwise, welcome back!

This package helps R users to access data from the United States Census Bureau, especially its American Community Survey.

The American Community Survey (ACS) from the US Census’s website provides a vast amount of useful data.

However, it returns those data in a weirdly idiosyncratic way. Even though the output seems tabular, the data are really organized in a tree-like fashion. This package is intended to make it easy to access and use the ACS data with R.

The database has an enormous amount of information, stored in a versatile but cumbersome format. Data are stored in hundreds of different tables, called “groups,” that give breakdowns of information. For example, group B21005 reports counts of people according to their employment status, military service, and age. Each row in a group reports a value for a specific slice of the topic. For example, row 5 in group B21005 gives the number of employed veterans between the ages of 18 and 34. The Census combines the group code and row into a variable, B21005_005E. This packing of information gives the Census a very flexible way to report all of their data. However, it is challenging for the user because you have to somehow know the variables that report the data that you are interested in.

In light of that, hercacstables is designed to help people make repeated, efficient use of the Census’s API. There are glossary tables that explicitly connect real-world concepts to Census variables. You can then use database-style joins to connect the concepts to the values that you pull from the API.

Many questions that work with Census data follow a common pattern:

How did [measurement] differ among [demographic groups] and across [geographic levels] in [geographic area] during [span of time]?

A Measurement is any of the huge number of things that the Census keeps track of. Measurements are arranged in tables (called “groups”) and rows. If you were drawing maps, this would be what determines the color of each area. Examples include population size (table B01001 and others), median household income (table B19013 and others), types of computers in a household (table B28001), number of vehicles used while commuting (table B08015).

A Demographic Group is a subset of the population that shares specific traits. Demographic groups can be represented either by groups (common when reporting by race or ethnicity) or rows (most other cases, like age, sex, or veteran status). If you were drawing maps, you would probably have a different version of each map for each demographic grouping. Examples include the number of Hispanic girls under 5 years old (table B01001I, row 18), the median income of Asian American households (table B19013D, row 1), the number of households with no computer (table B28001, row 11), the number of vehicles used by women while commuting (table B08015, row 3).

A Geographic Level is a Census-defined hierarchy from the whole country to small blocks. The level that you are interested in is determined by the scale of your question. If you were drawing maps, these would be the areas that appeared as different colors. Examples include congressional districts, incorporated cities, school districts, and Census tracts.

A Geographic Area is a set of specific instances of one or more geographic levels. This is the full geographic scope of your question. If you were drawing maps, this would determine their scale. Examples include whole states, metropolitan areas, Census-designated places, and counties.

The workflow

Identify the groups and rows that contain the data you need
- Find potential groups by searching the Universe and Description fields of hercacstables::METADATA_FOR_ACS_GROUPS.
- Identify specific variables with hercacstables::METADATA_FOR_ACS_VARIABLES: filter by the Group field and examine the Details field.
- Create a glossary that maps from the Census variable name to your variables.
  - You may need several if you are pulling data from several groups.
Define the geographies that you will use
- Use hercacstables::METADATA_FOR_ACS_GEOGRAPHIES for this.
- You need to know at least two levels of geography:
  - The one that you want to pull data for, your level of interest
    - This is often small, like tract or county subdivision.
  - Any and all levels that contain your level of interest
    - These will be larger, like state or county.
- Create a glossary that maps from geographic ids to meaningful names.
Write fetching functions that call hercacstables::fetch_data()
- they will often always use the same variables (from step 1)
- they may need to be parameterized by geography if you’re pulling from multiple levels
- they should always be parameterized by year so that you can use them to look at trends and/or reuse them in subsequent years.
Write wrangling functions that for turning fetched tables into useful ones.
- These will probably involve using dplyr::inner_join() between the fetched data and your glossaries.
- You can also perform calculations like aggregating or finding remainders.
- You will probably only have to write these functions once per question.
Run the workflow in two stages.
- Use purrr::map() and purrr::list_rbind() to download all of the data into one data frame.
  - Cache that result because API calls are slow
- Run the wrangling functions.
  - Save these results with, e.g. base::saveRDS().

Example

Let’s say you want to ask this question:

How did [the number of households] differ between [Hispanic and non-Hispanic people] and across [Manchester, Nashua, and suburban areas] in [Hillsborough County, NH] during [the last 10 years]?

Find the measurement and demographic groups

First, find the variables that describe the numbers of households. One way to do this is to search hercacstables::METADATA_FOR_ACS_GROUPS for groups whose Universe is “Households” and whose Description contains “Hispanic” or “Ethnicity.” A group’s Universe describes what it is measuring, often telling you the units of whatever its values are. A group’s Description is a phrase that summarizes what it reports.

hercacstables::METADATA_FOR_ACS_GROUPS |>
    dplyr::filter(
        .data$ACS5,
        stringr::str_detect(.data$Universe, "Household"),
        stringr::str_detect(.data$Description, "Hispanic|Ethnic")
    ) |>
    dplyr::select(
        "Group",
        "Universe",
        "Description"
    ) |>
    knitr::kable()

Group	Universe	Description
B11001H	Households with a householder who is White alone, not Hispanic or Latino	Household Type (Including Living Alone) (White Alone, Not Hispanic or Latino)
B11001I	Households with a householder who is Hispanic or Latino	Household Type (Including Living Alone) (Hispanic or Latino)
B19001H	Households with a householder who is White alone, not Hispanic or Latino	Household Income in the Past 12 Months (White Alone, Not Hispanic or Latino Householder)
B19001I	Households with a householder who is Hispanic or Latino	Household Income in the Past 12 Months (Hispanic or Latino Householder)
B19013H	Households with a householder who is White alone, not Hispanic or Latino	Median Household Income in the Past 12 Months (White Alone, Not Hispanic or Latino Householder)
B19013I	Households with a householder who is Hispanic or Latino	Median Household Income in the Past 12 Months (Hispanic or Latino Householder)
B19025H	Households with a householder who is White alone, not Hispanic or Latino	Aggregate Household Income in the Past 12 Months (White Alone, Not Hispanic or Latino Householder)
B19025I	Households with a householder who is Hispanic or Latino	Aggregate Household Income in the Past 12 Months (Hispanic or Latino Householder)
B19037H	Households with a householder who is White alone, not Hispanic or Latino	Age of Householder by Household Income in the Past 12 Months (White Alone, Not Hispanic or Latino Householder)
B19037I	Households with a householder who is Hispanic or Latino	Age of Householder by Household Income in the Past 12 Months (Hispanic or Latino Householder)
B22005H	Households with a householder who is White alone, not Hispanic or Latino	Receipt of Food Stamps/SNAP in the Past 12 Months by Race of Householder (White Alone, Not Hispanic or Latino)
B22005I	Households with a householder who is Hispanic or Latino	Receipt of Food Stamps/SNAP in the Past 12 Months by Race of Householder (Hispanic or Latino)

It looks like our best bet is group “B11001I.” It is likely that group “B11001” contains counts of households of any race. The first row of most groups is the total value across any demographic subset that it keeps track of. Since our question does not ask about different household types, we probably just need row one from groups “B11001” and “B11001I.” This gives us a good opportunity to document the ethnicities counted in each.

HOUSEHOLD_GROUPS <- c("B11001", "B11001I")
household_variables <- hercacstables::METADATA_FOR_ACS_VARIABLES |>
    dplyr::filter(
        .data$Dataset == "ACS1",
        .data$Group %in% HOUSEHOLD_GROUPS,
                  .data$Index == 1) |>
    dplyr::mutate(
        Ethnicity = dplyr::case_match(.data$Group,
                                      "B11001" ~ "All",
                                      "B11001I" ~ "Hispanic or Latino")
    ) |>
    dplyr::select("Group", "Index", "Variable", "Ethnicity")

knitr::kable(household_variables,
             align = c("lrll"))

Group	Index	Variable	Ethnicity
B11001I	1	B11001I_001E	Hispanic or Latino
B11001	1	B11001_001E	All

Find the geographic levels and geographic area

The next step is to find the codes that are related to the geographic areas and levels. We need to know the FIPS code for the state, county, and two cities. We also need to know that the geographic level that we’re working with is “county subdivision.” As the last part of this step, we define a glossary to translate from the FIPS codes for the different geographies to a human-readable name.

NEW_HAMPSHIRE <- "33"
COUNTY_LEVEL <- "county"
HILLSBOROUGH_CO <- "011"
CITY_LEVEL <- "county subdivision"
MANCHESTER_NH <- "45140"
NASHUA_NH <- "50260"

geography_definitions <- tibble::tribble(
    ~ FIPS,          ~ Location,
    HILLSBOROUGH_CO, "County-wide",
    MANCHESTER_NH,   "Manchester",
    NASHUA_NH,       "Nashua"
)

geography_definitions |>
    knitr::kable()

FIPS	Location
011	County-wide
45140	Manchester
50260	Nashua

Find the time interval

This example looks for population changes “in the last ten years.” What that really means is the ten most recent years for which we have data. We can use 1-year estimates for our question because we are dealing with geographic levels that have more than 50,000 people in them. In the case of the 1-year ACS dataset, the latest will be one or two years ago. The earliest will be 11 years before that because there was no survey in 2020.

BUILD_DATE <- Sys.Date()
LATEST_YEAR <- hercacstables::most_recent_vintage("acs", "acs1")
TEN_YEARS_AGO <- LATEST_YEAR - 11L

The last ten years available from the Census are, as of 2025-06-02, 2012 through 2023.

YEARS_INCLUDED <- c(TEN_YEARS_AGO:2019, 2021:LATEST_YEAR)

Define functions that use the API

We will need to make multiple calls to the API, so it makes sense to create some reusable functions. We need two calls per year. The first one will pull the county-wide household counts. The second one will pull the household counts for each city. Each function should have “year” as its argument so that we can reuse it.

generalized_fetch_data <- function(.year, .level, .areas, ...) {
    hercacstables::fetch_data(
        variables = household_variables$Variable,
        year = .year,
        survey_type = "acs",
        table_or_survey_code = "acs1",
        for_geo = .level,
        for_items = .areas,
        state = NEW_HAMPSHIRE,
        ...
    )
}

fetch_county_households <- function(.year){
    generalized_fetch_data(.year,
                           "county",
                           HILLSBOROUGH_CO)
}

fetch_city_households <- function(.year){
    generalized_fetch_data(.year,
                           CITY_LEVEL,
                           c(MANCHESTER_NH,
                             NASHUA_NH),
                           county = HILLSBOROUGH_CO)
}

fetch_example_data <- function(.year) {
    dplyr::bind_rows(
        fetch_county_households(.year),
        fetch_city_households(.year)
    )
}

Fetch the data

This is where hercacstables starts to come into its own. We define the fetching process as few times as possible, just tweaking it for related cases. This leads to a lot of code reuse and efficiency, especially for reports that you just need to update once a year.

raw_households <- YEARS_INCLUDED |>
    purrr::map(fetch_example_data) |>
    purrr::list_rbind()

raw_households |>
    dplyr::filter(.data$Year == LATEST_YEAR) |>
    dplyr::mutate(
        dplyr::across(c("Value"),
                      scales::label_comma(accuracy = 1))
    ) |>
    knitr::kable(
        align = "lllrrrl"
    )

state	county	Group	Index	Value	Year	county subdivision
33	011	B11001I	1	11,255	2023	NA
33	011	B11001	1	170,761	2023	NA
33	011	B11001I	1	4,511	2023	45140
33	011	B11001	1	50,053	2023	45140
33	011	B11001I	1	4,770	2023	50260
33	011	B11001	1	36,451	2023	50260

Wrangle the data

The raw data are not very usable. Several of the columns still have codes, rather than human-readable values. The value column also does not explicitly state values for the suburban or for non-Hispanic folks. The “wrangling” process is where we addressing these drawbacks.

Map codes to words

We should get rid of the columns that are Census database codes and create columns about location and demographics that have human-readable values.

households <- raw_households |>
    dplyr::inner_join(
        household_variables,
        by = c("Group", "Index")
    ) |>
    dplyr::mutate(
        FIPS = dplyr::coalesce(.data$`county subdivision`,
                               .data$county)
    ) |>
    dplyr::inner_join(
        geography_definitions,
        by = "FIPS"
    ) |>
    dplyr::select(
        "Location",
        "Year",
        "Ethnicity",
        Households = "Value"
    )

households |>
    dplyr::filter(.data$Year == LATEST_YEAR) |>
    dplyr::mutate(
        dplyr::across(c("Households"),
                      scales::label_comma(accuracy = 1))
    ) |>
    knitr::kable(
        align = "rllr"
    )

Location	Year	Ethnicity	Households
County-wide	2023	Hispanic or Latino	11,255
County-wide	2023	All	170,761
Manchester	2023	Hispanic or Latino	4,511
Manchester	2023	All	50,053
Nashua	2023	Hispanic or Latino	4,770
Nashua	2023	All	36,451

Compute implicit values

Now that the columns are human-readable, we can compute the values that we are actually interested in. That computation involves subtracting either the cities’ households from the county’s, or the number of Hispanic households from the total number of households. This task turns up very frequently when dealing with Census data, so here is a helper function that you can use or adapt to do it for you: hercacstables::subtract_parts_from_whole(). That function does not remove the rows that contain the all-groups category. In our case, we must remove them so all of the calculations come out correctly.

households <- households |>
    hercacstables::subtract_parts_from_whole(
        grouping_column = "Location",
        value_column = "Households",
        whole_name = "County-wide",
        part_names = c("Manchester", "Nashua"),
        remainder_name = "Suburbs"
    ) |>
    dplyr::filter(
        .data$Location != "County-wide"
    ) |>
    hercacstables::subtract_parts_from_whole(
        grouping_column = "Ethnicity",
        value_column = "Households",
        whole_name = "All",
        part_names = "Hispanic or Latino",
        remainder_name = "Not Hispanic or Latino"
    ) |>
    dplyr::filter(
        .data$Ethnicity != "All"
    )

households |>
    dplyr::filter(.data$Year == LATEST_YEAR) |>
    dplyr::mutate(
        dplyr::across(c("Households"),
                      scales::label_comma(accuracy = 1))
    ) |>
    knitr::kable(
        align = "lrlr"
    )

Year	Location	Households	Ethnicity
2023	Manchester	4,511	Hispanic or Latino
2023	Manchester	45,542	Not Hispanic or Latino
2023	Nashua	4,770	Hispanic or Latino
2023	Nashua	31,681	Not Hispanic or Latino
2023	Suburbs	1,974	Hispanic or Latino
2023	Suburbs	82,283	Not Hispanic or Latino

Answer the question

Now we can finally look at trends in number of households in Hillsborough County, NH, comparing between Hispanic and non-Hispanic households among Manchester, Nashua, and the suburbs.

Visualize

The first step is to make some graphs. These data are 4-dimensional because they involve time, location, ethnicity, and number of households. That means we’ll need more than one graph. It looked from the tables above that the number of Hispanic households is much lower than non-Hispanic households in all three locations. Let’s make a two-panel graph, where each panel shows one ethnicity. That way they can have separate y axes. Both graphs will have time on the x-axis, number of households on the y, and designate location with the color of points and lines.

households |>
    ggplot2::ggplot(
        ggplot2::aes(
            x = .data$Year,
            y = .data$Households,
            color = .data$Location,
            group = .data$Location
        )
    ) +
    ggplot2::geom_line(
        linewidth = 2
    ) +
    ggplot2::geom_point(
        size = 5
    ) +
    ggplot2::scale_x_continuous(
        name = NULL,
        breaks = scales::breaks_width(5),
        minor_breaks = scales::breaks_width(1)
    ) +
    ggplot2::scale_y_continuous(
        name = "Number of households",
        limits = c(0, NA),
        labels = scales::label_comma(accuracy = 1)
    ) +
    ggplot2::scale_color_viridis_d(
        name = NULL,
        guide = ggplot2::guide_legend(position = "top")
    ) +
    ggplot2::facet_grid(
        rows = ggplot2::vars(.data$Ethnicity),
        scales = "free_y"
    )

It looks like the number of households is growing, with non-Hispanic households increasing in the suburbs and Hispanic households increasing in the cities.

Analyze

Let’s test this with an ANCOVA. We’ll subtract 2012 from the year so that the intercept estimate gives us the value in 2012, not AD 0.

household_model <- households |>
    dplyr::mutate(
        Year = .data$Year - TEN_YEARS_AGO,
        Location = factor(.data$Location,
                          levels = c("Suburbs",
                                     "Manchester",
                                     "Nashua")),
        Ethnicity = factor(.data$Ethnicity,
                           levels = c("Not Hispanic or Latino",
                                      "Hispanic or Latino"))
    ) |>
    lm(
        Households ~ Year * Ethnicity * Location,
        data = _
    )

I always like to look at the ANOVA table first to get a 10,000 meter view before I try to interpret specific parameters.

household_model |>
    anova() |>
    broom::tidy() |>
    dplyr::mutate(
        dplyr::across(c("sumsq",
                        "meansq"),
                      scales::label_comma(accuracy = 1)),
        dplyr::across("statistic",
                      \(.) signif(., 4)),
        dplyr::across("p.value",
                      \(.) round(., 4))
    ) |>
    knitr::kable(
        align = "lrrrrr"
    )

term	df	sumsq	meansq	statistic	p.value
Year	1	18,087,730	18,087,730	16.080	0.0002
Ethnicity	1	37,636,191,674	37,636,191,674	33460.000	0.0000
Location	2	5,417,599,914	2,708,799,957	2408.000	0.0000
Year:Ethnicity	1	6,301,756	6,301,756	5.602	0.0216
Year:Location	2	6,733,309	3,366,655	2.993	0.0587
Ethnicity:Location	2	6,202,912,707	3,101,456,353	2757.000	0.0000
Year:Ethnicity:Location	2	30,349,623	15,174,812	13.490	0.0000
Residuals	53	59,619,254	1,124,892	NA	NA

It looks like EVERYTHING is significant, so let’s look at all of the parameters that were in the near-significant range.

household_model |>
    broom::tidy() |>
    dplyr::filter(
        .data$`p.value` < 0.1
    ) |>
    dplyr::mutate(
        dplyr::across(c("estimate",
                        "std.error",
                        "statistic"),
                      \(.) signif(., 4)),
        dplyr::across("p.value",
                      \(.) round(., 4))
    ) |>
    knitr::kable(
        align = "lrrrr"
    )

term	estimate	std.error	statistic	p.value
(Intercept)	72600.0	576.10	126.000	0e+00
Year	806.5	90.89	8.873	0e+00
EthnicityHispanic or Latino	-71030.0	814.80	-87.190	0e+00
LocationManchester	-31210.0	814.80	-38.310	0e+00
LocationNashua	-39680.0	814.80	-48.700	0e+00
Year:EthnicityHispanic or Latino	-787.6	128.50	-6.127	0e+00
Year:LocationManchester	-543.7	128.50	-4.230	1e-04
Year:LocationNashua	-754.5	128.50	-5.871	0e+00
EthnicityHispanic or Latino:LocationManchester	32360.0	1152.00	28.090	0e+00
EthnicityHispanic or Latino:LocationNashua	40430.0	1180.00	34.270	0e+00
Year:EthnicityHispanic or Latino:LocationManchester	683.2	181.80	3.758	4e-04
Year:EthnicityHispanic or Latino:LocationNashua	914.8	183.90	4.974	0e+00

Summarize

The number of households in Hillsborough county overall grew from 2012-2023. The number of Hispanic households was much higher in the two cities than in the suburban parts of the county. This difference became more pronounced over the decade, for two reasons. The number of Hispanic households grew more quickly in Nashua and Manchester than in the suburban areas. The number of non-Hispanic households grew more quickly in the suburbs than in the cities. Isn’t it nice when the visual patterns are corroborated by significant statistics?