Reusability
reusability.Rmd
Introduction
A reusable approach incorporates setup and post-processing, which take place before and after calling for data from the API. Let’s say that you are interested in the fraction of women who have given birth in the U.S., broken down by age.
The non-reusable part
Use search_in_columns()
to find tables that are relevant
to our question.
WB_TABLES <- hercacstables::search_in_columns(
hercacstables::METADATA_FOR_ACS_GROUPS,
Group = "\\d$", # the group must end in a digit, so only "all races" tables
Universe = "wom[ae]n", # matches woman and women
Description = c("birth", "year") # the census doesn't use "age" much
)
Group | Universe | Description | ACS1 | ACS5 |
---|---|---|---|---|
B13002 | Women 15 to 50 years | Women 15 to 50 Years Who Had a Birth in the Past 12 Months by Marital Status and Age | TRUE | TRUE |
B13004 | Women 15 to 50 years in households | Women 15 to 50 Years Who Had a Birth in the Past 12 Months by Presence of Spouse or Unmarried Partner | TRUE | TRUE |
B13008 | Women 15 to 50 years | Women 15 to 50 Years Who Had a Birth in the Past 12 Months by Marital Status and Nativity | TRUE | TRUE |
B13010 | Women 15 to 50 years for whom poverty status is determined | Women 15 to 50 Years Who Had a Birth in the Past 12 Months by Marital Status and Poverty Status in the Past 12 Months | TRUE | TRUE |
B13012 | Women 16 to 50 years | Women 16 to 50 Years Who Had a Birth in the Past 12 Months by Marital Status and Labor Force Status | TRUE | TRUE |
B13014 | Women 15 to 50 years | Women 15 to 50 Years Who Had a Birth in the Past 12 Months by Marital Status and Educational Attainment | TRUE | TRUE |
B13015 | Women 15 to 50 years | Women 15 to 50 Years Who Had a Birth in the Past 12 Months by Marital Status and Receipt of Public Assistance Income in the Past 12 Months | TRUE | TRUE |
B13016 | Women 15 to 50 years | Women 15 to 50 Years Who Had a Birth in the Past 12 Months by Age | TRUE | TRUE |
It looks like table B13016 is exactly what we are looking for.
Setup
Define a glossary of variable meanings
First, create a glossary table that maps from the Census variables you need to the real-world meanings that you actually care about.
The following example glosses 12 variables from “B13016.” Each specific variable encodes four columns’ worth of data.
GLOSSARY_OF_WOMEN_AND_BIRTHS <- "B13016" |>
hercacstables::unpack_group_details() |>
dplyr::filter(
.data$Dataset == "ACS1", # there are details for 1- and 5-year datasets
dplyr::if_all(c("A", "B"), \(.)nchar(.) > 0) #
) |>
dplyr::mutate(
`Gave Birth` = stringr::str_detect(.data$A, "had"),
`Lower Age` = stringr::str_extract(.data$B, "^\\d{2}"),
`Upper Age` = stringr::str_extract(.data$B, "(?<!^)\\d{2}"),
dplyr::across(tidyselect::ends_with("Age"),
as.integer)
) |>
dplyr::select(
!tidyselect::any_of(c("A", "B"))
)
Dataset | Group | Index | Variable | Gave Birth | Lower Age | Upper Age |
---|---|---|---|---|---|---|
ACS1 | B13016 | 3 | B13016_003E | TRUE | 15 | 19 |
ACS1 | B13016 | 4 | B13016_004E | TRUE | 20 | 24 |
ACS1 | B13016 | 5 | B13016_005E | TRUE | 25 | 29 |
ACS1 | B13016 | 6 | B13016_006E | TRUE | 30 | 34 |
ACS1 | B13016 | 7 | B13016_007E | TRUE | 35 | 39 |
ACS1 | B13016 | 8 | B13016_008E | TRUE | 40 | 44 |
ACS1 | B13016 | 9 | B13016_009E | TRUE | 45 | 50 |
ACS1 | B13016 | 11 | B13016_011E | FALSE | 15 | 19 |
ACS1 | B13016 | 12 | B13016_012E | FALSE | 20 | 24 |
ACS1 | B13016 | 13 | B13016_013E | FALSE | 25 | 29 |
ACS1 | B13016 | 14 | B13016_014E | FALSE | 30 | 34 |
ACS1 | B13016 | 15 | B13016_015E | FALSE | 35 | 39 |
ACS1 | B13016 | 16 | B13016_016E | FALSE | 40 | 44 |
ACS1 | B13016 | 17 | B13016_017E | FALSE | 45 | 50 |
Calling the API
Define a fetching function
Having defined a reusable glossary, we can now define a reusable fetching function.
fetch_women_and_births <- function(...) {
hercacstables::fetch_data(
variables = GLOSSARY_OF_WOMEN_AND_BIRTHS$Variable,
survey_type = "acs",
table_or_survey_code = "acs1",
...
)
}
Send the request to the API
Run the fetch_data()
command by itself so that you can
cache it. In this example, we will ask for the most recent data for the
whole country.
RAW_WOMEN_AND_BIRTHS <- fetch_women_and_births(
year = hercacstables::most_recent_vintage("acs", "acs1"),
for_geo = "us",
for_items = "*"
)
us | Group | Index | Value | Year |
---|---|---|---|---|
1 | B13016 | 3 | 91,622 | 2023 |
1 | B13016 | 4 | 548,375 | 2023 |
1 | B13016 | 5 | 989,964 | 2023 |
1 | B13016 | 6 | 1,201,077 | 2023 |
1 | B13016 | 7 | 792,711 | 2023 |
1 | B13016 | 8 | 269,902 | 2023 |
1 | B13016 | 9 | 125,282 | 2023 |
1 | B13016 | 11 | 10,712,296 | 2023 |
1 | B13016 | 12 | 9,996,049 | 2023 |
1 | B13016 | 13 | 9,825,750 | 2023 |
1 | B13016 | 14 | 10,392,627 | 2023 |
1 | B13016 | 15 | 10,419,909 | 2023 |
1 | B13016 | 16 | 10,755,886 | 2023 |
1 | B13016 | 17 | 11,862,580 | 2023 |
Postprocessing
Wrangle the result into a helpful format
The raw data do not answer our question all by themselves. We are
interested in rates, not counts. Using the
dplyr::summarize()
function will let us get rid of the
superfluous information about marital status.
wrangle_women_and_births <- function(.raw_api_output, ...) {
.raw_api_output |>
dplyr::inner_join(
GLOSSARY_OF_WOMEN_AND_BIRTHS,
by = c("Group", "Index")
) |>
dplyr::summarize(
`Recent Mothers` = sum(.data$Value * .data$`Gave Birth`),
`All Women` = sum(.data$Value),
Rate = .data$`Recent Mothers` / .data$`All Women`,
.by = tidyselect::all_of(c(..., "Lower Age", "Upper Age"))
)
}
WOMEN_AND_BIRTHS <- wrangle_women_and_births(RAW_WOMEN_AND_BIRTHS)
Lower Age | Upper Age | Recent Mothers | All Women | Rate |
---|---|---|---|---|
15 | 19 | 91,622 | 10,803,918 | 1% |
20 | 24 | 548,375 | 10,544,424 | 5% |
25 | 29 | 989,964 | 10,815,714 | 9% |
30 | 34 | 1,201,077 | 11,593,704 | 10% |
35 | 39 | 792,711 | 11,212,620 | 7% |
40 | 44 | 269,902 | 11,025,788 | 2% |
45 | 50 | 125,282 | 11,987,862 | 1% |
Reusability
The benefit of this approach is that we can reuse the fetching and wrangling functions.
Across Places
In this example, we pull data for three different counties in Wisconsin.
SE_WI_WOMEN_AND_BIRTHS <- fetch_women_and_births(
state = 55,
for_geo = "county",
for_items = c("059", "079", "101"), # Kenosha, Racine, and Milwaukee
year = 2023
) |>
wrangle_women_and_births(
"county"
) |>
dplyr::mutate(
county = dplyr::case_match(.data$county,
"059" ~ "Kenosha",
"079" ~ "Milwaukee",
"101" ~ "Racine")
)
For a simpler table, we’ll exclude the counts of mothers and women, then pivot the rate data by county.
Lower Age | Upper Age | Kenosha | Milwaukee | Racine |
---|---|---|---|---|
15 | 19 | 0% | 1% | 0% |
20 | 24 | 1% | 4% | 2% |
25 | 29 | 16% | 6% | 18% |
30 | 34 | 5% | 9% | 17% |
35 | 39 | 2% | 8% | 10% |
40 | 44 | 3% | 1% | 1% |
45 | 50 | 0% | 1% | 0% |
Through time
In this example, we pull data about Texas from three years. We have to make three separate calls, one for each year. It’s an idiosyncrasy of this particular API.
TEXAS_WOMEN_AND_BIRTHS <- 2021:2023 |>
purrr::map(
\(.y) fetch_women_and_births(
for_geo = "state",
for_items = 48L,
year = .y
)
) |>
purrr::list_rbind() |>
wrangle_women_and_births(
"Year"
)
For a simpler table, we’ll exclude the counts of mothers and women, then pivot the rate data by year.
Lower Age | Upper Age | 2021 | 2022 | 2023 |
---|---|---|---|---|
15 | 19 | 2% | 1% | 1% |
20 | 24 | 8% | 7% | 7% |
25 | 29 | 11% | 11% | 10% |
30 | 34 | 11% | 11% | 10% |
35 | 39 | 7% | 6% | 7% |
40 | 44 | 2% | 3% | 2% |
45 | 50 | 1% | 1% | 1% |