Skip to contents

Why the DHC?

The 2020 Demographic and Housing Characteristics file (DHC) is a table with detailed, block-level information about socioeconomic conditions.

The Census API has other tables from the Decennial Census, but they lack the level of detail in the DHC. Some tables lack spatial resolution. For example, tables like the Demographic Profile (DP) or the Congressional District (cd11X) do not provide data at the block level. Other tables lack demographic resolution. For example the table of Redistricting Data (PL) contains many fewer cross-tabulations between race and housing situation. The DHC provides fine-grained detail, as can be seen in its definition from the link above:

This product will include topics such as age, sex, race, Hispanic or Latino origin, household type, family type, relationship to householder, group quarters population, housing occupancy and housing tenure. Some tables will be iterated by race and ethnicity.

That ambitious portfolio means that the DHC is huge and complex, with 249 groups, or sub-tables, that hold a total of 9067 separate variables. I used the Census API to create a data table in R that captures the information from the link just above. These variables provide a compact way for the API to present information, but they need some assistance to be meaningful to actual humans. That is where glossary tables come in.

Glossary tables for DHC groups

A glossary table explicitly lays out the connection between the row number of a Census API variable and the demographic meaning that it represents. For example, the single variable “H12B_010N”, might reports the number of rented households that are headed by a Black, non-Hispanic, householder. That captures information about race, ethnicity, home ownership, and number of households. A glossary that explains group H12B would therefore need at least 4 columns.

This report documents the process that I will follow to create a glossary table for each group in the DHC.

Characteristics of each group

The glossary for each group will need enough columns to represent all of the detail that is captured by its rows. The first step for glossing a group is to count how many columns appear in its variables’ lists of details.

DHC_VARIABLES |>
    dplyr::summarize(
        `Columns Present` = .data$Details |>
            purrr::map_int(length) |>
            max(),
        .by = "Group"
    ) |>
    dplyr::count(
        .data$`Columns Present`,
        name = "Frequency"
    ) |>
    knitr::kable(
        caption = "How many detail columns are needed by DHC groups."
    )
How many detail columns are needed by DHC groups.
Columns Present Frequency
2 46
3 112
4 45
5 43
6 3

Most of the groups will need three or fewer columns to capture their details, but many will need more. Three different tables will need six columns! We’ll use those as examples as we proceed.

hoist_group_details <- function(.glossary, .group, .fields){
    .field_list <- .fields |>
        seq_along() |>
        rlang::set_names(.fields)
    .glossary |>
        dplyr::filter(.data$Group == .group) |>
        tidyr::hoist(.col = "Details", !!!.field_list)
}

GROUP_DETAILS <- DHC_VARIABLES |>
    dplyr::summarize(
        Concept = dplyr::first(.data$Concept),
        Length = dplyr::n(),
        Width = .data$Details |>
            purrr::map_int(length) |>
            max(),
        .by = "Group"
    ) |>
    dplyr::mutate(
        Variables = purrr::map2(
            .data$Group, .data$Width,
            \(.g, .c) {
                .tmp <- DHC_VARIABLES |>
                    hoist_group_details(
                        .g,
                        LETTERS[1:.c]
                    )
                if (nrow(.tmp) > 1) {
                    .tmp <- dplyr::select(.tmp,
                                          tidyselect::where(
                                              \(.) dplyr::n_distinct(.) > 1
                                          ))
                }
                dplyr::mutate(
                    .tmp,
                    dplyr::across(tidyselect::any_of(LETTERS[1:.c]),
                                  \(.) dplyr::coalesce(., ""))
                )
            }
        )
    )
EXAMPLE_GROUPS <- GROUP_DETAILS |>
    dplyr::filter(
        .data$Width == 6
    ) |>
    dplyr::select(
        !tidyselect::any_of(c("Variables", "Width"))
    )

knitr::kable(
    EXAMPLE_GROUPS,
    caption = "DHC groups that need six columns to capture their details"
)
DHC groups that need six columns to capture their details
Group Concept Length
H14 TENURE BY HOUSEHOLD TYPE BY AGE OF HOUSEHOLDER 69
PCT19 GROUP QUARTERS POPULATION BY SEX BY AGE BY GROUP QUARTERS TYPE 195
PCT2 HOUSEHOLD SIZE BY HOUSEHOLD TYPE BY PRESENCE OF OWN CHILDREN 19

An example glossary: PCT2

This is what the details of table PT2 look like after hoisting.

PCT2 <- GROUP_DETAILS |>
    dplyr::filter(
        .data$Group == "PCT2"
    ) |>
    dplyr::pull(
        "Variables"
    ) |>
    purrr::pluck(1)

knitr::kable(
    PCT2,
    caption = "The details of group PCT2"
)
The details of group PCT2
Index Variable B C D E F
1 PCT2_001N
2 PCT2_002N 1-person household
3 PCT2_003N 1-person household Male householder
4 PCT2_004N 1-person household Female householder
5 PCT2_005N 2-or-more-person household
6 PCT2_006N 2-or-more-person household Family households
7 PCT2_007N 2-or-more-person household Family households Married couple family
8 PCT2_008N 2-or-more-person household Family households Married couple family With own children under 18 years
9 PCT2_009N 2-or-more-person household Family households Married couple family No own children under 18 years
10 PCT2_010N 2-or-more-person household Family households Other family
11 PCT2_011N 2-or-more-person household Family households Other family Male householder, no spouse present
12 PCT2_012N 2-or-more-person household Family households Other family Male householder, no spouse present With own children under 18 years
13 PCT2_013N 2-or-more-person household Family households Other family Male householder, no spouse present No own children under 18 years
14 PCT2_014N 2-or-more-person household Family households Other family Female householder, no spouse present
15 PCT2_015N 2-or-more-person household Family households Other family Female householder, no spouse present With own children under 18 years
16 PCT2_016N 2-or-more-person household Family households Other family Female householder, no spouse present No own children under 18 years
17 PCT2_017N 2-or-more-person household Nonfamily households
18 PCT2_018N 2-or-more-person household Nonfamily households Male householder
19 PCT2_019N 2-or-more-person household Nonfamily households Female householder

One of the devilish things about Census data is that the meaning of the value in a particular row and column depends upon the structure of a table. Consequently, we cannot look at each value by itself. We have to try to make meaning of the entire table at once.

Glossed details from group PCT2
Index Level One Person Children Family Male Householder Female Householder
1 5 NA NA NA NA NA
2 1 TRUE FALSE FALSE NA NA
3 0 TRUE FALSE FALSE TRUE FALSE
4 0 TRUE FALSE FALSE FALSE TRUE
5 4 FALSE NA NA NA NA
6 3 FALSE NA TRUE NA NA
7 1 FALSE NA TRUE TRUE TRUE
8 0 FALSE TRUE TRUE TRUE TRUE
9 0 FALSE FALSE TRUE TRUE TRUE
10 2 FALSE NA TRUE NA NA
11 1 FALSE NA TRUE TRUE FALSE
12 0 FALSE TRUE TRUE TRUE FALSE
13 0 FALSE FALSE TRUE TRUE FALSE
14 1 FALSE NA TRUE FALSE TRUE
15 0 FALSE TRUE TRUE FALSE TRUE
16 0 FALSE FALSE TRUE FALSE TRUE
17 1 FALSE FALSE FALSE NA NA
18 0 FALSE FALSE FALSE TRUE FALSE
19 0 FALSE FALSE FALSE FALSE TRUE

Note that I use NA to represent the “any possible value” state for a factor. I’m not sure that this is a good choice. It might be better to use an explicit value like ““,”All”, or “*“. It does make things consistent across factors and Boolean fields, though.

Levels of summary

The worst thing about Census data files is that they include both stand-alone observations and subtotals. That is what the Level field is intended to capture. I am very open to suggestions for better terminology.

Using the Level field, we can pull stand-alone rows. Notice that none of these variables have NA in any of their values. That was originally my way of detecting stand-alone rows, but I can imagine a situation where some factor is simply irrelevant, rather than aggregated, so I think it is better to explicitly note each row’s level of aggregation.

Variables of PCT2 that are not subtotals or totals.
Variable Index Level One Person Children Family Male Householder Female Householder
PCT2_003N 3 0 TRUE FALSE FALSE TRUE FALSE
PCT2_004N 4 0 TRUE FALSE FALSE FALSE TRUE
PCT2_008N 8 0 FALSE TRUE TRUE TRUE TRUE
PCT2_009N 9 0 FALSE FALSE TRUE TRUE TRUE
PCT2_012N 12 0 FALSE TRUE TRUE TRUE FALSE
PCT2_013N 13 0 FALSE FALSE TRUE TRUE FALSE
PCT2_015N 15 0 FALSE TRUE TRUE FALSE TRUE
PCT2_016N 16 0 FALSE FALSE TRUE FALSE TRUE
PCT2_018N 18 0 FALSE FALSE FALSE TRUE FALSE
PCT2_019N 19 0 FALSE FALSE FALSE FALSE TRUE

We can also pull rows that are aggregations of other rows’ values. Note that each of these rows will definitely have some NA values. The number of NAs in a row is proportional, but not exactly equal, to its level of aggregation.

Variables of PCT2 that are subtotals or totals.
Index Variable Level One Person Children Family Male Householder Female Householder
1 PCT2_001N 5 NA NA NA NA NA
2 PCT2_002N 1 TRUE FALSE FALSE NA NA
5 PCT2_005N 4 FALSE NA NA NA NA
6 PCT2_006N 3 FALSE NA TRUE NA NA
7 PCT2_007N 1 FALSE NA TRUE TRUE TRUE
10 PCT2_010N 2 FALSE NA TRUE NA NA
11 PCT2_011N 1 FALSE NA TRUE TRUE FALSE
14 PCT2_014N 1 FALSE NA TRUE FALSE TRUE
17 PCT2_017N 1 FALSE FALSE FALSE NA NA