A Technical Overview of the District Density Index
By Alexander Chen, CDHD Data Analyst
This is the technical background to how we calculated the 2022 CDHD District Density Index, introduced here. But what exactly do we mean by district density?
US Congressional districts are drawn to reflect principles that include population parity and compactness –in other words, each district in a state will have the same number of people and will be contiguous and as tight as possible. However, categorizing congressional districts by population and population density is challenging; there are no clear definitions for rural, suburban, and urban areas. From a health policy perspective, understanding differences in population density in congressional districts can inform implementation and evaluation of, and need for, federal public health programs.
The 118th Congress is notably different from the 117th Congress, as it reflects redistricting changes as a result of 2020 Decennial Census block population counts, as mandated by Public Law (PL) 94-171. Redistricting changes ranged from no changes (in states with only one at-large district) to substantial change (redrawing, gaining or losing districts, or assigning a district to represent a different part of the state).
CityLab constructed a Congressional District Density Index (CDI) in 2018 to reflect the 116th Congress, whose boundaries differ from the current, 118th Congress. Thus, the District Density Index (DDI) reflects updated congressional district boundaries and district allocation.
We constructed the DDI using population and household census tract data from the 2020 American Community Survey. We used a fuzzy-c means clustering algorithm to determine likelihood of being categorized into six density clusters based on district population, the number of households, and tract area. As tracts do not perfectly nest in congressional districts, we constructed partial tracts reflecting the area in the tract that was also in the congressional district, and assigned block level population and household counts to those partial tracts. We then calculated their areas and calculated density from those values.
Taking a step back—what exactly is fuzzy c-means clustering? And how are we applying it in this context?
Fuzzy c-means clustering is an unsupervised algorithm that is conceptually similar to a k-means cluster. It uses model inputs to assign observations (in this instance, districts) to clusters based on distance to the center of a cluster. Unlike k-means clustering, however, c-means clustering assesses the degree to which a point can be assigned to all clusters, as opposed to deterministically assigning points to clusters. C-means clustering acknowledges that discrete groupings may not be perfect fits into only one cluster—hence, calling it fuzzy.
Think of it this way: let’s say you have a slightly underripe banana that’s both yellow and green. Depending on how close to optimal ripeness the banana is, under a k-means cluster, the banana would be assigned to be only yellow, or only green. But under a c-means cluster, if the banana were a couple days from being ripe, it could assign the degree to which the banana would be part of the yellow cluster, 0.7; and the degree to which the banana could be part of the green cluster, 0.4 (the sum of all degrees need not add to 1, as they are not probabilities; they are just the extent to which each observation is most similar to the respective clusters. The discrete degrees themselves, however, will be ranges from 0 to 1). For both methods, cluster centroids are an artefact of the data themselves and are not discrete parameters. So, based on how many variables are used in classification, the clusters “move” around in the matrix of variables.
Though we ultimately assigned districts to density clusters based on the district’s highest degree of similarity, we did also examine some districts that could have been assigned to several—districts on the “cusp”. Of these, they tended to be districts that seem quite exurban or districts that had substantial density differences within the district itself. There was no noticeable regional or state trends, but it is worth acknowledging that these districts do exist. This can explain why you might see a district categorized as one type of density, but with characteristics of another type of density.
For example, Ohio’s 9th Congressional District (OH-09), which covers a long and thin portion of Ohio bordering Lake Erie, is categorized as a Sparse Suburban district. This district includes areas such as suburban Toledo, but is also represents quite rural areas, such as Ottawa County. The population distribution across OH-09 led it to appear most similar to Sparse Suburban districts (similarity index: 0.448), but also demonstrated similarities to Rural-Suburban Mix districts (similarity index: 0.409). Notably, it exhibits similarities to Ohio’s 5th Congressional District (OH-05), which includes areas directly surrounding Toledo and stretches from Lake Erie to the Indiana border—and directly borders OH-09.
You can access our derivation code (in an R Markdown file) here. Reach out to us at [email protected] with any questions, suggestions, or bugs you notice.