Version 2.1 | First Created July 7, 2021 | Updated December 7, 2023
Chakraborty (2021) investigates the relationships between COVID-19 rates and demographic characteristics of people with disabilities by county in the continental United States. The aim of the study is to investigate whether people with disabilities (PwDs) face disproportionate challenges due to COVID-19. To do so, Chakraborty examines the statistical relationship between county incidence rates of COVID-19 cases and county-level percentages of people with disabilities and different socio-demographic characteristics. Specifically, Chakraborty tests county-level bivariate correlations between COVID-19 incidence against the percentage of disability as one hypothesis, and tests correlation between COVID-19 incidence and percentage of people with disabilities in 18 different socio-demographic categories of race, ethnicity, poverty status, age, and biological sex. Chakraborty then re-tests for the same county-level associations while controlling for spatial dependence. Spatial dependence is controlled by constructing generalized estimating equation (GEE) models using a combination of state and spatial clusters of COVID-19 incidence as to define the GEE clusters. One GEE model is constructed for each of the four types of socio-demographic category: race, ethnicity, age, and biological sex. Chakraborty (2021) finds significant positive relationships between COVID-19 rates and socially vulnerable demographic categories of race, ethnicity, poverty status, age, and biological sex.
This reproduction study is motivated by expanding the potential impact of Chakraborty’s study for policy, research, and teaching purposes. Measuring the relationship between COVID-19 incidence and socio-demographic and disability characteristics can provide important information for public health policy-making and resource allocation. A fully reproducible study will increase the accessibility, transparency, and potential impact of Chakraborty’s (2021) study by publishing a compendium complete with metadata, data, and code. This will allow other researchers to review, extend, and modify the study and will allow students of geography and spatial epidemiology to learn from the study design and methods.
In this reproduction, we will attempt to identically reproduce all of the results from the original study. This will include the map of county level distribution of COVID-19 incidence rates (Fig. 1), the summary statistics for disability and sociodemographic variables and bivariate correlations with county-level COVID-19 incidence rate (Table 1), and the GEE models for predicting COVID-19 county-level incidence rate (Table 2). A successful reproduction should be able to generate identical results as published by Chakraborty (2021).
The reproduction study data and code are available in public a GitHub
repository at github.com/HEGSRR/RPr-Chakraborty2021
and the analysis plans and reports are registered with OSF at https://doi.org/10.17605/OSF.IO/S5MTQ. The reproduction
is implemented with R markdown using the SpatialEpi
package
for the Kulldorff spatial scan statistic packages and the
geepack
package for the generalized estimating
equation.
Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. https://doi.org/10.1016/j.dhjo.2020.101007
COVID-19; Disability; Intersectionality; Race/ethnicity; Poverty; Reproducibility
The aim of this reproduction study is to implement the original study as closely as possible to reproduce the map of county level distribution of COVID-19 incidence rate, the summary statistics and bivariate correlation for disability characteristics and COVID-19 incidence, and the generalized estimating equations. Our two confirmatory hypotheses are that we will be able to exactly reproduce Chakraborty’s results as presented in table 1 and table 2. Stated as null reproduction study hypotheses (RPr-H):
RPr-H1: There is a less than perfect match between Chakraborty’s bivariate correlation coefficient for each disability/sociodemographic variable and COVID-19 incidence rate and our bivariate correlation coefficient for each disability/sociodemographic variable and COVID-19 incidence rate.
RPr-H2: There is a less than perfect match between Chakraborty’s beta coefficient for the GEE of each disability/sociodemographic variable and our beta coefficient for the GEE of each disability/sociodemographic variable.
There are multiple models being tested within each of the two hypotheses. That is, H1 and H2 both encompass five models, including one for each dimension of socio-demographics: race, ethnicity, poverty status, age, and biological sex.
The original study is observational, with the exploratory objective of determining “whether COVID-19 incidence is significantly greater in counties containing higher percentages of socio-demographically disadvantaged [people with disabilities], based on their race, ethnicity, poverty status, age, and biological sex” (Chakraborty 2021).
In the original study, 18 implicit bivariate hypotheses are tested for correlation between COVID-19 cumulative incidence rates and specific categories of PwDs at the county level. Although the original publication does not state null hypotheses for each bivariate correlation, we may formulate the original research hypotheses (OR-H) as follows:
OR-H1.1: There is no correlation between the COVID-19 incidence rate and the percentage of people with disabilities at the county level. OR-H1.2: There is no correlation between the COVID-19 incidence rate and the percentage of white people with disabilities at the county level. … OR-H1.18 There is no correlation between the COVID-19 incidence rate and the percentage of female people with disabilities at the county level.
Five multi-variate hypotheses are tested for associations between COVID-19 cumulative incidence rates and subgroups of PwDs at the county level. Although the original publication does not state null hypotheses for each model, we may formulate them as follows:
OR-H2.1: The percentages of people with disability, categorized by race, are not associated with COVID-19 incidence at the county level when accounting for the state and risk level of COVID-19 clusters. … OR-H2.5: The percentages of people with disability, categorized by gender, are not associated with COVID-19 incidence at the county level when accounting for the state and risk level of COVID-19 clusters.
The spatial extent of the study is the continental United States (48 contiguous states and Washington D.C.) The spatial scale of the analysis is at the county level. Both COVID-19 incidence rates and demographic variables are all measured at the county level. The temporal extent of the COVID-19 data ranges from 1/22/2020 (when John Hopkins began collecting the data) to 8/1/2020 (when the data was retrieved for the original study). The data on disability and sociodemographic characteristics come from the U.S. Census American Community Survey (ACS) five-year estimates for 2018 (2014-2018).
There is no randomization in the original study.
In this reproduction, we reproduce parts of Chakraborty’s study using excess death data from 2020 instead of COVID-19 cases.
The study was originally conducted using SaTScan software to
implement the Kulldorff spatial scan statistic. Other software are not
specified in the publication; however data files suggest and
communication with the author verifies that spatial analysis and mapping
was conducted in ArcGIS, generalized estimating equation (GEE) models
were calculated in SPSS, and the SaTScan software version was
9.6
.
This reproduction study uses R, including the SpatialEpi package for the Kulldorff spatial scan statistics and the geepack package for GEE models.
The American Community Survey (ACS) five-year estimate (2014-2018) variables used in the study are outlined in the table below. Details on ACS data collection can be found at https://www.census.gov/topics/health/disability/guidance/data-collection-acs.html and details on sampling methods and accuracy can be found at https://www.census.gov/programs-surveys/acs/technical-documentation/code-lists.html.
Variable Name in Study | ACS Variable name |
---|---|
percent of total civilian non-institutionalized population with a disability | S1810_C03_001E |
Race | |
percent w disability: White alone | S1810_C03_004E |
percent w disability: Black alone | S1810_C03_005E |
percent w disability: Native American | S1810_C03_006E |
percent w disability: Asian alone | S1810_C03_007E |
percent w disability: Other race | S1810_C03_009E |
Ethnicity | |
percent w disability: Non-Hispanic White | S1810_C03_0011E |
percent w disability: Hispanic | S1810_C03_012E |
percent w disability: Non-Hispanic non-White | (S1810_C02_001E - S1810_C02_011E - S1810_C02_012E) / (S1810_C01_001E - S1810_C01_011E - S1810_C01_012E) * 100 |
percent w disability: Other race | S1810_C03_009E |
Poverty | |
percent w disability: Below poverty level | (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100 |
percent w disability: Above poverty level | (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100 |
Age | |
percent w disability: 5-17 | S1810_C03_014E |
percent w disability: 18-34 | S1810_C03_015E |
percent w disability: 35-64 | S1810_C03_016E |
percent w disability: 65-74 | S1810_C03_017E |
percent w disability: 75+ | S1810_C03_018E |
Biological sex | |
percent w disability: male | S1810_C03_001E |
percent w disability: female | S1810_C03_003E |
American Community Survey (ACS) data for sociodemographic
subcategories of people with disabilities can be accessed by using the
tidycensus
package to query the Census API. This requires
an API key which can be acquired at api.census.gov/data/key_signup.html.
The original study extent is the lower 48 states and Washington D.C. Therefore, Alaska, Hawai’i and Puerto Rico are removed from the data (workflow step 1). Data on people with disabilities in poverty is derived from a different census table (C18130) than data on people with disabilities and age, race, ethnicity, age, and biological sex (S1810). Therefore, join the poverty data to the other data using the GEOID (workflow step 3). Also transform the ACS geographic data into Contiguous USA Albers Equal Area projection and fix geometry errors.
Calculate independent socio-demographic variables of people with disabilities as percentages for each sub-category of disability (race, ethnicity, poverty, age, and biological sex) and remove raw census data from the data frame (workflow step 4). Reproject the data into an Albers equal area conic projection.
Data on COVID-19 cases from the Johns Hopkins University dashboard
have been provided directly with the research compendium because the
data is no longer available online in the state in which it was
downloaded on August 1, 2020. The dashboard and cumulative counts of
COVID-19 cases and deaths were continually updated, so an exact
reproduction required communication with the original author, Jayajit
Chakraborty, for assistance with provision of data from August 1, 2020.
The data includes an estimate of the total population
(POP_ESTIMA
) and confirmed COVID-19 cases
(Confirmed
). The COVID-19 case data expresses cumulative
count of reported COVID-19 from 1/22/2020 to 8/1/2020. Although metadata
for this particular resource is no longer available from the original
source, one can reasonably assume that the total population estimate was
based on the 2014-2018 5-year ACS estimate, as the 2019 estimates data
had not been released yet.
Versions of the data can be found at the John Hopkins CCSE COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19). However, archived data only provides summaries at the national scale. We received the COVID-19 case data through 8/1/2020 at the county level from the author, as there is no readily apparent way to access archived data from the Johns Hopkins University Center for Systems Science Engineering database.
##Load Excess Mortality Data 2020
Load excess mortality data for 2020 from *The Uncounted Lab’s” GitHub repository.
We load the COVID data, which has XY coordinates necessary for creating Kulldorff models later on. We attach it to the death_table.
#Add XY Data (for running kulldorff)
Add the XY data from the covid data to excess death data in order to run Kulldorff spatial scan later.
Join the excess death data to the ACS demographic data.
Map the county level distribution of excess death incidence rates.
Map the county level distribution of COVID-19 incidence rates, comparing to Figure 1 of the original study.
Unplanned deviation for reproduction: We also map the spatial distribution of the percent of people with any disability to improve our understanding of the geographic patterns and relationships of between the overarching independent variable (percentage of people with disability) and the dependent variable (COVID-19 incidence rate).
Calculate descriptive statistics for dependent excess death rate and independent socio-demographic characteristics, reproducing the min, max, mean, and SD columns of original study table 1.
Planned deviation for reanalysis: We also calculate the Shapiro Wilk test for normality.
min | max | mean | SD | ShapiroWilk | p | |
---|---|---|---|---|---|---|
excess_death_rate_2020 | -0.93 | 6.25 | 1.53 | 0.93 | 0.98 | 0 |
dis_pct | 4.28 | 47.86 | 16.02 | 4.49 | 0.98 | 0 |
white_pct | 0.50 | 47.86 | 13.41 | 4.67 | 0.98 | 0 |
black_pct | 0.00 | 23.93 | 1.49 | 2.72 | 0.60 | 0 |
native_pct | 0.00 | 13.89 | 0.28 | 0.95 | 0.28 | 0 |
asian_pct | 0.00 | 3.46 | 0.10 | 0.20 | 0.52 | 0 |
other_pct | 0.00 | 19.29 | 0.73 | 0.86 | 0.60 | 0 |
non_hisp_white_pct | 0.23 | 47.86 | 12.77 | 4.89 | 0.98 | 0 |
hisp_pct | 0.00 | 31.40 | 1.03 | 2.11 | 0.45 | 0 |
non_hisp_non_white_pct | 0.00 | 23.93 | 2.22 | 2.82 | 0.70 | 0 |
bpov_pct | 0.00 | 17.59 | 3.49 | 1.86 | 0.92 | 0 |
apov_pct | 2.60 | 47.86 | 12.63 | 3.20 | 0.97 | 0 |
pct_5_17 | 0.00 | 5.15 | 1.05 | 0.54 | 0.94 | 0 |
pct_18_34 | 0.00 | 6.65 | 1.62 | 0.71 | 0.96 | 0 |
pct_35_64 | 0.00 | 20.64 | 6.18 | 2.32 | 0.95 | 0 |
pct_65_74 | 0.00 | 9.80 | 3.16 | 1.19 | 0.96 | 0 |
pct_75 | 0.34 | 32.48 | 3.97 | 1.39 | 0.86 | 0 |
male_pct | 0.00 | 47.86 | 8.11 | 2.56 | 0.92 | 0 |
female_pct | 0.00 | 18.44 | 7.91 | 2.30 | 0.98 | 0 |
The county-level Pearson’s rho correlation coefficient was used to test association between intra-categorical rates of disability and excess death rates. As this was a parametric test, normality should be tested. A separate hypothesis was formulated for disability in aggregate and for each sociodemographic disability characteristic.
Calculate Pearson’s R Correlation Coefficient of each independent variable and the excess death rate.
variable | r | t | p |
---|---|---|---|
fips | -0.047 | 2.612 | 0.005 |
dis_pct | 0.164 | 9.272 | 0.000 |
white_pct | -0.030 | 1.700 | 0.045 |
black_pct | 0.292 | 17.013 | 0.000 |
native_pct | 0.100 | 5.580 | 0.000 |
asian_pct | -0.139 | 7.802 | 0.000 |
other_pct | 0.015 | 0.814 | 0.208 |
non_hisp_white_pct | -0.070 | 3.927 | 0.000 |
hisp_pct | 0.112 | 6.298 | 0.000 |
non_hisp_non_white_pct | 0.297 | 17.322 | 0.000 |
bpov_pct | 0.233 | 13.345 | 0.000 |
apov_pct | 0.094 | 5.245 | 0.000 |
pct_5_17 | 0.064 | 3.579 | 0.000 |
pct_18_34 | 0.019 | 1.072 | 0.142 |
pct_35_64 | 0.142 | 7.973 | 0.000 |
pct_65_74 | 0.148 | 8.342 | 0.000 |
pct_75 | 0.137 | 7.699 | 0.000 |
male_pct | 0.123 | 6.935 | 0.000 |
female_pct | 0.186 | 10.559 | 0.000 |
state_fips | -0.047 | 2.641 | 0.004 |
cens_pop_est | -0.070 | 3.928 | 0.000 |
covid_death_rate_2020 | 0.781 | 69.646 | 0.000 |
excess_deaths_2020 | 0.145 | 8.142 | 0.000 |
Unplanned Deviation for Reproduction: The dependent and independent variables in this study do not have normal distributions, as shown in the Shapiro-Wilk test results above. Therefore, we deviate from the original study to use the Spearman’s Rho non-parametric correlation test.
Instabilities between the parametric and non-parametic correlations arise from variables with very skewed distributions and/or weak correlations at the county level. Some difference may also be attributable to the 13 counties with data errors in the COVID-19 Incidence Rate. In such distributions, outlier observations have more weight in the parametric Person’s R test than in the non-parametric Spearman’s Rho test.
## Warning in rm(spearmans_rho, pearsons_r, correlations, table1, df): object
## 'correlations' not found
## Warning in rm(spearmans_rho, pearsons_r, correlations, table1, df): object
## 'table1' not found
Although there were no major geographic transformations in
this study, geographic grouping criteria for the generalized
estimating equation (GEE) models are defined as a combination of states
and COVID-19 risk, which is based on the Kulldorff spatial scan
statistic for geographic clusters of high excess death incidence. The
scan statistic in SaTScan used spherical great circle distance
calculations based upon the latitude and longitude coordinates of the
centroid of each county. For this purpose, we have used the
X
and Y
attributes provided as geographic
variables with ACS data.
We use a Kulldorff spatial scan statistic to detect spatial clusters of high excess death incidence (workflow step 6). The statistic uses a Monte Carlo simulation to calculate statistical significance, and therefore may not produce identical results each time.
The original study uses SaTScan software to implement the Kulldorff
spatial scan statistic model. In SaTScan, models may be specified with
many parameters having significant implications for results. The
original manuscript only specifies that Poisson model should be used. We
can also intuit that the model is discrete (locations are stationary and
non-random), and spatial only (there is no temporal dimension). The
author-provided SaTScan results SatScan_results.txt
contains additional parameters which appear to adhere closely to the
software’s default settings. These include the maximum cluster size of
“50 percent of population at risk”, and the “GINI optimized cluster
collection” and “no geographical overlap” options for detecting
secondary clusters. The “P-value Cutoff” for significant clusters option
did not appear in the v9.6 output, suggesting that the software only
allowed the default “no” option for this at the time of the original
study.
SaTScan software can also output two versions of geographic data:
col
cluster polygon shapefile contains a circle for
each cluster, where each polygon is a circle defined by the cluster
center and radius. The attributes include a variable
REL_RISK
for cluster relative riskgis
location point shapefile contains one point for
each county in a cluster. The attributes include variables
LOC_RR
for local relative risk and CLU_RR
for
cluster relative riskThe SaTScan software implementation of the Kulldorff spatial scan statistic calculates two relative risk scores for locations:
REL_RISK
in the
col
cluster polygon shapefile and as CLU_RR
in
the gis
location point shapefile.LOC_RR
in the
gis
location shapefile, and is not calculated in the
col
cluster polygon shapefile.For the purposes of interpreting the spatial scan statistic, a location is a county centroid while a cluster is a collection of counties with high incidence rates, defined in the shape of a circle with a center location (a county centroid) and a radius.
The original study is not clear about using the cluster geographic
data vs the location geographic data or the cluster relative
risk vs local relative risk. However, The author-provided
SatScan_results.txt
results file indicates a geographic
cluster file but no location file, and the author-provided
Aug1GEEdata.csv
data table contains a REL_RISK
field but no CLU_RR
field or LOC_RR
field.
This suggests that in the original study, the col
polygon
cluster shapefile and cluster relative risk were used to
represent COVID-19 risk and define GEE clusters.
The spatial scan statistic is based on case counts and total population, and is therefore unaffected by errors in the COVID Incidence rate.
Planned deviation for reproduction: We opted to use the SpatialEpi package in R, selecting open source software with R integration over SatSCan software, which is free but not open. The Kulldorff spatial scan statistic model in SpatialEpi also supports a discrete Poisson spatial model, and uses the GINI coefficient to select secondary clusters with no geographical overlap that maximize the difference between locations inside of clusters and locations outside of clusters. We expected that this set of software options could reproduce identical results compared to SaTScan.
First, calculate the Kulldorff spatial scan statistic using SpatialEpi. Optionally, skip this code block due to long run times of more than 10 minutes.
We were unable to get the Kulldorff Spatial Scan statistic to work due to an error with the number of breaks. Just to make sure it wasn’t an issue with there being negative values in the excess_death_rate_2020, I took the absolute values and attempted to input those into the model, but it still didn’t run.
Load pre-calculated Kulldorff spatial scan results. Alternatively, skip or modify this code block to use your own version of the SpatialEpi Kulldorff results.
Report Kulldorff spatial scan results.
The SpatialEpi
implementation of Kulldorff spatial scan
statistics provides output in the form of hierarchical lists analogous
to the text output of SaTScan, but does not output a simple data frame
or tabular output analogous to the shapefiles from SaTScan. Therefore,
additional steps are required to append the Kulldorff scan results to
the acs_covid
simple features data frame. This can be done
by assigning unique cluster ID’s to each county within a cluster.
Clusters include the county at the center of a cluster and all of the
other counties within the cluster radius. Therefore, we use the FIPS
code of the county at the center of each cluster as the unique cluster
ID.
Unplanned deviation for reproduction: The original study does not include visualizations of the spatial structure and distribution of COVID-19 clusters.
First, we must join the Kulldorff spatial scan cluster IDs to the acs_covid simple features dataframe. Although this was planned in workflow step 9, the order of operations between steps 9 and steps 7 and 8 is not important.
Next, calculate a new field isCluster
to identify
counties in COVID-19 clusters. Additionally, distinguish between
counties defining the center of a cluster from counties constituting
other parts of a cluster by comparing the cluster ID (equivalent to the
center county’s fips code) to the county fips code.
Planned deviation for reproduction: Map the
SpatialEpi
cluster results.
Unplanned deviation for reproduction: The
SpatialEpi
implementation of Kulldorff spatial scan
statistics does not calculate local relative risk or cluster relative
risk. Therefore, the next step is to calculate local and cluster
relative risk (workflow step 7).
Classify relative risk on a scale from 1 to 6 (workflow step 8). Risk is classified according to this table:
Relative Risk Values | Relative Risk Class |
---|---|
Outside of cluster | 1 |
RR < 1 | 1 |
1 <= RR < 2 | 2 |
2 <= RR < 3 | 3 |
3 <= RR < 4 | 4 |
4 <= RR < 5 | 5 |
5 <= RR < 6 | 6 |
Counties falling outside of any cluster are assigned a score of 1.
Unplanned deviation for reproduction: It would be helpful to visualize the spatial distributions of local relative risk classes and Kulldorff cluster relative risk classes in advance of using these classes to control for spatial heterogeneity in GEE models.
First, map the spatial distribution of local relative risk score classifications.
Next, map the cluster relative risk scores for comparison. Note that
following the original study classification methodology, counties
outside of clusters are assigned the lowest risk class of
1
.
Comparing the cluster and local relative risk classifications for regions like the Southeast, it is apparent that some areas of high risk are represented with large clusters that have an averaging effect on the cluster-based relative risk score. This effect is more pronounced for clusters with low compactness (e.g. the Southeast cluster stretched over the “black belt” region from Louisiana and Arkansas to Georgia) than clusters with higher compactness (e.g. New York City) because the circular shape of clusters includes more low-risk counties.
The original study did not directly report any results from the
Kulldorff spatial scan statistic. However, the Kulldorff cluster
relative risk scores were combined with states to create clusters for
GEE models, hereafter called “GEE clusters”. The original study reported
102
unique GEE clusters having a range of 1
to
245
counties in each cluster.
In order to compare results, we first create cluster IDs as combinations of the state ID and COVID relative risk class. The first clustering ID (State) and second clustering score (COVID relative risk class) were combined to form IDs for each unique combination of state and relative risk class. Then, we find the number of unique clusters and frequency counties per cluster in our reproduction study for comparison to the original study.
We failed to reproduce the same configuration of GEE clusters as the original study, finding 9 more clusters than the original study and a much smaller maximum cluster of 159 counties compared to 245 counties.
SpatialEpi Clusters
The differences between these three approaches have very significant impacts on the results (see the differences in results in the two maps above) and it is impossible to control for all of the differences with the available parameters. Most fundamentally, SaTScan develops sets of secondary clusters from a universe of just one most likely cluster per location with no default limitation its statistical significance, whereas SpatialEpi may consider multiple possible cluster sizes for each location with a default limitation of maximum 0.05 p for each cluster. These fundamental differences are evident in the spatial distribution of clusters. For example, New York City is the most likely cluster in all analyses. For counties near New York, the radius of the most likely cluster is large and geographically overlaps New York City. Therefore, if only the most significant cluster radius is considered as a possible secondary cluster for counties near New York City, all such clusters are disqualified by their geographical overlap. This is what happens in the SaTScan Hierarchical Clusters model, for which the next nearest clusters are in Ohio and Virginia. In the SaTScan GINI-optimized model, the maximum cluster size is apparently smaller, such that the most likely cluster in New York City is also smaller. This change allows for two other non-overlapping secondary clusters in Rhode Island and New Jersey. In contrast, the SpatialEpi algorithm still considers a variety of possible cluster sizes for each county, allowing for detection of smaller clusters adjacent to more significant ones.
Of course, the relative risk score of each cluster is contingent on the cluster size, so each difference in geographic configuration of clusters also impacts the cluster risk classification of individual locations. The most stable results are for the most likely clusters in distinct regions (New York City, Southeast U.S., Southern California & Nevada), while the most variability appears for secondary clusters close enough for their most likely radius to overlap the more likely clusters. The circular shape could be considered a major limitation of Kulldorff cluster detection, for which the SaTScan methodology enhances the limitation by constraining the possibility of nearby clusters while the SpatialEpi methodology can detect smaller adjacent secondary clusters.
The high significance threshold in the default SaTScan analysis
allows the inclusion of many small clusters with low likelihoods, adding
noise to the results. This could be controlled by overriding the maximum
p value parameter. Combining all of the default parameters,
SaTSCan includes small clusters of relatively low-risk counties in the
Midwest, but excludes relatively high-risk counties adjacent to the
major clusters of New York, the Southeast, and Southern
California/Nevada. This problem does not exist in the SpatialEpi
implementation, and the SpatialEpi parameter alpha level
parameter cannot be practically increased to 1
to match the
SatSCan default. This is because SpatialEpi does not filter counties by
those with local relative risk greater than 1—therefore an alpha
value of 1
results in all counties being
included as clusters.
In sum, there are three construct validity issues with the original study’s COVID-19 high-risk clusters as implemented in SaTScan.
#GEE models ## Preprocess data for GEE modelling
Unplanned deviation for reanalysis: Based on the three observations above, we think that it would be more valid to choose one set of secondary clusters based on a single method rather than combining a set of hierarchical clusters with a set of GINI optimized clusters. We also think that it would be more valid to include risk levels for all counties within a cluster (i.e. all counties within any of the circles above), rather than only the county at the center of a cluster. Finally, we think it would be more valid to treat clusters as a single category rather than five tiers of above-normal risk.
To complete the reproduction/reanalysis study, we will therefore calculate and compare multiple versions of the GEE models:
First, calculate GEE cluster IDs.
We have already calculated: - rp_clusID
based on our
SpatialEpi clusters - ss_clusID
based on our SaTScan
cluster centers, and shown to be identical to the original author’s data
- gini_clusID
based on our SaTScan GINI-optimized
clusters
Second, filter the data for non-zero COVID-19 rates and z-score standardize the independent variables. This accomplishes step 10 of the workflow diagram.
Unplanned deviation for reproduction: We assumed that we should filter for COVID rates > 0 first and then calculate z-scores, however after comparing data in the next code block, we realized that the original study had first calculated z-scores and then filtered for COVID rates > 0. Therefore, to align with the original study, in the next code block we first calculate z-scores and then filter for COVID rates > 0.
If skipping kulldorff, start running code here
When we had filtered for COVID rates > 0 first and then z-score standardized second, the means of differences ranged from -0.012 to 0.004, and standard deviations of differences ranged from 0.000 to 0.016.
After changing the order to first z-score standardize and then filter for COVID rates > 0, we observed no mean difference between our reproduced variables and the original variables, and we find no standard deviation > 0.001 for the difference between reproduction independent variables and original variables. There are no major differences between the independent variables.
Optionally, you may save the preprocessed data to
data/raw/public/gee_data.gpkg
Optionally, you may load the preprocessed data from
data/raw/public/gee_data.gpkg
The generalized estimating equation (GEE) models were used to test association between intra-categorical rates of disability and COVID-19 incidence rates while accounting for spatial clustering. A separate hypothesis was formulated for each type of subcategorization of PwDs, numbered H2.1 through H2.5 in Table 4.
As specified by the author, “GEEs extend the generalized linear model to accommodate clustered data, in addition to relaxing several assumptions of traditional regression (i.e., normality)”. Additionally, the author noted that “clusters of observations must be defined based on the assumption that observations within a cluster are correlated while observations from different clusters are independent.” All five GEE models were specified with exchangeable correlation matrices, gamma distributions, and logarithmic link function. These specifications were chosen after testing each alternative and choosing the models with the best quasilikelihood under the independence model criterion (QIC).
This accomplishes the step 11 of the workflow diagram.
Generalized Estimating Equation parameters:
“The ‘exchangeable’ correlation matrix was selected for the results reported here, since this specification yielded the best statistical fit based on the QIC (quasi- likelihood under the independence) model criterion.” (Chakraborty 2021, Methods paragraph 5)
“The gamma distribution with logarithmic link function was chosen for all GEEs since this model specification provided the lowest QIC value.” (Chakraborty 2021, Methods paragraph 5)
Calculate GEE models with: - Clustering: Reproduced SpatialEpi clusters & State ID - Dependent variable: excess death incidence.
Calculate variance, covariance, correlation, and weights
Map each county’s weight in GEE model.
Looking at our maps of excess deaths and disability for 2020, there does not appear to be much of a correlation between the two at a glance. There are high rates of disability in regions such as New England, and especially upper Maine, as well as in swaths of the South and in the Pacific Northwest (PNW). However, there are not high excess death incidences in New England, in the PNW, nor in any big metropolitan areas, such as along the East Coast. However, there were high concentrations of excess deaths in parts of Texas and in other parts of the south.
Interestingly, in Nevada, there seemed to be notably more instances of excess deaths than in its immediate neighbors, and in some counties in southern Nevada, there were also high rates of disability. However, visually there does not appear to be a perfect correlation.
The county-level Pearson’s rho correlation coefficient was used to test association between intra-categorical rates of disability and excess death rates. There were negative correlations between excess death rates and percentages of white, Asian, and non-Hispanic white populations.
One potential issues with this reproduction is spatial heterogeneity; there are distinct areas where the pandemic spread early on as opposed to later, once it had spread widely throughout the entire country. By nature, our reproduction will mitigate this issue because we will be reproducing the study at a later date (2020 instead of 2018), after the pandemic had spread widely throughout the United States.
Another potential threat to validity would include scale dependency, which refers to the idea of a phenomena in one place being influenced by neighboring locations. Chakraborty was interested in the relationship between disabilities and the spread of COVID-19, however, it is likely that the spread of COVID-19 was not entirely dependent on disability, but rather in part physical proximity of places. This is to say that nearby places likely transmitted COVID-19 between them than places further apart.