Population data are often available at various levels of aggregation, such as census tracts or block groups in the United States or output areas in the United Kingdom. These units are often drawn for convenience and not because they represent some commonly accepted area in residents’ lived experiences. Despite this, researchers often use them as proxies for neighborhoods since they come with readily available data and the techniques for estimating population values for overlapping but incongruent areas, such as a shapefile of neighborhood boundaries for a city, are less accessible. For researchers who do wish to move past these proxies and produce population estimates for features of interest, the lack of accessibility means that they typically resort to developing their own, often manual, implementations of the areal weighted interpolation workflow (Qiu, Zhang, and Zhou 2012).
Unlike interpolation techniques, like inverse distance weighted interpolation, that are appropriate for point data, areal weighted interpolation is designed specifically for working with already aggregated data to some set of polygon spatial features (Lam 1983). Estimating population values from these data to an overlapping but incongruent set of polygon features is known as the polygon overlay problem (Goodchild 1978), with the original data known as the “source” data and the overlapping set of features known as the “target” data (Markoff and Shapiro 1973). The estimation process vary based on whether data are spatially extensive (i.e. count data) or spatially intensive (i.e. ratios, percentages, means, or medians; Goodchild and Lam 1980).
In estimating values for the target features, areal weighted interpolation makes a significant assumption that individuals are evenly spread throughout the source features (Qiu, Zhang, and Zhou 2012). This assumption often breaks down in practice - a census tract with a large park in it, for example, or a neighborhood that has a significant commercial area alongside residential housing. While more complex methods do exist to compensate for violating this assumption, there are few tools to implement them, and the strategies have mainly remained the focus of academics instead of GIS practitioners (Langford 2007; Qiu, Zhang, and Zhou 2012).
The areal
package aims to reduce the barriers to the implementation of areal interpolation in spatial researchers’ work. By implementing the process in R
, we aim to improve the reproducibility of the interpolation process, which is often done manually using point-and-click desktop GIS software. We also aim to provide additional functionality in comparison to the one existing approach found in the sf
package (Pebesma 2018), the st_interpolate_aw()
function, while providing support for modern data management (e.g. tidyverse
) and spatial data (e.g. sf
) frameworks.
Therefore, we provide functions that:
The validation process checks for five conditions that must be met prior to interpolation:
sf
objects?This validation process can be run independently by end users via the ar_validate()
function and is also run initially when aw_interpolate()
is called. The validation process is aimed at helping users new to both R
and areal interpolation ensure they have arranged their data correctly.
Unlike existing approaches, the areal
package offers the ability to interpolate multiple variables in a single function call and provides a wider set of options for making these estimates. For spatially extensive interpolations, we offer a formula that matches the existing functionality in sf
, which is based on the total area of the original source feature (specified with weight = "total"
). Alternatively, we offer a second formula that is more suitable for data where there should be complete overlap between the source and target data, but there is not. Such incongruity could be due to data quality issues or variation in how different sources represent particular geographies. This uses a sum of the source feature areas remaining after the data are intersected as part of the spatial weight calculation process (specified with weight = "sum"
).
Our package also provides a manual approach for calculating estimates alongside the primary function aw_interpolate()
. This is important for users who need to unpack the interpolation workflow and diagnose data issues that occur during interpolation as well as programmers who want to use a portion of the workflow in their software. Finally, areal
provides both spatial (sf
object) and a-spatial (tibble
object) options for output.
Once data have been prepared for interpolation, meaning they share the same projected coordinate system and there are no conflicts between variable names, interpolations can be calculated with a single function aw_interpolate()
. This functionality is illustrated here using source and target data built into the areal
package:
> aw_interpolate(ar_stl_wards, tid = WARD, source = ar_stl_race, sid = GEOID,
+ weight = "sum", output = "tibble",
+ extensive = c("TOTAL_E", "WHITE_E", "BLACK_E"))
# A tibble: 28 x 6
OBJECTID WARD AREA TOTAL_E WHITE_E BLACK_E
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 46138761. 7992. 153. 7779.
2 2 2 267817711. 12145. 1323. 10639.
3 3 3 66291644. 7344. 591. 6635.
4 4 4 53210707. 8458. 160. 8203.
5 5 5 60462396. 8783. 1526. 7056.
6 6 6 64337271. 14050. 5840. 7439.
7 7 7 101268146. 15840. 8220. 6629.
8 8 8 45966410. 12188. 7604. 3796.
9 9 9 73993891. 14217. 6838. 6413.
10 10 10 62915358. 11239. 8703. 1667.
# … with 18 more rows
If a sf
object is desired, changing the output
argument in the example above to "sf"
will provide not only the estimated values but the relevant geometric data as well.
Spatially intensive interpolations, which are used by analysts for data that are ratios or percentage values, can be calculated by replacing the extensive
argument with intensive
. Alternatively, both arguments and corresponding vectors of variable names can be provided to interpolate both spatially extensive and intensive data simultaneously.
The source data from the above data along with one of the variables, total population (TOTAL_E
), are mapped in panel A below at the census tract level. The resulting data for the total population variable interpolated to the target political ward data have been mapped in panel B:
Additional examples and vignette illustrations of areal
’s functionality are available on the package’s website.
Since sf
offers similar functionality, three comparisons are made and presented here. Code for these comparisons is available in an appendix within the areal
GitHub repository. The first compares spatially extensive interpolations calculated in both areal
and sf
. The first ten features are shown here for comparison:
Both st_interpolate_aw()
and aw_interpolate()
with the weight = "total"
produce identical results, which is expected.
The above comparison also shows the difference between using "total"
and "sum"
for the weight
argument in aw_interpolate()
. If we expect that our source and target data should overlap completely (but perhaps do not because of data quality issues), the "sum"
approach will allocate all individuals into target features whereas "total"
will not, instead allocating only a proportion of individuals relative to the overlap between the source and target features. In the example data provided in the areal
package, this is the case - the Census tracts do not perfectly map onto the extent of the wards, and so the "total"
approach is inappropriate.
The differences are highlighted in the delta
column in the above figure. With these data, the "total"
approach yields generally smaller results that the "sum"
approach, with an average difference of -30.781. The largest difference between the "total"
and "sum"
approaches was -194.750 individuals. The added functionality within areal
therefore provides a potentially more accurate set of estimations in similar cases where data should overlap, but do not.
The second comparison is with spatially intensive interpolations. For this approach, both st_interpolate_aw()
and aw_interpolate()
should yield identical results. As before, the first ten features are shown here for comparison:
As with the extensive interpolations where weight = "total"
, both st_interpolate_aw()
and aw_interpolate()
produce the expected identical results.
Finally, a comparison of the speed at which both the sf
and areal
packages calculated these values is included in the appendix as well. For the extensive interpolations, aw_interpolate()
performed 28 milliseconds slower on the sample data provided in the package, with a mean execution time of 279.429 milliseconds. Similarly, the intensive interpolations were 20 milliseconds slower when using areal
, which had a mean execution time of 267.878 milliseconds. Both comparisons were made using the R
package microbenchmark
.
One particular performance concern to note are intersections (a step in the interpolation process) that return geometry collections. These will cause st_interpolate_aw()
function from sf
to error. The aw_interpolate()
function will not error, but instead will correctly address these geometry collections before proceeding with the interpolation. However, the correction process can take significantly longer than a similar data set that does not require this additional step.
The appendix contains an example of interpolation population from Missouri’s 115 counties into its 4,506 Census block groups. When evaluated with microbenchmark
, the average length of time to complete this process was 17.689 seconds.
areal
is open source software made available under the GNU GPL-3 license. It can be installed through CRAN (Prener and Revord 2019) using: install.packages("areal")
. areal
can also be installed from its GitHub repository using the remotes
package: remotes::install_github("slu-openGIS/areal")
.
The authors wish to thank their colleagues, J.S. Onésimo Sandoval, Ph.D. and Taylor Harris Braswell, M.A., for reviewing an initial draft of this manuscript.
Goodchild, Michael F. 1978. “Statistical Aspects of the Polygon Overlay Problem.” Harvard Papers on Geographic Information Systems. Addison-Wesley.
Goodchild, Michael F, and Nina Siu Ngan Lam. 1980. Areal Interpolation: A Variant of the Traditional Spatial Problem. Department of Geography, University of Western Ontario London, ON, Canada.
Lam, Nina Siu-Ngan. 1983. “Spatial Interpolation Methods: A Review.” The American Cartographer 10 (2). Taylor & Francis: 129–50.
Langford, Mitchel. 2007. “Rapid Facilitation of Dasymetric-Based Population Interpolation by Means of Raster Pixel Maps.” Computers, Environment and Urban Systems 31 (1). Elsevier: 19–32.
Markoff, John, and Gilbert Shapiro. 1973. “The Linkage of Data Describing Overlapping Geographical Units.” Historical Methods Newsletter 7 (1). Taylor & Francis: 34–46.
Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
Prener, Christopher, and Charles Revord. 2019. “Areal: Areal Weighted Interpolation.” 2019. https://CRAN.R-project.org/package=areal.
Qiu, Fang, Caiyun Zhang, and Yuhong Zhou. 2012. “The Development of an Areal Interpolation Arcgis Extension and a Comparative Study.” GIScience & Remote Sensing 49 (5). Taylor & Francis: 644–63.