Prevalence mapping for DHS indicators using the surveyPrev package

Qianyu Dong, Zehang Richard Li, Yunhan Wu, Andrea Boskovic, Jon Wakefield

2026-06-03

1 Overview

In this vignette, we describe how prevalence mapping using DHS data can be carried out. Prevalence mapping is Small Area Estimation (SAE) for proportions of some indicator of interest. The vignette is organized as follows. In Section 2 we first describe the data preparation steps to obtain and process the relevant DHS survey data in R. We then describe three classes of models

  1. Direct estimates in Section 3.
  2. Area-level Fay-Herriot model in Section 4.
  3. Cluster-level model in Section 5.

Then in Section 6 we show several tools for visualization of different estimates, and in Section 7 we discuss how to aggregate the result for a given model to produce estimates at higher admin levels. For a more detailed discussion of the workflow of SAE analysis using the surveyPrev package, see (Dong et al. 2026).

2 Data Preparation

2.1 Prerequisites

The following pieces of data are needed to produce SAE for DHS surveys:

2.2 Install surveyPrev

The latest surveyPrev package can be installed from CRAN with

install.packages("surveyPrev")

The development version of the surveyPrev package can be installed from GitHub with

library(devtools)
install_github("richardli/surveyPrev")

After installation, the package can be loaded with

library(surveyPrev)

One of the key required packages, INLA is not available on CRAN or GitHub, and needs to be installed with the following code (Rue, Martino, and Chopin 2009).

install.packages("INLA", repos=c(getOption("repos"), 
                  INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(INLA)

The following packages are needed to run the example analysis described in this vignette.

library(geodata)
library(sf)
library(SUMMER)
library(rdhs)
library(ggplot2)
library(patchwork)
library(dplyr)
library(tidyr)
library(kableExtra)

2.3 Built-in indicators

Currently, the surveyPrev package supports 182 indicators. The full list of indicators and their IDs can be found in the indicatorList dataset. The ID column corresponds to the standardized indicator IDs used in the DHS API, while the Description and Topic columns give a short description and the DHS chapter each indicator belongs to. In previous versions of surveyPrev, a small set of indicators could also be referred to by shorter alternative IDs (for example, ancvisit4+); these alternative IDs are no longer used, and the standard DHS indicator IDs should be used instead. The full list of indicators is summarized in Table 2.1.

data(indicatorList)
indicatorList[1:20, ]
Table 2.1: List of built-in indicators in the surveyPrev package.
ID Description Topic
CH_SZWT_C_L25 Birth weight: Less than 2.5 kg Chapter 10 - Child Health
CH_DIAT_C_ORT Diarrhea treatment (Children under five with diarrhea treated with either ORS or RHF) Chapter 10 - Child Health
CH_DIAT_C_ABI Treatment of diarrhea: Antibiotics Chapter 10 - Child Health
CH_DIAT_C_ADV Treatment of diarrhea: Advice or treatment was sought Chapter 10 - Child Health
CH_DIAT_C_AMO Treatment of diarrhea: Antimotility drugs Chapter 10 - Child Health
CH_ARIS_C_ADV Children with ARI for whom advice or treatment was sought Chapter 10 - Child Health
CH_DIAT_C_ORS Treatment of diarrhea: Oral rehydration solution (ORS) Chapter 10 - Child Health
CH_DIAT_C_OSI Treatment of diarrhea: ORS or increased fluids Chapter 10 - Child Health
CH_DIAT_C_RHF Treatment of diarrhea: Recommended home fluids (RHF) at home Chapter 10 - Child Health
CH_DIAT_C_NOT Treatment of diarrhea: No treatment Chapter 10 - Child Health
CH_FEVR_C_FEV Children with fever in the last two weeks Chapter 10 - Child Health
CH_FEVT_C_ADV Children with fever for whom advice or treatment was sought Chapter 10 - Child Health
CH_VACC_C_BAS Children age 12-23 months with all 8 basic vaccinations Chapter 10 - Child Health
CH_VACC_C_BCG BCG vaccination received Chapter 10 - Child Health
CH_VACC_C_PN3 Pneumococcal 3 vaccination received Chapter 10 - Child Health
CH_VACC_C_MSL Percentage of children 12-23 months who had received MCV 1 (Measles containing vaccine) Chapter 10 - Child Health
CH_VACC_C_NON Children 12-23 months with no vaccinations Chapter 10 - Child Health
CH_VACC_C_DP1 Percentage of children 12-23 months who had received DPT 1 vaccination Chapter 10 - Child Health
CH_VACC_C_DP3 Percentage of children 12-23 months who had received DPT 3 vaccination Chapter 10 - Child Health
CH_DIFP_C_FAL Feeding practices during diarrhea: ORT and continued feeding Chapter 10 - Child Health

In this vignette, we consider estimating the percentage of women who had a live birth in the five years before the survey who had four or more antenatal care visits. This indicator has the standard DHS ID RH_ANCN_W_N4P, which we use throughout this vignette.

2.4 Customized indicators

Details on more common DHS indicators can be found in the Guide to DHS Statistics at https://dhsprogram.com/data/Guide-to-DHS-Statistics/. User-specified function can also be used to process raw DHS data into new indicators. Please refer to the vignette on Creating Customized Indicators for surveyPrev for more details.

2.5 DHS survey data

Processing the binary indicator from the raw DHS data consists of two steps:

  1. download the relevant DHS data recode using getDHSdata(), or manually from the DHS website,
  2. create a data frame with the binary indicator and relevant information using getDHSindicator().

The getDHSdata() function downloads the relevant DHS data directly from the DHS website using the rdhs package. This step requires users to

  1. register with DHS to gain data access,
  2. set up the DHS account login details in R with rdhs::set_rdhs_config().
rdhs::set_rdhs_config(email = "your_email",
                project = "your_registered_DHS_project_title")

After setting up the DHS account information, getDHSdata() can be used to download the relevant survey files directly into R. If data download using the API fails, or if it is preferred to work with downloaded data files offline, you may also manually download the file from the DHS website and read into R. The getDHSdata() function returns a message specifying which file is used (e.g., Individual Record file for the ANC visit example).

indicator <- "RH_ANCN_W_N4P"
year <- 2018
country <- "Zambia"
dhsData <- getDHSdata(country = country, indicator = indicator, year = year)

We then use getDHSindicator() to process the raw survey data into a data frame, where the column value is the indicator of interest. The data frame also contains information specifying survey designs in the survey package, including cluster ID, household ID, survey weight and strata information.

data <- getDHSindicator(dhsData, indicator = indicator)
head(data)
##   cluster householdID            v022            v023    v024  weight strata
## 1       1           1 eastern - rural eastern - rural eastern 1892890  rural
## 2       1           2 eastern - rural eastern - rural eastern 1892890  rural
## 3       1           3 eastern - rural eastern - rural eastern 1892890  rural
## 4       1           4 eastern - rural eastern - rural eastern 1892890  rural
## 5       1           9 eastern - rural eastern - rural eastern 1892890  rural
## 6       1           9 eastern - rural eastern - rural eastern 1892890  rural
##   value
## 1     1
## 2     1
## 3     1
## 4     0
## 5     0
## 6     0

2.6 Spatial information

The getDHSgeo()function downloads the GPS data for cluster locations, also through the rdhs package. This step loads in the GPS location as a SpatialPointsDataFrame, but it can also be performed manually by downloading the GPS file from the DHS website. The GPS location of the clusters are used to match clusters to regions at different admin levels.

geo <- getDHSgeo(country = country, year = year)

In the case of Zambia, the maps are already included in the package as example datasets (data(ZambiaAdm1) and data(ZambiaAdm2)) so we load them directly within R. For other countries, the Admin 1 and Admin 2 boundary shapefiles can be downloaded from the GADM site3 and read into R manually. Another alternative is to use the geodata and sf package to download the maps from the GADM site and load them in R directly, in the following steps:

poly.adm1 <- geodata::gadm(country="ZMB", level=1, path=tempdir())
poly.adm1 <- sf::st_as_sf(poly.adm1)
poly.adm2 <- geodata::gadm(country="ZMB", level=2, path=tempdir())
poly.adm2 <- sf::st_as_sf(poly.adm2)

The clusterInfo() function extracts Admin 1 and Admin 2 area information for each cluster. To avoid confusion in countries that have duplicated region names, we create new labels for Admin 2 regions using the variable admin2.name.full. The clusters with invalid GPS locations are saved in the wrong.points object and not used in any models.

cluster.info <- clusterInfo(geo=geo, poly.adm1=poly.adm1, poly.adm2=poly.adm2, 
                            by.adm1 = "NAME_1",by.adm2 = "NAME_2")
head(cluster.info$data)
##   cluster  LONGNUM     LATNUM                   geometry admin2.name
## 1       1 32.06631 -13.895685 POINT (32.06631 -13.89568)      Katete
## 2       2 28.73625 -13.419209 POINT (28.73625 -13.41921)     Masaiti
## 3       3 31.13254  -8.754055 POINT (31.13254 -8.754055)    Mpulungu
## 4       4 28.44930 -14.419112  POINT (28.4493 -14.41911)       Kabwe
## 5       5 29.96905 -12.677275 POINT (29.96905 -12.67727)    Chitambo
## 6       6 32.74745  -9.319054 POINT (32.74745 -9.319054)     Nakonde
##   admin1.name   admin2.name.full
## 1     Eastern     Eastern_Katete
## 2  Copperbelt Copperbelt_Masaiti
## 3    Northern  Northern_Mpulungu
## 4     Central      Central_Kabwe
## 5     Central   Central_Chitambo
## 6    Muchinga   Muchinga_Nakonde
cluster.info$wrong.points
##  [1]  19  33  65 179 236 281 293 360 361 469

The adminInfo() function computes the spatial adjacency matrix, and if supplied with additional information on population, combines needed information for each administrative area, including the name, population, and urban population fractions of each administrative area. The admin.mat object stores the adjacency matrix for spatial models. We will ignore the population information for now and focus on only information from the DHS data, and will return to the population information in Section 7.

admin.info1 <- adminInfo(poly.adm = poly.adm1, admin = 1,by.adm = "NAME_1")
admin.info2 <- adminInfo(poly.adm = poly.adm2, admin = 2,by.adm = "NAME_2",by.adm.upper = "NAME_1")
head(admin.info2$data)
##   admin1.name   admin2.name      admin2.name.full population urban
## 1     Central      Chibombo      Central_Chibombo         NA    NA
## 2     Central      Chisamba      Central_Chisamba         NA    NA
## 3     Central      Chitambo      Central_Chitambo         NA    NA
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi         NA    NA
## 5     Central         Kabwe         Central_Kabwe         NA    NA
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi         NA    NA

3 Direct estimates

3.1 National and Admin 1 direct estimates

For \(y_j\) be the binary outcome of interest for the \(j^{\text{th}}\) individual in the survey and \(w_j\) be the design weight associated with this individual. For a given area denoted as \(i\), we have the weighted estimator:

\[ \hat p^{W}_{i}=\frac {\sum_{j \in S_i} y_j w_j}{\sum_{j \in S_i} w_j}\]

where \(S_i\) is the set of individual index within the \(i\)-th region. The directEST() function calculates direct estimates at different Admin levels using the SUMMER::smoothSurvey() function in the SUMMER package internally (Li et al. 2020). The output of the function includes the direct estimates and their variance, standard error, and confidence intervals based on the specified confidence level. The confidence intervals are computed on the logit scale, i.e., if we use \(D_i\) to denote the design-based variance of \(\hat p^{W}_i\), then the asymptotic design-based variance of \(\text{logit}(\hat p^{W}_i)\) is \[ V_i = \frac{D_i}{(\hat p^{W}_i)^2(1-\hat p^{W}_i)^2} \] and we compute the confidence interval on the probability scale to be \[(\quad \text{expit}[\text{logit}(\hat p^{W}_i) - z_{\alpha/2}V_i^{1/2}], \;\; \text{expit}[\text{logit}(\hat p^{W}_i) + z_{\alpha/2}V_i^{1/2}]\quad). \]

Currently the package defaults to a two-stage stratified cluster sampling design, with the sampling clusters (enumeration areas) being stratified by Admin 1 areas and urban/rural strata, which is the most common sampling design in the DHS.

res_ad1 <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1)
head(res_ad1$res.admin1)
##   admin1.name direct.est   direct.var direct.logit.est direct.logit.var
## 1     Central  0.5666503 0.0009124018        0.2681973       0.01513139
## 2  Copperbelt  0.6194303 0.0016407269        0.4871308       0.02952453
## 3     Eastern  0.6849734 0.0005389968        0.7767228       0.01157562
## 4     Luapula  0.6341424 0.0013165862        0.5500295       0.02445973
## 5      Lusaka  0.5950341 0.0007179686        0.3848158       0.01236474
## 6    Muchinga  0.6819640 0.0008074465        0.7628123       0.01716478
##   direct.logit.prec  direct.se direct.lower direct.upper         cv
## 1          66.08777 0.03020599    0.5067753    0.6246405 0.06970350
## 2          33.87014 0.04050589    0.5375183    0.6950648 0.10643487
## 3          86.38845 0.02321631    0.6378051    0.7286127 0.07369633
## 4          40.88353 0.03628479    0.5605757    0.7019415 0.09917737
## 5          80.87516 0.02679494    0.5416220    0.6462870 0.06616591
## 6          58.25882 0.02841560    0.6238751    0.7348939 0.08934713

For national level (admin = 0), the function can additionally produce urban- and rural-specific direct estimates by specifying argument strata.

res_ad0 <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="all")
res_ad0_urban <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="urban")
res_ad0_rural <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="rural")
list(all = res_ad0$res.admin0, 
      urban = res_ad0_urban$res.admin0, 
      rural = res_ad0_rural$res.admin0)
## $all
## NULL
## 
## $urban
## NULL
## 
## $rural
## NULL

3.2 Admin 2 direct estimates

In general, direct estimates at Admin 2 are unstable. For countries with a large number of Admin 2 areas, some of the direct estimates can be undefined if there is no cluster in the area.

res_ad2 <- directEST(data = data,
                     cluster.info = cluster.info,
                     admin = 2,
                     aggregation = FALSE)

4 Area-level Fay-Herriot model

4.1 Admin 1 Fay-Herriot estimates

Fay-Herriot models provides smoothed estimates at the areal level using direct estimate \(\hat p^{W}_{i}\) as input. The direct estimates are modeled as a noisy observation of the true prevalence, with the variance of noise determined by the design-based variance. We consider the spatial Fay-Herriot model for the logit transformed direct estimates, which is defined as follows:

\[\text{logit}(\hat p^{W}_{i})|\lambda_{i} \sim\textrm{Normal}(\lambda_{i}, V_{i}^{HT}),\] \[\lambda_{i}= \alpha + e_i+S_i.\] Here \(\text{expit}(\lambda_{i})\) is the latent true prevalence, and \(e_i\) and \(S_i\) are unstructured and structured spatial random effects. Inference is carried out using Bayesian methods, and so the model specification is completed by priors on \(\alpha\), \(e\) and \(S\), and their hyperpriors. More details of the Bayesian model setup can be found in Jonathan Wakefield, Okonek, and Pedersen (2020). Area-level Fay-Herriot models are viewed as the most reliable model choice, since they acknowledge the design through the use of a weighted estimate and its associated variance. See chapters 4 to 6 of Rao and Molina (2015).

As of now the package allows only an overall intercept \(\alpha\), but future versions of the package will allow area level covariates to be included. The default prior for the intercept is \(N(0, 1000)\). The structured and non-structured random effects are implemented using the Besag-York-Mollié (BYM) model via the BYM2 parameterization, with default PC priors such that the marginal standard deviation has a prior such that \(Prob(\sigma > 1) = 0.01\) and the proportion of variation explained by the spatial effect, \(\phi\) has a prior such that \(Prob(\phi > 0.5) = 2/3\) (Riebler et al. 2016; Simpson et al. 2017).

The fhModel() calculates spatial Fay-Herriot estimates at different administrative levels using SUMMER::smoothSurvey(). Users can specify either only i.i.d. random effects (i.e., no \(S_i\) term) or the BYM2 model by setting model = "iid" or model = "bym2". The function returns model results at the specified admin level with aggregation = FALSE. Aggregated to higher admin levels is possible with aggregation = TRUE when additional population information is provided. The aggregation steps are described in more details in Section 7.

smth_res_ad1_bym2 <- fhModel(data,
                        cluster.info = cluster.info,
                        admin.info = admin.info1,
                        admin = 1,
                        model = "bym2",
                        aggregation =FALSE)

smth_res_ad1_iid <- fhModel(data, 
                        cluster.info = cluster.info, 
                        admin.info = admin.info1,
                        admin = 1, 
                        model = "iid",
                        aggregation =FALSE)
head(smth_res_ad1_bym2$res.admin1) 
##   admin1.name      mean    median         sd          var     lower     upper
## 1     Central 0.6044782 0.6065355 0.02540212 0.0006452679 0.5502105 0.6473369
## 2  Copperbelt 0.6274234 0.6287296 0.02615116 0.0006838832 0.5719901 0.6772125
## 3     Eastern 0.6609729 0.6602310 0.02131922 0.0004545093 0.6222575 0.7040662
## 4     Luapula 0.6376078 0.6379043 0.02497230 0.0006236156 0.5861663 0.6869233
## 5      Lusaka 0.6134253 0.6147980 0.02203910 0.0004857219 0.5672111 0.6524969
## 6    Muchinga 0.6569453 0.6557629 0.02282445 0.0005209555 0.6157119 0.7043158
##           cv logit.mean logit.median   logit.var logit.lower logit.upper
## 1 0.06456014  0.4253019    0.4327724 0.011256651   0.2015213   0.2015213
## 2 0.07043695  0.5227582    0.5267704 0.012543047   0.2899753   0.2899753
## 3 0.06274623  0.6691180    0.6643239 0.009159307   0.4991414   0.4991414
## 4 0.06896601  0.5666071    0.5662796 0.011745266   0.3481393   0.3481393
## 5 0.05721439  0.4626948    0.4675257 0.008624177   0.2704817   0.2704817
## 6 0.06630446  0.6513613    0.6444688 0.010409992   0.4713867   0.4713867

Columns 2-6 represent the posterior mean, posterior variance, posterior median and 2.5%, and 97.5% posterior quantiles, respectively. The posterior variance can be viewed as an analogue of the mean squared error (MSE) summary.

4.2 Admin 2 Fay-Herriot estimates

Design-based variance at Admin 2 level can be unstable due to data sparsity, which sometimes creates numerical issues for Fay-Herriot model. A fix for this variance issue is described in (Jon Wakefield, Jiang, and Wu 2026) and is enabled by setting var.fix = TRUE.

smth_res_ad2 <- fhModel(data,
                    cluster.info = cluster.info,
                    admin.info = admin.info2,
                    admin = 2,
                    model = "bym2",
                    aggregation = FALSE, 
                    var.fix = TRUE)
head(smth_res_ad2$res.admin2)
##        admin2.name.full      mean    median         sd         var     lower
## 1      Central_Chibombo 0.6326583 0.6347531 0.06240667 0.003894592 0.5021236
## 2      Central_Chisamba 0.6118544 0.6128304 0.03869883 0.001497600 0.5328828
## 3      Central_Chitambo 0.6598944 0.6653905 0.09444574 0.008919997 0.4533604
## 4  Central_Itezhi-tezhi 0.6312101 0.6375645 0.10324821 0.010660194 0.4092146
## 5         Central_Kabwe 0.5390667 0.5382750 0.03387439 0.001147475 0.4737059
## 6 Central_Kapiri Mposhi 0.4932258 0.4944318 0.04091956 0.001674410 0.4118850
##       upper         cv  logit.mean logit.median  logit.var  logit.lower
## 1 0.7428049 0.17086159  0.54546226   0.54510837 0.07512336  0.006054826
## 2 0.6847706 0.09995318  0.45188068   0.45192602 0.02677752  0.128656454
## 3 0.8194911 0.28225657  0.67581506   0.67666149 0.19211106 -0.187097097
## 4 0.8030814 0.28487336  0.56775460   0.56859802 0.22795389 -0.370116406
## 5 0.6054594 0.07336486  0.15940455   0.16012419 0.01925647 -0.113810011
## 6 0.5707236 0.08276078 -0.02920066  -0.02812478 0.02669842 -0.351824721
##   logit.upper admin1.name   admin2.name
## 1   1.0835899     Central      Chibombo
## 2   0.7714136     Central      Chisamba
## 3   1.5378288     Central      Chitambo
## 4   1.5060982     Central  Itezhi-tezhi
## 5   0.4308869     Central         Kabwe
## 6   0.2883475     Central Kapiri Mposhi

5 Cluster-level model

5.1 Unstratified model

Cluster-level models assume smoothing models for counts of events in each cluster (Jonathan Wakefield, Okonek, and Pedersen 2020; Li et al. 2020). In terms of traditional SAE literature, cluster-level models are a type of unit-level model. We start with describing an unstratified model without taking into account the urban/rural stratification in the sampling design.

Let \(Y_c\) be the number of events in cluster \(c\), and \(n_c\) be the number of individuals at risk, where \(c= 1,\dots,C\). The unstratified model assumes the hierarchical structure:

\[Y_c \mid p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\]

where \(\alpha\) is the intercept, and \(i[s_c]\) indexes the area within which the cluster \(s_c\) resides. Similar to the area-level model, \(e_i\) and \(S_i\) are unstructured and structured spatial random effects with the same prior as before. The Beta-binomial distribution arises from a hierarchical model in which the probability follows a \(\text{Beta}(a, b)\) prior. The overdispersion parameter, \(d=\frac{1}{\alpha+\beta+1}\), is between 0 and 1 and represents the intracluster correlation between Bernoulli draws within a cluster. The default prior for \(d\) is \(\text{logit}(d) \sim \text{Normal}(0,0.4)\).

The clusterModel() function fits the cluster-level model above. Unstratified model can be specified by setting stratification = FALSE, and either i.i.d. or the BYM2 model by setting model = "bym2" or model = "iid". For the the overdispersion parameter \(d\), users can change the prior mean and precision through overdisp.mean and overdisp.prec.

We fit the unstratified model in both Admin 1 and Admin 2 below. These models differ only in the admin level at which the spatial random effects enter, with the cluster-level model being the same in each case.

cl_res_ad1 <- clusterModel(data=data,
                   cluster.info=cluster.info,
                   admin.info  = admin.info1,
                   stratification = FALSE,
                   model = "bym2",
                   admin = 1,
                   aggregation =FALSE,
                   CI = 0.95)

cl_res_ad2 <- clusterModel(data=data,
                   cluster.info= cluster.info,
                   admin.info = admin.info2,
                   model = "bym2",
                   stratification =FALSE,
                   admin = 2,
                   aggregation =FALSE,
                   CI = 0.95)
head(cl_res_ad2$res.admin2)
##        admin2.name.full      mean    median         sd         var     lower
## 1      Central_Chibombo 0.6304505 0.6297172 0.05155764 0.002658190 0.5248170
## 2      Central_Chisamba 0.6290134 0.6287097 0.05635730 0.003176146 0.5127512
## 3      Central_Chitambo 0.6322612 0.6361675 0.06361386 0.004046723 0.4991826
## 4  Central_Itezhi-tezhi 0.6402539 0.6424507 0.07113662 0.005060418 0.4908513
## 5         Central_Kabwe 0.5707115 0.5721401 0.05257994 0.002764651 0.4629834
## 6 Central_Kapiri Mposhi 0.5711356 0.5708991 0.05200752 0.002704783 0.4643945
##       upper        cv admin1.name   admin2.name population urban
## 1 0.7273955 0.1392386     Central      Chibombo         NA    NA
## 2 0.7343986 0.1517877     Central      Chisamba         NA    NA
## 3 0.7497821 0.1748438     Central      Chitambo         NA    NA
## 4 0.7727964 0.1989561     Central  Itezhi-tezhi         NA    NA
## 5 0.6687750 0.1228906     Central         Kabwe         NA    NA
## 6 0.6679537 0.1212012     Central Kapiri Mposhi         NA    NA

5.2 Stratified model

Cluster-level models will produce biased estimates if the specified model does not hold for all units in the population. In the DHS, urban clusters are often oversampled. If such oversampling occurs and the outcome associated with urban/rural, then bias will result when the model does not allow for differences between urban/rural. To emphasize, bias will only arise when oversampling of urban (or rural) areas occurs, and the indicator has an association with urban/rural. One may check whether one needs to include urban/rural terms in the model, and then aggregate by urban/rural.

A fully adjusted model would include interaction terms representing the crossing of Admin 1 areas and urban/rural status. However, for reasons of parsimony we include area-level random effects and a single urban/rural fixed effect. The sampling model becomes

\[Y_c|p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+\gamma\times I(s_c \in \textrm{urban})+x_c\beta+e_{i[s_c]}+S_{i[s_c]}).\]

The area-level risk is defined as \[p_i=q_i\times \text{expit}(\alpha+\gamma+e_{i[s_c]}+S_{i[s_c]})+(1-q_i)\times \text{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\] where \(q_i\) is the proportion of urban population in area \(i\) where “urban” is defined by the survey design sampling frame.

Stratified cluster-level model can be fit by setting stratification = TRUE. The returned results then consist of urban-specific, rural-specific, and the overall area-level prevalences. In order to facilitate aggregation for cluster-level model with urban/rural effects, we need to know the fraction of urban population by each administrative region. This computation can be performed with the getUR() function (see example for details), but it requires external information (i.e., the urban/rural definition by census), so we omit this step from this document for now and assume such information is already obtained from a different source. The urban proportions for women of age 15 to 49 in Zambia in 2018 have been pre-computed and saved in ZambiaPopWomen$admin1_urban and ZambiaPopWomen$admin2_urban.

head(ZambiaPopWomen$admin2_urban)
##     admin1.name admin2.name       urban
## 1       Eastern     Chadiza 0.049373174
## 2      Muchinga       Chama 0.054972234
## 3       Eastern     Chasefu 0.001718483
## 4 North-Western     Chavuma 0.325830536
## 5       Luapula      Chembe 0.000000000
## 6       Central    Chibombo 0.577152577

Given the urban/rural proportions, we first add the information into admin.info2 using adminInfo().

admin.info2 <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,
                         proportion = ZambiaPopWomen$admin2_urban,
                         by.adm="NAME_2",by.adm.upper="NAME_1")
head(admin.info2$data)
##   admin1.name   admin2.name      admin2.name.full      urban population
## 1     Central      Chibombo      Central_Chibombo 0.57715258         NA
## 2     Central      Chisamba      Central_Chisamba 0.03825222         NA
## 3     Central      Chitambo      Central_Chitambo 0.00000000         NA
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi 0.00000000         NA
## 5     Central         Kabwe         Central_Kabwe 0.68604463         NA
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi 0.13246326         NA

We can then fit the stratified cluster-level model by specifying stratification = TRUE.

cl_res_ad2_T <- clusterModel(data=data,
                   cluster.info=cluster.info,
                   admin.info = admin.info2,
                   model = "bym2",
                   stratification = TRUE,
                   admin = 2, 
                   CI = 0.95)

6 Visualizing prevalence estimates

We can use mapPlot() function from the SUMMER package to visualize estimates and standard errors.

# Arrange all estimates into a long-format data frame
out1 <- res_ad1$res.admin1[, c("admin1.name", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad1_bym2$res.admin1[, c("admin1.name", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad1$res.admin1[, c("admin1.name", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"

g1 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
              by.data = "admin1.name",  by.geo = "NAME_1", is.long = TRUE,
              variable = "model", value = "mean", legend.label = "Mean")

g2 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
              by.data = "admin1.name",  by.geo = "NAME_1", is.long = TRUE,
              variable = "model", value = "cv", legend.label = "CV")
g1 / g2
Comparing Admin 1 direct estimates, Fay–Herriot model, and unstratified cluster-level posterior mean estimates

Figure 6.1: Comparing Admin 1 direct estimates, Fay–Herriot model, and unstratified cluster-level posterior mean estimates

We note that for Admin 2 estimates, surveyPrev uses a recreated region name in the form of [admin1_name]_[admin2_name] to avoid duplicated Admin 2 names. This new region identifier is assigned a column name of admin2.name.full in the output. Thus when plotting, we need to manually create the admin2.name.full in the spatial polygon object.

poly.adm2$admin2.name.full=paste0(poly.adm2$NAME_1,"_",poly.adm2$NAME_2)

# Arrange all estimates into a long-format data frame
out1<- res_ad2$res.admin2[, c("admin2.name.full", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"
out4 <- cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full", c("admin2.name.full", "mean", "cv")]
out4$model <- "Stratified Cluster-level Model"

g1 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
              by.data = "admin2.name.full",  by.geo = "admin2.name.full", 
              is.long = TRUE, variable = "model", value = "mean", 
              legend.label = "Mean", ncol = 4)

g2 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
              by.data = "admin2.name.full",  by.geo = "admin2.name.full", 
              is.long = TRUE, variable = "model", value = "cv", 
              legend.label = "CV", ncol = 4)
g1 / g2 
Comparing Admin 2 direct estimates, Fay–Herriot model, unstratified and stratified cluster-level posterior mean estimates

Figure 6.2: Comparing Admin 2 direct estimates, Fay–Herriot model, unstratified and stratified cluster-level posterior mean estimates

The scatterPlot() function takes results from two model outputs, and produce a scatter plot of the selected variable. For example, we can compare the direct estimates with the Fay-Herriot model (top row) and Cluster-level model (bottom row) in terms of their point estimates and standard deviations. In the two plots on the left, we see the expected shrinkage (attenuation) of the spatial Fay–Herriot posterior mean estimates as compared to the direct estimates. In the plots on the right we see the reduction in uncertainty which arises from using all of the data in a single model.

s1 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=smth_res_ad1_bym2$res.admin1,
          value1="direct.est",
          value2="mean",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Fay-Herriot vs Direct estimate",
          label1="Direct Est",
          label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=smth_res_ad1_bym2$res.admin1,
          value1="direct.se",
          value2="sd",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Fay–Herriot vs Direct SE",
          label1="Direct Est",
          label2="Spatial Fay–Herriot")
s3 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=cl_res_ad1$res.admin1,
          value1="direct.est",
          value2="mean",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Cluster-level model vs Direct estimate",
          label1="Direct Est",
          label2="Cluster-level model")
s4 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=cl_res_ad1$res.admin1,
          value1="direct.se",
          value2="sd",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Cluster-level model vs Direct SE",
          label1="Direct Est",
          label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)
Comparing  direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row).

Figure 6.3: Comparing direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row).

Similar smoothing patterns can be observed when comparing estimates at Admin 2 level.

s1 <- scatterPlot(
          res1=res_ad2$res.admin2,
          res2=smth_res_ad2$res.admin2,
          value1="direct.est",
          value2="mean",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Fay-Herriot vs Direct estimate",
          label1="Direct Est",
          label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
          res1=res_ad2$res.admin2,
          res2=smth_res_ad2$res.admin2,
          value1="direct.se",
          value2="sd",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Fay–Herriot vs Direct SE",
          label1="Direct Est",
          label2="Spatial Fay–Herriot")
s3 <- scatterPlot(
        res1=res_ad2$res.admin2,
        res2=cl_res_ad2$res.admin2,
        value1="direct.est",
        value2="mean",
        by.res1="admin2.name.full",
        by.res2="admin2.name.full",
        title="Cluster-level model vs Direct estimate",
        label1="Direct Est",
        label2="Cluster-level model")
s4 <- scatterPlot(
        res1=res_ad2$res.admin2,
        res2=cl_res_ad2$res.admin2,
        value1="direct.se",
        value2="sd",
        by.res1="admin2.name.full",
        by.res2="admin2.name.full",
        title="Cluster-level model vs Direct SE",
        label1="Direct Est",
        label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)
Comparing  direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row). Red triangles corresponding to areas where direct estimates are not available.

Figure 6.4: Comparing direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row). Red triangles corresponding to areas where direct estimates are not available.

Finally, we compare the unstratified and stratified cluster-level model at Admin 2 level. We can observe mild differences but the estimates are mostly similar between the two models.

s1 <- scatterPlot(
          res1=cl_res_ad2$res.admin2,
          res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
          value1="mean",
          value2="mean",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Stratified vs Unstratified model estimate",
          label1="Unstratified model estimates",
          label2="Stratified model estimates")
s2 <- scatterPlot(
          res1=cl_res_ad2$res.admin2,
          res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
          value1="sd",
          value2="sd",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Stratified vs Unstratified model SD",
          label1="Unstratified model estimates",
          label2="Stratified model estimates")
(s1 + s2 )  
Comparing unstratified and stratified cluster-level model at Admin 2 level.

Figure 6.5: Comparing unstratified and stratified cluster-level model at Admin 2 level.

Another visualization function intervalPlot() can provide interval plots at different admin levels for model comparison with compare = TRUE. The input should be a list of model outputs from surveyPrev, each with a self-defined model name (e.g., model = list("name 1" = fit1, "name 2" = fit2)).

intervalPlot(admin = 1, compare = TRUE, model = list(
      "Direct estimate model"= res_ad1,
      "Fay-Herriot model"= smth_res_ad1_bym2,
      "Unstratified Cluster-level model"= cl_res_ad1))
Comparing different Admin 1 estimates.

Figure 6.6: Comparing different Admin 1 estimates.

For stratified cluster-level models, intervalPlot() visualizes the urban, rural, and overall Admin 2 estimates within each Admin 1 region, when setting compare = FALSE.

plots <- intervalPlot(compare = FALSE, model = list("model"=cl_res_ad2_T))
patchwork::wrap_plots(plots, ncol = 2)
Comparing Admin 2 estimates arranged by the corresponding Admin 1 regions.

Figure 6.7: Comparing Admin 2 estimates arranged by the corresponding Admin 1 regions.

7 Aggregation to higher admin levels

7.1 Computing population size from WorldPop raster

Population fractions are necessary to aggregate estimates from finer administrative regions to higher levels. The script below demonstrates the creation of a pixel-level population raster at the 100m by 100m level for women of age 15 to 49 in Zambia in 2018. The corresponding age-sex-specific population estimates can be found at https://hub.worldpop.org/geodata/summary?id=16429. The path to the downloaded .tiff files are specified in the pop_dir object. The aggPopulation() aggregates population data from the population raster into admin levels based on the input shapefiles we read in previous step. Here we aggregate pixel-level population for women of age 15 to 49 in Zambia in 2018 into both Admin 1 and Admin 2 levels.

library(raster)
pop.abbrev <- "ZMB"
year <- 2018
pop.dir <- "../data/Zambia/worldpop"

## first sum up four rasters female  15-49
f_15_name <- paste0(pop.dir, "/", pop.abbrev, '_f_15_', year, '.tiff')
f_20_name <- paste0(pop.dir, "/", pop.abbrev, '_f_20_', year, '.tiff')
f_25_name <- paste0(pop.dir, "/", pop.abbrev, '_f_25_', year, '.tiff')
f_30_name <- paste0(pop.dir, "/", pop.abbrev, '_f_30_', year, '.tiff')
f_35_name <- paste0(pop.dir, "/", pop.abbrev, '_f_35_', year, '.tiff')
f_40_name <- paste0(pop.dir, "/", pop.abbrev, '_f_40_', year, '.tiff')
f_45_name <- paste0(pop.dir, "/", pop.abbrev, '_f_45_', year, '.tiff')


pop_f_15 <- raster(f_15_name)
pop_f_20 <- raster(f_20_name)
pop_f_25 <- raster(f_25_name)
pop_f_30 <- raster(f_30_name)
pop_f_35 <- raster(f_35_name)
pop_f_40 <- raster(f_40_name)
pop_f_45 <- raster(f_45_name)

pop_raster <- pop_f_45 + pop_f_40 + pop_f_35 + pop_f_30 + pop_f_25 + pop_f_20 + pop_f_15
agg.pop1 <- aggPopulation(
   tiff = pop_raster,
   poly.adm  = ZambiaAdm1,
   by.adm  = "NAME_1")
colnames(agg.pop1)[1] <- "admin1.name"
agg.pop2 <- aggPopulation(
   tiff = pop_raster,
   poly.adm = ZambiaAdm2,
   by.adm  = "NAME_2",
   by.adm.upper= "NAME_1")

The computed population estimates in this example is built into the package and available in the ZambiaPopWomen dataset.

data(ZambiaPopWomen)
head(ZambiaPopWomen$admin1_pop)
##   admin1.name population
## 1     Central   430358.5
## 2  Copperbelt   598032.5
## 3     Eastern   452361.1
## 4     Luapula   342075.2
## 5      Lusaka   808640.2
## 6    Muchinga   214810.4
head(ZambiaPopWomen$admin2_pop)
##        admin2.name.full population admin1.name   admin2.name
## 1      Central_Chibombo  108561.22     Central      Chibombo
## 2      Central_Chisamba   28307.61     Central      Chisamba
## 3      Central_Chitambo   20727.69     Central      Chitambo
## 4  Central_Itezhi-tezhi   22814.54     Central  Itezhi-tezhi
## 5         Central_Kabwe   56875.20     Central         Kabwe
## 6 Central_Kapiri Mposhi   55995.96     Central Kapiri Mposhi

7.2 Estimating population size by survey weight

An alternative approach to obtain population fractions is to simply use survey weights to estimate the population size within each area. This approach, however, can lead to noisy estimates at fine spatial resolutions. Nevertheless, similar to aggPopulation(), the aggSurveyWeight() function produces the estimated population size at the specified admin level. The output can be used as weights for aggregation model results to higher admin levels.

agg.survey1 <- aggSurveyWeight(data = data, cluster.info = cluster.info, admin = 1)
agg.survey2 <- aggSurveyWeight(data = data, cluster.info = cluster.info, admin = 2,
    poly.adm = poly.adm2, by.adm = "NAME_2", by.adm.upper = "NAME_1")

As mentioned in Section 2, the adminInfo() combine information for each administrative area into a single data frame. This includes the name, population, and fractions of the urban population for each area. Population and urban population fractions can be added through the agg.pop and proportion objects, respectively. Additionally, the survey weight (output of aggSurveyWeight()) may serve as an alternative to population data, and users can add it via agg.pop.

admin.info1 <- adminInfo(poly.adm = poly.adm1, 
                         admin = 1,
                         by.adm="NAME_1",
                         agg.pop  =ZambiaPopWomen$admin1_pop,
                         proportion = ZambiaPopWomen$admin1_urban )

admin.info2 <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,by.adm="NAME_2",by.adm.upper = "NAME_1",
                         agg.pop =ZambiaPopWomen$admin2_pop,
                         proportion = ZambiaPopWomen$admin2_urban)

admin.info1.survey <- adminInfo(poly.adm = poly.adm1, 
                         admin = 1,
                         by.adm="NAME_1",
                         agg.pop  =agg.survey1,
                         proportion = ZambiaPopWomen$admin1_urban )

admin.info2.survey  <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,by.adm="NAME_2",by.adm.upper = "NAME_1",
                         agg.pop =agg.survey2,
                         proportion = ZambiaPopWomen$admin2_urban)

head(admin.info2$data)
##   admin1.name   admin2.name      admin2.name.full      urban population
## 1     Central      Chibombo      Central_Chibombo 0.57715258  108561.22
## 2     Central      Chisamba      Central_Chisamba 0.03825222   28307.61
## 3     Central      Chitambo      Central_Chitambo 0.00000000   20727.69
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi 0.00000000   22814.54
## 5     Central         Kabwe         Central_Kabwe 0.68604463   56875.20
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi 0.13246326   55995.96
##   population.admin1
## 1          430358.5
## 2          430358.5
## 3          430358.5
## 4          430358.5
## 5          430358.5
## 6          430358.5

The following maps display the 15-49 female population and aggregated DHS survey weights ratio within upper admin levels.

out1 <- rbind(cbind(admin.info1$data, type = "WorldPop"), 
             cbind(admin.info1.survey$data, type = "SurveyWeight"))
out1 <- out1 %>% group_by(type) %>% mutate(pop.frac = population / sum(population)) 
g1 <- mapPlot(data = out1, geo = poly.adm1, size = 0.1, 
              by.data = "admin1.name", by.geo = "NAME_1", 
              variable = "type",  value = "pop.frac", is.long = TRUE,
              legend.label = "Population Fraction")

out2 <- rbind(cbind(admin.info2$data, type = "WorldPop"), 
             cbind(admin.info2.survey$data, type = "SurveyWeight"))
out2 <- out2 %>% group_by(type) %>% mutate(pop.frac = population / sum(population)) 
poly.adm2$admin2.name.full=paste0(poly.adm2$NAME_1,"_",poly.adm2$NAME_2)
g2 <- mapPlot(data = out2, geo = poly.adm2, size = 0.1, 
              by.data = "admin2.name.full", by.geo = "admin2.name.full", 
              variable = "type",  value = "pop.frac", is.long = TRUE,
              legend.label = "Population Fraction")
g1 / g2
survey weight ratio and population ratio at Admin 1 and Admin 2

Figure 7.1: survey weight ratio and population ratio at Admin 1 and Admin 2

The scatter plot below examines the fraction of population residing in each Admin 1 areas, computed by both WorldPop raster and survey weights. Note that DHS data may have missing Admin 2 areas. For example, Zambia has 112 out of 115 Admin 2 areas for this survey (red points in the scatter plots). Users should be aware that this is an incomplete set of weights for aggregation from Admin 2 to Admin 1.

out2$pop.frac.within <- out2$population / out2$population.admin1
out3 <- tidyr::spread(out2[, c("admin2.name.full", "pop.frac.within", "type")], 
                      type, pop.frac.within)
out3 <- left_join(out3, admin.info2$data[, c("admin2.name.full", "population")])
ggplot(out3, aes(x = SurveyWeight, y = WorldPop, size = population)) +
    geom_point(alpha = 0.5, color = "red")+
    geom_abline(slope = 1, intercept = 0, linetype = "dashed")
Comparing two sets of Admin 2 15-49 female population fractions within Admin 1 area.

Figure 7.2: Comparing two sets of Admin 2 15-49 female population fractions within Admin 1 area.

7.3 Aggregating direct estimates

For direct estimates, we caution that we almost always prefer directly computing direct estimates at the desired level, rather than aggregating direct estimates at finer levels. Nevertheless, here we describe how the aggregated direct estimates are computed.

Let \(i\) denote the index for the lower admin level and \(k[i]\) be the index for the corresponding upper admin level, then aggregated estimates for the \(k\)-th upper admin area is

\[ \hat p^{agg}_{k}=\frac {\sum_{k[i] = k} \hat p^{W}_{i}E_i}{\sum_{k[i] = k} \text{E}_i},\] where \(E_i=\widehat{\text{pop}_{i}} \times \textbf{1}(\hat p^{W}_{i} \neq \text{NA})\), i.e., the point estimates are obtained by weighted average of non-missing direct estimates in the area, where the weights are either given by external population information, or estimated by survey weights. The standard error and confidence intervals are computed by simulation using samples from the design-based asymptotic sampling distributions of \(\text{logit}(\hat p^{W}_i)\).

The directEST() function allows aggregation by survey weight directly by specifying weight = "survey", without going through the steps discussed in Section 7.2. We compute the direct estimates at admin1 level and aggregate them to national estimates. When using survey weight, the aggregated point estimate at national level is the same as the national level direct estimate, but the uncertainty calculations are different, as discussed above.

res_ad1agg <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1, 
                   weight = "population", 
                   admin.info = admin.info1, 
                   aggregation = TRUE)
head(res_ad1agg$res.admin1)
##   admin1.name direct.est   direct.var direct.logit.est direct.logit.var
## 1     Central  0.5666503 0.0009124018        0.2681973       0.01513139
## 2  Copperbelt  0.6194303 0.0016407269        0.4871308       0.02952453
## 3     Eastern  0.6849734 0.0005389968        0.7767228       0.01157562
## 4     Luapula  0.6341424 0.0013165862        0.5500295       0.02445973
## 5      Lusaka  0.5950341 0.0007179686        0.3848158       0.01236474
## 6    Muchinga  0.6819640 0.0008074465        0.7628123       0.01716478
##   direct.logit.prec  direct.se direct.lower direct.upper         cv
## 1          66.08777 0.03020599    0.5067753    0.6246405 0.06970350
## 2          33.87014 0.04050589    0.5375183    0.6950648 0.10643487
## 3          86.38845 0.02321631    0.6378051    0.7286127 0.07369633
## 4          40.88353 0.03628479    0.5605757    0.7019415 0.09917737
## 5          80.87516 0.02679494    0.5416220    0.6462870 0.06616591
## 6          58.25882 0.02841560    0.6238751    0.7348939 0.08934713
res_ad1agg$agg.natl
##      direct.est  direct.se   direct.var direct.lower direct.upper         cv
## 2.5%  0.6254591 0.01097975 0.0001205549     0.602551    0.6468098 0.02931522
res_ad1agg_bysurvey <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1, 
                   weight = "survey", 
                   admin.info = admin.info1, 
                   aggregation = TRUE)

head(res_ad1agg_bysurvey$agg.natl)
##      direct.est  direct.se   direct.var direct.lower direct.upper         cv
## 2.5%  0.6317635 0.01092154 0.0001192799    0.6094978    0.6529042 0.02965903

Similarly, if we set admin = 2, the output includes aggregated results at both the Admin1 and national level with aggregation = TRUE, computed by weighting the Admin 2 level estimates by their population (as specified by weight = "population").

res_ad2agg <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 2,
                   admin.info = admin.info2,
                   weight = "population",
                   aggregation = TRUE)


res_ad2agg_bysurvey <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 2, 
                   weight = "survey", 
                   admin.info = admin.info2, 
                   aggregation = TRUE)
head(res_ad2agg$res.admin2)
##        admin2.name.full direct.est   direct.var direct.logit.est
## 1      Central_Chibombo  0.6314670 6.230091e-03        0.5385154
## 2      Central_Chisamba  0.6067785 1.724683e-03        0.4337907
## 3      Central_Chitambo  0.6351265 4.825404e-02        0.5542735
## 4         Central_Kabwe  0.5305028 1.289813e-03        0.1221629
## 5 Central_Kapiri Mposhi  0.4712469 1.879603e-03       -0.1151396
## 6         Central_Luano  0.7000000 1.456485e-33        0.8472979
##   direct.logit.var direct.logit.prec    direct.se direct.lower direct.upper
## 1     1.150377e-01      8.692800e+00 7.893093e-02    0.4684794    0.7691079
## 2     3.029524e-02      3.300848e+01 4.152931e-02    0.5231456    0.6845870
## 3     8.985218e-01      1.112939e+00 2.196680e-01    0.2135625    0.9177477
## 4     2.079148e-02      4.809662e+01 3.591397e-02    0.4599735    0.5998344
## 5     3.027354e-02      3.303215e+01 4.335438e-02    0.3878973    0.5562312
## 6     3.302686e-32      3.027839e+31 3.816392e-17    0.7000000    0.7000000
##             cv   admin2.name admin1.name
## 1 2.141760e-01      Chibombo     Central
## 2 1.056130e-01      Chisamba     Central
## 3 6.020389e-01      Chitambo     Central
## 4 7.649454e-02         Kabwe     Central
## 5 9.199930e-02 Kapiri Mposhi     Central
## 6 1.272131e-16         Luano     Central
head(res_ad2agg$agg.admin1)
##   admin1.name direct.est  direct.se direct.lower direct.upper         cv
## 1     Central  0.5886856 0.02743155    0.5309335    0.6386792 0.06669241
## 2  Copperbelt  0.6245591 0.03069591    0.5609564    0.6809169 0.08175964
## 3     Eastern  0.6872843 0.01881645    0.6465393    0.7203940 0.06017109
## 4     Luapula  0.6495958 0.03249481    0.5822051    0.7088149 0.09273523
## 5      Lusaka  0.5894722 0.02492292    0.5399926    0.6371705 0.06070947
## 6    Muchinga  0.6735329 0.01998557    0.6302436    0.7093725 0.06121773
res_ad2agg$agg.natl
##      direct.est   direct.se   direct.var direct.lower direct.upper         cv
## 2.5%  0.6290747 0.008890824 7.904675e-05    0.6093993    0.6439434 0.02396931

We can compare these aggregated results at the national level using intervalPlot().

intervalPlot(admin = 0, group = FALSE, compare = TRUE, 
  model = list(
  "Admin 0 direct estimate" = res_ad0,
  "Admin 2 direct estimate, aggregated by survey" =res_ad2agg_bysurvey,
  "Admin 1 direct estimate, aggregated by survey" =res_ad1agg_bysurvey,
  "Admin 2 direct estimate, aggregated by population" = res_ad2agg,
  "Admin 1 direct estimate, aggregated by population" = res_ad1agg))
Comparing different aggregated direct estimates at national level

Figure 7.3: Comparing different aggregated direct estimates at national level

7.4 Aggregating area-level Fay-Herriot estimates

Aggregation of the Fay-Herriot model can be similarly carried out as model based results. We first re-fit the Admin 1 model with the two updated admin.info objects.

smth_res_ad1_spatial <- fhModel(data=data,
                                cluster.info = cluster.info,
                                admin.info = admin.info1,
                                admin = 1,
                                model = "bym2",
                                aggregation = TRUE)
smth_res_ad1_spatial_survey <- fhModel(data=data,
                                      cluster.info = cluster.info,
                                      admin.info = admin.info1.survey,
                                      admin = 1,
                                      model = "bym2",
                                      aggregation = TRUE)

We also re-fit the Admin 2 model with variance fix.

# Run FH model without problematic clusters
smth_res_ad2_agg_pop <- fhModel(data = data,
                                cluster.info = cluster.info,
                                admin.info = admin.info2,
                                admin = 2,
                                model = "bym2",
                                aggregation = TRUE, 
                                var.fix = TRUE)
smth_res_ad2_agg_survey <- fhModel(data = data,
                                  cluster.info = cluster.info,
                                  admin.info = admin.info2.survey,
                                  admin = 2,
                                  model = "bym2",
                                  aggregation = TRUE, 
                                  var.fix = TRUE)

And we can compare all four models when aggregating to national level, with the national direct estimate as an additional comparison.

intervalPlot(admin = 0, group = FALSE, compare = TRUE, model = list(
            "Admin 0 direct estimates" = res_ad0,
            "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
            "Admin 1 FH model, aggregated by survey" = smth_res_ad1_spatial_survey,
            "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop,
            "Admin 1 FH model, aggregated by pop" = smth_res_ad1_spatial))
Comparing different aggregated Fay-Herriot estimates at national level

Figure 7.4: Comparing different aggregated Fay-Herriot estimates at national level

We can also compare all four set of results at the Admin 1 level, with the Admin 1 direct estimate as an additional comparison.

intervalPlot(admin = 1, group = FALSE, compare = TRUE, model = list(
            "Admin 1 direct estimates" = res_ad1,
            "Admin 1 FH model" = smth_res_ad1_spatial_survey,
            "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
            "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop))
Comparing different aggregated Fay-Herriot estimates at national level

Figure 7.5: Comparing different aggregated Fay-Herriot estimates at national level

7.5 Aggregating cluster-level model

In the same manner, we fit the unstratified and stratified cluster-level model at both Admin 1 and Admin 2 levels. For simplicity of presentation, here we use only the population fractions derived from WorldPop. Results aggregating with survey weights produce very similar results and can be obtained by replacing admin.info argument with the corresponding object computed survey weights, i.e., admin.info1.survey and admin.info2.survey.

cl_res_ad1_pop <- clusterModel(data=data,
                               cluster.info=cluster.info,
                               admin.info = admin.info1,
                               stratification = FALSE,
                               model = "bym2",
                               admin = 1, 
                               aggregation = TRUE,
                               CI = 0.95)
cl_res_ad2_pop <- clusterModel(data=data,
                               cluster.info= cluster.info,
                               admin.info = admin.info2,
                               model = "bym2",
                               stratification = FALSE,
                               admin = 2, 
                               aggregation = TRUE,
                               CI = 0.95)
cl_res_ad1_T_pop <- clusterModel(data=data,
                                 cluster.info=cluster.info,
                                 admin.info = admin.info1,
                                 model = "bym2",
                                 stratification = TRUE,
                                 admin = 1, 
                                 aggregation = TRUE,
                                 CI = 0.95)
cl_res_ad2_T_pop <- clusterModel(data=data,
                                 cluster.info = cluster.info,
                                 admin.info = admin.info2,
                                 model = "bym2",
                                 stratification = TRUE,
                                 admin = 2, 
                                 aggregation = TRUE,
                                 CI = 0.95)

Finally we compare the different aggregated national estimates with the national direct estimates

intervalPlot(admin = 0, group = FALSE, compare = TRUE, model = list(
            "Admin 0 direct estimates" = res_ad0,
             "Admin 1 unstratified model, aggregated by pop" = cl_res_ad1_pop,
             "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
             "Admin 1 stratified model, aggregated by pop" = cl_res_ad1_T_pop,
             "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))
Comparing different aggregated cluster-level estimates at national level

Figure 7.6: Comparing different aggregated cluster-level estimates at national level

And we can also compare the different model estimates at the Admin 1 level, together with the Admin 1 direct estimates

intervalPlot(admin = 1, group = FALSE, compare = TRUE, model = list(
             "Admin 1 direct estimates" = res_ad1,
             "Admin 1 unstratified model" = cl_res_ad1_pop,
             "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
             "Admin 1 stratified model" = cl_res_ad1_T_pop,
             "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))
Comparing different cluster-level estimates at Admin 1 level

Figure 7.7: Comparing different cluster-level estimates at Admin 1 level

Finally, we put together all the different Admin 1 level results based on both Fay-Herriot and cluster-level models. When comparing many estimates, it is usually useful to highlight groups of similar models. This can be done by specifying a group object in the fitted models, and set group = TRUE in intervalPlot().

res_ad1$group <- "Direct Estimate"
smth_res_ad1_spatial_survey$group <- "Fay-Herriot"
smth_res_ad2_agg_survey$group <- "Fay-Herriot"
smth_res_ad2_agg_pop$group <- "Fay-Herriot"
cl_res_ad1_pop$group <- "Cluster-Level"
cl_res_ad2_pop$group <- "Cluster-Level"
cl_res_ad1_T_pop$group <- "Cluster-Level"
cl_res_ad2_T_pop$group <- "Cluster-Level"
intervalPlot(admin = 1, group = TRUE, compare = TRUE, model = list(
                 "Admin 1 direct estimates" = res_ad1,
                 "Admin 1 FH model" = smth_res_ad1_spatial_survey,
                 "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
                 "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop,
                 "Admin 1 unstratified model" = cl_res_ad1_pop,
                 "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
                 "Admin 1 stratified model" = cl_res_ad1_T_pop,
                 "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))
Comparing different model estimates at national level

Figure 7.8: Comparing different model estimates at national level

Acknowledgement

We thank Ben Mayala and Trevor Croft from the Demographic and Health Surveys (DHS) program for useful discussions in creating this R package.

References

Dong, Qianyu, Yunhan Wu, Zehang Richard Li, and Jon Wakefield. 2026. “Toward a Principled Workflow for Prevalence Mapping Using Household Survey Data.” Journal of Survey Statistics and Methodology, smaf048.
Li, Zehang R, Bryan D Martin, Tracy Q Dong, Geir-Arne Fuglstad, Jessica Godwin, John Paige, Andrea Riebler, Samuel Clark, and Jon Wakefield. 2020. Space-Time Smoothing of Demographic and Health Indicators Using the R Package SUMMER. arXiv Preprint.
Rao, John NK, and Isabel Molina. 2015. Small Area Estimation. John Wiley & Sons.
Riebler, Andrea, Sigrunn H Sørbye, Daniel Simpson, and Håvard Rue. 2016. “An Intuitive Bayesian Spatial Model for Disease Mapping That Accounts for Scaling.” Statistical Methods in Medical Research 25 (4): 1145–65.
Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society Series B: Statistical Methodology 71 (2): 319–92.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, and Sigrunn H Sørbye. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.”
Wakefield, Jonathan, Taylor Okonek, and Jon Pedersen. 2020. “Small Area Estimation for Disease Prevalence Mapping.” International Statistical Review 88 (2): 398–418.
Wakefield, Jon, Jitong Jiang, and Yunhan Wu. 2026. “Automatic Variance Adjustment for Small Area Estimation.” arXiv Preprint arXiv:2602.14387. https://doi.org/10.48550/arXiv.2602.14387.

  1. https://gadm.org/download_country.html↩︎

  2. https://hub.worldpop.org/project/categories?id=8↩︎

  3. https://gadm.org/download_country.html↩︎