Introduction to unicefData

Motivation: Data Provenance in the Age of AI

This package is motivated by a fundamental principle: data acquisition should be treated as code, not as an external preparatory step. This is increasingly important in the age of AI-assisted research.

As AI tools accelerate analytical workflows while enabling plausible fabrication of statistics and citations, anchoring research to authoritative, version-controlled data sources becomes essential infrastructure for scientific credibility. The unicefData package operationalizes this principle by making data provenance an integral component of your analytical script.

When you specify:

df <- unicefData(
  indicator = "CME_MRY0T4",
  countries = c("ALB", "USA", "BRA"),
  year = "2015:2023"
)

You are not downloading a spreadsheet from a portal and applying undocumented filters. Instead, you are executing an explicit, reproducible specification of provenance: a command line that documents what data were requested, from which source, and under what constraints. This ensures:

Traceability: Others can inspect your code and verify exactly which data were used
Transparency: Data selection decisions are parametrized and visible
Sustainability: If upstream data are revised, the same command yields systematically updated results
Reproducibility: Your analysis remains verifiable and replicable over time

Design Principles from wbopendata

The unicefData package adopts three design principles from wbopendata (Azevedo, 2026), a similar package for World Bank data:

Data acquisition as code: Indicator selection, country coverage, and time ranges are explicit parameters in your analytical script
Backward compatibility as trust infrastructure: The package prioritizes stable syntax so analyses remain reproducible even as APIs evolve
Domain-specific syntax as error prevention: Rather than exposing HTTP requests and JSON, the interface uses familiar concepts (indicators, countries, years) that constrain input to meaningful values

Together, these principles treat data acquisition as infrastructure for reproducibility, not mere convenience.

The UNICEF Data Warehouse

The United Nations Children’s Fund (UNICEF) maintains one of the world’s most comprehensive databases on child welfare, covering health, nutrition, education, protection, HIV/AIDS, and water, sanitation and hygiene (WASH). The UNICEF Data Warehouse uses the Statistical Data and Metadata eXchange (SDMX) standard, an ISO-certified framework for exchanging statistical information.

The warehouse currently maintains 733+ indicators organized across thematic dataflows:

Dataflow	Domain	Indicators
CME	Child Mortality Estimates	39
NUTRITION	Stunting, wasting, underweight	112
IMMUNISATION	Immunization coverage	18
WASH_HOUSEHOLDS	Water and sanitation	57
EDUCATION	Education access and quality	38
HIV_AIDS	HIV-related indicators	38
MNCH	Maternal, newborn, child health	varies
PT	Child protection	varies
CHLD_PVTY	Child poverty	varies
GENDER	Gender equality	varies

What is SDMX?

SDMX (Statistical Data and Metadata eXchange) is an international standard for structuring and exchanging statistical data. Within SDMX:

An indicator is a time-series measure of a specific phenomenon (e.g., under-5 mortality rate, stunting prevalence, DTP3 coverage).
Each indicator is identified by a unique code (e.g., CME_MRY0T4) and belongs to a dataflow, which groups related indicators.
Observations are disaggregated along dimensions: sex, age, wealth quintile, residence (urban/rural), and maternal education.
Each observation also carries attributes (data source, confidence intervals, observation status) that contextualize the value.

While powerful, direct SDMX API interaction requires knowledge of dataflow structures, dimension codes, and RESTful query syntax. The unicefData package removes these barriers.

Why unicefData?

The unicefData package provides a simple R interface to the UNICEF Data Warehouse. It is part of a trilingual ecosystem with identical implementations in R, Python (unicef_api), and Stata (unicefdata), sharing the same function names and parameter structures for cross-team collaboration.

Key features:

Automatic dataflow detection: specify only the indicator code; the package finds the correct dataflow automatically.
Discovery commands: search indicators by keyword, list dataflows, display metadata—no prior SDMX knowledge required.
Flexible filtering: countries, years, sex, age, wealth quintiles, residence, maternal education.
Multiple output formats: long (default), wide (years as columns), wide_indicators (indicators as columns).
Caching with memoisation: repeated queries are fast.

Installation

Install from GitHub:

# install.packages("devtools")
devtools::install_github("unicef-drp/unicefData")

Quick start

library(unicefData)

Discovering indicators

Before downloading data, explore what is available:

# Browse indicator categories (thematic dataflows)
list_categories()

# Search for indicators by keyword
search_indicators("mortality")

# List all indicators in the Child Mortality Estimates dataflow
list_indicators("CME")

# Get detailed information about a specific indicator
get_indicator_info("CME_MRY0T4")

These discovery commands mirror the paper’s Examples 1–4 and the Stata equivalents:

. unicefdata, categories
. unicefdata, search(mortality)
. unicefdata, indicators(CME)
. unicefdata, info(CME_MRY0T4)

Basic data retrieval

Fetch under-5 mortality rate for three countries over a year range:

# Example 5 (paper): Basic data retrieval
df <- unicefData(
  indicator = "CME_MRY0T4",
  countries = c("BRA", "IND", "CHN"),
  year = "2015:2023"
)
head(df)

The equivalent Stata command is:

. unicefdata, indicator(CME_MRY0T4) countries(BRA IND CHN) year(2015:2023) clear

Geographic filtering

Fetch data for East African countries for a single year:

# Example 6 (paper): Geographic filtering
df <- unicefData(
  indicator = "CME_MRY0T4",
  countries = c("KEN", "TZA", "UGA", "ETH", "RWA"),
  year = 2020
)

Latest values and most recent values

# Example 7 (paper): Get the latest available value per country
df_latest <- unicefData(
  indicator = "CME_MRY0T4",
  countries = c("BGD", "IND", "PAK"),
  latest = TRUE
)

# Get the 3 most recent values per country
df_mrv <- unicefData(
  indicator = "CME_MRY0T4",
  countries = c("BGD", "IND", "PAK"),
  mrv = 3
)

Year specifications

The year parameter supports multiple formats:

# Single year
df <- unicefData(indicator = "CME_MRY0T4", year = 2020)

# Year range
df <- unicefData(indicator = "CME_MRY0T4", year = "2015:2023")

# Non-contiguous years
df <- unicefData(indicator = "CME_MRY0T4", year = "2015,2018,2020")

# Circa mode: find closest available year
df <- unicefData(indicator = "CME_MRY0T4", year = 2015, circa = TRUE)

Disaggregations

UNICEF data supports rich disaggregation along multiple dimensions. Not all dimensions are available for all indicators—availability depends on the dataflow (see the disaggregation matrix in the package paper).

By sex

# Total only (default)
df <- unicefData(indicator = "CME_MRY0T4", sex = "_T")

# Female only
df <- unicefData(indicator = "CME_MRY0T4", sex = "F")

# All sex categories (total, male, female)
df <- unicefData(indicator = "CME_MRY0T4", sex = "ALL")

By wealth quintile

# Example 8 (paper): Stunting by wealth and sex
df <- unicefData(
  indicator = "NT_ANT_WHZ_NE2",
  countries = "IND",
  sex = "ALL",
  wealth = "ALL"
)

By residence

# Urban only
df <- unicefData(indicator = "NT_ANT_HAZ_NE2", residence = "U")

# Rural only
df <- unicefData(indicator = "NT_ANT_HAZ_NE2", residence = "R")

Output formats

Wide format (years as columns)

Useful for time-series analysis:

# Example 9 (paper): Wide format
df_wide <- unicefData(
  indicator = "CME_MRY0T4",
  countries = c("USA", "GBR", "DEU", "FRA"),
  year = "2000,2010,2020,2023",
  format = "wide"
)

Multiple indicators

Fetch and merge multiple indicators automatically:

# Example 10 (paper): Multiple indicators
df <- unicefData(
  indicator = c("CME_MRM0", "CME_MRY0T4"),
  countries = c("KEN", "TZA", "UGA"),
  year = 2020
)

# Wide indicators format: one column per indicator
df_wide <- unicefData(
  indicator = c("CME_MRY0T4", "CME_MRY0", "IM_DTP3", "IM_MCV1"),
  countries = c("AFG", "ETH", "PAK", "NGA"),
  latest = TRUE,
  format = "wide_indicators"
)

Metadata enrichment

Add regional and income group classifications:

# Example 12 (paper): Regional classifications
df <- unicefData(
  indicator = "CME_MRY0T4",
  add_metadata = c("region", "income_group"),
  latest = TRUE
)

Data cleaning and filtering

Post-processing utilities for downloaded data:

# Clean raw SDMX column names to user-friendly names
df_raw <- unicefData_raw(indicator = "CME_MRY0T4", countries = "BRA")
df_clean <- clean_unicef_data(df_raw)

# Filter to specific disaggregations
df_filtered <- filter_unicef_data(df_clean, sex = "F", wealth = "Q1")

Cache management

The package caches metadata and API responses for performance. To clear and refresh all caches:

# Clear all caches and reload metadata
clear_unicef_cache()

# Clear without reloading (lazy reload on next use)
clear_unicef_cache(reload = FALSE)

# View cache status
get_cache_info()

Dataflow schemas

Inspect the structure of any dataflow:

# View the dimensions and attributes of a dataflow
schema <- dataflow_schema("CME")
print(schema)

Cross-language parity

The unicefData ecosystem provides identical functionality across R, Python, and Stata. The same analytical workflow translates directly:

Operation	R	Python	Stata
Search	`search_indicators("mortality")`	`search_indicators("mortality")`	`unicefdata, search(mortality)`
Fetch	`unicefData(indicator="CME_MRY0T4")`	`unicefData(indicator="CME_MRY0T4")`	`unicefdata, indicator(CME_MRY0T4) clear`
Latest	`unicefData(..., latest=TRUE)`	`unicefData(..., latest=True)`	`unicefdata, ... latest clear`
Wide	`unicefData(..., format="wide")`	`unicefData(..., format="wide")`	`unicefdata, ... wide clear`
Cache	`clear_unicef_cache()`	`clear_cache()`	`unicefdata, clearcache`
Sync	`sync_metadata()`	`sync_metadata()`	`unicefdata_sync, all`

This parity enables cross-team collaboration: an analyst can prototype in R and a colleague can reproduce the workflow in Stata or Python with minimal translation.

Design Principles and Reproducibility

The unicefData package embodies three design principles that make reproducibility the default rather than the exception:

1. Data acquisition as code

When you write:

df <- unicefData(indicator = "CME_MRY0T4", countries = c("ALB", "USA", "BRA"), year = "2015:2023")

You are not performing manual steps that will be forgotten or become undocumented. Every data selection decision—indicator, countries, years, disaggregations—is explicitly specified in your script. This ensures that:

Others can audit your data selection
You can reproduce results exactly
Revisions to upstream data are transparent
Your analysis is sustainable across time

2. Backward compatibility as trust infrastructure

The package prioritizes stable syntax and predictable behavior. This matters because:

Historical analyses remain reproducible even as the UNICEF SDMX API evolves
Your scripts from 2026 will still work in 2030
Backward compatibility reduces the risk that methodology becomes irreproducible due to infrastructure drift

3. Domain-specific syntax as error prevention

Rather than exposing HTTP requests and JSON parsing, the interface uses concepts familiar to development researchers: indicators, countries, years. This constrains input to meaningful values and reduces opportunities for error.

Why This Matters for AI-Assisted Research

In an era where AI tools accelerate analytical workflows, these principles become more important, not less. As generative tools lower the cost of producing plausible analyses and narratives, anchoring empirical work to authoritative and verifiable data sources is essential infrastructure for scientific credibility. The unicefData package provides this foundation by making data provenance explicit and executable.

Acknowledgments

This package was developed at the UNICEF Data and Analytics Section. The author gratefully acknowledges the collaboration of Lucas Rodrigues, Yang Liu, and Karen Avanesian, whose technical contributions and feedback were instrumental in the development of this R package.

Special thanks to Yves Jaques, Alberto Sibileau, and Daniele Olivotti for designing and maintaining the UNICEF SDMX data warehouse infrastructure that makes this package possible.

The author also acknowledges the UNICEF database managers and technical teams who ensure data quality, as well as the country office staff and National Statistical Offices whose data collection efforts make this work possible.

Development of this package was supported by UNICEF institutional funding for data infrastructure and statistical capacity building. The author also acknowledges UNICEF colleagues who provided testing and feedback during development, as well as the broader open-source R community.

Development was assisted by AI coding tools (GitHub Copilot, Claude). All code has been reviewed, tested, and validated by the package maintainers.

Disclaimer

This package is provided for research and analytical purposes.

The unicefData package provides programmatic access to UNICEF’s public data warehouse. While the author is affiliated with UNICEF, this package is not an official UNICEF product and the statements in this documentation are the views of the author and do not necessarily reflect the policies or views of UNICEF.

Data accessed through this package comes from the UNICEF Data Warehouse. Users should verify critical data points against official UNICEF publications at data.unicef.org.

This software is provided “as is”, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or UNICEF be liable for any claim, damages or other liability arising from the use of this software.

The designations employed and the presentation of material in this package do not imply the expression of any opinion whatsoever on the part of UNICEF concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries.

Data Citation and Provenance

Important Note on Data Vintages

Official statistics are subject to revisions as new information becomes available and estimation methodologies improve. UNICEF indicators are regularly updated based on new surveys, censuses, and improved modeling techniques. Historical values may be revised retroactively to reflect better information or methodological improvements.

For reproducible research and proper data attribution, users should:

Document the indicator code - Specify the exact SDMX indicator code(s) used (e.g., CME_MRY0T4)
Record the download date - Note when data was accessed (e.g., “Data downloaded: 2026-02-09”)
Cite the data source - Reference both the package and the UNICEF Data Warehouse
Archive your dataset - Save a copy of the exact data used in your analysis

Example citation for data used in research:

Under-5 mortality data (indicator: CME_MRY0T4) accessed from UNICEF Data Warehouse via unicefData R package (v2.1.0) on 2026-02-09. Data available at: https://sdmx.data.unicef.org/

This practice ensures that others can verify your results and understand any differences that may arise from data updates. For official UNICEF statistics in publications, always cross-reference with the current version at data.unicef.org.