Download and Prepare PNADC Data

Introduction

This vignette provides a complete workflow for downloading Brazil’s quarterly PNADC (Pesquisa Nacional por Amostra de Domicilios Continua) microdata and preparing it for mensalization. The workflow covers three steps:

Downloading quarterly PNADC microdata from IBGE using the PNADcIBGE package
Stacking multiple quarters into a single dataset (critical for high determination rates)
Applying mensalization using the PNADCperiods package

If you already have PNADC data and want to learn the package API first, see Get Started. For algorithm details, see How PNADCperiods Works.

Prerequisites

Required Packages

# Install packages if needed
install.packages(c("PNADCperiods", "PNADcIBGE", "fst"))

# Load packages
library(PNADcIBGE)
library(data.table)
library(fst)
library(PNADCperiods)

System Requirements

Disk space: ~5 GB for 2020-2024 data, ~15 GB for full history (2012-present)
RAM: At least 8 GB recommended; 16 GB for comfortable processing
Time: 2-3 hours for downloading (depends on internet speed), ~5 minutes for processing
Internet: Required for downloading data and for SIDRA API access (weight calibration)

Understanding PNADC Data

PNADC is Brazil’s primary household survey for labor market statistics, conducted by IBGE. The survey uses a rotating panel design where each household is interviewed five times over 15 months. Each quarterly release contains approximately 500,000 observations.

Why stack multiple quarters? The mensalization algorithm identifies reference months by tracking households across their panel interviews. With a single quarter, the determination rate is only ~70%. By stacking multiple quarters, the algorithm leverages the rotating panel structure to achieve over 97% determination.

Quarters Stacked	Month %	Fortnight %	Week %
1 (single quarter)	~70%	~7%	~2%
8 (2 years)	~94%	~9%	~3%
20 (5 years)	~95%	~8%	~3%
55+ (full history)	~97%	~9%	~3%

For most applications, we recommend stacking at least 2 years (8 quarters) of data.

Step 1: Set Up Your Environment

# Set your data directory (adjust path as needed)
data_dir <- "path/to/your/pnadc_data/"
dir.create(data_dir, recursive = TRUE, showWarnings = FALSE)

Step 2: Define Which Quarters to Download

Create a grid of year-quarter combinations. This example uses 2020-2024, which provides a good balance between data size and determination rate:

# Define quarters to download (2020-2024 example)
editions <- expand.grid(
  year = 2020:2024,
  quarter = 1:4
)
# If downloading recent years, filter out quarters not yet available:
# editions <- editions[!(editions$year == 2025 & editions$quarter > 3), ]

Step 3: Download the Data

The download loop fetches each quarter from IBGE and saves it in FST format for fast loading:

for (i in 1:nrow(editions)) {
  year_i <- editions$year[i]
  quarter_i <- editions$quarter[i]

  filename <- paste0("pnadc_", year_i, "-", quarter_i, "q.fst")
  cat("Downloading:", year_i, "Q", quarter_i, "\n")

  # Download from IBGE
  pnadc_quarter <- get_pnadc(
    year = year_i,
    quarter = quarter_i,
    labels = FALSE,    # IMPORTANT: Use numeric codes, not labels
    deflator = FALSE,
    design = FALSE,
    savedir = data_dir
  )

  # Save as FST format (fast serialization)
  write_fst(pnadc_quarter, file.path(data_dir, filename))

  # Clean up temporary files created by PNADcIBGE
  temp_files <- list.files(data_dir,
                           pattern = "\\.(zip|sas|txt)$",
                           full.names = TRUE)
  file.remove(temp_files)

  rm(pnadc_quarter)
  gc()
}

Important: Always use labels = FALSE when downloading. The mensalization algorithm requires numeric codes for the birthday variables (V2008, V20081, V20082). Using labeled factors will cause errors.

Step 4: Stack the Quarterly Files

Stack all quarterly files into a single dataset. To save memory, only load the columns needed for mensalization:

# Columns needed for mensalization
cols_needed <- c(
  # Time and identifiers
  "Ano", "Trimestre", "UPA", "V1008", "V1014",
  # Birthday variables (for reference period algorithm)
  "V2008", "V20081", "V20082", "V2009",
  # Weight and stratification (for weight calibration)
  "V1028", "UF", "posest", "posest_sxi"
)

# Stack all quarters
files <- list.files(data_dir, pattern = "pnadc_.*\\.fst$", full.names = TRUE)

pnadc_stacked <- rbindlist(lapply(files, function(f) {
  cat("Loading:", basename(f), "\n")
  read_fst(f, columns = cols_needed, as.data.table = TRUE)
}))

cat("Total observations:", format(nrow(pnadc_stacked), big.mark = ","), "\n")

Step 5: Apply Mensalization

Build the crosswalk (identify reference periods) and calibrate weights:

# Step 1: Build crosswalk (identify reference periods)
crosswalk <- pnadc_identify_periods(pnadc_stacked, verbose = TRUE)

# Check determination rates
crosswalk[, .(
  month_rate = mean(determined_month),
  fortnight_rate = mean(determined_fortnight),
  week_rate = mean(determined_week)
)]

# Step 2: Apply crosswalk and calibrate weights
result <- pnadc_apply_periods(
  data = pnadc_stacked,
  crosswalk = crosswalk,
  weight_var = "V1028",
  anchor = "quarter",
  calibrate = TRUE,
  calibration_unit = "month",
  verbose = TRUE
)

The verbose output shows progress and determination rates for each phase (month, fortnight, week). With 20 quarters stacked (2020-2024), expect ~95% month determination.

Step 6: Explore the Results

The result contains all original columns plus reference period indicators and calibrated weights:

# Key new columns
names(result)[grep("ref_|determined_|weight_", names(result))]

# Distribution of reference months within quarters
result[, .N, by = ref_month_in_quarter][order(ref_month_in_quarter)]

Key output columns:

Column	Description
`ref_month_in_quarter`	Position within quarter (1, 2, or 3; NA if indeterminate)
`ref_month_yyyymm`	Reference month as YYYYMM integer (e.g., 202301)
`determined_month`	Logical flag (TRUE if month was determined)
`weight_monthly`	Calibrated monthly weight (if `calibrate = TRUE`)

The distribution is approximately equal across months 1, 2, and 3 (each around 31-32%), with the remaining observations having NA for indeterminate cases.

Step 7: Save and Use the Results

Save the mensalized data for future use:

write_fst(result, file.path(data_dir, "pnadc_mensalized.fst"))

To compute monthly estimates, filter to determined observations and aggregate by ref_month_yyyymm:

# Monthly unemployment rate
monthly_unemployment <- result[determined_month == TRUE, .(
  unemployment_rate = sum((VD4002 == 2) * weight_monthly, na.rm = TRUE) /
                      sum((VD4001 == 1) * weight_monthly, na.rm = TRUE)
), by = ref_month_yyyymm]

# Monthly population
monthly_pop <- result[determined_month == TRUE, .(
  population = sum(weight_monthly, na.rm = TRUE)
), by = ref_month_yyyymm]

For more analysis examples, see Applied Examples.

Memory and Performance Tips

Selective column loading: Only load the columns you need with read_fst(..., columns = ...). This dramatically reduces memory usage.
Process in batches: For very large analyses, process one year at a time and combine results.
Use FST format: FST is much faster than CSV or RDS for large datasets. A typical quarter loads in seconds rather than minutes.
Clean up regularly: Use rm() and gc() to free memory after processing each quarter.

File Size Reference

Period	Quarters	Observations	FST Size
2020-2024	20	~8.9M	~5 GB
2012-2025	55	~29M	~15 GB

Extending to Full History

For the best determination rate and longitudinal analysis, download all available quarters:

# Download all available data (2012-present)
editions_full <- expand.grid(
  year = 2012:2025,
  quarter = 1:4
)
editions_full <- editions_full[!(editions_full$year == 2025 &
                                   editions_full$quarter > 3), ]

# Use the same download and stacking workflow as above

The full history provides approximately 29 million observations and achieves the highest possible determination rate (~97% month).

Troubleshooting

“Column not found” errors: Ensure you used labels = FALSE when downloading. The algorithm requires numeric codes.
Download failures: IBGE servers can be slow or unavailable. The PNADcIBGE package will retry automatically, but you may need to restart interrupted downloads.
Memory errors: Try processing fewer quarters at a time, or use a machine with more RAM.
SIDRA API errors: Weight calibration requires internet access to the SIDRA API. If it fails, try again later or use calibrate = FALSE for reference period identification without weight calibration.

Next Steps

Follow the usage patterns in Get Started with your real data
See analysis examples in Applied Examples
Learn about the algorithm in How PNADCperiods Works

Working with annual PNADC data? Annual data (visit-specific microdata with comprehensive income variables) requires a different workflow. See Monthly Poverty Analysis with Annual PNADC Data for details on using pnadc_apply_periods() with anchor = "year".

References

HECKSHER, Marcos. “Valor Impreciso por Mes Exato: Microdados e Indicadores Mensais Baseados na Pnad Continua”. IPEA - Nota Tecnica Disoc, n. 62. Brasilia, DF: IPEA, 2020. https://portalantigo.ipea.gov.br/portal/index.php?option=com_content&view=article&id=35453
HECKSHER, M. “Cinco meses de perdas de empregos e simulacao de um incentivo a contratacoes”. IPEA - Nota Tecnica Disoc, n. 87. Brasilia, DF: IPEA, 2020.
HECKSHER, Marcos. “Mercado de trabalho: A queda da segunda quinzena de marco, aprofundada em abril”. IPEA - Carta de Conjuntura, v. 47, p. 1-6, 2020.
Barbosa, Rogerio J; Hecksher, Marcos. (2026). PNADCperiods: Identify Reference Periods in Brazil’s PNADC Survey Data. R package version v0.1.0. https://github.com/antrologos/PNADCperiods