This vignette provides a complete workflow for downloading Brazil’s quarterly PNADC (Pesquisa Nacional por Amostra de Domicilios Continua) microdata and preparing it for mensalization. The workflow covers three steps:
PNADcIBGE packagePNADCperiods packageIf you already have PNADC data and want to learn the package API first, see Get Started. For algorithm details, see How PNADCperiods Works.
PNADC is Brazil’s primary household survey for labor market statistics, conducted by IBGE. The survey uses a rotating panel design where each household is interviewed five times over 15 months. Each quarterly release contains approximately 500,000 observations.
Why stack multiple quarters? The mensalization algorithm identifies reference months by tracking households across their panel interviews. With a single quarter, the determination rate is only ~70%. By stacking multiple quarters, the algorithm leverages the rotating panel structure to achieve over 97% determination.
| Quarters Stacked | Month % | Fortnight % | Week % |
|---|---|---|---|
| 1 (single quarter) | ~70% | ~7% | ~2% |
| 8 (2 years) | ~94% | ~9% | ~3% |
| 20 (5 years) | ~95% | ~8% | ~3% |
| 55+ (full history) | ~97% | ~9% | ~3% |
For most applications, we recommend stacking at least 2 years (8 quarters) of data.
Create a grid of year-quarter combinations. This example uses 2020-2024, which provides a good balance between data size and determination rate:
The download loop fetches each quarter from IBGE and saves it in FST format for fast loading:
for (i in 1:nrow(editions)) {
year_i <- editions$year[i]
quarter_i <- editions$quarter[i]
filename <- paste0("pnadc_", year_i, "-", quarter_i, "q.fst")
cat("Downloading:", year_i, "Q", quarter_i, "\n")
# Download from IBGE
pnadc_quarter <- get_pnadc(
year = year_i,
quarter = quarter_i,
labels = FALSE, # IMPORTANT: Use numeric codes, not labels
deflator = FALSE,
design = FALSE,
savedir = data_dir
)
# Save as FST format (fast serialization)
write_fst(pnadc_quarter, file.path(data_dir, filename))
# Clean up temporary files created by PNADcIBGE
temp_files <- list.files(data_dir,
pattern = "\\.(zip|sas|txt)$",
full.names = TRUE)
file.remove(temp_files)
rm(pnadc_quarter)
gc()
}Important: Always use
labels = FALSEwhen downloading. The mensalization algorithm requires numeric codes for the birthday variables (V2008, V20081, V20082). Using labeled factors will cause errors.
Stack all quarterly files into a single dataset. To save memory, only load the columns needed for mensalization:
# Columns needed for mensalization
cols_needed <- c(
# Time and identifiers
"Ano", "Trimestre", "UPA", "V1008", "V1014",
# Birthday variables (for reference period algorithm)
"V2008", "V20081", "V20082", "V2009",
# Weight and stratification (for weight calibration)
"V1028", "UF", "posest", "posest_sxi"
)
# Stack all quarters
files <- list.files(data_dir, pattern = "pnadc_.*\\.fst$", full.names = TRUE)
pnadc_stacked <- rbindlist(lapply(files, function(f) {
cat("Loading:", basename(f), "\n")
read_fst(f, columns = cols_needed, as.data.table = TRUE)
}))
cat("Total observations:", format(nrow(pnadc_stacked), big.mark = ","), "\n")Build the crosswalk (identify reference periods) and calibrate weights:
# Step 1: Build crosswalk (identify reference periods)
crosswalk <- pnadc_identify_periods(pnadc_stacked, verbose = TRUE)
# Check determination rates
crosswalk[, .(
month_rate = mean(determined_month),
fortnight_rate = mean(determined_fortnight),
week_rate = mean(determined_week)
)]
# Step 2: Apply crosswalk and calibrate weights
result <- pnadc_apply_periods(
data = pnadc_stacked,
crosswalk = crosswalk,
weight_var = "V1028",
anchor = "quarter",
calibrate = TRUE,
calibration_unit = "month",
verbose = TRUE
)The verbose output shows progress and determination rates for each phase (month, fortnight, week). With 20 quarters stacked (2020-2024), expect ~95% month determination.
The result contains all original columns plus reference period indicators and calibrated weights:
# Key new columns
names(result)[grep("ref_|determined_|weight_", names(result))]
# Distribution of reference months within quarters
result[, .N, by = ref_month_in_quarter][order(ref_month_in_quarter)]Key output columns:
| Column | Description |
|---|---|
ref_month_in_quarter |
Position within quarter (1, 2, or 3; NA if indeterminate) |
ref_month_yyyymm |
Reference month as YYYYMM integer (e.g., 202301) |
determined_month |
Logical flag (TRUE if month was determined) |
weight_monthly |
Calibrated monthly weight (if calibrate = TRUE) |
The distribution is approximately equal across months 1, 2, and 3
(each around 31-32%), with the remaining observations having
NA for indeterminate cases.
Save the mensalized data for future use:
To compute monthly estimates, filter to determined observations and
aggregate by ref_month_yyyymm:
# Monthly unemployment rate
monthly_unemployment <- result[determined_month == TRUE, .(
unemployment_rate = sum((VD4002 == 2) * weight_monthly, na.rm = TRUE) /
sum((VD4001 == 1) * weight_monthly, na.rm = TRUE)
), by = ref_month_yyyymm]
# Monthly population
monthly_pop <- result[determined_month == TRUE, .(
population = sum(weight_monthly, na.rm = TRUE)
), by = ref_month_yyyymm]For more analysis examples, see Applied Examples.
Selective column loading: Only load the columns
you need with read_fst(..., columns = ...). This
dramatically reduces memory usage.
Process in batches: For very large analyses, process one year at a time and combine results.
Use FST format: FST is much faster than CSV or RDS for large datasets. A typical quarter loads in seconds rather than minutes.
Clean up regularly: Use rm() and
gc() to free memory after processing each quarter.
| Period | Quarters | Observations | FST Size |
|---|---|---|---|
| 2020-2024 | 20 | ~8.9M | ~5 GB |
| 2012-2025 | 55 | ~29M | ~15 GB |
For the best determination rate and longitudinal analysis, download all available quarters:
# Download all available data (2012-present)
editions_full <- expand.grid(
year = 2012:2025,
quarter = 1:4
)
editions_full <- editions_full[!(editions_full$year == 2025 &
editions_full$quarter > 3), ]
# Use the same download and stacking workflow as aboveThe full history provides approximately 29 million observations and achieves the highest possible determination rate (~97% month).
“Column not found” errors: Ensure you used
labels = FALSE when downloading. The algorithm requires
numeric codes.
Download failures: IBGE servers can be slow or
unavailable. The PNADcIBGE package will retry
automatically, but you may need to restart interrupted
downloads.
Memory errors: Try processing fewer quarters at a time, or use a machine with more RAM.
SIDRA API errors: Weight calibration requires
internet access to the SIDRA API. If it fails, try again later or use
calibrate = FALSE for reference period identification
without weight calibration.
Working with annual PNADC data? Annual data (visit-specific microdata with comprehensive income variables) requires a different workflow. See Monthly Poverty Analysis with Annual PNADC Data for details on using
pnadc_apply_periods()withanchor = "year".