Given an observed aggregate price index \(\mathrm{cpi}_t\) and a matrix of (known)
sectoral aggregation weights \(W_{t,k}\) — value-added (VAB) shares — the
goal is to recover the \(K\) latent
sectoral price indices \(\varphi_{t,k}\) that the aggregate is made
of. The sectoral indices then feed a downstream nested
Ornstein–Uhlenbeck model (bayesianOU) as the market price
\(\varphi\).
The disaggregation is genuinely Bayesian: the aggregate enters as evidence (an observation density), and the sectoral indices come out as a posterior with credible intervals, not as a single deterministic re-weighting.
The 0.1.x family advertised “MCMC-free Bayesian disaggregation”, but the aggregate CPI never entered the computation (F1): the “posterior” was derived from the prior weight matrix alone, the Dirichlet concentration cancelled on renormalization (F2), the temporal pattern cancelled too (F3), an “efficiency” term was a fixed constant (F4), there were no recovery tests (F5), and a correlation helper opportunistically picked whichever of Pearson/Spearman was larger (F6). That foundational defect — not using the data — cannot be patched within a deterministic re-weighting; the fix is a model that conditions on the aggregate. The deterministic family has been removed; two honest Bayesian engines replace it.
Latent state in logs, with a random walk plus drift and partial
pooling: \[
\log \varphi_{t,k} = \log \varphi_{t-1,k} + \delta_k +
\tau_k\,\eta_{t,k},
\qquad \eta_{t,k}\sim\mathcal N(0,1),
\] with \(\delta_k \sim \mathcal
N(\delta_\mu,\delta_\sigma)\) and \(\log\tau_k \sim \mathcal N(\mu_{\log\tau},
\sigma_{\log\tau})\) (the drift and the innovation scale are
pooled across sectors). The cross-section at \(t=1\) is anchored at the aggregate level
with an estimable dispersion \(\omega_{\text{struct}}\) (the real
concentration the old Dirichlet \(\gamma\) failed to be): \[
\log\varphi_{1,k} = \log(\text{phi1\_center}) +
\omega_{\text{struct}}\,z_k .
\] The aggregate is the genuine observation: \[
\mathrm{cpi}_t \sim \mathrm{Student\text{-}t}\!\left(\nu,\
\textstyle\sum_k W_{t,k}\,\varphi_{t,k},\ \sigma\right),
\] (Gaussian if student_obs = FALSE).
The aggregate \(\sum_k W\varphi\) is strongly identified by the observation density. The per-sector split is only weakly identified: at each period one linear combination of the \(K\) sectors is pinned by the CPI, and the remaining \(K-1\) directions are governed by the cross-sectional prior plus temporal smoothness. So the per-sector intervals are honestly wide and prior-influenced. This is not a defect to hide — it is the correct uncertainty, and it is precisely why we feed the full posterior draws (not a point estimate) to the OU by multiple imputation: the sectoral uncertainty is propagated, not faked away.
disaggregate_conjugate(). A linear-Gaussian random
walk in levels with the same aggregate observation; its exact
posterior is the Kalman filter + RTS smoother, with no MCMC. Joint
posterior draws come from the Durbin–Koopman simulation smoother. This
is the correct realization of the original “MCMC-free posterior”
idea.disaggregate_statespace(). The
richer model above (log scale ⇒ positivity, Student-t ⇒ robustness to
aggregate outliers, hierarchical pooling), which is not
conjugate and therefore needs HMC.Both are Bayesian. Closed form buys speed and exactness at the cost of a simpler (Gaussian, linear) model; MCMC buys richness at the cost of sampling.
library(BayesianDisaggregation)
sim <- simulate_disagg(T = 30, K = 4, seed = 1) # synthetic CPI + VAB weights
bl <- disaggregate_conjugate(sim$cpi, sim$W, n_draws = 100, seed = 1)
bl
#> <disagg_conjugate> closed-form linear-Gaussian baseline (Kalman/RTS)
#> periods T = 30, sectors K = 4, joint draws = 100
#> aggregate Gaussian log-likelihood = -64.56
## the smoothed aggregate tracks the CPI tightly (aggregate is well identified)
round(cor(bl$agg_summary[, "median"], sim$cpi), 4)
#> [1] 0.999
## joint posterior draws: the [T, K, draws] contract consumed by the nested OU
dim(bl$phi_draws)
#> [1] 30 4 100fit <- disaggregate_statespace(sim$cpi, sim$W, chains = 4, iter = 2000, warmup = 1000)
fit$diagnostics # rhat_max, divergences
dim(fit$phi_draws) # T x K x draws
str(fit$phi_summary) # median, q2.5, q97.5 (T x K each)
## couple to the nested OU (uncertainty propagated by Rubin's rule):
## bayesianOU::fit_ou_nested_mi(phi_draws = fit$phi_draws, X = Phi_index, ...)From Excel directly, reusing the bundled readers:
The model is about index levels, so the CPI must be
a level series (FRED units “Index source base”, aggregation “Average”
for annual data — never a rate-of-change), re-indexed to the
same base as the production prices it will be compared
against (e.g. 1982–1984 = 100 via the project’s
convert_to_index). Feeding a percent-change series here is
a category error: the aggregate would not be on the same scale as \(\sum_k W\varphi\).
disaggregate_statespace()$phi_draws (or
disaggregate_conjugate(..., n_draws = M)$phi_draws) is a
[T, K, M] array — exactly the multiple-imputation input of
bayesianOU::fit_ou_nested_mi(). The OU refits once per
imputation and combines the analyses by Rubin’s rule, so the
disaggregation uncertainty becomes part of the OU posterior. ```