Simulation benchmarks: recovery and power

This article shows how to use ild_simulate() and ild_power(..., return_sims = TRUE) to assess parameter recovery for a fixed effect under the package’s bundled Gaussian mixed-model simulation setup, and how that connects to empirical power. It is an illustrative benchmark for that DGP—not a proof that every tidyILD method is validated for all data.

Power answers: “How often do we reject the null when the effect is present?” Recovery answers: “How close are point estimates (and intervals) to the known coefficient across repeated simulations?” Both use the same replication loop inside ild_power().

Simulation size and precision

Recovery summaries (bias, RMSE, coverage) are Monte Carlo estimates: their sampling variability shrinks as n_sim increases (roughly \(\propto 1/\sqrt{n_\text{sim}}\) for means). The worked example below uses n_sim = 200 as a balance between stability and build time on CRAN; for publication-style tables, consider larger n_sim and reporting uncertainty (e.g. bootstrap across replications or analytical approximations where available). Chunks that re-run ild_power() are marked with optional cache = TRUE so local rebuilds can skip re-simulation when the code is unchanged (not used on CRAN by default).

What `ild_simulate()` encodes

The base outcome y (before ild_power() adds a predictor) is built as follows:

Between-person: independent random intercepts per id (scale bp_effect).
Within-person: optional AR(1) structure on de-trended within-person series when ar1 is set (numeric \(\phi\)); otherwise i.i.d. Gaussian noise (scale wp_effect).
Time: a mild linear-in-time component plus time as POSIXct from a regular grid (with optional jitter if irregular = TRUE).

The formula you fit should match this structure (e.g. random intercept for id when using ild_lme(..., ar1 = FALSE)). Recovery in this vignette focuses on a single fixed effect added by ild_power() on top of that DGP (see below).

Fixed-effect recovery with `ild_power(..., return_sims = TRUE)`

For each replication, ild_power():

Simulates data with ild_simulate().
Adds a standard normal predictor x and sets y <- y + effect_size * x so the true coefficient of x is effect_size (call it β).
Runs ild_prepare() and ild_lme() with your formula.
Records the Wald-based estimate and standard error for the test term.

With return_sims = TRUE, you get a tibble of per-replication estimate, std_error, p_value, and converged. Use ild_recovery_metrics() for bias, RMSE, and nominal Wald interval coverage (normal approximation \(\hat\beta \pm z_{1-\alpha/2} \cdot \mathrm{SE}\)), consistent with the manual calculation shown in earlier package versions.

library(tidyILD)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

truth <- 0.35
n_sim <- 200L

set.seed(2026)
res <- ild_power(
  formula = y ~ x + (1 | id),
  n_sim = n_sim,
  n_id = 25L,
  n_obs_per = 12L,
  effect_size = truth,
  seed = 2026L,
  return_sims = TRUE,
  verbose = FALSE
)

sim <- res$sim_results |> dplyr::filter(converged)

ild_recovery_metrics(res$sim_results, truth = truth, level = 0.95)
#> # A tibble: 1 × 8
#>   truth     n n_total mean_estimate     bias   rmse coverage level
#>   <dbl> <int>   <int>         <dbl>    <dbl>  <dbl>    <dbl> <dbl>
#> 1  0.35   200     200         0.346 -0.00404 0.0368    0.945  0.95

Across replications, the mean estimate should be near β, bias near 0, RMSE positive but modest if the model is well specified, and coverage near the nominal level (e.g. 0.95) for large n_sim—subject to Monte Carlo noise when n_sim is moderate.

ggplot2::ggplot(sim, ggplot2::aes(x = estimate)) +
  ggplot2::geom_histogram(fill = "steelblue", color = "white", bins = 20) +
  ggplot2::geom_vline(xintercept = truth, linetype = 2, linewidth = 0.8) +
  ggplot2::labs(
    x = "Estimated coefficient for x",
    y = "Count",
    title = "Sampling distribution of fixed-effect estimates",
    subtitle = sprintf("True beta = %s (vertical line)", truth)
  )

Histogram of simulation estimates of the fixed effect

From recovery to power

The same object res already contains empirical power: the fraction of converged runs with p_value < alpha.

tibble::tibble(
  power = res$power,
  n_reject = res$n_reject,
  n_converged = res$n_converged,
  alpha = res$alpha
)
#> # A tibble: 1 × 4
#>   power n_reject n_converged alpha
#>   <dbl>    <int>       <int> <dbl>
#> 1     1      200         200  0.05

Interpretation: Recovery summarizes the sampling distribution of \(\hat\beta\). Power summarizes how often that distribution leads to rejection at alpha. A large effect with low variance yields good recovery and high power; a small effect may show acceptable recovery (unbiased estimates) but low power at a fixed sample size.

Variance components (illustrative snapshot)

ild_power() is built around a fixed-effect estimand. Recovery of variance components (random-intercept SD, residual variance) is a separate question. As a single-run sanity check, you can inspect VarCorr() on one simulated dataset ( lme4 or nlme output depending on ar1):

set.seed(1)
d_one <- ild_simulate(n_id = 30L, n_obs_per = 15L, seed = 99)
d_one$x <- rnorm(nrow(d_one))
d_one$y <- d_one$y + 0.3 * d_one$x
prep_one <- ild_prepare(d_one, id = "id", time = "time")
fit_one <- ild_lme(y ~ x + (1 | id), prep_one, ar1 = FALSE, warn_no_ar1 = FALSE)
#> Warning: Predictor(s) 'x' vary both within and between persons. Consider using
#> ild_center(x) to separate within- and between-person effects and avoid
#> conflation bias.
lme4::VarCorr(fit_one)
#>  Groups   Name        Std.Dev.
#>  id       (Intercept) 0.97185 
#>  Residual             0.66690

This is not a simulation-based assessment of variance recovery—only a template for where to look when extending benchmarks.

AR(1) in the DGP vs residual correlation in the fit

ild_simulate(..., ar1 = phi) (numeric \(\phi\)) injects AR(1) within-person correlation in the outcome before ild_power() adds x.
ild_lme(..., ar1 = TRUE) fits a residual AR1/CAR1 structure in nlme (different object from the DGP’s \(\phi\)); when ar1 = TRUE, the formula must be fixed-effects-only with random = ~ 1 | id (see ?ild_lme).

ild_power() exposes ar1 as a logical flag for the fit, not the DGP’s \(\phi\). To study AR(1) DGPs with ild_power(), you would need a custom loop calling ild_simulate(..., ar1 = ...) yourself, or prioritize alignment by matching structures in both steps. Minimal illustration of DGP AR + nlme residual correlation on one draw:

set.seed(2)
d_ar <- ild_simulate(n_id = 20L, n_obs_per = 14L, ar1 = 0.35, wp_effect = 0.5, seed = 2)
d_ar$x <- rnorm(nrow(d_ar))
d_ar$y <- d_ar$y + 0.25 * d_ar$x
prep_ar <- ild_prepare(d_ar, id = "id", time = "time")
fit_ar <- tryCatch(
  ild_lme(y ~ x, prep_ar, ar1 = TRUE, random = ~ 1 | id, warn_no_ar1 = FALSE),
  error = function(e) NULL
)
#> Warning: Predictor(s) 'x' vary both within and between persons. Consider using
#> ild_center(x) to separate within- and between-person effects and avoid
#> conflation bias.
if (!is.null(fit_ar)) summary(fit_ar) else "nlme fit failed on this platform (skip)"
#> Linear mixed-effects model fit by REML
#>   Data: data 
#>        AIC    BIC    logLik
#>   573.1519 591.29 -281.5759
#> 
#> Random effects:
#>  Formula: ~1 | id
#>         (Intercept)  Residual
#> StdDev:    1.024086 0.8321674
#> 
#> Correlation Structure: AR(1)
#>  Formula: ~.ild_seq | id 
#>  Parameter estimate(s):
#>       Phi 
#> 0.6910938 
#> Fixed effects:  list(formula) 
#>                 Value  Std.Error  DF  t-value p-value
#> (Intercept) 0.9146186 0.25045053 259 3.651893   3e-04
#> x           0.2603221 0.03073403 259 8.470158   0e+00
#>  Correlation: 
#>   (Intr)
#> x -0.021
#> 
#> Standardized Within-Group Residuals:
#>         Min          Q1         Med          Q3         Max 
#> -2.17356216 -0.70843954 -0.07166726  0.61178709  2.15484654 
#> 
#> Number of Observations: 280
#> Number of Groups: 20

Bayesian and state-space extensions

ild_power() currently drives ild_lme() only. For brms / Stan, use ild_simulate() + ild_fit(..., backend = "brms") (or ild_brms()) in your own replication loop, then summarize posterior intervals against truth with the same coverage idea (credible vs Wald nominal coverage is not interchangeable—document which you use).
KFAS targets state-space models; see vignette("kfas-state-space-modeling", package = "tidyILD") and ?ild_kfas. Cross-checking mixed-model simulation benchmarks against KFAS is out of scope here but natural for method comparison papers.

Cross-backend validation harness (optional)

For repeatable CI-style checks across several engines (lme4, optional nlme / brms / KFAS / ctsem), the package ships a non-exported harness documented in inst/dev/BACKEND_VALIDATION_BENCHMARK_CONTRACT.md. From a source tree, run scripts/run-backend-validation-benchmarks.R with a --tier (smoke, nightly, full) and compare outputs to JSON thresholds via scripts/check-backend-validation-thresholds.R. This complements the ild_power() / ild_recovery_metrics() workflow above by standardizing raw and summary tables for trend comparison; it is not a substitute for study-specific simulation design.

Limitations and scope

One estimand in ild_power(): this path targets a single fixed effect in a linear mixed model fit with ild_lme(), matching the current ild_power() implementation.
Inference: for lme4 fits, ild_power() uses a Wald z-approximation when p-values are unavailable from the engine; that is appropriate for this illustration but not identical to likelihood-ratio tests.
Variance / AR / other backends: recovery of variance components and AR parameters, brms, and KFAS are discussed above; full simulation suites for those paths are left to application-specific code.

Simulation benchmarks: recovery and power

tidyILD authors

2026-04-17

Simulation size and precision

What `ild_simulate()` encodes

Fixed-effect recovery with `ild_power(..., return_sims = TRUE)`

From recovery to power

Variance components (illustrative snapshot)

AR(1) in the DGP vs residual correlation in the fit

Bayesian and state-space extensions

Cross-backend validation harness (optional)

Limitations and scope

See also

Simulation benchmarks: recovery and power

tidyILD authors

2026-04-17

Simulation size and precision

What ild_simulate() encodes

Fixed-effect recovery with ild_power(..., return_sims = TRUE)

From recovery to power

Variance components (illustrative snapshot)

AR(1) in the DGP vs residual correlation in the fit

Bayesian and state-space extensions

Cross-backend validation harness (optional)

Limitations and scope

See also

What `ild_simulate()` encodes

Fixed-effect recovery with `ild_power(..., return_sims = TRUE)`