Simulation benchmarks: recovery and power

tidyILD authors

2026-04-17

This article shows how to use ild_simulate() and ild_power(..., return_sims = TRUE) to assess parameter recovery for a fixed effect under the package’s bundled Gaussian mixed-model simulation setup, and how that connects to empirical power. It is an illustrative benchmark for that DGP—not a proof that every tidyILD method is validated for all data.

Power answers: “How often do we reject the null when the effect is present?” Recovery answers: “How close are point estimates (and intervals) to the known coefficient across repeated simulations?” Both use the same replication loop inside ild_power().

Simulation size and precision

Recovery summaries (bias, RMSE, coverage) are Monte Carlo estimates: their sampling variability shrinks as n_sim increases (roughly \(\propto 1/\sqrt{n_\text{sim}}\) for means). The worked example below uses n_sim = 200 as a balance between stability and build time on CRAN; for publication-style tables, consider larger n_sim and reporting uncertainty (e.g. bootstrap across replications or analytical approximations where available). Chunks that re-run ild_power() are marked with optional cache = TRUE so local rebuilds can skip re-simulation when the code is unchanged (not used on CRAN by default).

What ild_simulate() encodes

The base outcome y (before ild_power() adds a predictor) is built as follows:

The formula you fit should match this structure (e.g. random intercept for id when using ild_lme(..., ar1 = FALSE)). Recovery in this vignette focuses on a single fixed effect added by ild_power() on top of that DGP (see below).

Fixed-effect recovery with ild_power(..., return_sims = TRUE)

For each replication, ild_power():

  1. Simulates data with ild_simulate().
  2. Adds a standard normal predictor x and sets y <- y + effect_size * x so the true coefficient of x is effect_size (call it β).
  3. Runs ild_prepare() and ild_lme() with your formula.
  4. Records the Wald-based estimate and standard error for the test term.

With return_sims = TRUE, you get a tibble of per-replication estimate, std_error, p_value, and converged. Use ild_recovery_metrics() for bias, RMSE, and nominal Wald interval coverage (normal approximation \(\hat\beta \pm z_{1-\alpha/2} \cdot \mathrm{SE}\)), consistent with the manual calculation shown in earlier package versions.

library(tidyILD)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

truth <- 0.35
n_sim <- 200L

set.seed(2026)
res <- ild_power(
  formula = y ~ x + (1 | id),
  n_sim = n_sim,
  n_id = 25L,
  n_obs_per = 12L,
  effect_size = truth,
  seed = 2026L,
  return_sims = TRUE,
  verbose = FALSE
)

sim <- res$sim_results |> dplyr::filter(converged)
ild_recovery_metrics(res$sim_results, truth = truth, level = 0.95)
#> # A tibble: 1 × 8
#>   truth     n n_total mean_estimate     bias   rmse coverage level
#>   <dbl> <int>   <int>         <dbl>    <dbl>  <dbl>    <dbl> <dbl>
#> 1  0.35   200     200         0.346 -0.00404 0.0368    0.945  0.95

Across replications, the mean estimate should be near β, bias near 0, RMSE positive but modest if the model is well specified, and coverage near the nominal level (e.g. 0.95) for large n_sim—subject to Monte Carlo noise when n_sim is moderate.

ggplot2::ggplot(sim, ggplot2::aes(x = estimate)) +
  ggplot2::geom_histogram(fill = "steelblue", color = "white", bins = 20) +
  ggplot2::geom_vline(xintercept = truth, linetype = 2, linewidth = 0.8) +
  ggplot2::labs(
    x = "Estimated coefficient for x",
    y = "Count",
    title = "Sampling distribution of fixed-effect estimates",
    subtitle = sprintf("True beta = %s (vertical line)", truth)
  )

Histogram of simulation estimates of the fixed effect

From recovery to power

The same object res already contains empirical power: the fraction of converged runs with p_value < alpha.

tibble::tibble(
  power = res$power,
  n_reject = res$n_reject,
  n_converged = res$n_converged,
  alpha = res$alpha
)
#> # A tibble: 1 × 4
#>   power n_reject n_converged alpha
#>   <dbl>    <int>       <int> <dbl>
#> 1     1      200         200  0.05

Interpretation: Recovery summarizes the sampling distribution of \(\hat\beta\). Power summarizes how often that distribution leads to rejection at alpha. A large effect with low variance yields good recovery and high power; a small effect may show acceptable recovery (unbiased estimates) but low power at a fixed sample size.

Variance components (illustrative snapshot)

ild_power() is built around a fixed-effect estimand. Recovery of variance components (random-intercept SD, residual variance) is a separate question. As a single-run sanity check, you can inspect VarCorr() on one simulated dataset ( lme4 or nlme output depending on ar1):

set.seed(1)
d_one <- ild_simulate(n_id = 30L, n_obs_per = 15L, seed = 99)
d_one$x <- rnorm(nrow(d_one))
d_one$y <- d_one$y + 0.3 * d_one$x
prep_one <- ild_prepare(d_one, id = "id", time = "time")
fit_one <- ild_lme(y ~ x + (1 | id), prep_one, ar1 = FALSE, warn_no_ar1 = FALSE)
#> Warning: Predictor(s) 'x' vary both within and between persons. Consider using
#> ild_center(x) to separate within- and between-person effects and avoid
#> conflation bias.
lme4::VarCorr(fit_one)
#>  Groups   Name        Std.Dev.
#>  id       (Intercept) 0.97185 
#>  Residual             0.66690

This is not a simulation-based assessment of variance recovery—only a template for where to look when extending benchmarks.

AR(1) in the DGP vs residual correlation in the fit

ild_power() exposes ar1 as a logical flag for the fit, not the DGP’s \(\phi\). To study AR(1) DGPs with ild_power(), you would need a custom loop calling ild_simulate(..., ar1 = ...) yourself, or prioritize alignment by matching structures in both steps. Minimal illustration of DGP AR + nlme residual correlation on one draw:

set.seed(2)
d_ar <- ild_simulate(n_id = 20L, n_obs_per = 14L, ar1 = 0.35, wp_effect = 0.5, seed = 2)
d_ar$x <- rnorm(nrow(d_ar))
d_ar$y <- d_ar$y + 0.25 * d_ar$x
prep_ar <- ild_prepare(d_ar, id = "id", time = "time")
fit_ar <- tryCatch(
  ild_lme(y ~ x, prep_ar, ar1 = TRUE, random = ~ 1 | id, warn_no_ar1 = FALSE),
  error = function(e) NULL
)
#> Warning: Predictor(s) 'x' vary both within and between persons. Consider using
#> ild_center(x) to separate within- and between-person effects and avoid
#> conflation bias.
if (!is.null(fit_ar)) summary(fit_ar) else "nlme fit failed on this platform (skip)"
#> Linear mixed-effects model fit by REML
#>   Data: data 
#>        AIC    BIC    logLik
#>   573.1519 591.29 -281.5759
#> 
#> Random effects:
#>  Formula: ~1 | id
#>         (Intercept)  Residual
#> StdDev:    1.024086 0.8321674
#> 
#> Correlation Structure: AR(1)
#>  Formula: ~.ild_seq | id 
#>  Parameter estimate(s):
#>       Phi 
#> 0.6910938 
#> Fixed effects:  list(formula) 
#>                 Value  Std.Error  DF  t-value p-value
#> (Intercept) 0.9146186 0.25045053 259 3.651893   3e-04
#> x           0.2603221 0.03073403 259 8.470158   0e+00
#>  Correlation: 
#>   (Intr)
#> x -0.021
#> 
#> Standardized Within-Group Residuals:
#>         Min          Q1         Med          Q3         Max 
#> -2.17356216 -0.70843954 -0.07166726  0.61178709  2.15484654 
#> 
#> Number of Observations: 280
#> Number of Groups: 20

Bayesian and state-space extensions

Cross-backend validation harness (optional)

For repeatable CI-style checks across several engines (lme4, optional nlme / brms / KFAS / ctsem), the package ships a non-exported harness documented in inst/dev/BACKEND_VALIDATION_BENCHMARK_CONTRACT.md. From a source tree, run scripts/run-backend-validation-benchmarks.R with a --tier (smoke, nightly, full) and compare outputs to JSON thresholds via scripts/check-backend-validation-thresholds.R. This complements the ild_power() / ild_recovery_metrics() workflow above by standardizing raw and summary tables for trend comparison; it is not a substitute for study-specific simulation design.

Limitations and scope

See also

#> R version 4.5.3 (2026-03-11)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US/en_US/C/en_US/en_US
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_4.0.0 dplyr_1.1.4   tidyILD_0.4.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.7-4       gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.3    
#>  [5] Rcpp_1.1.0         tidyselect_1.2.1   jquerylib_0.1.4    splines_4.5.3     
#>  [9] scales_1.4.0       boot_1.3-32        yaml_2.3.10        fastmap_1.2.0     
#> [13] lattice_0.22-9     R6_2.6.1           labeling_0.4.3     generics_0.1.4    
#> [17] knitr_1.50         rbibutils_2.3      MASS_7.3-65        tibble_3.3.0      
#> [21] nloptr_2.2.1       lubridate_1.9.4    tsibble_1.2.0      minqa_1.2.8       
#> [25] bslib_0.9.0        pillar_1.11.1      RColorBrewer_1.1-3 rlang_1.1.6       
#> [29] cachem_1.1.0       xfun_0.53          sass_0.4.10        S7_0.2.0          
#> [33] timechange_0.3.0   cli_3.6.5          withr_3.0.2        magrittr_2.0.4    
#> [37] Rdpack_2.6.4       digest_0.6.37      grid_4.5.3         lme4_1.1-37       
#> [41] anytime_0.3.12     lifecycle_1.0.4    nlme_3.1-168       reformulas_0.4.1  
#> [45] vctrs_0.6.5        evaluate_1.0.5     glue_1.8.0         farver_2.1.2      
#> [49] rmarkdown_2.29     tools_4.5.3        pkgconfig_2.0.3    htmltools_0.5.8.1