Bayesian Reanalysis of the ICT-107 Trial

Riko Kelter
Institute of Medical Statistics and Computational Biology
Faculty of Medicine
University of Cologne
Cologne, Germany

23 December 2025

Introduction and Overview

In this vignette, we illustrate the basic functionality of the bfbin2arm package and its core functions. The package can be used to design a Bayesian (phase II) clinical trial with two arms and binary endpoints (success or failure) based on Bayes factors. Our main assumption here is that the observed data in both groups are from two random variables \(Y_1,Y_2\) which both follow a binomial distribution with parameters \(n_1\) and \(n_2\) and \(p_1\) respectively \(p_2\), \[Y_1\sim \mathrm{Bin}(n_1,p_1), \hspace{1cm} Y_2\sim \mathrm{Bin}(n_2,p_2)\]

Hypothesis tests

In its current form, the package implements four different hypothesis tests for the trial:

\[H_0:p_1=p_2 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:p_1\neq p_2\] Alternatively, a well-known parameterization of this test introduces a difference parameter \(\eta=p_2-p_1\) and the grand mean \(\zeta=\frac{1}{2}(p_1+p_2)\). Using this parameterization, we have \[p_1=\zeta-\frac{\eta}{2}, \hspace{1cm} p_2=\zeta+\frac{\eta}{2}\] and the hypotheses can be rewritten as: \[H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta \neq 0\] Next to this two-sided test, three directional tests are available in the package:

For each of the four tests, a separate Bayes factor exists and can be used. For the two-sided test, we denote the Bayes factor as \(BF_{01}\), and for the three directional tests above we denote the Bayes factors as \(BF_{+-}\), \(BF_{+0}\) and \(BF_{-0}\).

Design and analysis priors

The \(\mathrm{Beta}(a_0,b_0)\) distribution is a conjugate prior for the binomial likelihood, and when chosen as the prior, the posterior \(P_{p \mid Y}\) is also Beta-distributed. A natural choice for the priors is the beta distribution. We assume independent Beta design priors \(H_0\) as follows: \[p_1 =p_2 = p\mid H_0 \sim \mathrm{Beta}(a_0^d,b_0^d)\] Thus, under \(H_0:\eta = 0\), both probabilities are identical, \(p_1=p_2\), and take some value \(p\in [0,1]\), which has a beta design prior. Likewise, we pick independent Beta design priors under \(H_1:\eta \neq 0\): \[p_1 \mid H_1 \sim \mathrm{Beta}(a_1^d,b_1^d), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_2^d,b_2^d)\] For the analysis priors \(P_{p_1}^a\), \(P_{p_2}^a\) under \(H_1\), we also choose independent Beta priors, with possibly different values \(a_i^a\) and \(b_i^a\) for \(i=1,2\), where the superscript signals that the hyperparameters belong to our analysis instead of design prior: \[p_1 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a)\] Lastly, for the analysis prior \(P_{p}^a\) under \(H_0:\eta=0\), we choose a Dirac prior with all probability on \(\eta=p_2-p_1=0\) conditionally on a uniform prior on \(\zeta\), that is \[p_1=p_2=p|H_0 \sim 1_{\{\eta=0\}}| \zeta \sim U(0,1)\] for the analysis with the Bayes factor.

Using the package

First, we load the package after installation:

library(bfbin2arm)

Next, we illustrate the key functions of the package by re-analyzing a phase II trial in the context of oncology. While no Bayesian approach was used in the original statistical analysis of the trial, the step-by-step walktrough below showcases how a structured approach to designing and calibrating a Bayesian phase II trial with the bfbin2arm package looks like. Importantly, the trial must have two trial arms and binary endpoints and we assume that one of the four tests detailed above is carried out using Bayes factors as the test criterion.

ICT-107 Phase II Trial Overview

The ICT-107 trial (Wen et al., 2019) was a randomized phase II study in newly diagnosed glioblastoma patients (n=124, 2:1 randomization). The primary binary endpoint is progression status at 6 months (PFS6), and the secondary binary endpoint immunologic status. Here, we focus on the secondary endpoint for illustration purposes.

Reported results (ITT population):

1. Bayes Factor Analysis

We start by calculating the Bayes factor(s) for the ICT-107 trial data:

## -------------------------------------------------------------
## 2. ICT-107 trial (immunologic response)
##    Placebo (control): 12 responders, 31 non-responders
##    ICT-107 (treatment): 49 responders, 32 non-responders  
## -------------------------------------------------------------

y1_ict <- 12      # control successes
n1_ict <- 12 + 31
y2_ict <- 49      # treatment successes
n2_ict <- 49 + 32

cat("\n=== ICT-107 Trial (n1 =", n1_ict, ", n2 =", n2_ict, ") ===\n")
#> 
#> === ICT-107 Trial (n1 = 43 , n2 = 81 ) ===

# BF01
BF01_ict = twoarmbinbf01(y1_ict, y2_ict, n1_ict, n2_ict, 
                         a_0_a = 1, b_0_a = 1, 
                         a_1_a = 1, b_1_a = 1, 
                         a_2_a = 1, b_2_a = 1)

# BF+1
BFp1_ict = BFplus1(y1_ict, y2_ict, n1_ict, n2_ict, 
                   a_1_d = 1, b_1_d = 1, 
                   a_2_d = 1, b_2_d = 1)

# BF-1
BFm1_ict = BFminus1(y1_ict, y2_ict, n1_ict, n2_ict, 
                    a_1_d = 1, b_1_d = 1, 
                    a_2_d = 1, b_2_d = 1)

# BF+0
cat("=== ICT-107 Trial === Bayes factor BF+0 results in ", BFplus0(BFp1_ict, BF01_ict))
#> === ICT-107 Trial === Bayes factor BF+0 results in  186.6192

# BF+-
cat("=== ICT-107 Trial === Bayes factor BF+- results in ", BFplusMinus(BFp1_ict, BFm1_ict))
#> === ICT-107 Trial === Bayes factor BF+- results in  3702.659

The most relevant Bayes factor here is \(BF_{+-}\), because it is directional and leaves open the possibility of the placebo group having a larger response rate than the treatment group. Note that the hyperparameters of the beta analysis priors are specified in twoarmbinbf01 via a_0_a = 1, b_0_a = 1 et cetera.

2. Operating characteristics for actual sample sizes

Now, a key question is which operating characteristics can be expected based on the actual sample sizes used in the trial. The powertwoarmbinbf01 function can provide the answer:

ict_results <- powertwoarmbinbf01(
  n1 = n1_ict, n2 = n2_ict,
  k = 1/3, k_f = 3,
  test = "BF+-",  # H+: p2 > p1 vs H-: p2 <= p1
  a_0_d = 1, b_0_d = 1, a_0_a = 1, b_0_a = 1,
  a_1_d = 1, b_1_d = 1, a_2_d = 1, b_2_d = 1,
  a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1,
  output = "numeric",
  compute_freq_t1e = TRUE,
)
print(ict_results)
#>                   Power             Type1_Error                   CE_H0 
#>               0.8788106               0.0214111               0.8788106 
#> Frequentist_Type1_Error 
#>               0.2871811 
#> attr(,"hypothesis")
#> [1] "H[+]:~p[2] > p[1] ~~ vs ~~ H[-]:~p[2] <= p[1]"
#> attr(,"compute_freq_t1e")
#> [1] TRUE

We see that based on the actual sample sizes and a moderate evidence threshold \(k=1/3\), the Bayesian power is sufficiently large with \(87.8\%\). Still, the frequentist type-I-error rate is way too high with \(28.7\%\), so we increase the evidence threshold to \(k=1/10\) (strong evidence) and use the ntwoarmbinbf01 function to calibrate the design based on our requirements next.

3. Power & Sample Size for ICT-107 Design

The core working function to design a Bayesian trial with the package is the ntwoarmbinbf01 function. It provides a method to calibrate a Bayesian design in terms of

The function makes use of parallelization and it is recommended to run it on a computer with multiple cores to make computations fast. First, we perform a sample size search for a ICT-107-type trial (balanced arms) under flat design priors and substantial evidence thresholds, using the directional Bayes factor \(BF_{+-}\):

ntwoarmbinbf01(
  k = 1/10, k_f = 10,
  power = 0.8, alpha = 0.05, pce_H0 = 0.8,
  test = "BF+-",
  nrange = c(10, 75), n_step = 1,
  progress = FALSE,
  compute_freq_t1e = TRUE,
  p1_power = 0.3, p2_power = 0.6,
  output = "plot"  # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 75 (step = 1, 66 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#> 
#> 
#> Simulation complete.
#> SUMMARY for BF+-:
#>   Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#>   k = 0.100, k_f = 10.000
#>   Allocation: alloc1 = 0.500, alloc2 = 0.500
#>   Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#>     POWER not reached: max=0.754 at n_total=75
#>     Bayesian Type-I error <= 0.05 achieved at n_total=10
#>     P(CE|H0) not reached: max=0.754 at n_total=75
#>     FREQUENTIST Type-I error TOO HIGH: max(sup)=0.120 > 0.05
#>     Frequentist power >= 0.80 achieved at n_total=50 (p1=0.30, p2=0.60)

The function arguments are

The resulting output plots the design and analysis priors at the top row, the resulting power and type-I-error rate curves as functions of the sample size with markers for which sample sizes the design achieves the required calibration thresholds (middle row), and the probability of compelling evidence for the null hypothesis (in this case, the hypothesis \(H_-\)) in the bottom row. Note that the oscillations happen due to the discrete nature of the binomial distribution, and the package algorithm ensures that for the next 10 sample sizes, the power does not drop below the required threshold. Likewise, the package ensures that the type-I-error rate does not increase below the required alpha level, and that the probability of compelling evidence drops below its required threshold. It is straightforward to check this visually by means of the provided output plots, too. If no plots are required, use the option numeric instead of plot for the output argument.

The resulting plot shows that while the type I error is calibrated for \(n=10\) patients per trial arm, Bayesian power is not reaching our desired level of 80% even for \(n=75\) patients in total (in both arms). We could increase the range, or alternatively, use more informative design priors under which the hypotheses under comparison are separated in a better way. Right now, we essentially assume that everything is equally likely under our design priors, although we should have a clear expectation about the probabilities in the treatment and control arm. Thus, we modify our design priors next. Note that the plot also shows that frequentist power is calibrated for \(n=50\) patients per arm when assuming \(p_1=0.3\) (control arm probability) and \(p_2=0.6\) (treatment arm probability).

4. Informative design priors

Now, the example above used flat design priors, which might be unrealistic in a variety of settings. Next, we perform a sample size search for new ICT-107-type trial (balanced arms) under informative design priors with very strong evidence thresholds. Notice the additionally specified parameters a_1_d = 1, b_1_d = 2 and a_2_d = 2, b_2_d = 1 which are the design prior hyperparameters of the Beta design priors for \(p_1\) and \(p_2\) under \(H_+\).

ntwoarmbinbf01(
  k = 1/30, k_f = 30,
  power = 0.8, alpha = 0.05, pce_H0 = 0.8,
  test = "BF+-",
  nrange = c(10, 100), n_step = 1,
  progress = FALSE,
  a_1_d = 1, b_1_d = 2,
  a_2_d = 2, b_2_d = 1,
  compute_freq_t1e = TRUE,
  p1_power = 0.3, p2_power = 0.6,
  output = "plot"  # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#> 
#> 
#> Simulation complete.
#> SUMMARY for BF+-:
#>   Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#>   k = 0.033, k_f = 30.000
#>   Allocation: alloc1 = 0.500, alloc2 = 0.500
#>   Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#>     Power >= 0.80 achieved at n_total=72
#>     Bayesian Type-I error <= 0.05 achieved at n_total=10
#>     P(CE|H0) not reached: max=0.715 at n_total=100
#>     Frequentist Type-I error <= 0.05 achieved (max(sup)=0.041)
#>     Frequentist power >= 0.80 achieved at n_total=77 (p1=0.30, p2=0.60)

We see that now the Bayesian power is calibrated for \(n=72\) patients per trial arm, while the frequentist power is calibrated for \(n=77\) patients per trial arm. Importantly, the frequentist type-I-error is now only \(0.041<0.05\), as stated by the console output of the function. Thus, the design is fully calibrated except for the probability of compelling evidence for \(H_-\) shown in the bottom plot.

Therefore, next, we perform a sample size search for new ICT-107-type trial (balanced arms) under informative design priors with very strong evidence thresholds, and design prior under H- modified to achieve probability of compelling evidence PCE(H0) for even smaller sample sizes. Note that now, additionally, the design prior hyperparameters of the Beta design priors for \(p_1\) and \(p_2\) under \(H_-\) are specified in a_1_d_Hminus = 2, b_1_d_Hminus = 1 and a_2_d_Hminus = 1, b_2_d_Hminus = 2:

ntwoarmbinbf01(
  k = 1/30, k_f = 30,
  power = 0.8, alpha = 0.05, pce_H0 = 0.8,
  test = "BF+-",
  nrange = c(10, 100), n_step = 1,
  progress = TRUE,
  a_1_d = 1, b_1_d = 2,
  a_2_d = 2, b_2_d = 1,
  a_1_d_Hminus = 2, b_1_d_Hminus = 1,
  a_2_d_Hminus = 1, b_2_d_Hminus = 2,
  compute_freq_t1e = TRUE,
  p1_power = 0.3, p2_power = 0.6,
  output = "plot"  # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#> 
#> 
#> Simulation complete.
#> SUMMARY for BF+-:
#>   Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#>   k = 0.033, k_f = 30.000
#>   Allocation: alloc1 = 0.500, alloc2 = 0.500
#>   Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#>     Power >= 0.80 achieved at n_total=72
#>     Bayesian Type-I error <= 0.05 achieved at n_total=10
#>     P(CE|H0) >= 0.80 achieved at n_total=72
#>     Frequentist Type-I error <= 0.05 achieved (max(sup)=0.041)
#>     Frequentist power >= 0.80 achieved at n_total=77 (p1=0.30, p2=0.60)

The result is a fully calibrated Bayesian design which meets Bayesian and frequentist power demands, Bayesian and frequentist type-I-error rate requirements and our requirement on the probability of compelling evidence for \(H_0\) (that is, \(H_-\) in this case).

The bfbin2arm package reveals several aspects. If a balanced design with equal randomization probabilities is desired, then:

5. Unequal randomization probabilities

In the original ICT-107 trial, \(2/3\) of the patients was randomized into the treatment group, while \(1/3\) of the patients was randomized into the control group. We can use the parameters alloc1 and alloc2 to specify randomization probabilities for the control and treatment arms and carry out the Bayesian sample size calculations based on these randomization probabilities. As an example, we rerun the last calibration, but use the randomization probabilities of the ICT-107 trial:

ntwoarmbinbf01(
  k = 1/30, k_f = 30,
  power = 0.8, alpha = 0.05, pce_H0 = 0.8,
  test = "BF+-",
  nrange = c(10, 100), n_step = 1,
  progress = FALSE,
  a_1_d = 1, b_1_d = 2,
  a_2_d = 2, b_2_d = 1,
  a_1_d_Hminus = 2, b_1_d_Hminus = 1,
  a_2_d_Hminus = 1, b_2_d_Hminus = 2,
  compute_freq_t1e = TRUE,
  p1_power = 0.3, p2_power = 0.6,
  output = "plot",  # Returns recommended n per group
  alloc1 = 1/3,
  alloc2 = 2/3
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.333, alloc2 = 0.667
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#> 
#> 
#> Simulation complete.
#> SUMMARY for BF+-:
#>   Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#>   k = 0.033, k_f = 30.000
#>   Allocation: alloc1 = 0.333, alloc2 = 0.667
#>   Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#>     Power >= 0.80 achieved at n_total=83
#>     Bayesian Type-I error <= 0.05 achieved at n_total=10
#>     P(CE|H0) >= 0.80 achieved at n_total=83
#>     FREQUENTIST Type-I error TOO HIGH: max(sup)=0.050 > 0.05
#>     Frequentist power >= 0.80 achieved at n_total=86 (p1=0.30, p2=0.60)

Remember that the sample size shown at the x-axis in the power and type-I-error rate plot as well as in the probability of compelling evidence plot is the total sample size in both arms. We see that now we need \(n=83\) patients in total to reach Bayesian power of 80%, while \(n=86\) patients in total are required for frequentist power calibration of 80%. The probability of compelling evidence reaches 80% at \(n=83\) patients in total. Note, however, that the frequentist type-I-error rate is exactly at the boundary now, which might be too liberal for some. As the frequentist type-I-error rate assumes fixed success probabilities in both trial arms and is independent of the design priors, we must change the evidence threshold \(k\) slightly to decrease the frequentist type-I-error rate accordingly. Just try it out yourself and decrease \(k\) from \(k=1/30\) to \(k=1/40\) and rerun the last code block.

6. Design Recommendations based on the calibration

If the original 2:1 randomization of the ICT-107 trial is used and two thirds of the patients are randomized into the treatment group, then:

To fulfill all four requirements, it thus suffices if 32 patients in the control arm and 64 in the treatment arm are enrolled in the trial, and the Bayes factor thresholds \(k=1/40\) and \(k_f=30\) are used for decision making about the hypotheses \(H_+\) and \(H_-\) under consideration.

7. Predictive Densities

If desired, we can compare predictive densities under different hypotheses also directly via:

pred_H0 <- predictiveDensityH0(y1_ict, y2_ict, n1_ict, n2_ict)
pred_H1 <- predictiveDensityH1(y1_ict, y2_ict, n1_ict, n2_ict)
pred_Hplus <- predictiveDensityHplus_trunc(y1_ict, y2_ict, n1_ict, n2_ict)

data.frame(
Hypothesis = c("H0: p1=p2", "H1: p1 != p2", "H+: p2>p1"),
"Pred. Density" = round(c(pred_H0, pred_H1, pred_Hplus), 4)
)
#>     Hypothesis Pred..Density
#> 1    H0: p1=p2         0e+00
#> 2 H1: p1 != p2         3e-04
#> 3    H+: p2>p1         6e-04

References

Wen PY, et al. (2019). A Randomized Double-Blind Placebo-Controlled Phase II Trial of Dendritic Cell Vaccine ICT-107. Clinical Cancer Research. [PMID: 31320597]