MDgof-hybrid

Note the examples are not actually run but their results are stored in the included data set hybrid.mdgof.vignette. This is done to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE.

In a standard goodness-of-fit problem we have a data set \(x\) and a probability model, given in form of the cumulative distribution function \(F\), and want to test \(H_0: X\sim F\), that is the data set was generated by \(F\).

There are however circumstances where evaluation of the function \(F(x)\) is difficult or even impossible, especially for high-dimensional data. It is however possible to generate data from \(F\). Then a test can be run as follows:

One additional “feature” of this approach is that often we are free to choose the sample size of \(y\). Generally on would expect the power of the tests to increase.

The companion R package MD2sample has a large number of methods already implemented and is used in routines that follow.

Example 1: Continuous data without parameter estimation

We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis \(H_0:X\sim N(\mathbf{0}, \mathbf{I_2})\):

set.seed(123)

rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
x=rnull()

examples[["ex1a"]]=hybrid_test(x, rnull, B=B)

examples[["ex1a"]]
#> $statistics
#>      KS       K     CvM      AD     NN1     NN5      AZ      BF      BG 
#>  0.1100  0.1700  0.1047  0.7069  0.8800  1.7300 -0.0070  0.2167  0.1383 
#> 
#> $p.values
#>    KS     K   CvM    AD   NN1   NN5    AZ    BF    BG 
#> 0.770 0.890 0.540 0.635 0.905 0.985 0.885 0.780 0.725

\(B\) is the number of simulation runs to estimated the p-value, \(B=5000\) is the default but if the data set is large a smaller value should suffice.

Arguments of hybrid_test for continuous data

x a matrix of numbers with different observations in different rows (for continuous data).
rnull routine that generates data under the null hypothesis.
phat routine that finds parameter estimates.
TS a routine to calculate test statistics or p values for new tests.
TSextra additional info passed to TS, if necessary.
maxProcessor maximum number of cores to use. If set to 1 no parallel processing is used.
doMethods=“all” names of method to include.
B=5000 number of simulation runs.

Example 2: Continuous data with parameter estimation

In this example we will test whether the data comes from bivariate normal distribution, with means and variance-covariance estimated from the data.

rnull=function(p) mvtnorm::rmvnorm(200, 
      mean=p[1:2],
      sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
phat=function(x) 
   c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2])))
x=rnull(c(1, 2, 4, 9,0.6))
phat(x)
#> [1] 1.2812099 1.8038581 3.9441709 8.0894505 0.1389893

examples[["ex2a"]]=hybrid_test(x, rnull, phat, B=B)

examples[["ex2a"]]
#> $statistics
#>        KS         K       CvM        AD       NN1       NN5        AZ        BF 
#>  0.110000  0.120000  0.263900  1.987300  0.955000  1.930000 -0.001000  0.297600 
#>        BG 
#>  0.008009 
#> 
#> $p.values
#>    KS     K   CvM    AD   NN1   NN5    AZ    BF    BG 
#> 0.320 0.940 0.120 0.070 0.750 0.745 0.485 0.245 0.890

In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom. We also use a larger sample size.

y=mvtnorm::rmvt(300, sigma=diag(2), df=5)

examples[["ex2b"]]=hybrid_test(y, rnull, phat, B=B)

examples[["ex2b"]]
#> $statistics
#>       KS        K      CvM       AD      NN1      NN5       AZ       BF 
#> 0.080000 0.153300 0.092860 0.764500 0.950000 1.973300 0.003255 0.289100 
#>       BG 
#> 0.334200 
#> 
#> $p.values
#>    KS     K   CvM    AD   NN1   NN5    AZ    BF    BG 
#> 0.705 0.425 0.610 0.605 0.800 0.620 0.885 0.290 0.370

New Tests

hybrid_test allows the user to supply their own (two-sample) test method, either one that finds its own p-value or a method that requires simulation.

The routine chiTS.cont (included in the MD2sample package) finds either the test statistic or the p value of a chi square test in two dimensions. First we need to set

TSextra=list(which="statistic", nbins=cbind(c(3,3), c(3,4)))

this list is passed to chiTS.cont and tells the routine to

return the test statistic (and not the p value)
Run two tests, first with 3 and 3 bins in the x and y direction and then 3 and 4 bins.

Example 3

In this example we use the same setup as in example 2.

examples[["ex3a"]]=hybrid_test(x, rnull, phat, 
         TS=MD2sample::chiTS.cont, TSextra=TSextra,  
         B=B)

examples[["ex3a"]]
#> $statistics
#> Chisq Stat 3,3 Chisq Stat 3,4 
#>         5.1979         6.8728 
#> 
#> $p.values
#> Chisq Stat 3,3 Chisq Stat 3,4 
#>          0.700          0.775

examples[["ex3b"]]=hybrid_test(y, rnull, phat, 
         TS=MD2sample::chiTS.cont, TSextra=TSextra,  
         B=B)

examples[["ex3b"]]
#> $statistics
#> Chisq Stat 3,3 Chisq Stat 3,4 
#>         11.449         19.757 
#> 
#> $p.values
#> Chisq Stat 3,3 Chisq Stat 3,4 
#>           0.18           0.07

Arguments and output of new test routine for continuous data

see the vignette MD2sample.

Power Estimation

The routine hybrid_power allows the user to estimate the power of the tests via two-sample methods.

Example 4

Let’s say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector \((0,0)\), variances equal to 1 and a covariance \(\rho\). We first need a function that generates these data sets:

rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0),
                sigma=matrix(c(1, s, s, 1), 2, 2))

Now we can run

examples[["ex4a"]]=hybrid_power(rnull, ralt, 
      c(0, 0.8), B=B, maxProcessor=1)

examples[["ex4a"]]
#>        KS     K  CvM    AD  NN1  NN5    AZ   BF    BG    FR NN0   CF1  CF2
#> 0   0.105 0.065 0.09 0.125 0.05 0.06 0.085 0.05 0.065 0.045   1 0.045 0.08
#> 0.8 0.560 0.705 0.63 0.700 0.76 0.95 0.985 0.95 0.065 0.865   1 0.865 0.70
#>       CF3   CF4  Ball    ES    EP
#> 0   0.045 0.085 0.065 0.065 0.035
#> 0.8 0.865 0.800 0.350 0.960 0.995

Arguments of hybrid_power

rnull routine that generates data under the null hypothesis.
ralt routine that generates data under the alternative hypothesis.
param_alt vector of parameter values passed to ralt.
phat routine that finds parameter estimates.
TS a routine to calculate test statistics or p values for new tests.
TSextra additional info passed to TS, if necessary.
With.p.value=FALSE does user supplied routine return p values?
alpha=0.05 desired type I error probability.
maxProcessor maximum number of cores to use. If set to 1 no parallel processing is used.
B=1000 number of simulation runs.

Again the user can provide their own routine:

TSextra=list(which="pvalue", nbins=c(3,4))

examples[["ex4b"]]=hybrid_power(rnull, ralt, c(0, 0.8), TS=MD2sample::chiTS.cont, 
          TSextra=TSextra, B=500, 
          With.p.value=TRUE)

examples[["ex4b"]]
#>     Chisq Pval 3,4
#> 0            0.062
#> 0.8          0.992

Discrete Data

Example 5

We consider the case of data from binomial distributions:

rnull=function() {
   x=rbinom(1000, 5, 0.5)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
x=rnull()

examples[["ex5"]]=hybrid_test(x, rnull, B=B)

examples[["ex5"]]
#> $statistics
#>     KS      K    CvM     AD     NN     AZ     BF 
#> 0.0500 0.0790 0.1998 1.3240 4.4895 0.0681 0.1053 
#> 
#> $p.values
#>    KS     K   CvM    AD    NN    AZ    BF 
#> 0.110 0.020 0.340 0.265 0.195 0.100 0.100

The MDgof routine sq2rec above changes the format of the data from that output by the table command to what is required by hybrid_test.

The arguments of hybrid_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if x is a matrix with three columns and the the values in the third column are integers.

User supplied tests

The routine chiTS.disc (included in the MD2sample package) does a chi-square test for discrete data:

TSextra=list(which="statistic")

examples[["ex6c"]]=hybrid_test(x, rnull,  TS=MD2sample::chiTS.disc, TSextra=TSextra,  B=B)

examples[["ex6c"]]
#> $statistics
#> Chisq Stat 
#>     16.942 
#> 
#> $p.values
#> Chisq Stat 
#>      0.795

Power Estimation

Power estimation for discrete data is done the same way as for continuous data:

Example 7

We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y.

rnull=function() {
   x=rbinom(1000, 5, 0.5)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
ralt=function(p) {
   x=rbinom(1000, 5, p)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
x=ralt(0.475)

examples[["ex7b"]]= hybrid_power(rnull, ralt, c(0.5, 0.475), B=200)

examples[["ex7b"]]
#>         KS     K   CvM    AD   NN    AZ    BF Chisquare
#> 0.5   0.07 0.075 0.035 0.040 0.04 0.030 0.040     0.055
#> 0.475 0.50 0.225 0.525 0.595 0.13 0.675 0.675     0.240

or using a user-supplied test that find a p value:

TSextra=list(which="pvalue")

examples[["ex7d"]]=hybrid_power(rnull, ralt, c(0.5, 0.475), TS=MD2sample::chiTS.disc, TSextra=TSextra,  
         With.p.value = TRUE, B=B)

examples[["ex7d"]]
#>       Chisq P
#> 0.5     0.025
#> 0.475   0.205

saveRDS(examples, "hybrid.mdgof.vignette.rds")