Note the examples are not actually run but their results are stored in the included data set hybrid.mdgof.vignette. This is done to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE.
In a standard goodness-of-fit problem we have a data set \(x\) and a probability model, given in form of the cumulative distribution function \(F\), and want to test \(H_0: X\sim F\), that is the data set was generated by \(F\).
There are however circumstances where evaluation of the function \(F(x)\) is difficult or even impossible, especially for high-dimensional data. It is however possible to generate data from \(F\). Then a test can be run as follows:
In what follows we will call this a hybrid test.
One additional “feature” of this approach is that often we are free to choose the sample size of \(y\). Generally on would expect the power of the tests to increase.
The companion R package MD2sample has a large number of methods already implemented and is used in routines that follow.
We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis \(H_0:X\sim N(\mathbf{0}, \mathbf{I_2})\):
examples[["ex1a"]]
#> $statistics
#> KS K CvM AD NN1 NN5 AZ BF BG
#> 0.1100 0.1700 0.1047 0.7069 0.8800 1.7300 -0.0070 0.2167 0.1383
#>
#> $p.values
#> KS K CvM AD NN1 NN5 AZ BF BG
#> 0.770 0.890 0.540 0.635 0.905 0.985 0.885 0.780 0.725\(B\) is the number of simulation runs to estimated the p-value, \(B=5000\) is the default but if the data set is large a smaller value should suffice.
Arguments of hybrid_test for continuous data
In this example we will test whether the data comes from bivariate normal distribution, with means and variance-covariance estimated from the data.
rnull=function(p) mvtnorm::rmvnorm(200,
mean=p[1:2],
sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
phat=function(x)
c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2])))
x=rnull(c(1, 2, 4, 9,0.6))
phat(x)
#> [1] 1.2812099 1.8038581 3.9441709 8.0894505 0.1389893examples[["ex2a"]]
#> $statistics
#> KS K CvM AD NN1 NN5 AZ BF
#> 0.110000 0.120000 0.263900 1.987300 0.955000 1.930000 -0.001000 0.297600
#> BG
#> 0.008009
#>
#> $p.values
#> KS K CvM AD NN1 NN5 AZ BF BG
#> 0.320 0.940 0.120 0.070 0.750 0.745 0.485 0.245 0.890In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom. We also use a larger sample size.
hybrid_test allows the user to supply their own (two-sample) test method, either one that finds its own p-value or a method that requires simulation.
The routine chiTS.cont (included in the MD2sample package) finds either the test statistic or the p value of a chi square test in two dimensions. First we need to set
this list is passed to chiTS.cont and tells the routine to
In this example we use the same setup as in example 2.
examples[["ex3a"]]
#> $statistics
#> Chisq Stat 3,3 Chisq Stat 3,4
#> 5.1979 6.8728
#>
#> $p.values
#> Chisq Stat 3,3 Chisq Stat 3,4
#> 0.700 0.775examples[["ex3b"]]
#> $statistics
#> Chisq Stat 3,3 Chisq Stat 3,4
#> 11.449 19.757
#>
#> $p.values
#> Chisq Stat 3,3 Chisq Stat 3,4
#> 0.18 0.07Arguments and output of new test routine for continuous data
see the vignette MD2sample.
The routine hybrid_power allows the user to estimate the power of the tests via two-sample methods.
Let’s say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector \((0,0)\), variances equal to 1 and a covariance \(\rho\). We first need a function that generates these data sets:
rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0),
sigma=matrix(c(1, s, s, 1), 2, 2))Now we can run
examples[["ex4a"]]
#> KS K CvM AD NN1 NN5 AZ BF BG FR NN0 CF1 CF2
#> 0 0.105 0.065 0.09 0.125 0.05 0.06 0.085 0.05 0.065 0.045 1 0.045 0.08
#> 0.8 0.560 0.705 0.63 0.700 0.76 0.95 0.985 0.95 0.065 0.865 1 0.865 0.70
#> CF3 CF4 Ball ES EP
#> 0 0.045 0.085 0.065 0.065 0.035
#> 0.8 0.865 0.800 0.350 0.960 0.995Arguments of hybrid_power
Again the user can provide their own routine:
We consider the case of data from binomial distributions:
rnull=function() {
x=rbinom(1000, 5, 0.5)
y=rbinom(1000, 3, 0.5+x/20)
MDgof::sq2rec(table(x, y))
}
x=rnull()examples[["ex5"]]
#> $statistics
#> KS K CvM AD NN AZ BF
#> 0.0500 0.0790 0.1998 1.3240 4.4895 0.0681 0.1053
#>
#> $p.values
#> KS K CvM AD NN AZ BF
#> 0.110 0.020 0.340 0.265 0.195 0.100 0.100The MDgof routine sq2rec above changes the format of the data from that output by the table command to what is required by hybrid_test.
The arguments of hybrid_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if x is a matrix with three columns and the the values in the third column are integers.
The routine chiTS.disc (included in the MD2sample package) does a chi-square test for discrete data:
Power estimation for discrete data is done the same way as for continuous data:
We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y.
rnull=function() {
x=rbinom(1000, 5, 0.5)
y=rbinom(1000, 3, 0.5+x/20)
MDgof::sq2rec(table(x, y))
}
ralt=function(p) {
x=rbinom(1000, 5, p)
y=rbinom(1000, 3, 0.5+x/20)
MDgof::sq2rec(table(x, y))
}
x=ralt(0.475)examples[["ex7b"]]
#> KS K CvM AD NN AZ BF Chisquare
#> 0.5 0.07 0.075 0.035 0.040 0.04 0.030 0.040 0.055
#> 0.475 0.50 0.225 0.525 0.595 0.13 0.675 0.675 0.240or using a user-supplied test that find a p value: