Introduction

This vignette demonstrates a data analysis application using the DPComb package. We consider a gene-based genetic association study, where the primary goal is to assess whether each gene—represented by a group of genetic markers—is associated with disease status.

In this example, we use the built-in dataset case_control, which consists of 2000 subjects (1000 cases and 1000 controls) measured on 15 genetic markers. The markers are divided into two groups: markers 1–5 (gene1) and markers 6–15 (gene2).

Data Analysis using DPComb_tests

The analysis proceeds as follows:

  1. Data Loading and Grouping:
    Load the built-in dataset and group markers by gene.

  2. Combination Testing:
    For each gene, perform a combination test using one of the following methods (detailed formulas can be found in the documentation for DPComb_tests):

    • Fisher’s combination test, with either the mean-value or median-value chi-squared statistic.
    • Pearson’s combination test.
    • George’s combination test.
    • Stouffer’s combination test.
    • Edgington’s combination test.

How the Combination Tests Work

  • Individual Marker Analysis:
    Each marker’s association with disease status is initially assessed using Fisher’s exact test (via the fisher.test function). This produces discrete p-values, which can be two-sided, right-sided, or left-sided.

  • Null Distribution:
    Under the null hypothesis that none of the markers in the gene group are associated, mutations are assumed to be randomly distributed between cases and controls. Consequently, for each marker, the distribution of mutation count in the case group follows the hypergeometric distribution, with parameters determined by the number of cases, the number of controls, and the total mutation count of the marker. Therefore, we input a list of such distribution descriptions. Here, the input p-values are assumed to be independent (i.e., markers are independent).

  • P-Value Aggregation:
    The discrete p-values are then combined using one of the methods listed above. The output testing p-value indicates the evidence of association between the gene (i.e., the corresponding group of markers) and disease status.

Alternative Input Format

An equivalent and often more convenient input for the DPComb_tests function is:

  • A vector of mutation counts in cases (xs), and
  • A list describing the null distribution.
  • Specify the “sidedness” consistent with fisher.test.

Either data input — using discrete p-values or the mutation counts — yields equivalent combination statistics and testing p-values.

Code and Results

library(DPComb)

# Load built-in data
#data(case_control)
data("case_control", package = "DPComb")

# Group markers: gene1 contains marker1 to marker5; gene2 contains marker6 to marker15.
markers_gene1 <- paste0("marker", 1:5)
markers_gene2 <- paste0("marker", 6:15)


#' Function: perform_comb_test
#' 
#' This function conducts a combination test on a specified set of genetic markers in 
#' a case-control study. It computes the number of mutations in cases (xs) and 
#' performs two tests:
#'   1. X-value based test using a hypergeometric model with parameters constructed 
#'      from the data.
#'   2. Fisher's exact test based combination, where p-values derived from Fisher's exact
#'      test are combined.
#'
#' @param markers A character vector of marker names.
#' @param method (Optional) A string specifying the combination method. Default is 
#'               "fisher_mean". Supported methods include "fisher_mean", "fisher_median", 
#'               "pearson", "edgington", "stouffer", and "george".
#' @param side (Optional) A string indicating the tail option for converting x values to 
#'              p-values. Default is "two". Options are "two" (two-tailed), "left", or "right". 
#'              Ensure consistency with your test specifications.
#'
#' @return A list containing two elements:
#'         xs_test: The result from the combination test using mutation counts (xs).
#'         ps_test: The result from the combination test using Fisher's p-values.
#'
perform_comb_test <- function(markers, Data, method = "fisher_mean", side = "two") {
    ##Method 1: Based on mutation counts and null distribution descriptions.
    # Extract disease status
    disease_status <- Data$disease_status
    
    # Calculate xs: number of mutations (1's) in cases for each marker
    xs <- sapply(markers, function(m) sum(Data[disease_status == 1, m]))
    
    # Total cases and controls
    total_cases <- sum(disease_status == 1)
    total_controls <- sum(disease_status == 0)
    
    # Construct x_distn_params for hyper distribution for each marker:
    # m = number of cases, n = number of controls, k = total mutations in the marker.
    x_distn_params <- lapply(markers, function(m) {
        k <- sum(Data[, m])
        list(distn = "hyper", m = total_cases, n = total_controls, k = k)
    })
    
    # Combination test using xs and x_distn_params:
    res_xs <- DPComb_tests(xs = xs, side = side, x_distn_params = x_distn_params, method = method)
    
    ##Method 2: Based on Fisher's exact test p-values and x_distn_params. 
    # Compute p-values from Fisher's exact test for each marker.
    # define alternative as "two.sided", "greater", or "less" correspond to side 
    # being "two", "right", or "left".
    alternative <- switch(side, "two" = "two.sided", "right" = "greater", "left" = "less")
    ps <- sapply(markers, function(m) {
        tbl <- table(Data$disease_status, Data[, m])
        fisher.test(tbl, alternative = alternative)$p.value
    })
    ps[ps > 1] <- 1 # Ensure p-values are within [0, 1]. 
                    #Results from Fisher's test may exceed 1 slightly.
    res_ps <- DPComb_tests(ps = ps, side = side, x_distn_params = x_distn_params, method = method)
    
    list(xs_test = res_xs, ps_test = res_ps)
}

# Perform tests for gene1 and gene2.
# Use default method "fisher_mean" and side "two". 
result_gene1 <- perform_comb_test(markers_gene1, Data=case_control)
result_gene2 <- perform_comb_test(markers_gene2, Data=case_control)

# Print the results.
cat("Gene 1 Combination Test Results (using xs):\n")
## Gene 1 Combination Test Results (using xs):
print(result_gene1$xs_test)
## $Sn
## [1] 18.99809
## 
## $pval
## [1] 0.03697353
cat("\nGene 1 Combination Test Results (using Fisher's p-values):\n")
## 
## Gene 1 Combination Test Results (using Fisher's p-values):
print(result_gene1$ps_test)
## $Sn
## [1] 18.99809
## 
## $pval
## [1] 0.03697353
cat("\nGene 2 Combination Test Results (using xs):\n")
## 
## Gene 2 Combination Test Results (using xs):
print(result_gene2$xs_test)
## $Sn
## [1] 22.26252
## 
## $pval
## [1] 0.323226
cat("\nGene 2 Combination Test Results (using Fisher's p-values):\n")
## 
## Gene 2 Combination Test Results (using Fisher's p-values):
print(result_gene2$ps_test)
## $Sn
## [1] 22.26252
## 
## $pval
## [1] 0.323226
## === Test all methods and sides ===
methods <- c("fisher_mean", "fisher_median", "pearson", "edgington", "stouffer", "george")
sides <- c("two", "right", "left")

# Initialize a list to store results
results <- list()

# Perform tests for each `method` and `side` option, and store the results
for (method in methods) {
    for (side in sides) {
        result_gene1 <- perform_comb_test(markers_gene1, Data=case_control, method = method, side = side)
        result_gene2 <- perform_comb_test(markers_gene2, Data=case_control, method = method, side = side)
        
        # Store results in the list
        results[[paste(method, side, sep = "_")]] <- list(
            gene1_xs = result_gene1$xs_test,
            gene1_ps = result_gene1$ps_test,
            gene2_xs = result_gene2$xs_test,
            gene2_ps = result_gene2$ps_test
        )
    }
}

# Convert results to data frames for better visualization
results_gene1_df <- do.call(rbind, lapply(names(results), function(name) {
    res <- results[[name]]
    data.frame(
        Method_Side = name,
        XS_Sn = round(res$gene1_xs$Sn, 2),
        XS_pval = round(res$gene1_xs$pval, 4),
        PS_Sn = round(res$gene1_ps$Sn, 2),
        PS_pval = round(res$gene1_ps$pval, 4)
    )
}))

results_gene2_df <- do.call(rbind, lapply(names(results), function(name) {
    res <- results[[name]]
    data.frame(
        Method_Side = name,
        XS_Sn = round(res$gene2_xs$Sn, 2),
        XS_pval = round(res$gene2_xs$pval, 4),
        PS_Sn = round(res$gene2_ps$Sn, 2),
        PS_pval = round(res$gene2_ps$pval, 4)
    )
}))

# Print the results in table format
knitr::kable(results_gene1_df, caption = "Combination Test Results for Gene 1")
Combination Test Results for Gene 1
Method_Side XS_Sn XS_pval PS_Sn PS_pval
fisher_mean_two 19.00 0.0370 19.00 0.0370
fisher_mean_right 25.93 0.0034 25.93 0.0034
fisher_mean_left 0.84 0.9999 0.84 0.9999
fisher_median_two 18.59 0.0367 18.59 0.0367
fisher_median_right 25.52 0.0034 25.52 0.0034
fisher_median_left 0.84 0.9999 0.84 0.9999
pearson_two 1.77 0.0003 1.77 0.0003
pearson_right 0.84 0.0001 0.84 0.0001
pearson_left 25.93 0.9966 25.93 0.9966
edgington_two 0.80 0.0030 0.80 0.0030
edgington_right 0.40 0.0005 0.40 0.0005
edgington_left 4.60 0.9995 4.60 0.9995
stouffer_two -5.11 0.0075 -5.11 0.0075
stouffer_right -7.16 0.0006 -7.16 0.0006
stouffer_left 7.16 0.9994 7.16 0.9994
george_two -8.61 0.0111 -8.61 0.0111
george_right -12.54 0.0009 -12.54 0.0009
george_left 12.54 0.9991 12.54 0.9991
knitr::kable(results_gene2_df, caption = "Combination Test Results for Gene 2")
Combination Test Results for Gene 2
Method_Side XS_Sn XS_pval PS_Sn PS_pval
fisher_mean_two 22.26 0.3232 22.26 0.3232
fisher_mean_right 31.20 0.0496 31.20 0.0496
fisher_mean_left 9.72 0.9756 9.72 0.9756
fisher_median_two 21.58 0.3216 21.58 0.3216
fisher_median_right 30.63 0.0487 30.63 0.0487
fisher_median_left 9.57 0.9759 9.57 0.9759
pearson_two 13.96 0.1079 13.96 0.1079
pearson_right 9.72 0.0244 9.72 0.0244
pearson_left 31.20 0.9504 31.20 0.9504
edgington_two 4.05 0.1347 4.05 0.1347
edgington_right 3.08 0.0160 3.08 0.0160
edgington_left 6.92 0.9840 6.92 0.9840
stouffer_two -2.57 0.1899 -2.57 0.1899
stouffer_right -6.23 0.0227 -6.23 0.0227
stouffer_left 6.23 0.9773 6.23 0.9773
george_two -4.15 0.2145 -4.15 0.2145
george_right -10.74 0.0284 -10.74 0.0284
george_left 10.74 0.9716 10.74 0.9716

Analysis of Gene 1 using test_case_control_fisher

In this section, we demonstrate the analysis of gene1 (markers 1–5) using the function test_case_control_fisher from the DPComb package. This function performs Fisher’s exact test on each marker and combines the p-values. It is an implementation of the above analysis to conveniently analyze such case-control data.

library(DPComb)
data("case_control", package = "DPComb")

# Analyze gene1 using test_case_control_fisher
test_case_control_fisher(Data = case_control, response = "disease_status", 
                         covariates = paste0("marker", 1:5), 
                         method = "fisher_mean",  alternative = "two.sided")
## $Sn
## [1] 18.99809
## 
## $pval
## [1] 0.03697353