The Geno class

The main object of the package is the Geno class that contains the SNP-level data and manipulates it through its methods.

class genal.Geno(df, CHR='CHR', POS='POS', SNP='SNP', EA='EA', NEA='NEA', BETA='BETA', SE='SE', P='P', EAF='EAF', keep_columns=True)[source]

A class to handle GWAS-derived data, including SNP rsID, genome position, SNP-trait effects, and effect allele frequencies.

data

Main DataFrame containing SNP data.

Type:

pd.DataFrame

phenotype

Tuple with a DataFrame of individual-level phenotype data and a string representing the phenotype trait column. Initialized after running the ‘set_phenotype’ method.

Type:

pd.DataFrame, str

MR_data

Tuple containing DataFrames for associations with exposure and outcome, and a string for the outcome name. Initialized after running the ‘query_outcome’ method.

Type:

pd.DataFrame, pd.DataFrame, str

MR_results

Contains an MR results dataframe, a dataframe of harmonized SNPs, an exposure name, an outcome name. Assigned after calling the MR method and used for plotting with the MR_plot method.

Type:

pd.DataFrame, pd.DataFrame, str, str

ram

Available memory.

Type:

int

cpus

Number of available CPUs.

Type:

int

checks

Dictionary of checks performed on the main DataFrame.

Type:

dict

name

ID of the object (for internal reference and debugging purposes).

Type:

str

reference_panel

Reference population SNP data used for SNP info adjustments. Initialized when first needed.

Type:

pd.DataFrame

reference_panel_name

string to identify the reference_panel (path or population string)

Type:

str

preprocess_data()[source]

Clean and preprocess the ‘data’ attribute (the main dataframe of SNP-level data).

clump()[source]

Clump the main data based on reference panels and return a new Geno object with the clumped data.

prs()[source]

Computes Polygenic Risk Score on genomic data.

set_phenotype()[source]

Assigns a DataFrame with individual-level data and a phenotype trait to the ‘phenotype’ attribute.

association_test()[source]

Computes SNP-trait effect estimates, standard errors, and p-values.

query_outcome()[source]

Extracts SNPs from SNP-outcome association data and stores it in the ‘MR_data’ attribute.

MR()[source]

Performs Mendelian Randomization between the SNP-exposure and SNP-outcome data stored in the ‘MR_data’ attribute. Stores the results in the ‘MR_results’ attribute.

MR_plot()[source]

Plot the results of the MR analysis stored in the ‘MR_results’ attribute.

MRpresso()[source]

Executes the MR-PRESSO algorithm for horizontal pleiotropy correction between the SNP-exposure and SNP-outcome data stored in the ‘MR_data’ attribute.

lift()[source]

Lifts SNP data from one genomic build to another.

query_gwas_catalog()[source]

Query the GWAS Catalog for SNP-trait associations.

Main functions

Preprocessing

The preprocessing of the SNP-level data is performed with the preprocess_data() method:

Geno.preprocess_data(preprocessing='Fill', reference_panel='eur', effect_column=None, keep_multi=None, keep_dups=None, fill_snpids=None, fill_coordinates=None)[source]

Clean and preprocess the main dataframe of Single Nucleotide Polymorphisms (SNP) data.

Parameters:
  • preprocessing (str, optional) – Level of preprocessing to apply. Options include: - “None”: The dataframe is not modified. - “Fill”: Missing columns are added based on reference data and invalid values set to NaN, but no rows are deleted. - “Fill_delete”: Missing columns are added, and rows with missing, duplicated, or invalid values are deleted. Defaults to ‘Fill’.

  • reference_panel (str or pd.DataFrame, optional) – Reference panel for SNP adjustments. Can be a string representing ancestry classification (“eur”, “afr”, “eas”, “sas”, “amr”) or a DataFrame with [“CHR”,”SNP”,”POS”,”A1”,”A2”] columns or a path to a .bim file. Defaults to “eur”.

  • effect_column (str, optional) – Specifies the type of effect column (“BETA” or “OR”). If None, the method tries to determine it. Odds Ratios will be log-transformed and the standard error adjusted. Defaults to None.

  • keep_multi (bool, optional) – Determines if multiallelic SNPs should be kept. If None, defers to preprocessing value. Defaults to None.

  • keep_dups (bool, optional) – Determines if rows with duplicate SNP IDs should be kept. If None, defers to preprocessing value. Defaults to None.

  • fill_snpids (bool, optional) – Decides if the SNP (rsID) column should be created or replaced based on CHR/POS columns and a reference genome. If None, defers to preprocessing value. Defaults to None.

  • fill_coordinates (bool, optional) – Decides if CHR and/or POS should be created or replaced based on SNP column and a reference genome. If None, defers to preprocessing value. Defaults to None.

Clumping

Clumping is performed with the clump() method:

Geno.clump(kb=250, r2=0.1, p1=5e-08, p2=0.01, reference_panel='eur')[source]

Clump the data based on linkage disequilibrium and return another Geno object with the clumped data. The clumping process is executed using plink.

Parameters:
  • kb (int, optional) – Clumping window in thousands of SNPs. Default is 250.

  • r2 (float, optional) – Linkage disequilibrium threshold, values between 0 and 1. Default is 0.1.

  • p1 (float, optional) – P-value threshold during clumping. SNPs with a P-value higher than this value are excluded. Default is 5e-8.

  • p2 (float, optional) – P-value threshold post-clumping to further filter the clumped SNPs. If p2 < p1, it won’t be considered. Default is 0.01.

  • reference_panel (str, optional) – The reference population for linkage disequilibrium values. Accepts values “eur”, “sas”, “afr”, “eas”, “amr”. Alternatively, a path leading to a specific bed/bim/fam reference panel can be provided. Default is “eur”.

Returns:

A new Geno object based on the clumped data.

Return type:

genal.Geno

Polygenic Risk Scoring

The computation of a polygenic risk score in a target population is performed with the prs() method:

Geno.prs(name=None, weighted=True, path=None, proxy=False, reference_panel='eur', kb=5000, r2=0.6, window_snps=5000)[source]

Compute a Polygenic Risk Score (PRS) and save it as a CSV file in the current directory.

Parameters:
  • name (str, optional) – Name or path of the saved PRS file.

  • weighted (bool, optional) – If True, performs a PRS weighted by the BETA column estimates. If False, performs an unweighted PRS. Default is True.

  • path (str, optional) – Path to a bed/bim/fam set of genetic files for PRS calculation. If files are split by chromosomes, replace the chromosome number with ‘$’. For instance: path = “ukb_chr$_file”. If not provided, it will use the genetic path most recently used (if any). Default is None.

  • position (bool, optional) – Use the genomic positions instead of the SNP names to find the SNPs in the genetic data (recommended).

  • proxy (bool, optional) – If true, proxies are searched. Default is True.

  • reference_panel (str, optional) – The reference population used to derive linkage disequilibrium values and find proxies (only if proxy=True). Acceptable values include “EUR”, “SAS”, “AFR”, “EAS”, “AMR” or a path to a specific bed/bim/fam panel. Default is “EUR”.

  • kb (int, optional) – Width of the genomic window to look for proxies. Default is 5000.

  • r2 (float, optional) – Minimum linkage disequilibrium value with the main SNP for a proxy to be included. Default is 0.6.

  • window_snps (int, optional) – Compute the LD value for SNPs that are not more than x SNPs away from the main SNP. Default is 5000.

Returns:

The computed PRS data.

Return type:

pd.DataFrame

Raises:

ValueError – If the data hasn’t been clumped and ‘clumped’ parameter is True.

Querying outcome data

Before running Mendelian Randomization, the extraction of the genetic instruments from the Geno object containing the SNP-outcome association data is done with query_outcome() method:

Geno.query_outcome(outcome, name=None, proxy=True, reference_panel='eur', kb=5000, r2=0.6, window_snps=5000)[source]

Prepares dataframes required for Mendelian Randomization (MR) with the SNP information in data as exposure.

Queries the outcome data, with or without proxying, and assigns a tuple to the outcome attribute: (exposure_data, outcome_data, name) ready for MR methods.

Parameters:
  • outcome – Can be a Geno object (from a GWAS) or a filepath of types: .h5 or .hdf5 (created with the Geno.save() method.

  • name (str, optional) – Name for the outcome data. Defaults to None.

  • proxy (bool, optional) – If true, proxies are searched. Default is True.

  • reference_panel (str, optional) – The reference population to get linkage disequilibrium values and find proxies (only if proxy=True). Acceptable values include “EUR”, “SAS”, “AFR”, “EAS”, “AMR” or a path to a specific bed/bim/fam panel. Default is “EUR”.

  • kb (int, optional) – Width of the genomic window to look for proxies. Default is 5000.

  • r2 (float, optional) – Minimum linkage disequilibrium value with the main SNP for a proxy to be included. Default is 0.6.

  • window_snps (int, optional) – Compute the LD value for SNPs that are not more than x SNPs away from the main SNP. Default is 5000.

Returns:

Sets the MR_data attribute for the instance.

Return type:

None

Mendelian Randomization

Various Mendelian Randomization methods are computed with the MR() method:

Geno.MR(methods=['IVW', 'IVW-FE', 'WM', 'Simple-mode', 'Egger'], action=2, eaf_threshold=0.42, heterogeneity=False, nboot=1000, penk=20, phi=1, exposure_name=None, outcome_name=None, cpus=-1)[source]

Executes Mendelian Randomization (MR) using the data_clumped attribute as exposure data and MR_data attribute as outcome data queried using the query_outcome method.

Parameters:
  • methods (list, optional) – List of MR methods to run. Possible options include: “IVW”: inverse variance-weighted with random effects and under-dispersion correction “IVW-FE”: inverse variance-weighted with fixed effects “IVW-RE”: inverse variance-weighted with random effects and without under-dispersion correction “UWR”: unweighted regression “WM”: weighted median (bootstrapped standard errors) “WM-pen”: penalised weighted median (bootstrapped standard errors) “Simple-median”: simple median (bootstrapped standard errors) “Sign”: sign concordance test “Egger”: egger regression “Egger-boot”: egger regression with bootstrapped standard errors “Simple-mode”: simple mode method “Weighted-mode”: weighted mode method Default is [“IVW”,”IVW-FE”,”WM”,”Simple-mode”,”Weighted-mode”,”Egger”].

  • action (int, optional) – How to treat palindromes during harmonizing between exposure and outcome data. Accepts: 1: Doesn’t flip them (Assumes all alleles are on the forward strand) 2: Uses allele frequencies to attempt to flip (conservative, default) 3: Removes all palindromic SNPs (very conservative)

  • eaf_threshold (float, optional) – Max effect allele frequency accepted when flipping palindromic SNPs (relevant if action=2). Default is 0.42.

  • heterogeneity (bool, optional) – If True, includes heterogeneity tests in the results (Cochran’s Q test).Default is False.

  • nboot (int, optional) – Number of bootstrap replications for methods with bootstrapping. Default is 1000.

  • penk (int, optional) – Penalty value for the WM-pen method. Default is 20.

  • phi (int, optional) – Factor for the bandwidth parameter used in the kernel density estimation of the mode methods

  • exposure_name (str, optional) – Name of the exposure data (only for display purposes).

  • outcome_name (str, optional) – Name of the outcome data (only for display purposes).

Returns:

A table with MR results.

Return type:

pd.DataFrame

MR-PRESSO

The MR-PRESSO algorithm to detect and correct horizontal pleiotropy is executed with MRpresso() method:

Geno.MRpresso(action=2, eaf_threshold=0.42, n_iterations=10000, outlier_test=True, distortion_test=True, significance_p=0.05, cpus=-1)[source]

Executes the MR-PRESSO Mendelian Randomization algorithm for detection and correction of horizontal pleiotropy.

Parameters:
  • action (int, optional) – Treatment for palindromes during harmonizing between exposure and outcome data. Options: - 1: Don’t flip (assume all alleles are on the forward strand) - 2: Use allele frequencies to flip (default) - 3: Remove all palindromic SNPs

  • eaf_threshold (float, optional) – Max effect allele frequency when flipping palindromic SNPs (relevant if action=2). Default is 0.42.

  • n_iterations (int, optional) – Number of random data generation steps for improved result stability. Default is 10000.

  • outlier_test (bool, optional) – Identify outlier SNPs responsible for horizontal pleiotropy if global test p_value < significance_p. Default is True.

  • distortion_test (bool, optional) – Test significant distortion in causal estimates before and after outlier removal if global test p_value < significance_p. Default is True.

  • significance_p (float, optional) – Statistical significance threshold for horizontal pleiotropy detection (both global test and outlier identification). Default is 0.05.

  • cpus (int, optional) – number of cpu cores to be used for the parallel random data generation.

Returns:

Contains the following elements:
  • mod_table: DataFrame containing the original (before outlier removal)

    and outlier-corrected (after outlier removal) inverse variance-weighted MR results.

  • GlobalTest: p-value of the global MR-PRESSO test indicating the presence of horizontal pleiotropy.

  • OutlierTest: DataFrame assigning a p-value to each SNP representing the likelihood of this

    SNP being responsible for the global pleiotropy. Set to NaN if global test p_value > significance_p.

  • DistortionTest: p-value for the distortion test.

Return type:

tuple

Phenotype assignment

Before running SNP-association tests, assigning a dataframe with phenotypic data to the Geno object is done with set_phenotype() method:

Geno.set_phenotype(data, IID=None, PHENO=None, PHENO_type=None, alternate_control=False)[source]

Assign a phenotype dataframe to the .phenotype attribute.

Parameters:
  • data (pd.DataFrame) – DataFrame containing individual-level row data with at least an individual IDs column and one phenotype column.

  • IID (str, optional) – Name of the individual IDs column in ‘data’. These IDs should correspond to the genetic IDs in the FAM file that will be used for association testing.

  • PHENO (str, optional) – Name of the phenotype column in ‘data’ which will be used as the dependent variable for association tests.

  • PHENO_type (str, optional) – If not specified, the function will try to infer if the phenotype is binary or quantitative. To bypass this, use “quant” for quantitative or “binary” for binary phenotypes. Default is None.

  • alternate_control (bool, optional) – By default, the function assumes that for a binary trait, the controls have the most frequent value. Set to True if this is not the case. Default is False.

Returns:

Sets the .phenotype attribute for the instance.

Return type:

None

Note

This method sets the .phenotype attribute which is essential to perform single-SNP association tests using the association_test method.

SNP-association tests

SNP-association testing is conducted with association_test() method:

Geno.association_test(path=None, covar=[], standardize=True)[source]

Conduct single-SNP association tests against a phenotype.

Parameters:
  • path (str, optional) – Path to a bed/bim/fam set of genetic files. If files are split by chromosomes, replace the chromosome number with ‘$’. For instance: path = “ukb_chr$_file”. Default is None.

  • covar (list, optional) – List of columns in the phenotype dataframe to be used as covariates in the association tests. Default is an empty list.

  • standardize (bool, optional) – If True, it will standardize a quantitative phenotype before performing association tests. This is typically done to make results more interpretable. Default is True.

Returns:

Updates the BETA, SE, and P columns of the data attribute based on the results

of the association tests.

Return type:

None

Note

This method requires the phenotype to be set using the set_phenotype() function.

Genetic lifting

Lifting the SNP data to another genetic build is done with lift() method:

Geno.lift(start='hg19', end='hg38', replace=False, extraction_file=False, chain_file=None, name=None, liftover_path=None)[source]

Perform a liftover from one genetic build to another.

Parameters:
  • start (str, optional) – Current build of the data. Default is “hg19”.

  • end (str, optional) – Target build for the liftover. Default is “hg38”.

  • replace (bool, optional) – If True, updates the data attribute in place. Default is False.

  • extraction_file (bool, optional) – If True, prints a CHR POS SNP space-delimited file. Default is False.

  • chain_file (str, optional) – Path to a local chain file for the lift. If provided, start and end arguments are not considered. Default is None.

  • name (str, optional) – Filename or filepath (without extension) to save the lifted dataframe. If not provided, the data is not saved.

  • liftover_path (str, optional) – Specify the path to the USCS liftover executable. If not provided, the lift will be done in python (slower for large amount of SNPs).

Returns:

Data after being lifted.

Return type:

pd.DataFrame

GWAS Catalog

Querying the GWAS Catalog to extract traits associated with the SNPs is done with query_gwas_catalog() method:

Geno.query_gwas_catalog(p_threshold=5e-08, return_p=False, return_study=False, replace=True, max_associations=None, timeout=-1)[source]

Queries the GWAS Catalog Rest API and add an “ASSOC” column containing associated traits for each SNP.

Parameters:
  • p_threshold (float, optional) – Only associations that are at least as significant are reported. Default is 5e-8.

  • return_p (bool, optional) – If True, include the p-value in the results. Default is False.

  • return_study (bool, optional) – If True, include the ID of the study from which the association is derived in the results. Default is False.

  • replace (bool, optional) – If True, updates the data attribute in place. Default is True.

  • max_associations (int, optional) – If not None, only the first max_associations associations are reported for each SNP. Default is None.

  • timeout (int, optional) – Timeout for each query in seconds. Default is -1 (custom timeout based on number of SNPs to query). Choose None for no timeout.

Returns:

Data attribute with an additional column “ASSOC”.

The elements of this column are lists of strings or tuples depending on the return_p and return_study flags. If the SNP could not be queried, the value is set to “FAILED_QUERY”.

Return type:

pd.DataFrame