genal package

Submodules

genal.GENO module

class genal.GENO.GENO(df, name='noname', CHR='CHR', POS='POS', SNP='SNP', EA='EA', NEA='NEA', BETA='BETA', SE='SE', P='P', EAF='EAF', preprocessing=1, reference_panel='eur', clumped=None, effect_column=None, keep_columns=None, keep_multi=None, keep_dups=None, fill_snpids=None, fill_coordinates=None)

Bases: object

A class to handle GWAS-derived data, including SNP rsID, genome position, SNP-trait effects, and effect allele frequencies.

name

Name of the object.

Type:

str

data

Main DataFrame containing SNP data.

Type:

pd.DataFrame

data_clumped

DataFrame of clumped data. Initialized after ‘clump’ method execution or when using clumped=True during initialization.

Type:

pd.DataFrame, optional

phenotype

Tuple with a DataFrame of individual-level phenotype data and a string representing the phenotype trait column. Initialized after running the ‘set_phenotype’ method.

Type:

pd.DataFrame, str

MR_data

Tuple containing DataFrames for associations with exposure and outcome, and a string for the outcome name. Initialized after running the ‘query_outcome’ method.

Type:

pd.DataFrame, pd.DataFrame, str

ram

Available memory.

Type:

int

cpus

Number of available CPUs.

Type:

int

checks

List of checks performed on the main DataFrame.

Type:

list

reference_panel

Reference population SNP data used for SNP info adjustments. Initialized when first needed.

Type:

pd.DataFrame

clump()

Clumps the main data and stores the result in data_clumped.

prs()

Computes Polygenic Risk Score on genomic data.

set_phenotype()

Assigns a DataFrame with individual-level data and a phenotype trait to the phenotype attribute.

association_test()

Computes SNP-trait effect estimates, standard errors, and p-values.

query_outcome()

Extracts SNPs from outcome data with proxying and initializes MR_data.

MR()

Performs Mendelian Randomization between SNP-exposure and SNP-outcome data.

MRpresso()

Executes the MR-PRESSO algorithm for horizontal pleiotropy correction between SNP-exposure and SNP-outcome data.

lift()

Lifts SNP data from one genomic build to another.

MR(methods=['IVW', 'IVW-FE', 'UWR', 'WM', 'WM-pen', 'Simple-median', 'Sign', 'Egger', 'Egger-boot'], action=2, eaf_threshold=0.42, heterogeneity=False, nboot=10000, penk=20)

Executes Mendelian Randomization (MR) using the data_clumped attribute as exposure data and MR_data attribute as outcome data queried using the query_outcome method.

Parameters:
  • methods (list, optional) – List of MR methods to run. Possible options include: “IVW”: inverse variance-weighted with random effects and under-dispersion correction “IVW-FE”: inverse variance-weighted with fixed effects “IVW-RE”: inverse variance-weighted with random effects and without under-dispersion correction “UWR”: unweighted regression “WM”: weighted median (bootstrapped standard errors) “WM-pen”: penalised weighted median (bootstrapped standard errors) “Simple-median”: simple median (bootstrapped standard errors) “Sign”: sign concordance test “Egger”: egger regression “Egger-boot”: egger regression with bootstrapped standard errors Default is [“IVW”,”IVW-FE”,”UWR”,”WM”,”WM-pen”,”Simple-median”,”Sign”,”Egger”,”Egger-boot”].

  • action (int, optional) – How to treat palindromes during harmonizing between exposure and outcome data. Accepts: 1: Doesn’t flip them (Assumes all alleles are on the forward strand) 2: Uses allele frequencies to attempt to flip (conservative, default) 3: Removes all palindromic SNPs (very conservative)

  • eaf_threshold (float, optional) – Max effect allele frequency accepted when flipping palindromic SNPs (relevant if action=2). Default is 0.42.

  • heterogeneity (bool, optional) – If True, includes heterogeneity tests in the results (Cochran’s Q test).Default is False.

  • nboot (int, optional) – Number of bootstrap replications for methods with bootstrapping. Default is 10000.

  • penk (int, optional) – Penalty value for the WM-pen method. Default is 20.

Returns:

A table with MR results.

Return type:

pd.DataFrame

MRpresso(action=2, eaf_threshold=0.42, n_iterations=10000, outlier_test=True, distortion_test=True, significance_p=0.05, cpus=-1)

Executes the MR-PRESSO Mendelian Randomization algorithm for detection and correction of horizontal pleiotropy.

Parameters:
  • action (int, optional) – Treatment for palindromes during harmonizing between exposure and outcome data. Options: - 1: Don’t flip (assume all alleles are on the forward strand) - 2: Use allele frequencies to flip (default) - 3: Remove all palindromic SNPs

  • eaf_threshold (float, optional) – Max effect allele frequency when flipping palindromic SNPs (relevant if action=2). Default is 0.42.

  • n_iterations (int, optional) – Number of random data generation steps for improved result stability. Default is 10000.

  • outlier_test (bool, optional) – Identify outlier SNPs responsible for horizontal pleiotropy if global test p_value < significance_p. Default is True.

  • distortion_test (bool, optional) – Test significant distortion in causal estimates before and after outlier removal if global test p_value < significance_p. Default is True.

  • significance_p (float, optional) – Statistical significance threshold for horizontal pleiotropy detection (both global test and outlier identification). Default is 0.05.

  • cpus (int, optional) – number of cpu cores to be used for the parallel random data generation.

Returns:

Contains the following elements:
  • mod_table: DataFrame containing the original (before outlier removal)

    and outlier-corrected (after outlier removal) inverse variance-weighted MR results.

  • GlobalTest: p-value of the global MR-PRESSO test indicating the presence of horizontal pleiotropy.

  • OutlierTest: DataFrame assigning a p-value to each SNP representing the likelihood of this

    SNP being responsible for the global pleiotropy. Set to NaN if global test p_value > significance_p.

  • DistortionTest: p-value for the distortion test.

Return type:

list

association_test(covar=[], standardize=True, clumped=True)

Conduct single-SNP association tests against a phenotype.

This method requires the phenotype to be set using the set_phenotype() function. The method also expects the extract_snps method to have been called prior to this.

Parameters:
  • covar (list, optional) – List of columns in the phenotype dataframe to be used as covariates in the association tests. Default is an empty list.

  • standardize (bool, optional) – If True, it will standardize a quantitative phenotype before performing association tests. This is typically done to make results more interpretable. Default is True.

  • clumped (bool, optional) – If True, association tests will be run for the clumped list of SNPs. If False, it will use the unclumped list. Default is True.

Returns:

Updates the BETA, SE, and P columns of the data attribute based on the results

of the association tests.

Return type:

None

clump(kb=250, r2=0.1, p1=5e-08, p2=0.01, reference_panel='eur')

Clump the data based on linkage disequilibrium and assign it to the .data_clumped attribute. The clumping process is executed using plink.

Parameters:
  • kb (int, optional) – Clumping window in terms of thousands of SNPs. Default is 250.

  • r2 (float, optional) – Linkage disequilibrium threshold, values between 0 and 1. Default is 0.1.

  • p1 (float, optional) – P-value threshold during clumping. SNPs above this value are not considered. Default is 5e-8.

  • p2 (float, optional) – P-value threshold post-clumping to further filter the clumped SNPs. If p2 < p1, it won’t be considered. Default is 0.01.

  • reference_panel (str, optional) – The reference population for linkage disequilibrium values. Accepts values “eur”, “sas”, “afr”, “eas”, “amr”. Alternatively, a path leading to a specific bed/bim/fam reference panel can be provided. Default is “eur”.

Returns:

The result is stored in the .data_clumped attribute.

Return type:

None

copy()

Create a deep copy of the GENO instance.

Returns:

A deep copy of the instance.

Return type:

GENO

extract_snps(clumped=True, path=None)

Extract the list of SNPs present in the data.

Parameters:
  • clumped (bool, optional) – If True, SNPs will be extracted from .data_clumped. If False, from .data. Default is True.

  • path (str, optional) – Path to a bed/bim/fam set of genetic files. If files are split by chromosomes, replace the chromosome number with ‘$’. For instance: path = “ukb_chr$_file”. Default is None.

Returns:

The output is a bed/bim/fam triple in the tmp_GENAL folder with the format “{name}_extract_allchr” which includes the SNPs from the UKB.

Return type:

None

Notes

The provided path is saved to the config file. If this function is called again, you don’t need to specify the path if you want to use the same genomic files.

get_reference_panel(reference_panel='eur')

Retrieve or set the reference panel for the GENO object.

If the GENO object does not have a reference panel attribute set, this method will try to set it based on the provided reference_panel argument. This can be either a string indicating a predefined reference panel or a DataFrame with specific columns or a path to a .bim file.

Parameters:

reference_panel (str or pd.DataFrame, optional) – Either a string indicating a predefined reference panel (default is “eur”) or a DataFrame with necessary columns or a valid path to a .bim file

Returns:

The reference panel DataFrame for the GENO object.

Return type:

pd.DataFrame

Raises:

ValueError – If the provided DataFrame doesn’t have the necessary columns.

init_attributes(name, clumped)

Initializes several attributes for the GENO object, including the name, outcome, and memory/CPU related attributes. Also guesses whether the provided data is clumped based on the P column.

Parameters:
  • name (str) – Name for the GENO object.

  • clumped (bool, optional) – Specifies if the data is already clumped. If None, the method tries to determine this based on the P column.

name

Name for the GENO object.

Type:

str

outcome

List of outcomes (initialized as empty).

Type:

list

data_clumped

Dataframe of clumped data if clumped is True.

Type:

pd.DataFrame

cpus

Number of CPUs to be used.

Type:

int

ram

Amount of RAM to be used in MBs.

Type:

int

lift(clumped=True, start='hg19', end='hg38', replace=False, extraction_file=False, chain_file=None, name=None, liftover=False, liftover_path=None)

Perform a liftover from one genetic build to another.

Parameters:
  • clumped (bool, optional) – If True, uses data in .data_clumped. If False, uses data in .data. Default is True.

  • start (str, optional) – Current build of the data. Default is “hg19”.

  • end (str, optional) – Target build for the liftover. Default is “hg38”.

  • replace (bool, optional) – If True, updates the object data attributes in place. Default is False.

  • extraction_file (bool, optional) – If True, prints a CHR POS SNP space-delimited file. Default is False.

  • chain_file (str, optional) – Path to a local chain file for the lift. If provided, start and end arguments are not considered. Default is None.

  • name (str, optional) – Filename or filepath (without extension) to save the lifted dataframe. Default saves as [name]_lifted.txt in the current directory.

Returns:

Data after being lifted.

Return type:

pd.DataFrame

prs(name=None, clumped=True, weighted=True, path=None)

Compute a Polygenic Risk Score (PRS) and save it as a CSV file in the current directory.

Parameters:
  • name (str, optional) – Name or path of the saved PRS file. If not given, it defaults to the name of the GENO object.

  • clumped (bool, optional) – If True, uses data in .data_clumped. If False, uses data in .data. Default is True.

  • weighted (bool, optional) – If True, performs a PRS weighted by the BETA column estimates. If False, performs an unweighted PRS. Default is True.

  • path (str, optional) – Path to a bed/bim/fam set of genetic files for PRS calculation. If files are split by chromosomes, replace the chromosome number with ‘$’. For instance: path = “ukb_chr$_file”. If not provided, it will use the genetic data extracted with the .extract_snps method. Default is None.

Returns:

The computed PRS data.

Return type:

pd.DataFrame

Raises:

ValueError – If the data hasn’t been clumped and ‘clumped’ parameter is True.

query_outcome(outcome, name=None, proxy=True, reference_panel='eur', kb=5000, r2=0.6, window_snps=5000)

Prepares dataframes required for Mendelian Randomization (MR) with data_clumped as exposure.

Queries the outcome data, with or without proxying, and assigns a tuple to the outcome attribute: (exposure_data, outcome_data, name) ready for MR methods.

Parameters:
  • outcome – Can be a GENO object (from a GWAS) or a filepath of types: .h5 or .hdf5 (created with the GENO.save() method.

  • name (str, optional) – Name for the outcome data. Defaults to None.

  • proxy (bool, optional) – If true, proxies are searched. Default is True.

  • reference_panel (str, optional) – The reference population to get linkage disequilibrium values and find proxies (only if proxy=True). Acceptable values include “EUR”, “SAS”, “AFR”, “EAS”, “AMR” or a path to a specific bed/bim/fam panel. Default is “EUR”.

  • kb (int, optional) – Width of the genomic window to look for proxies. Default is 5000.

  • r2 (float, optional) – Minimum linkage disequilibrium value with the main SNP for a proxy to be included. Default is 0.6.

  • window_snps (int, optional) – Compute the LD value for SNPs that are not more than x SNPs away from the main SNP. Default is 5000.

Returns:

Sets the MR_data attribute for the instance.

Return type:

None

save(path='', fmt='h5', sep='\t', header=True, clumped=True)

Save the GENO data to a file.

Parameters:
  • path (str, optional) – Folder path to save the file. Defaults to the current directory.

  • fmt (str, optional) – File format. Options: .h5 (default), .csv, .txt. Future: .vcf, .vcf.gz.

  • sep (str, optional) – Delimiter for .csv and .txt formats. Default is tab.

  • header (bool, optional) – Save column names for .csv and .txt formats. Default is True.

  • clumped (bool, optional) – If True, save clumped data. Otherwise, save full data. Default is True.

Raises:

ValueError – If clumped data is requested but data is not clumped.

set_phenotype(data, IID=None, PHENO=None, PHENO_type=None, alternate_control=False)

Assign a phenotype dataframe to the .phenotype attribute.

This method sets the .phenotype attribute which is essential to perform single-SNP association tests using the association_test method.

Parameters:
  • data (pd.DataFrame) – DataFrame containing individual-level row data with at least an individual IDs column and one phenotype column.

  • IID (str, optional) – Name of the individual IDs column in ‘data’. These IDs should correspond to the genetic IDs in the FAM file that will be used for association testing.

  • PHENO (str, optional) – Name of the phenotype column in ‘data’ which will be used as the dependent variable for association tests.

  • PHENO_type (str, optional) – If not specified, the function will try to infer if the phenotype is binary or quantitative. To bypass this, use “quant” for quantitative or “binary” for binary phenotypes. Default is None.

  • alternate_control (bool, optional) – By default, the function assumes that for a binary trait, the controls have the most frequent value. Set to True if this is not the case. Default is False.

Returns:

Sets the .phenotype attribute for the instance.

Return type:

None

sort_group(method='lowest_p')

Handle duplicate SNPs. Useful if the instance combines different GENOs.

Parameters:

method (str, optional) – How to handle duplicates. Default is “lowest_p”, which retains the lowest P-value for each SNP.

Returns:

None

standardize()

Standardize the Betas and adjust the SE column accordingly.

Raises:

ValueError – If the required columns are not found in the data.

genal.MR module

genal.MR.linreg(x, y, w=None)

Helper function to run linear regressions for the parallel egger bootstrapping.

genal.MR.mr_egger_regression(BETA_e, SE_e, BETA_o, SE_o)

Perform a Mendelian Randomization analysis using the egger regression method. See mr_egger_regression_bootstrap() for a version with bootstrapping.

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

  • nboot (int) – Number of boostrap iterations to obtain the standard error and p-value

  • cpus (int) – Number of cpu cores to use in parallel for the boostrapping iterations.

Returns:

A list containing two dictionaries with the results for the egger regression estimate and the egger regression intercept (horizontal pleiotropy estimate):
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient of the regression, representing the causal estimate or the intercept.

  • ’se’: Adjusted standard error of the coefficient or intercept.

  • ’pval’: P-value for the causal estimate or intercept.

  • ’nSNP’: Number of genetic variants used in the analysis.

Return type:

list of dict

genal.MR.mr_egger_regression_bootstrap(BETA_e, SE_e, BETA_o, SE_o, nboot, cpus=4)

Perform a Mendelian Randomization analysis using the egger regression method with boostrapped standard errors. See mr_egger_regression() for a version without bootstrapping.

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

  • nboot (int) – Number of boostrap iterations to obtain the standard error and p-value

  • cpus (int) – Number of cpu cores to use in parallel for the boostrapping iterations.

Returns:

A list containing two dictionaries with the results for the egger regression estimate and the egger regression intercept (horizontal pleiotropy estimate):
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient of the regression, representing the causal estimate or the intercept.

  • ’se’: Adjusted standard error of the coefficient or intercept.

  • ’pval’: P-value for the causal estimate or intercept.

  • ’nSNP’: Number of genetic variants used in the analysis.

Return type:

list of dict

genal.MR.mr_ivw(BETA_e, SE_e, BETA_o, SE_o)

Perform a Mendelian Randomization analysis using the Inverse Variance Weighted (IVW) method with random effects. Standard Error is corrected for under dispersion (as opposed to mr_ivw_re()).

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

Returns:

A list containing a dictionary with the results:
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient of the regression, representing the causal estimate.

  • ’se’: Adjusted standard error of the coefficient.

  • ’pval’: P-value for the causal estimate.

  • ’nSNP’: Number of genetic variants used in the analysis.

  • ’Q’: Cochran’s Q statistic for heterogeneity.

  • ’Q_df’: Degrees of freedom for the Q statistic.

  • ’Q_pval’: P-value for the Q statistic.

Return type:

list of dict

Notes

The function uses weighted least squares regression (WLS) to estimate the causal effect size, weighting by the inverse of the variance of the outcome’s effect sizes. Cochran’s Q statistics also computed to assess the heterogeneity across the instrumental variables.

genal.MR.mr_ivw_fe(BETA_e, SE_e, BETA_o, SE_o)

Perform a Mendelian Randomization analysis using the Inverse Variance Weighted (IVW) method with fixed effects.

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

Returns:

A list containing a dictionary with the results:
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient of the regression, representing the causal estimate.

  • ’se’: Adjusted standard error of the coefficient.

  • ’pval’: P-value for the causal estimate.

  • ’nSNP’: Number of genetic variants used in the analysis.

  • ’Q’: Cochran’s Q statistic for heterogeneity.

  • ’Q_df’: Degrees of freedom for the Q statistic.

  • ’Q_pval’: P-value for the Q statistic.

Return type:

list of dict

Notes

The function uses weighted least squares regression (WLS) to estimate the causal effect size, weighting by the inverse of the variance of the outcome’s effect sizes. Cochran’s Q statistics also computed to assess the heterogeneity across the instrumental variables.

genal.MR.mr_ivw_re(BETA_e, SE_e, BETA_o, SE_o)

Perform a Mendelian Randomization analysis using the Inverse Variance Weighted (IVW) method with random effects. Standard Error is not corrected for under dispersion (as opposed to mr_ivw()).

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

Returns:

A list containing a dictionary with the results:
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient of the regression, representing the causal estimate.

  • ’se’: Adjusted standard error of the coefficient.

  • ’pval’: P-value for the causal estimate.

  • ’nSNP’: Number of genetic variants used in the analysis.

  • ’Q’: Cochran’s Q statistic for heterogeneity.

  • ’Q_df’: Degrees of freedom for the Q statistic.

  • ’Q_pval’: P-value for the Q statistic.

Return type:

list of dict

Notes

The function uses weighted least squares regression (WLS) to estimate the causal effect size, weighting by the inverse of the variance of the outcome’s effect sizes. Cochran’s Q statistics also computed to assess the heterogeneity across the instrumental variables.

genal.MR.mr_pen_wm(BETA_e, SE_e, BETA_o, SE_o, nboot, penk)

Perform a Mendelian Randomization analysis using the penalised weighted median method. See https://arxiv.org/abs/1606.03729.

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

  • nboot (int) – Number of boostrap iterations to obtain the standard error and p-value.

  • penk (float) – Constant factor used to penalise the weights.

Returns:

A list containing a dictionary with the results:
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient representing the causal estimate.

  • ’se’: Adjusted standard error of the coefficient.

  • ’pval’: P-value for the causal estimate.

  • ’nSNP’: Number of genetic variants used in the analysis.

Return type:

list of dict

genal.MR.mr_sign(BETA_e, BETA_o)

Performs the sign concordance test.

The sign concordance test is used to determine whether there is a consistent direction of effect (i.e., sign) between the exposure and outcome across the variants. The consistent directonality is an assumption of Mendelian Randomization.

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

Returns:

A list containing dictionaries with the following keys:
  • ’method’: Name of the method.

  • ’nSNP’: Number of genetic variants used in the analysis.

  • ’b’: Proportion of concordant signs between exposure and outcome effect

    sizes minus 0.5, multiplied by 2.

  • ’se’: Not applicable for this method (returns NaN).

  • ’pval’: P-value for the sign concordance test based on a binomial distribution.

Return type:

list of dict

Notes

Effect sizes that are exactly zero are replaced with NaN and are not included in the analysis. A binomial test is then performed to evaluate the probability of observing the given number of concordant signs by chance alone, assuming a null expectation of 50% concordance.

genal.MR.mr_simple_median(BETA_e, SE_e, BETA_o, SE_o, nboot)

Perform a Mendelian Randomization analysis using the simple median method.

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

  • nboot (int) – Number of boostrap iterations to obtain the standard error and p-value

Returns:

A list containing a dictionary with the results:
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient representing the causal estimate.

  • ’se’: Adjusted standard error of the coefficient.

  • ’pval’: P-value for the causal estimate.

  • ’nSNP’: Number of genetic variants used in the analysis.

Return type:

list of dict

Notes

The standard error is obtained with bootstrapping.

genal.MR.mr_uwr(BETA_e, SE_e, BETA_o, SE_o)

Performs an unweighted regression Mendelian Randomization analysis.

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

Returns:

A list containing a dictionary with the results:
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient of the regression, representing the causal estimate.

  • ’se’: Adjusted standard error of the coefficient.

  • ’pval’: P-value for the causal estimate.

  • ’nSNP’: Number of genetic variants used in the analysis.

  • ’Q’: Cochran’s Q statistic for heterogeneity.

  • ’Q_df’: Degrees of freedom for the Q statistic.

  • ’Q_pval’: P-value for the Q statistic.

Return type:

list of dict

Notes

The returned causal estimate is not weighted by the inverse variance. The standard error is corrected for under dispersion. Cochran’s Q statistics also computed to assess the heterogeneity across the instrumental variables.

genal.MR.mr_weighted_median(BETA_e, SE_e, BETA_o, SE_o, nboot)

Perform a Mendelian Randomization analysis using the weighted median method.

Parameters:
  • BETA_e (numpy array) – Effect sizes of genetic variants on the exposure.

  • SE_e (numpy array) – Standard errors corresponding to BETA_e.

  • BETA_o (numpy array) – Effect sizes of the same genetic variants on the outcome.

  • SE_o (numpy array) – Standard errors corresponding to BETA_o.

  • nboot (int) – Number of boostrap iterations to obtain the standard error and p-value

Returns:

A list containing a dictionary with the results:
  • ’method’: Name of the analysis method.

  • ’b’: Coefficient representing the causal estimate.

  • ’se’: Adjusted standard error of the coefficient.

  • ’pval’: P-value for the causal estimate.

  • ’nSNP’: Number of genetic variants used in the analysis.

Return type:

list of dict

Notes

The standard error is obtained with bootstrapping.

genal.MR.parallel_bootstrap_func(i, BETA_e, SE_e, BETA_o, SE_o)

Helper function to run the egger regression bootstrapping in parallel.

genal.MR.weighted_median(b_iv, weights)

Helper function to compute the weighted median estimate.

genal.MR.weighted_median_bootstrap(BETA_e, SE_e, BETA_o, SE_o, weights, nboot)

Helper function to generate boostrapped replications.

genal.MR_tools module

genal.MR_tools.MR_func(data, methods, action, heterogeneity, eaf_threshold, nboot, penk, name_exposure, cpus)

Wrapper function corresponding to the GENO.MR() method. Refer to them for more details regarding arguments and return values. The MR algorithms are implemented here: MR.mr_ivw(), MR.mr_weighted_median(), MR.mr_egger_regression(), MR.mr_simple_median()

Notes

  • Validation of the action and methods arguments

  • EAF column check if action is set to 2.

  • Data harmonization between exposure and outcome data based on action and eaf_threshold

  • NA check

  • MR methods execution

  • Compiles results and return a pd.DataFrame

genal.MR_tools.apply_action_2(df, eaf_threshold)
Use EAF_e and EAF_o to align palindromes if both EAFs are outside the intermediate allele frequency range.
  • Replace NA values in EAF columns by 0.5 (will be flagged and removed in step 3)

  • Set boundaries for intermediate allele frequencies

  • Identify palindromes that have an intermediate allele frequency and delete them

  • Among the remaining palindromes, identify the ones that need to be flipped and flip them

genal.MR_tools.check_required_columns(df, columns)

Check if the required columns are present in the dataframe.

genal.MR_tools.flip_alleles(x)

Flip the alleles.

genal.MR_tools.harmonize_MR(df_exposure, df_outcome, action=2, eaf_threshold=0.42)

Harmonize exposure and outcome for MR analyses.

Parameters:
  • df_exposure (-) – Exposure data with “SNP”,”BETA”,”SE”,”EA”,”NEA” and “EAF” if action=2

  • df_outcome (-) – Outcome data with “SNP”,”BETA”,”SE”,”EA”,”NEA” and “EAF” if action=2

  • action (-) – Determines how to treat palindromes. Defaults to 2. 1: Doesn’t attempt to flip them (= Assume all alleles are coded on the forward strand) 2: Use allele frequencies (EAF) to attempt to flip them (conservative, default) 3: Remove all palindromic SNPs (very conservative).

  • eaf_threshold (-) – Maximal effect allele frequency accepted when attempting to flip palindromic SNPs (only applied if action = 2). Defaults to 0.42.

Returns:

Harmonized data.

Return type:

  • pd.DataFrame

Notes

  • Verify the presence of required columns in both dataframes and rename them

  • Merge exposure and outcome data

  • Identify palindromes

  • Classify SNPs into aligned / inverted / need to be flipped

  • Flip the ones that require flipping

  • Switch those that are inverted to align them

  • Remove those that are still not aligned

  • Treat palindromes based on action parameter

genal.MR_tools.load_outcome_from_filepath(outcome)

Load outcome data from a file path.

genal.MR_tools.load_outcome_from_geno_object(outcome)

Load outcome data from a GENO object.

genal.MR_tools.mrpresso_func(data, action, eaf_threshold, n_iterations, outlier_test, distortion_test, significance_p, cpus)

Wrapper function corresponding to the GENO.MRpresso() method. The MR-PRESSO algorithm is implemented here: MRpresso.mr_presso() Refer to them for more details regarding arguments and return values.

Notes

  • EAF column check if action is set to 2.

  • Data harmonization between exposure and outcome data based on action and eaf_threshold

  • NA check

  • MRpresso call and results return

genal.MR_tools.query_outcome_func(data, outcome, name, proxy, reference_panel, kb, r2, window_snps, cpus)

Wrapper function corresponding to the GENO.query_outcome() method. Refer to it for more details on the arguments and return values.

Notes

  • Validation of the required columns

  • Load outcome data from GENO or path.

  • Identify SNPs present in the outcome data

  • Find proxies for the absent SNPs if needed

  • Return exposure dataframe, outcome dataframe, outcome name

genal.MRpresso module

genal.MRpresso.getRSS_LOO(data, BETA_e_columns, returnIV)
genal.MRpresso.getRandomData(data, BETA_e_columns=['BETA_e'])
genal.MRpresso.mr_presso(data, BETA_e_columns=['BETA_e'], n_iterations=1000, outlier_test=True, distortion_test=True, significance_p=0.05, cpus=5)

Perform the MR-PRESSO algorithm for detection of horizontal pleiotropy.

Parameters:
  • data (pd.DataFrame) – DataFrame with at least 4 columns: BETA_o (outcome), SE_o, BETA_e (exposure), SE_e.

  • BETA_e_columns (list) – List of exposure beta columns.

  • n_iterations (int) – Number of steps performed (random data generation).

  • outlier_test (bool) – If True, identifies outlier SNPs responsible for horizontal pleiotropy.

  • distortion_test (bool) – If True, tests significant distortion in the causal estimates.

  • significance_p (float) – Statistical significance threshold for the detection of horizontal pleiotropy.

  • cpus (int) – Number of CPUs to use for parallel processing.

Returns:

DataFrame with the original and outlier-corrected inverse variance-weighted MR results. GlobalTest (dict): Dictionary with p-value of the global MR-PRESSO test. OutlierTest (pd.DataFrame): DataFrame with p-value for each SNP for the outlier test. BiasTest (dict): Dictionary with results of the distortion test.

Return type:

mod_table (pd.DataFrame)

genal.MRpresso.parallel_RSS_LOO(i, data, BETA_e_columns)
genal.MRpresso.power_eigen(x, n)

genal.association module

genal.association.association_test_func(data, covar_list, standardize, name, data_pheno, pheno_type)

Conduct single-SNP association tests against a phenotype.

This function performs a series of operations:
  1. Checks for necessary preliminary steps.

  2. Updates the FAM file with the phenotype data.

  3. Creates a covariate file if required.

  4. Runs a PLINK association test.

  5. Processes the results and returns them.

Parameters:
  • data (pd.DataFrame) – Genetic data with the standard GENO columns.

  • covar_list (list) – List of column names in the data_pheno DataFrame to use as covariates.

  • standardize (bool) – Flag indicating if the phenotype needs standardization.

  • name (str) – Prefix for the filenames used during the process.

  • data_pheno (pd.DataFrame) – Phenotype data with at least an IID and PHENO columns.

  • pheno_type (str) – Type of phenotype (‘binary’ or ‘quant’).

Returns:

Processed results of the association test.

Return type:

pd.DataFrame

This function corresponds to the following GENO method: GENO.association_test().

genal.association.set_phenotype_func(data, PHENO, PHENO_type, IID, alternate_control)

Set a phenotype dataframe containing individual IDs and phenotype columns formatted for single-SNP association testing.

Parameters:
  • data (pd.DataFrame) – Contains at least an individual IDs column and one phenotype column.

  • IID (str) – Name of the individual IDs column in data.

  • PHENO (str) – Name of the phenotype column in data.

  • PHENO_type (str, optional) – Type of the phenotype column. Either “quant” for quantitative (continuous) or “binary”.The function tries to infer the type if not provided.

  • alternate_control (bool) – Assumes that for a binary trait, the controls are coded with the most frequent value. Use True to reverse the assumption.

Returns:

The modified data. str: The inferred or provided PHENO_type.

Return type:

pd.DataFrame

Raises:

ValueError – For inconsistencies in the provided data or arguments.

This function corresponds to the following GENO method: GENO.set_phenotype().

genal.clump module

genal.clump.clump_data(data, reference_panel='eur', plink19_path='/gpfs/gibbs/pi/falcone/00_General/software/plink1.9/plink', kb=250, r2=0.1, p1=5e-08, p2=0.01, name='noname', ram=10000, checks=[])

Perform clumping on the given data using plink. Corresponds to the GENO.clump() method.

Parameters:
  • data (pd.DataFrame) – Input data with at least ‘SNP’ and ‘P’ columns.

  • reference_panel (str) – The reference population for linkage disequilibrium values. Accepts values “eur”, “sas”, “afr”, “eas”, “amr”. Alternatively, a path leading to a specific bed/bim/fam reference panel can be provided. Default is “eur”.

  • plink19_path (str) – Path to the plink 1.9 software.

  • kb (int, optional) – Clumping window in terms of thousands of SNPs. Default is 250.

  • r2 (float, optional) – Linkage disequilibrium threshold, values between 0 and 1. Default is 0.1.

  • p1 (float, optional) – P-value threshold during clumping. SNPs above this value are not considered. Default is 5e-8.

  • p2 (float, optional) – P-value threshold post-clumping to further filter the clumped SNPs. If p2 < p1, it won’t be considered. Default is 0.01.

  • name (str) – Name used for the files created in the tmp_GENAL folder.

  • ram (int) – Amount of RAM in MB to be used by plink.

  • checks (list) – List of column checks already performed on the data.

Returns:

Data after clumping, if any. list: Updated checks.

Return type:

pd.DataFrame

genal.extract_prs module

genal.extract_prs.create_bedlist(bedlist, output_name, not_found)

Creates a bedlist file for SNP extraction. :param bedlist: Path to save the bedlist file. :type bedlist: str :param output_name: Base name for the output files. :type output_name: str :param not_found: List of chromosome numbers for which no bed/bim/fam files were found. :type not_found: List[int]

genal.extract_prs.extract_command_parallel(task_id, name, path, snp_list_path)

Helper function to run SNP extraction in parallel for different chromosomes. :param task_id: Identifier for the task/chromosome. :type task_id: int :param name: Name prefix for the output files. :type name: str :param path: Path to the data set. :type path: str :param snp_list_path: Path to the list of SNPs to extract. :type snp_list_path: str

Returns:

Returns the task_id if no valid bed/bim/fam files are found.

Return type:

int

genal.extract_prs.extract_snps_from_combined_data(name, path, snp_list_path)

Extract SNPs from combined data.

genal.extract_prs.extract_snps_func(snp_list, name, path=None)

Extracts a list of SNPs from the given path. This function corresponds to the following GENO method: GENO.extract_snps().

Parameters:
  • snp_list (List[str]) – List of SNPs to extract.

  • name (str) – Name prefix for the output files.

  • path (str, optional) – Path to the dataset. Defaults to the path from the configuration.

Raises:

TypeError – Raises an error when no valid path is saved or when there’s an incorrect format in the provided path.

genal.extract_prs.handle_multiallelic_variants(name, merge_command)

Handle multiallelic variants detected during merging.

genal.extract_prs.merge_extracted_snps(name, not_found)

Merge extracted SNPs from each chromosome.

genal.extract_prs.prepare_snp_list(snp_list, name)

Prepare the SNP list for extraction.

genal.extract_prs.process_split_data(name, path, snp_list_path)

Process data that is split by chromosome.

genal.extract_prs.prs_func(data, weighted=True, path=None, checks=None, ram=10000, name='')

Compute a PRS (Polygenic Risk Score) using provided genetic data. Corresponds to the GENO.prs() method

Parameters:
  • data (pd.DataFrame) – Dataframe containing genetic information.

  • weighted (bool, optional) – Perform a weighted PRS using BETA column estimates. If False, perform an unweighted PRS (equivalent to BETAs set to 1). Defaults to True.

  • path (str, optional) – Path to bed/bim/fam set of genetic files to use for PRS calculation. If not provided, it uses the genetic data extracted with the .extract_snps method. Defaults to None.

  • checks (list, optional) – List of column checks already performed on the data to avoid repeating them. Defaults to None.

  • ram (int, optional) – RAM memory in MB to be used by plink. Defaults to 10000.

  • name (str, optional) – Name used for naming output and intermediate files. Defaults to “”.

Returns:

DataFrame containing PRS results.

Return type:

pd.DataFrame

Raises:
  • ValueError – If mandatory columns are missing in the data.

  • TypeError – If valid bed/bim/fam files are not found.

  • ValueError – If PRS computation was not successful.

genal.extract_prs.report_snps_not_found(nrow, name)

Report the number of SNPs not found in the data.

genal.extract_prs.setup_path(path, config)

Configure the path based on user input and saved configuration.

genal.geno_tools module

genal.geno_tools.Combine_GENO(Gs, name='noname', clumped=False, preprocessing=0)

Combine a list of GWAS objects into one.

Args: - Gs (list): List of GWAS objects. - name (str, optional): Name for the combined object. Default is “noname”. - clumped (bool, optional): If True, uses the clumped data of each object. Default is False. - preprocessing (int, optional): Level of preprocessing to apply. Default is 0.

Returns: GENO object: Combined GENO object.

genal.geno_tools.adjust_column_names(data, CHR, POS, SNP, EA, NEA, BETA, SE, P, EAF, keep_columns)

Rename columns to the standard names making sure that there are no duplicated names. Delete other columns if keep_columns=False, keep them if True.

genal.geno_tools.check_allele_column(data, allele_col, keep_multi)

Verify that the corresponding allele column is upper case strings. Set to nan if not formed with A, T, C, G letters. Set to nan if values are multiallelic unless keep_multi=True.

genal.geno_tools.check_arguments(df, preprocessing, reference_panel, clumped, effect_column, keep_columns, fill_snpids, fill_coordinates, keep_multi, keep_dups)

Verify the arguments passed for the GENO initialization and apply logic based on the preprocessing value. See GENO for more details.

Returns:

Tuple containing updated values for (keep_columns, keep_multi, keep_dups, fill_snpids, fill_coordinates)

Return type:

tuple

Raises:

TypeError – For invalid data types or incompatible argument values.

genal.geno_tools.check_beta_column(data, effect_column, preprocessing)

If the BETA column is a column of odds ratios, log-transform it. If no effect_column argument is specified, determine if the BETA column are beta estimates or odds ratios.

genal.geno_tools.check_int_column(data, int_col)

Set the type of the int_col column to Int32 and non-numeric values to NA.

genal.geno_tools.check_p_column(data)

Verify that the P column contains numeric values in the range [0,1]. Set inappropriate values to NA.

genal.geno_tools.check_snp_column(data)

Remove duplicates in the SNP column.

genal.geno_tools.delete_tmp()

Delete the tmp folder.

genal.geno_tools.fill_coordinates_func(data, reference_panel_df)

Fill in the CHR/POS columns based on reference data.

genal.geno_tools.fill_ea_nea(data, reference_panel_df)

Fill in the EA and NEA columns based on reference data.

genal.geno_tools.fill_nea(data, reference_panel_df)

Fill in the NEA column based on reference data.

genal.geno_tools.fill_se_p(data)

If either P or SE is missing but the other and BETA are present, fill it.

genal.geno_tools.fill_snpids_func(data, reference_panel_df)

Fill in the SNP column based on reference data. If some SNPids are still missing, they will be replaced by a standard name: CHR:POS:EA

genal.geno_tools.remove_na(data)

Identify the columns containing NA values. Delete rows with NA values.

genal.geno_tools.save_data(data, name, path='', fmt='h5', sep='\t', header=True)

Save a DataFrame to a file in a given format.

Args: - data (pd.DataFrame): The data to be saved. - name (str): The name of the file without extension. - path (str, optional): Directory path for saving. Default is the current directory. - fmt (str, optional): Format for the file, e.g., “h5”, “csv”, “txt”, “vcf”, “vcf.gz”. Default is “h5”. - sep (str, optional): Delimiter for csv or txt files. Default is tab. - header (bool, optional): Whether to include header in csv or txt files. Default is True.

Returns: None. But saves the data to a file and prints the file path.

Raises: - ValueError: If the provided format is not recognized.

genal.lift module

genal.lift.lift_coordinates(data, lo)

Perform liftover on data using LiftOver object after handling missing values.

genal.lift.lift_data(data, start='hg19', end='hg38', extraction_file=False, chain_file=None, name='')

Perform a liftover from one genetic build to another. If the chain file required for the liftover is not present, it will be downloaded. It’s also possible to manually provide the path to the chain file. If the dataset is large, it is suggested to use an alternate method (e.g., lift_data_liftover).

Parameters:
  • data (pd.DataFrame) – The input data containing at least “CHR” and “POS” columns.

  • start (str, optional) – The current build of the data. Defaults to “hg19”.

  • end (str, optional) – The target build for liftover. Defaults to “hg38”.

  • extraction_file (bool, optional) – If True, also prints a CHR POS SNP space-delimited file for extraction. Defaults to False.

  • chain_file (str, optional) – Path to a local chain file for the lift. Overrides the start and end arguments if provided.

  • name (str, optional) – Specify a filename or filepath (without extension) for saving. If not provided, uses [name]_lifted.txt.

Raises:

ValueError – If required columns are missing or if provided chain file path is incorrect.

Returns:

Lifted data.

Return type:

pd.DataFrame

Notes

Function for the GENO.lift() method.

genal.lift.lift_data_liftover(data, start='hg19', end='hg38', replace=False, extraction_file=False, chain_file='', name='', genal_tools_path='', liftover_path='')

Perform a liftover from a genetic build to another using the LiftOver software (requires Linux). The software must be installed on your system. It can be downloaded from https://genome-store.ucsc.edu If the chain file required to do the liftover is not present, download it first. It is also possible to manually provide the path to the chain file. start=”hg19”: current build of the data end=”hg38”: build to be lifted to replace=False: whether to change the dataframe inplace or return a new one extraction_file==True, also print a CHR POS SNP space delimited file for extraction in All of Us (WES data) chain_file=””: path to a local chain file to be used for the lift. If provided, the start and end arguments are not considered. name=””: can be used to specify a filename or filepath (without extension) to save the lifted dataframe. If not provided, will be saved in the current folder as [name]_lifted.txt genal_tools_path: path to the Genal_tools folder. If not provided, the tmp_GENAL folder will be used. liftover_path: path to LiftOver executable

genal.lift.post_lift_operations(data, name, extraction_file)

Handle post-liftover operations like reporting, and saving results.

genal.lift.prepare_chain_file(chain_file, start, end)

Handle chain file loading, downloading if necessary. Return LiftOver object.

genal.proxy module

genal.proxy.apply_proxies(df, ld, searchspace=None)

Given a dataframe (coming from GENO.data or GENO.data_clumped attributes) and a dataframe of proxies (output from find_proxies), replace the SNPs in df with their best proxies, if they exist. This function is suited for exposure data (before running a PRS for instance).

Parameters:
  • df (DataFrame) – Dataframe of SNP information with the usual GENO columns (SNP, BETA, SE, EAF, EA, NEA). EAF is not necessary.

  • ld (DataFrame) – Dataframe of proxies (output from find_proxies).

  • searchspace (list, optional) – List of SNPs to restrict the list of potential proxies. By default, includes all the proxies found. Using a searchspace can be done either at the find_proxies step or at this step, but it is much faster to use it here.

Returns:

A DataFrame with SNPs replaced by their best proxies, if they exist.

Return type:

DataFrame

genal.proxy.find_proxies(snp_list, searchspace=None, reference_panel='eur', kb=5000, r2=0.6, window_snps=5000, threads=1)

Given a list of SNPs, return a table of proxies.

Parameters:
  • snp_list (list) – List of rsids.

  • searchspace (list, optional) – List of SNPs to include in the search. By default, includes the whole reference panel.

  • reference_panel (str, optional) – The reference population to get linkage disequilibrium values and find proxies. Accepts values: “EUR”, “SAS”, “AFR”, “EAS”, “AMR”. Alternatively, provide a path leading to a specific bed/bim/fam reference panel.

  • kb (int, optional) – Width of the genomic window to look for proxies. Defaults to 5000.

  • r2 (float, optional) – Minimum linkage disequilibrium value with the main SNP for a proxy to be included. Defaults to 0.6.

  • window_snps (int, optional) – Compute the LD value for SNPs that are not more than x SNPs apart from the main SNP. Defaults to 5000.

  • threads (int, optional) – Number of threads to use. Defaults to 1.

Returns:

A DataFrame containing the proxies. Only biallelic SNPs are returned.

Return type:

DataFrame

genal.proxy.query_outcome_proxy(df, ld, snps_to_extract, snps_df=[])

Extract the best proxies from a dataframe, as well as specific SNPs.

Given a dataframe df (originating from GENO.data) and a dataframe of potential proxies (output from find_proxies), this function extracts the best proxies from df as well as the SNPs specified in snps_to_extract. This is suited for querying outcome data.

Parameters:
  • df (pd.DataFrame) – Dataframe of SNP information with the usual GENO columns (SNP, BETA, SE, EAF, EA, NEA). EAF is not necessary.

  • ld (pd.DataFrame) – Dataframe of proxies (output from find_proxies).

  • snps_to_extract (list) – List of SNPs to extract in addition to the proxies.

  • snps_df (list, optional) – List of SNPs to choose the proxy from. Should be the list of SNPs in df. Can be provided to avoid recomputing it. Defaults to an empty list.

Returns:

Dataframe with queried SNPs and their proxies.

Return type:

pd.DataFrame

genal.tools module

genal.tools.check_bfiles(filepath)

Check if the path specified leads to a bed/bim/fam triple.

genal.tools.default_config()

Returns default config values

genal.tools.get_plink19_path()

Return the plink19 path if it exists in the config file.

genal.tools.get_reference_panel_path(reference_panel='eur')

Retrieve the path of the specified reference panel.

This function checks if the provided reference panel is a valid path to bed/bim/fam files. If not, it checks if the reference panel exists in the reference folder. If it doesn’t exist, the function attempts to download it.

Parameters:

reference_panel (str, optional) – The name of the reference panel or a path to bed/bim/fam files. Defaults to “eur”.

Raises:
  • ValueError – If the provided reference panel is not recognized.

  • OSError – If there’s an issue creating the directory.

  • FileNotFoundError – If the reference panel is not found.

Returns:

The path to the reference panel.

Return type:

str

genal.tools.load_reference_panel(reference_panel='eur')

Load the bim file from the reference panel specified.

genal.tools.read_config()

Get config file data

Set the plink 1.9 path and verify that it is the correct version.

genal.tools.set_reference_folder(path='')

Set a folder path to store reference data.

This function allows users to specify a directory where reference data will be stored. If the directory doesn’t exist, it will be created. If no path is provided, a default directory named ‘tmp_GENAL’ in the current working directory will be used.

Parameters:

path (str, optional) – The desired directory path for storing reference data. Defaults to a temporary folder in the current working directory.

Raises:

OSError – If the directory cannot be created.

Returns:

The function prints messages to inform the user of the status and any errors.

Return type:

None

genal.tools.write_config(config)

Write data to config file

Module contents