The Geno class¶
The main object of the package is the Geno
class that contains the SNP-level data and manipulates it through its methods.
- class genal.Geno(df, CHR='CHR', POS='POS', SNP='SNP', EA='EA', NEA='NEA', BETA='BETA', SE='SE', P='P', EAF='EAF', keep_columns=True)[source]¶
A class to handle GWAS-derived data, including SNP rsID, genome position, SNP-trait effects, and effect allele frequencies.
- data¶
Main DataFrame containing SNP data.
- Type:
pd.DataFrame
- phenotype¶
Tuple with a DataFrame of individual-level phenotype data and a string representing the phenotype trait column. Initialized after running the ‘set_phenotype’ method.
- Type:
pd.DataFrame, str
- MR_data¶
Tuple containing DataFrames for associations with exposure and outcome, and a string for the outcome name. Initialized after running the ‘query_outcome’ method.
- Type:
pd.DataFrame, pd.DataFrame, str
- MR_results¶
Contains an MR results dataframe, a dataframe of harmonized SNPs, an exposure name, an outcome name. Assigned after calling the MR method and used for plotting with the MR_plot method.
- Type:
pd.DataFrame, pd.DataFrame, str, str
- ram¶
Available memory.
- Type:
int
- cpus¶
Number of available CPUs.
- Type:
int
- checks¶
Dictionary of checks performed on the main DataFrame.
- Type:
dict
- name¶
ID of the object (for internal reference and debugging purposes).
- Type:
str
- reference_panel¶
Reference population SNP data used for SNP info adjustments. Initialized when first needed.
- Type:
pd.DataFrame
- reference_panel_name¶
string to identify the reference_panel (path or population string)
- Type:
str
- preprocess_data()[source]¶
Clean and preprocess the ‘data’ attribute (the main dataframe of SNP-level data).
- clump()[source]¶
Clump the main data based on reference panels and return a new Geno object with the clumped data.
- set_phenotype()[source]¶
Assigns a DataFrame with individual-level data and a phenotype trait to the ‘phenotype’ attribute.
- query_outcome()[source]¶
Extracts SNPs from SNP-outcome association data and stores it in the ‘MR_data’ attribute.
- MR()[source]¶
Performs Mendelian Randomization between the SNP-exposure and SNP-outcome data stored in the ‘MR_data’ attribute. Stores the results in the ‘MR_results’ attribute.
Main functions¶
Preprocessing¶
The preprocessing of the SNP-level data is performed with the preprocess_data()
method:
- Geno.preprocess_data(preprocessing='Fill', reference_panel='eur', effect_column=None, keep_multi=None, keep_dups=None, fill_snpids=None, fill_coordinates=None)[source]¶
Clean and preprocess the main dataframe of Single Nucleotide Polymorphisms (SNP) data.
- Parameters:
preprocessing (str, optional) – Level of preprocessing to apply. Options include: - “None”: The dataframe is not modified. - “Fill”: Missing columns are added based on reference data and invalid values set to NaN, but no rows are deleted. - “Fill_delete”: Missing columns are added, and rows with missing, duplicated, or invalid values are deleted. Defaults to ‘Fill’.
reference_panel (str or pd.DataFrame, optional) – Reference panel for SNP adjustments. Can be a string representing ancestry classification (“eur”, “afr”, “eas”, “sas”, “amr”) or a DataFrame with [“CHR”,”SNP”,”POS”,”A1”,”A2”] columns or a path to a .bim file. Defaults to “eur”.
effect_column (str, optional) – Specifies the type of effect column (“BETA” or “OR”). If None, the method tries to determine it. Odds Ratios will be log-transformed and the standard error adjusted. Defaults to None.
keep_multi (bool, optional) – Determines if multiallelic SNPs should be kept. If None, defers to preprocessing value. Defaults to None.
keep_dups (bool, optional) – Determines if rows with duplicate SNP IDs should be kept. If None, defers to preprocessing value. Defaults to None.
fill_snpids (bool, optional) – Decides if the SNP (rsID) column should be created or replaced based on CHR/POS columns and a reference genome. If None, defers to preprocessing value. Defaults to None.
fill_coordinates (bool, optional) – Decides if CHR and/or POS should be created or replaced based on SNP column and a reference genome. If None, defers to preprocessing value. Defaults to None.
Clumping¶
Clumping is performed with the clump()
method:
- Geno.clump(kb=250, r2=0.1, p1=5e-08, p2=0.01, reference_panel='eur')[source]¶
Clump the data based on linkage disequilibrium and return another Geno object with the clumped data. The clumping process is executed using plink.
- Parameters:
kb (int, optional) – Clumping window in thousands of SNPs. Default is 250.
r2 (float, optional) – Linkage disequilibrium threshold, values between 0 and 1. Default is 0.1.
p1 (float, optional) – P-value threshold during clumping. SNPs with a P-value higher than this value are excluded. Default is 5e-8.
p2 (float, optional) – P-value threshold post-clumping to further filter the clumped SNPs. If p2 < p1, it won’t be considered. Default is 0.01.
reference_panel (str, optional) – The reference population for linkage disequilibrium values. Accepts values “eur”, “sas”, “afr”, “eas”, “amr”. Alternatively, a path leading to a specific bed/bim/fam reference panel can be provided. Default is “eur”.
- Returns:
A new Geno object based on the clumped data.
- Return type:
Polygenic Risk Scoring¶
The computation of a polygenic risk score in a target population is performed with the prs()
method:
- Geno.prs(name=None, weighted=True, path=None, proxy=False, reference_panel='eur', kb=5000, r2=0.6, window_snps=5000)[source]¶
Compute a Polygenic Risk Score (PRS) and save it as a CSV file in the current directory.
- Parameters:
name (str, optional) – Name or path of the saved PRS file.
weighted (bool, optional) – If True, performs a PRS weighted by the BETA column estimates. If False, performs an unweighted PRS. Default is True.
path (str, optional) – Path to a bed/bim/fam set of genetic files for PRS calculation. If files are split by chromosomes, replace the chromosome number with ‘$’. For instance: path = “ukb_chr$_file”. If not provided, it will use the genetic path most recently used (if any). Default is None.
position (bool, optional) – Use the genomic positions instead of the SNP names to find the SNPs in the genetic data (recommended).
proxy (bool, optional) – If true, proxies are searched. Default is True.
reference_panel (str, optional) – The reference population used to derive linkage disequilibrium values and find proxies (only if proxy=True). Acceptable values include “EUR”, “SAS”, “AFR”, “EAS”, “AMR” or a path to a specific bed/bim/fam panel. Default is “EUR”.
kb (int, optional) – Width of the genomic window to look for proxies. Default is 5000.
r2 (float, optional) – Minimum linkage disequilibrium value with the main SNP for a proxy to be included. Default is 0.6.
window_snps (int, optional) – Compute the LD value for SNPs that are not more than x SNPs away from the main SNP. Default is 5000.
- Returns:
The computed PRS data.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the data hasn’t been clumped and ‘clumped’ parameter is True.
Querying outcome data¶
Before running Mendelian Randomization, the extraction of the genetic instruments from the Geno
object containing the SNP-outcome association data is done with query_outcome()
method:
- Geno.query_outcome(outcome, name=None, proxy=True, reference_panel='eur', kb=5000, r2=0.6, window_snps=5000)[source]¶
Prepares dataframes required for Mendelian Randomization (MR) with the SNP information in data as exposure.
Queries the outcome data, with or without proxying, and assigns a tuple to the outcome attribute: (exposure_data, outcome_data, name) ready for MR methods.
- Parameters:
outcome – Can be a Geno object (from a GWAS) or a filepath of types: .h5 or .hdf5 (created with the
Geno.save()
method.name (str, optional) – Name for the outcome data. Defaults to None.
proxy (bool, optional) – If true, proxies are searched. Default is True.
reference_panel (str, optional) – The reference population to get linkage disequilibrium values and find proxies (only if proxy=True). Acceptable values include “EUR”, “SAS”, “AFR”, “EAS”, “AMR” or a path to a specific bed/bim/fam panel. Default is “EUR”.
kb (int, optional) – Width of the genomic window to look for proxies. Default is 5000.
r2 (float, optional) – Minimum linkage disequilibrium value with the main SNP for a proxy to be included. Default is 0.6.
window_snps (int, optional) – Compute the LD value for SNPs that are not more than x SNPs away from the main SNP. Default is 5000.
- Returns:
Sets the MR_data attribute for the instance.
- Return type:
None
Mendelian Randomization¶
Various Mendelian Randomization methods are computed with the MR()
method:
- Geno.MR(methods=['IVW', 'IVW-FE', 'WM', 'Simple-mode', 'Egger'], action=2, eaf_threshold=0.42, heterogeneity=False, nboot=1000, penk=20, phi=1, exposure_name=None, outcome_name=None, cpus=-1)[source]¶
Executes Mendelian Randomization (MR) using the data_clumped attribute as exposure data and MR_data attribute as outcome data queried using the query_outcome method.
- Parameters:
methods (list, optional) – List of MR methods to run. Possible options include: “IVW”: inverse variance-weighted with random effects and under-dispersion correction “IVW-FE”: inverse variance-weighted with fixed effects “IVW-RE”: inverse variance-weighted with random effects and without under-dispersion correction “UWR”: unweighted regression “WM”: weighted median (bootstrapped standard errors) “WM-pen”: penalised weighted median (bootstrapped standard errors) “Simple-median”: simple median (bootstrapped standard errors) “Sign”: sign concordance test “Egger”: egger regression “Egger-boot”: egger regression with bootstrapped standard errors “Simple-mode”: simple mode method “Weighted-mode”: weighted mode method Default is [“IVW”,”IVW-FE”,”WM”,”Simple-mode”,”Weighted-mode”,”Egger”].
action (int, optional) – How to treat palindromes during harmonizing between exposure and outcome data. Accepts: 1: Doesn’t flip them (Assumes all alleles are on the forward strand) 2: Uses allele frequencies to attempt to flip (conservative, default) 3: Removes all palindromic SNPs (very conservative)
eaf_threshold (float, optional) – Max effect allele frequency accepted when flipping palindromic SNPs (relevant if action=2). Default is 0.42.
heterogeneity (bool, optional) – If True, includes heterogeneity tests in the results (Cochran’s Q test).Default is False.
nboot (int, optional) – Number of bootstrap replications for methods with bootstrapping. Default is 1000.
penk (int, optional) – Penalty value for the WM-pen method. Default is 20.
phi (int, optional) – Factor for the bandwidth parameter used in the kernel density estimation of the mode methods
exposure_name (str, optional) – Name of the exposure data (only for display purposes).
outcome_name (str, optional) – Name of the outcome data (only for display purposes).
- Returns:
A table with MR results.
- Return type:
pd.DataFrame
MR-PRESSO¶
The MR-PRESSO algorithm to detect and correct horizontal pleiotropy is executed with MRpresso()
method:
- Geno.MRpresso(action=2, eaf_threshold=0.42, n_iterations=10000, outlier_test=True, distortion_test=True, significance_p=0.05, cpus=-1)[source]¶
Executes the MR-PRESSO Mendelian Randomization algorithm for detection and correction of horizontal pleiotropy.
- Parameters:
action (int, optional) – Treatment for palindromes during harmonizing between exposure and outcome data. Options: - 1: Don’t flip (assume all alleles are on the forward strand) - 2: Use allele frequencies to flip (default) - 3: Remove all palindromic SNPs
eaf_threshold (float, optional) – Max effect allele frequency when flipping palindromic SNPs (relevant if action=2). Default is 0.42.
n_iterations (int, optional) – Number of random data generation steps for improved result stability. Default is 10000.
outlier_test (bool, optional) – Identify outlier SNPs responsible for horizontal pleiotropy if global test p_value < significance_p. Default is True.
distortion_test (bool, optional) – Test significant distortion in causal estimates before and after outlier removal if global test p_value < significance_p. Default is True.
significance_p (float, optional) – Statistical significance threshold for horizontal pleiotropy detection (both global test and outlier identification). Default is 0.05.
cpus (int, optional) – number of cpu cores to be used for the parallel random data generation.
- Returns:
- Contains the following elements:
- mod_table: DataFrame containing the original (before outlier removal)
and outlier-corrected (after outlier removal) inverse variance-weighted MR results.
GlobalTest: p-value of the global MR-PRESSO test indicating the presence of horizontal pleiotropy.
- OutlierTest: DataFrame assigning a p-value to each SNP representing the likelihood of this
SNP being responsible for the global pleiotropy. Set to NaN if global test p_value > significance_p.
DistortionTest: p-value for the distortion test.
- Return type:
tuple
Phenotype assignment¶
Before running SNP-association tests, assigning a dataframe with phenotypic data to the Geno
object is done with set_phenotype()
method:
- Geno.set_phenotype(data, IID=None, PHENO=None, PHENO_type=None, alternate_control=False)[source]¶
Assign a phenotype dataframe to the .phenotype attribute.
- Parameters:
data (pd.DataFrame) – DataFrame containing individual-level row data with at least an individual IDs column and one phenotype column.
IID (str, optional) – Name of the individual IDs column in ‘data’. These IDs should correspond to the genetic IDs in the FAM file that will be used for association testing.
PHENO (str, optional) – Name of the phenotype column in ‘data’ which will be used as the dependent variable for association tests.
PHENO_type (str, optional) – If not specified, the function will try to infer if the phenotype is binary or quantitative. To bypass this, use “quant” for quantitative or “binary” for binary phenotypes. Default is None.
alternate_control (bool, optional) – By default, the function assumes that for a binary trait, the controls have the most frequent value. Set to True if this is not the case. Default is False.
- Returns:
Sets the .phenotype attribute for the instance.
- Return type:
None
Note
This method sets the .phenotype attribute which is essential to perform single-SNP association tests using the association_test method.
SNP-association tests¶
SNP-association testing is conducted with association_test()
method:
- Geno.association_test(path=None, covar=[], standardize=True)[source]¶
Conduct single-SNP association tests against a phenotype.
- Parameters:
path (str, optional) – Path to a bed/bim/fam set of genetic files. If files are split by chromosomes, replace the chromosome number with ‘$’. For instance: path = “ukb_chr$_file”. Default is None.
covar (list, optional) – List of columns in the phenotype dataframe to be used as covariates in the association tests. Default is an empty list.
standardize (bool, optional) – If True, it will standardize a quantitative phenotype before performing association tests. This is typically done to make results more interpretable. Default is True.
- Returns:
- Updates the BETA, SE, and P columns of the data attribute based on the results
of the association tests.
- Return type:
None
Note
This method requires the phenotype to be set using the set_phenotype() function.
Genetic lifting¶
Lifting the SNP data to another genetic build is done with lift()
method:
- Geno.lift(start='hg19', end='hg38', replace=False, extraction_file=False, chain_file=None, name=None, liftover_path=None)[source]¶
Perform a liftover from one genetic build to another.
- Parameters:
start (str, optional) – Current build of the data. Default is “hg19”.
end (str, optional) – Target build for the liftover. Default is “hg38”.
replace (bool, optional) – If True, updates the data attribute in place. Default is False.
extraction_file (bool, optional) – If True, prints a CHR POS SNP space-delimited file. Default is False.
chain_file (str, optional) – Path to a local chain file for the lift. If provided, start and end arguments are not considered. Default is None.
name (str, optional) – Filename or filepath (without extension) to save the lifted dataframe. If not provided, the data is not saved.
liftover_path (str, optional) – Specify the path to the USCS liftover executable. If not provided, the lift will be done in python (slower for large amount of SNPs).
- Returns:
Data after being lifted.
- Return type:
pd.DataFrame
GWAS Catalog¶
Querying the GWAS Catalog to extract traits associated with the SNPs is done with query_gwas_catalog()
method:
- Geno.query_gwas_catalog(p_threshold=5e-08, return_p=False, return_study=False, replace=True, max_associations=None, timeout=-1)[source]¶
Queries the GWAS Catalog Rest API and add an “ASSOC” column containing associated traits for each SNP.
- Parameters:
p_threshold (float, optional) – Only associations that are at least as significant are reported. Default is 5e-8.
return_p (bool, optional) – If True, include the p-value in the results. Default is False.
return_study (bool, optional) – If True, include the ID of the study from which the association is derived in the results. Default is False.
replace (bool, optional) – If True, updates the data attribute in place. Default is True.
max_associations (int, optional) – If not None, only the first max_associations associations are reported for each SNP. Default is None.
timeout (int, optional) – Timeout for each query in seconds. Default is -1 (custom timeout based on number of SNPs to query). Choose None for no timeout.
- Returns:
- Data attribute with an additional column “ASSOC”.
The elements of this column are lists of strings or tuples depending on the return_p and return_study flags. If the SNP could not be queried, the value is set to “FAILED_QUERY”.
- Return type:
pd.DataFrame