Complete Documentation#

H2M Output Data Description#
Column | Description | |
---|---|---|
0 | gene_name_h | Human gene name |
1 | gene_id_h | Human gene ID |
2 | tx_id_h | Human transcript ID |
3 | chr_h | Human chromosome number |
4 | exon_num_h | Total number of exons of the human transcript |
5 | strand_h | Positive or Negative strand of the human transcript on the chromosome |
6 | match | The computed reference sequence by given coordinate is matched with the input reference sequence or not |
7 | start_h | end_h | Start and end position of the human variant on the chromosome in MAF format |
8 | ref_seq_h | alt_seq_h | Reference and alternate sequence of the human variant on the chromosome in MAF format |
9 | HGVSc_h | HGVSp_h | HGVSc and HGVSp expression of the human variant |
10 | classification_h | Human variant effect classification, including missense/nonsense/in-frame indel/fram-shift indel/intron, etc. |
11 | exon_h | Exon/Intron location of the given human mutation, for example, E_7/I_5 |
12 | type_h | Human variant type in MAF format, including SNP/DNP/TNO/ONP/INS/DEL |
13 | status | This mutation can be modeled in the given target transcript or not, True or False |
14 | class | H2M modeling result class, 0-5 |
15 | statement | Statement of the H2M result class |
16 | flank_size_left | flank_size_right | Length of the identical sequences between human and mouse on the left/right side of the mutation |
17 | gene_name_m | Mouse gene name |
18 | gene_id_m | Mouse gene ID |
19 | tx_id_m | Mouse transcript ID |
20 | chr_m | Mouse chromosome number |
21 | exon_num_m | Total number of exons of the mouse transcript |
22 | strand_m | Positive or negative strand of the mouse transcript on the chromosome |
23 | type_m | Mouse variant type in MAF format, including SNP/DNP/TNO/ONP/INS/DEL |
24 | classification_m | Mouse variant effect classification |
25 | exon_m | Exon/Intron location of the murine mutation |
26 | start_m_ori | end_m_ori | Start and end position of the mouse variant (with exactly the same DNA change) on the chromosome in MAF format |
27 | ref_seq_m_ori | alt_seq_m_ori | Reference and alternate sequence of the mouse variant (with exactly the same DNA change) on the chromosome in MAF format |
28 | HGVSc_m_ori | HGVSp_m_ori | HGVSc and HGVSp expression of the mouse variant (with exactly the same DNA change) |
29 | start_m | end_m | Start and end position of the mouse variant (with the same amino acid change) on the chromosome in MAF format |
30 | ref_seq_m | alt_seq_m | Reference and alternate sequence of the mouse variant (with the same amino acid change) on the chromosome in MAF format |
31 | HGVSc_m | HGVSp_m | HGVSc and HGVSp expression of the mouse variant (with the same amino acid change) |
H2M Modeling Class Description#
Class | Statement | |
---|---|---|
0 | Class 0 | This mutation can be originally modeled. |
1 | Class 1 | This mutation can be alternatively modeled. |
2 | Class 2 | This mutation can be modeled, but the effect may not be consistent. |
3 | Class 3 | This mutation cannot be originally modeled and no alternative is found. |
4 | Class 4 | Mutated sequences are not identical. |
5 | Class 5 | Coordinate error. This mutation is not in the query gene. |
6 | Class 6 | This mutation cannot be originally modeled. |
Functions#
- h2m.genome_loader(path)#
Load the refernce genome file.
- Parameters:
path (str): path of the genome file.
- Return:
reference genome records and the index list of chromosomes.
- Example:
>>> records_h, index_list_h = h2m.genome_loader(path_h_ref)
- h2m.anno_loader(path)#
Load the GENCODE annotation file.
- Parameters:
path (str): path of the annotation file.
- Return:
a FeatureDB
- Example:
>>> db_h = h2m.anno_loader(path_h_anno)
- h2m.cbio_reader(path=None, df=None, keep=True)#
Generate MAF-sty h2m input from cbioportal data.
- Parameter:
path (str): the path of mutation data in txt format.
keep (bool): True: keep all the original columns in the dataframe/ False: keep the necesssary columns for h2m only. Default to False.
- Output:
An input dataframe for h2m modeling.
- Example:
>>> h2m.cbio_reader('.../data_mutations.txt', keep=False)
- h2m.clinvar_reader(path, list_of_ids=None, keep=True)#
Generate h2m input from ClinVar data.
- Parameter:
path (str): the path of clinvar renference vcf.gz data.
list_of_ids (list): the list of variation ids. If no value, the function would output all entries in the ClinVar data file.
keep (bool): True: keep all the original columns in the dataframe/ False: keep the necesssary columns for h2m only. Default to False.
- Output:
An input dataframe for h2m modeling.
- Example:
>>> filepath = '.../GrCh37_clinvar_20230923.vcf.gz' >>> variation_ids = [925574, 925434, 926695, 925707, 325626, 1191613, 308061, 361149, 1205375, 208043] >>> df = h2m.clinvar_reader(filepath, variation_ids)
- h2m.clinvar_to_maf(df)#
Convert clinvar-style input into maf-style input, while keeping other columns intact.
- h2m.vcf_reader(path, keep=True)#
Generate MAF-style h2m input from VCF-format data, including genomAD data.
- Parameter:
path (str): the path of input csv data.
keep (bool): True: keep all the original columns in the dataframe/ False: keep the necesssary columns for h2m only. Default to False.
- Output:
An input dataframe for h2m modeling.
- Example:
>>> filepath = '.../gnomAD_v4.0.0_ENSG00000141510_2024_02_07_11_36_03.csv' >>> df = h2m.vcf_reader('','TP53')
- h2m.vcf_to_maf(df)#
Convert vcf-style input into maf-style input, while keeping other columns intact.
- h2m.get_variant_type(df, ref_col, alt_col, col_name='type_h')#
Generate h2m/cbio-style neucleotide variant type annotations.
- Parameter:
df (str): a dataframe of mutations including columns for both reference and alternate sequences
ref_col (str): column name for reference sequences.
alt_col (str): column name for alternate sequences.
- Output:
The input dataframe but with a column of vairant type added.
- Example:
>>> h2m.get_variant_type(df, 'ref_seq_h','alt_seq_h')
- h2m.get_tx_id(id, species, ver=None, ty='default', show=True)#
Query a human or mouse gene for coordinate and information of all its transcripts. Internet needed.
- Parameters:
id (str):, identification of a human gene. Multiple input forms are accepted, including gene name, stable ensembl gene id with or without version number.
species (str): ‘h’ for human or ‘m’ for mouse.
ver (int): specify the version of human, one of 37/38. It is a necessary parameter.
ty (str): OPTIONAL. type of your input id. string, one of ‘name’/’gene_id’.
show (bool): OPTIONAL. print summary of output or not.
- Return:
A list [chromosome, start location(of gene), end location(of gene), canonical transcript id, list of all transcript id (the canonical one included and always at the first place), a list of additional information of each transcript]
- Example:
>>> h2m.get_tx_id('TP53','h',ver=37)
- h2m.get_tx_batch(df, species, ver=None)#
Batch query of canonical transcript IDs of human or mouse genes.
- Parameters:
df (Pandas DataFrame): Must include a column of gene names named ‘gene_name_h’/’gene_name_m’, depending on the species. An index column is recommended.
species (str): ‘h’ for human or ‘m’ for mouse.
ver (int): specify the version of human, one of 37/38.
- Return:
Two dataframes. The first dataframe is the processed original dataframe with canonical transcirpt id attached in the column named ‘tx_id_h’/’tx_id_m’. The second dataframe contains all rows that are not successfully processed.
- Example:
>>> h2m.get_tx_batch(df,'h',ver=37)
- h2m.query(id, db=None, direction='h2m', ty='default', show=True)#
Query homologous mouse genes of human genes.
- Parameters:
id (str): name/gene_id/tx_id of human.
direction (str): OPTIONAL. query from human gene to the mouse gene (‘h2m’) or vise versa (‘m2h’).
db (FeatureDB): OPITONAL. The transcript annotation database of specific version.
ty (str): OPITONAL. Specify the id type. one of ‘gene_id’/’tx_id’/’name’.
- Return:
a list of human gene name, mouse gene name, mapping type and sequence similarity.
- Example:
>>> h2m.query('TP53')
- h2m.query_batch(df, direction='h2m')#
Batch query of orthologous mouse gene of given human genes.
- Parameters:
df (Pandas DataFrame): Must include a column of gene names named ‘gene_name_h’. An index column is recommended.
direction: OPTIONAL. query from human gene to the mouse gene (‘h2m’) or vise versa (‘m2h’).
- Return:
Two dataframes. The first dataframe is the processed original dataframe with canonical transcirpt id attached in the column named ‘gene_name_m’. The second dataframe contains all rows that are not successfully processed.
- Example:
>>> h2m.query_batch(df)
- h2m.model(records_h, index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, start, end, ref_seq, alt_seq, ty_h=None, ver=None, direction='h2m', param='default', coor='nc', search_alternative=True, max_alternative=5, nonstop_size=300, splicing_size=30, batch=False, show_sequence=False, align_input=None, memory_protect=True, memory_size=10000)#
Model human variants in the mouse genome.
- Parameters:
records_h, index_list_h, records_m, index_list_m: human and mouse reference genome.
db_h, db_m: human and mouse GENCODE annotation.
tx_id_h, tx_id_m: human and mouse transcript id (could get by h2m.get_tx_id()). Transcript ids of input and output variants if use direction = ‘h2h’ or direction = ‘m2m’.
start_h, end_h: int, start and end location of the mutation on the chromosome.
ref_seq_h: str, human mutation reference sequence. Reference sequence of the input variant if use direction = ‘m2h’/’h2h’/’m2m’.
alt_seq_h: str, human mutation alternate sequence. Alternate sequence of the input variant if use direction = ‘m2h’/’h2h’/’m2m’.
ty_h: str, human variantion type in MAF format. One of [‘SNP’, ‘DNP’, ‘TNP’,’ONP’, ‘INS’, ‘DEL’].
ver: int, human ref genome number. 37 or 38.
direction (optional): str, set the modeling direction by ‘h2m’ (default) or ‘m2h’, ‘h2h’, ‘m2m’.
param (optional): set param = ‘BE’ and will only output base editing modelable results.
coor (optional): default = ‘nc’. set input = ‘aa’ and will be compatable with input of amino acid variants.
search_alternative (optional): set search_alternative = False and will only output original modeling results.
max_alternative (optional): the maximum number of output alternatives of one human variants.
nonstop_size (optional): the length of neucleotides that are included after the stop codon for alignment and translation in case of the nonstop mutations or frame shifting mutations.
splicing_size (optional): the number of amino acids or neucleotides (for non-coding mutations) that are included after the top codon for the consideration of frame-shifting effect.
batch (optional): set batch = True and will use input align_dict to save time in batch processing.
show_sequence (optional): set batch = True and will output the whole sequences.
align_dict (optional): input a prepared dictionary of alignment indexes to save time in batch processing.
memory_protect (optional): default True. Break long alignments that may lead to death of the kernel.
memory_size (optional): maxlength of aligned sequence when memory_protect == True.
- Other rules:
1. If the mutation falls in the coding and non-coding regions at the same time, it would be considered and processed as a ORIGIAL-MODELING ONLY mutation. 3. The alt_seq input should be in the positive strand and the start_h coordinate should be smaller than or equal the end_h coordinate. 4. If the ref-seq or alt-see has no length, it could be input as ‘’ or ‘-‘.
- Example:
>>> h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577120, 7577120, 'C', 'T', ty_h = 'SNP', ver = 37)
- h2m.model_batch(df, records_h, index_list_h, records_m, index_list_m, db_h, db_m, ver, param='default', direction='h2m', coor='nc', search_alternative=True, max_alternative=5, nonstop_size=300, splicing_size=30, show_sequence=False, align_input=None, memory_protect=True, memory_size=10000, bind=False)#
Batch modeling of human variants in the mouse genome.
- Parameters:
df (Pandas DataFrame): Must include columns {‘start_h’,’end_h’,’type_h’,’ref_seq_h’,’alt_seq_h’,’tx_id_h’,’tx_id_m’,’index’}.
ecords_h, index_list_h, records_m, index_list_m: reference genome files
db_h, db_m: genomic annotation files
ver (int): specify the version of human, one of 37/38.
param (optional): set param = ‘BE’ and will only output base editing modelable results.
direction (optional): str, set the modeling direction by ‘h2m’ (default) or ‘m2h’.
coor (optional): default = ‘nc’. set input = ‘aa’ and will be compatable with input of amino acid variants.
search_alternative (optional): set search_alternative = False and will only output original modeling results.
max_alternative (optional): the maximum number of output alternatives of one human variants.
nonstop_size (optional): the length of neucleotides that are included after the stop codon for alignment and translation in case of the nonstop mutations or frame shifting mutations.
splicing_size (optional):
batch (optional): set batch = True and will use input align_dict to save time in batch processing.
show_sequence (optional): set batch = True and will output the whole sequences.
align_dict (optional): input a prepared dictionary of alignment indexes to save time in batch processing.
memory_protect (optional): default True. Break long alignments that may lead to death of the kernel.
memory_size (optional): maxlength of aligned sequence when memory_protect == True.
bind (optional): to bind the output dataframe with the original input or not.
- Return:
Two dataframes. The first dataframe is the processed original dataframe. The second dataframe contains all rows that are not successfully processed.
- Example:
>>> h2m.model_batch(df, records_h, index_list_h, records_m, index_list_m, db_h, db_m, ver = 37, param = 'BE')
- h2m.visualization(model_result, flank_size=0, print_size=6)#
Visualize h2m modeling results.
- Parameter:
model_result (list): the output of h2m.model(show_sequence = True) function.
flank_size (int) (de).
print_size (int): lenth of neucleotide/peptide included on both sides of the flank region.
- Output:
A visualization plot.
- Example:
>>> model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577120, 7577120, 'C','T', ty_h = 'SNP', ver = 37, show_sequence=True) >>> h2m.visualization(model_result, flank_size = 2, print_size = 4)