Quick Start#

_images/h2m-logo-final.png

Note

Data used in this tutorial can be downloaded from this Dropbox folder.

Package Installation#

H2M is available through the python package index (PyPI). To install, use pip:

    pip install bioh2m

Attention

Python 3.9-3.12 are recommended since H2M has been tested compatible in them.

Hint

H2M has pysam as a dependency. This is for a function that can read .vcf files. If you are experiencing installation problems due to pysam, you can download and install the wheel file in the GitHub repository without this function and the pysam dependency, which has been tested to solve most installation issues. The function rounded off in mini-h2m is also given in the repository.

Importing packages#

    import bioh2m as h2m
    import pandas as pd

Loading data#

We should upload reference genome and GENCODE annotation data for both human and mouse, which could be directly downloaded from this Dropbox folder.
Both GRCh37 and GRCh38 human reference genome assemblys are available. Upload the one that you are going to use.

    path_h_ref, path_m_ref = '.../GCF_000001405.25_GRCh37.p13_genomic.fna.gz', '.../GCF_000001635.27_GRCm39_genomic.fna.gz'
    # remember to replace the paths with yours; for human, GRCh38 reference genome assembly is also provided  
    records_h, index_list_h = h2m.genome_loader(path_h_ref)
    records_m, index_list_m  = h2m.genome_loader(path_m_ref)

    path_h_anno, path_m_anno = '.../gencode_v19_GRCh37.db', '.../gencode_vm33_GRCm39.db'
    # remember to replace the paths with yours
    db_h, db_m = h2m.anno_loader(path_h_anno), h2m.anno_loader(path_m_anno)

Batch Processing#

Input format#

Common mutation data formats include MAF (Mutation Annotation Format, used by cBioPortal), VCF (Variant Call Format, used by genomAD), and ClinVar (a modified VCF format, used by ClinVar). Mutation coordinates, reference and alternative sequences are recorded in slightly different ways between the three.

_images/format.png

In batch processing, H2M accepts MAF input. More information about MAF format can be found at GDC Documentation.

For MAF files, you need to build a pandas dataframe with columns as the following example:

_images/1.png

For VCF and ClinVar files, you will need to convert the mutation coordinates and sequence information to MAF format after this.

This can be achieved simply by using H2M built-in functions.

Read from cBioPortal - MAF#

This format is compatible with all of the datasets in the cBioPortal, as well as TCGA and AACR-GENIE. Download the txt mutation data file from such public dataset and then load it as follows:

    path_aacr = '.../data_mutations_extended.txt'
    df = h2m.cbio_reader(path_aacr)
    df
_images/cbio_reader.png

Read from GenomAD - VCF#

Search a specific gene in GenomAD browser, and download the conluson csv.

_images/genomad.png
    # downloaded TP53 variants from genomAD
    df = h2m.vcf_reader('.../gnomAD_v4.1.0_ENSG00000141510.csv',keep=False)
    df['gene_name_h'] = 'TP53'
    df
_images/vcf_reader.png

Then convert it to MAF format.

    df = h2m.vcf_to_maf(df)
    df
_images/vcf_to_maf.png

Read from ClinVar#

download a ClinVar vcf.gz file, and choose your desired Variation IDs that you wish to model. These vcf.gz files are available in this Dropbox folder.

_images/clinvar.png
    filepath = '/Users/kexindong/Documents/GitHub/Database/PublicDatabase/ClinVar/GRCh37_clinvar_20240206.vcf.gz'
    variation_ids = [32798013, 375926, 325626, 140953, 233866, 1796995, 17578, 573320]
    df = h2m.clinvar_reader(filepath, variation_ids)
    df = h2m.clinvar_to_maf(df)
    df = df[['gene_name_h',	'start_h','end_h','ref_seq_h','alt_seq_h','type_h','format','ID']]
    df = df.rename(columns={'ID':'index'})
    df
_images/clinvar_result.png

Get canonical transcript IDs for human#

There will be returning two dataframes for success and failures.

    df, df_fail = h2m.get_tx_batch(df, species='h', ver = 37)
    df
_images/2.png

Query orthologous genes#

    df_queried, df_fail = h2m.query_batch(df, direction='h2m')
    df_queried
_images/3.png

Get canonical transcript IDs for mouse#

    df_queried, df_fail = h2m.get_tx_batch(df_queried, species='m')
    df_queried
_images/4.png

Compute the muerine variant equivalents#

    df_result, df_fail = h2m.model_batch(df_queried, records_h, index_list_h, records_m, index_list_m, db_h, db_m, 37)

Single variant input#

Query orthologous genes#

First of all, you can use H2M to query a human gene for the presence of mouse homologs and vice versa.

    query_result = h2m.query('TP53')
    Query human gene: TP53;
    Mouse ortholog(s): Trp53;
    Homology type: one2one;
    Sequence Simalarity(%):77.3537.
    query_result = h2m.query('Trp53', direction='m2h')
    Query human gene: Trp53;
    Mouse ortholog(s): TP53;
    Homology type: one2one;
    Sequence Simalarity(%):77.3537.

The output is a list of information for all the mouse ortholog(s) (if have; sometimes more than one).
Each element is a dictionary of mouse gene name, mouse gene id, homology type (one to one/one to multiple/many to many), and similarity of human and mouse gene in percentage.

    h2m.query('U1')
    Query human gene: U1;
    Mouse ortholog(s): Gm22866,Gm25938;
    Homology type: one2many;
    Sequence Simalarity(%):68.75, 62.3457.
    h2m.query('TPT1P6')
    The query human gene: TPT1P6 has no mouse ortholog or this gene id is not included in the database. Please check the input format.

Except for gene names, both ENSEMBL gene id and transcript id are accepted to identify a human gene. You can use the ty parameter (‘tx_id’,’gene_id’ or ‘name’) to specify your input type, but this is totally optional.

Using gene id:

    query_result = h2m.query('ENSG00000141510')
    Query human gene: TP53;
    Mouse ortholog(s): Trp53;
    Homology type: one2one;
    Sequence Simalarity(%):77.3537.

Using transcript id. Should include a db annotation file with the same ref genome version.

    query_result = h2m.query('ENST00000269305.4', db=db_h, ty='tx_id')
    Query human gene: TP53;
    Mouse ortholog(s): Trp53;
    Homology type: one2one;
    Sequence Simalarity(%):77.3537.

The query result of all human genes, as well as corresponding transcript IDs, is also available as a csv file in the this Dropbox folder.

Get transcript ID#

Note

Internet connection needed for this function

One gene may have different transcripts. For mutation modeling, it is important to specify one transcript. If you do not have this information in hand, you can use H2M to get it.

Again, both gene IDs and gene names are accepted as identificaitons for human and mouse genes.

    list_tx_id_h = h2m.get_tx_id('TP53', 'h', ver=37)
  Genome assembly: GRCh37;
  The canonical transcript is: ENST00000269305.4;
  You can choose from the 17 transcripts below for further analysis:
  (1)ENST00000269305.4 (2)ENST00000413465.2 (3)ENST00000359597.4 (4)ENST00000504290.1 (5)ENST00000510385.1 (6)ENST00000504937.1 (7)ENST00000455263.2 (8)ENST00000420246.2 (9)ENST00000445888.2 (10)ENST00000576024.1 (11)ENST00000509690.1 (12)ENST00000514944.1 (13)ENST00000574684.1 (14)ENST00000505014.1 (15)ENST00000508793.1 (16)ENST00000604348.1 (17)ENST00000503591.1
    list_tx_id_m = h2m.get_tx_id('ENSMUSG00000059552', 'm')
  Genome assembly: GRCm39;
  The canonical transcript is: ENSMUST00000108658.10;
  You can choose from the 6 transcripts below for further analysis:
  (1)ENSMUST00000108658.10 (2)ENSMUST00000171247.8 (3)ENSMUST00000005371.12 (4)ENSMUST00000147512.2 (5)ENSMUST00000108657.4 (6)ENSMUST00000130540.2

Now you can use H2M to model your human mutations of interest.
You should have at least such information in hand:

  1. Transcript ID of the human gene

  2. Transcript ID of the mouse gene

Also, multiple infomation for the huaman variant in MAF format:

  1. Start postion

  2. End position

  3. Reference sequence

  4. Alternate sequence

  5. Type in 'SNP','DNP','TNP','ONP','INS','DEL'

  6. The version number of human ref genome '37','38'

Modeling human variants in the mouse genome#

Basic usage#

Taking TP53 R273H (ENST00000269305.4:c.818G>A) as an example.

    tx_id_h, tx_id_m = list_tx_id_h[3], list_tx_id_m[3]
    # use the canonical transcript
    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577120, 7577120, 'C','T', ty_h = 'SNP', ver = 37)
    model_result

Key

Value

gene_name_h

TP53

gene_id_h

ENSG00000141510.11

tx_id_h

ENST00000269305.4

chr_h

chr17

exon_num_h

10

strand_h

-

match

True

start_h

7577120

end_h

7577120

ref_seq_h

C

alt_seq_h

T

HGVSc_h

ENST00000269305.4:c.818G>A

HGVSp_h

R273H

classification_h

Missense

exon_h

E_7

type_h

SNP

status

True

class

0

statement

Class 0: This mutation can be originally modeled.

flank_size_left

4aa

flank_size_right

15aa

gene_name_m

Trp53

gene_id_m

ENSMUSG00000059552.14

tx_id_m

ENSMUST00000108658.10

chr_m

chr11

exon_num_m

10

strand_m

+

type_m

SNP

classification_m

Missense

exon_m

E_7

start_m_ori

69480434

end_m_ori

69480434

ref_seq_m_ori

G

alt_seq_m_ori

A

HGVSc_m_ori

ENSMUST00000108658.10:c.809G>A

HGVSp_m_ori

R270H

start_m

69480434

end_m

69480434

ref_seq_m

G

alt_seq_m

A

HGVSc_m

ENSMUST00000108658.10:c.809G>A

HGVSp_m

R270H

We can see that this human mutaton can be originally modeled by introducing the same neucleotide alteration.

Flank Size#

The length of the identical sequences between human and mouse on teh left/right side of the mutation is provided in order to give you a sense of the local homology and how confident you should be in the fidelity of this modeling.

    pd.DataFrame(model_result)[['flank_size_left','flank_size_right']]

flank_size_left

flank_size_left

4aa

15aa

Result visualization#

By setting show_sequence = True, we can output the sequences of the wild-type and mutated human gene, wild-type, originally-modeled, and alternatively-modeled (if exsist) mouse gene. Modeling results with show_sequence = True can be directly visulaized by h2m.visulization.

    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577120, 7577120, 'C','T', ty_h = 'SNP', ver = 37, show_sequence=True)
    pd.DataFrame(model_result)
    h2m.visualization(model_result, flank_size=4, print_size=2)
_images/h2m_visual.png

Alternative modeling#

Sometimes the human mutation cannot be originally modeled in the mouse genome by using the same neucleotide alteration. Under this circumsatance, some alternative modeling strategies may be found by searching the codon list of the target amino acids.

  • Example 1: TP53 R306Q.

    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577021, 7577021, 'C','T', ty_h = 'SNP', ver = 37)
    pd.DataFrame(model_result)[['HGVSc_h','HGVSp_h',
                                'HGVSc_m_ori','HGVSp_m_ori',
                                'HGVSc_m','HGVSp_m']]

HGVSc_h

ENST00000269305.4:c.917G>A

ENST00000269305.4:c.917G>A

HGVSp_h

R306Q

R306Q

HGVSc_m_ori

ENSMUST00000108658.10:c.908G>A

ENSMUST00000108658.10:c.908G>A

HGVSp_m_ori

R303K

R303K

HGVSc_m

ENSMUST00000108658.10:c.907_908AG>CA

ENSMUST00000108658.10:c.907_909AGA>CAG

HGVSp_m

R303Q

R303Q

  • Example 2: TP53 R249_T253delinsS.

    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577523, 7577534, 'GTGAGGATGGGC', '-', ty_h = 'DEL', ver = 37)
    pd.DataFrame(model_result)[['HGVSc_h','HGVSp_h',
                                'HGVSc_m_ori','HGVSp_m_ori',
                                'HGVSc_m','HGVSp_m']]
_images/delins.png

The default maximum number of output alternatives is 5. You can definitly change that by the parameter max_alternative.

model_result_long = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577523, 7577534, 'GTGAGGATGGGC', '-', ty_h = 'DEL', ver = 37, max_alternative=10)
len(model_result), len(model_result_long)
(5, 6)

If you do not want to alternatively model variants, you can set search_alternatve to False.

    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577523, 7577534, 'GTGAGGATGGGC', '-', ty_h = 'DEL', ver = 37, search_alternative= False)
    model_result[0]['statement']
'Class 6: This mutation cannot be originally modeled.'

Original modeling with uncertain effects#

For frame-shifting mutations and mutations in the non-coding region, we cannot find such alternative modeling strategies with the same protein change effects. H2M will only offer the original modeling and its effect.

  • Example 1: TP53 C275Lfs*31

    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, 
                                db_h, db_m, tx_id_h, tx_id_m, 
                                7577115, 7577116, '','A', ty_h = 'INS', ver = 37)
    pd.DataFrame(model_result)[['HGVSc_h','HGVSp_h',
                                'HGVSc_m','HGVSp_m']]

HGVSc_h

ENST00000269305.4:c.822_823>T

HGVSp_h

C275Lfs*31

HGVSc_m

ENSMUST00000108658.10:c.813_814>T

HGVSp_m

C272Lfs*24

  • Example 2: TP53 splice site mutation

    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, 
                            db_h, db_m, tx_id_h, tx_id_m, 7578555, 7578555, 
                            'C', 'T', ty_h = 'SNP', ver = 37)
    pd.DataFrame(model_result)[['HGVSc_h','HGVSp_h',
                            'HGVSc_m','HGVSp_m']]

HGVSc_h

ENST00000269305.4:c.376-1G>A

HGVSp_h

X125_splice

HGVSc_m

ENSMUST00000108658.10:c.367-1G>A

HGVSp_m

X122_splice

Additional Usage Hint#

Additional function 1: modeling M2H#

Replace human variant coordinates and sequences with murine ones, and set direction = 'm2h'. Use TP53 R273H as an example.

H2M:

    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577120, 7577120, 'C','T', ty_h = 'SNP', ver = 37)
    pd.DataFrame(model_result)[['start_h','end_h','ref_seq_h','alt_seq_h','HGVSp_h','start_m','end_m','ref_seq_m','alt_seq_m','HGVSp_m']]

start_h

7577120

end_h

7577120

ref_seq_h

C

alt_seq_h

T

HGVSp_h

R273H

start_m

69480434

end_m

69480434

ref_seq_m

G

alt_seq_m

A

HGVSp_m

R270H

M2H:

    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 
                            69480434, 69480434, 'G', 'A', ty_h = 'SNP', ver = 37, 
                            direction='m2h')
    pd.DataFrame(model_result)[['start_h','end_h','ref_seq_h','alt_seq_h','HGVSp_h','start_m','end_m','ref_seq_m','alt_seq_m','HGVSp_m']]

start_h

7577120

end_h

7577120

ref_seq_h

C

alt_seq_h

T

HGVSp_h

R273H

start_m

69480434

end_m

69480434

ref_seq_m

G

alt_seq_m

A

HGVSp_m

R270H

Additional function 2: modeling H2H/M2M paralogs#

Replace the reference genome and gencode annotation database input parameter to do so. Take human IDH1 R172G as an example.

df = df[df['class']==1].reset_index(drop=True)
tx_id_1_h, tx_id_2_h = h2m.get_tx_id('SMARCA2','h',ver=37)[3],h2m.get_tx_id('SMARCA4','h',ver=37)[3]
Genome assembly: GRCh37;
The canonical transcript is: ENST00000382203.1;
You can choose from the 17 transcripts below for further analysis:
(1)ENST00000382203.1 (2)ENST00000450198.1 (3)ENST00000457226.1 (4)ENST00000439732.1 (5)ENST00000382194.1 (6)ENST00000491574.1 (7)ENST00000452193.1 (8)ENST00000302401.3 (9)ENST00000423555.1 (10)ENST00000382186.1 (11)ENST00000417599.1 (12)ENST00000382185.1 (13)ENST00000382183.1 (14)ENST00000416751.1 (15)ENST00000349721.2 (16)ENST00000357248.2 (17)ENST00000324954.5

Genome assembly: GRCh37;
The canonical transcript is: ENST00000429416.3;
You can choose from the 20 transcripts below for further analysis:
(1)ENST00000429416.3 (2)ENST00000344626.4 (3)ENST00000541122.2 (4)ENST00000589677.1 (5)ENST00000444061.3 (6)ENST00000590574.1 (7)ENST00000591545.1 (8)ENST00000592604.1 (9)ENST00000586122.1 (10)ENST00000587988.1 (11)ENST00000591595.1 (12)ENST00000585799.1 (13)ENST00000592158.1 (14)ENST00000586892.1 (15)ENST00000538456.3 (16)ENST00000586985.1 (17)ENST00000586921.1 (18)ENST00000358026.2 (19)ENST00000413806.3 (20)ENST00000450717.3
model_result = h2m.model(records_h,index_list_h, records_h, index_list_h, db_h, db_h, tx_id_1_h, tx_id_2_h, 
                        2115855, 2115855, 'G', 'A', ty_h = 'SNP', ver = 37,
                        direction='h2h')
pd.DataFrame(model_result)[['gene_name_h_1','start_h_1','end_h_1','ref_seq_h_1','alt_seq_h_1','HGVSp_h_1','gene_name_h_2','start_h_2','end_h_2','ref_seq_h_2','alt_seq_h_2','HGVSp_h_2']]

gene_name_h_1

SMARCA2

start_h_1

2115855

end_h_1

2115855

ref_seq_h_1

G

alt_seq_h_1

A

HGVSp_h_1

G1164R

gene_name_h_2

SMARCA4

start_h_2

11143999

end_h_2

11143999

ref_seq_h_2

G

alt_seq_h_2

A

HGVSp_h_2

G1194R

Additional function 3: modeling for base editing#

When you set param = ‘BE’, you will get modeling results that can be modeled by base editing (A->G, G->A, C->T, T->C, AA->GG, …etc.). If one mutation can be originally modeled in the mouse genome but not in a BE style, alternative BE modeling strategies will be returned too.

Taking KEAP1 F221L as an example.

    h2m.query('KEAP1')
    Query human gene: KEAP1;
    Mouse ortholog(s): Keap1;
    Homology type: one2one;
    Sequence Simalarity(%):94.0705.
    tx_id_h_2, tx_id_m_2 = h2m.get_tx_id('KEAP1','h',ver=37, show=False)[3], h2m.get_tx_id('Keap1','m', show=False)[3]
    model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h_2, tx_id_m_2, 10602915, 10602915, 'G','T', ty_h = 'SNP', ver = 37, param='BE')
    pd.DataFrame(model_result)[['HGVSc_h','HGVSp_h','HGVSc_m_ori','HGVSp_m_ori','statement','HGVSc_m','HGVSp_m']]

HGVSc_h

ENST00000171111.5:c.663C>A

HGVSp_h

F221L

HGVSc_m_ori

ENSMUST00000164812.8:c.663C>A

HGVSp_m_ori

F221L

statement

Class 1: This mutation can be alternatively modeled.

HGVSc_m

ENSMUST00000164812.8:c.661T>C

HGVSp_m

F221L

Additional function 4: modeling by amino acid change input#

Set coor = ‘aa’ and modeling variants by amino acid change input. Use TP53 R175H as an example.

model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 175, 175, 'R', 'H', coor = 'aa', ty_h = 'SNP', ver = 37)
pd.DataFrame(model_result)

Key

0

1

gene_name_h

TP53

TP53

gene_id_h

ENSG00000141510.11

ENSG00000141510.11

tx_id_h

ENST00000269305.4

ENST00000269305.4

chr_h

chr17

chr17

exon_num_h

10

10

strand_h

-

-

match

True

True

start_h

7578405

7578405

end_h

7578407

7578407

ref_seq_m_ori

CGC

CGC

alt_seq_m_ori

CAC

CAC

HGVSc_m_ori

ENSMUST00000108658.10:c.514_516CGC>CAC

ENSMUST00000108658.10:c.514_516CGC>CAC

HGVSp_m_ori

R172H

R172H

start_m

69479338

69479338

end_m

69479338

69479339

ref_seq_m

G

GC

alt_seq_m

A

AT

HGVSc_m

ENSMUST00000108658.10:c.515G>A

ENSMUST00000108658.10:c.515_516GC>AT

HGVSp_m

R172H

R172H

All of these can also be done in a batch-processing style by using h2m.model_batch.