Package xi_covutils
XI Cov Utils
Utilities to compute and analyze protein covariation
Description
This pack has some tools to make easier to work with protein covariation.
This package is compatible with python 3.
Package content
PDB to sequence mapper
from xi_covutils.pdbmapper import PDBSeqMapper
pdb_file = "WXYZ.pdb"
mapper = PDBSeqMapper()
sequence = "ACDEFGHIKLM"
chain = "A"
# Makes the alignment between the sequence and the PDB file.
mapper.align_sequence_to_pdb(sequence, pdb_file, chain)
# Retrieves the original sequence.
query = mapper.get_sequence() == sequence
# Retrieves the aligned original sequene to the PDB sequence.
aln_seq = mapper.get_aln_sequence()
# Retrieves the amino acid pdb sequence
# Non standard, HOH and HETERO are removed
pdb_seq = mapper.get_pdb_sequence()
# Rerieves the aligned pdb sequence
aln_pdb_seq = mapper.get_aln_pdb_sequence()
# Retrieves the sequence position index from the residue number
# annotated the the PDB file.
# The index starts in 1.
assert mapper.from_residue_number_to_seq(0) == 4
assert mapper.from_residue_number_to_seq(1) == 5
# Retrieves the residue number from the index position of te sequence.
# The index starts in 1.
assert mapper.from_seq_to_residue_number(4) == 0
assert mapper.from_seq_to_residue_number(5) == 1
Calculate distances over two regions in a pdb structure
from xi_covutils.distances import calculate_distances_between_regions
pdb_file = join("5IZE.pdb")
chain1 = "A"
reg1 = [1, 2]
chain2 = "B"
reg2 = [7, 8]
distances = calculate_distances_between_regions(
pdb_file,
chain1,
chain2,
reg1,
reg2
)
assert distances == [
('A', 1, 'MET', 'CE', 'B', 7, 'ILE', 'CD1', 76.32719),
('A', 1, 'MET', 'SD', 'B', 8, 'HIS', 'CD2', 76.88578),
('A', 2, 'ASP', 'N', 'B', 7, 'ILE', 'CD1', 81.55944),
('A', 2, 'ASP', 'N', 'B', 8, 'HIS', 'CD2', 81.966064)
]
Load distances from MIToS
Compute mean number of intramolecular contacts for every chain
dist_data = [
('A', 1, 'A', 2, 1),
('A', 1, 'A', 3, 1),
('A', 1, 'A', 4, 1),
('A', 1, 'A', 5, 1),
('A', 2, 'A', 3, 1),
('A', 2, 'A', 4, 9),
('A', 2, 'A', 5, 1),
('A', 3, 'A', 4, 9),
('A', 3, 'A', 5, 9),
('A', 4, 'A', 5, 1),
('B', 1, 'A', 2, 10)
]
dist = Distances(dist_data)
mean_ic = dist.mean_intramolecular()
# After execution:
mean_ic = {'A': 2.8, 'B': 0.0}
MSA to sequence mapper
Sequence to MSA mapper
# MSA contains this:
# >Reference
# --------eeQ--D--rrE---G------W---LMG-----Vkesdw---
# >SEQ_1
# amnsrlsklqR--D--rrEatrG------W---LMG-----Vkesdw---
# >SEQ_2
# --------eeQ--D--rrE---Gas--llWc--LMGwiovnVkesdwmet
# >SEQ_3
# --------eeQ--Dth--E---G------W---LMG-----Vkesdw---
# >SEQ_4
# --------eeQaaDth--E---G------W---LMG-----Vkesdw---
msa_file = "my_msa.fasta"
motif = "RRDGWLMG"
mapped = map_sequence_to_reference(msa_file, motif)
# After execution:
mapped = {
1: {'position': 17, 'source': 'R', 'target': 'R'},
2: {'position': 18, 'source': 'R', 'target': 'R'},
3: {'position': 19, 'source': 'D', 'target': 'E'},
4: {'position': 23, 'source': 'G', 'target': 'G'},
5: {'position': 30, 'source': 'W', 'target': 'W'},
6: {'position': 34, 'source': 'L', 'target': 'L'},
7: {'position': 35, 'source': 'M', 'target': 'M'},
8: {'position': 36, 'source': 'G', 'target': 'G'}
}
MSA gapstrip
# MSA contains this:
# >Reference
# --------eeQ--D--rrE---G------W---LMG-----Vkesdw---
# >SEQ_1
# amnsrlsklqR--D--rrEatrG------W---LMG-----Vkesdw---
# >SEQ_2
# --------eeQ--D--rrE---Gas--llWc--LMGwiovnVkesdwmet
# >SEQ_3
# --------eeQ--Dth--E---G------W---LMG----stripped-Vkesdw---
# >SEQ_4
# --------eeQaaDth--E---G------W---LMG-----Vkesdw---
msa_file = "my_msa.fasta"
stripped = gapstrip(msa_file, use_reference=True)
# After execution:
stripped = [
SeqRecord(seq='eeQDrrEGWLMGVkesdw', id='Reference',
name='Reference', description='Reference', dbxrefs=[]),
SeqRecord(seq='lqRDrrEGWLMGVkesdw', id='SEQ_1',
name='SEQ_1', description='SEQ_1', dbxrefs=[]),
SeqRecord(seq='eeQDrrEGWLMGVkesdw', id='SEQ_2',
name='SEQ_2', description='SEQ_2', dbxrefs=[]),
SeqRecord(seq='eeQD--EGWLMGVkesdw', id='SEQ_3',
name='SEQ_3', description='SEQ_3', dbxrefs=[]),
SeqRecord(seq='eeQD--EGWLMGVkesdw', id='SEQ_4',
name='SEQ_4', description='SEQ_4', dbxrefs=[])]
stripped = gapstrip(msa_file, use_reference=False)
# After execution:
stripped = [
SeqRecord(seq='--------eeQ--D--rrE---G----W-LMG-----Vkesdw---',
id='Reference', name='Reference', description='Reference', dbxrefs=[]),
SeqRecord(seq='amnsrlsklqR--D--rrEatrG----W-LMG-----Vkesdw---',
id='SEQ_1', name='SEQ_1', description='SEQ_1', dbxrefs=[]),
SeqRecord(seq='--------eeQ--D--rrE---GasllWcLMGwiovnVkesdwmet',
id='SEQ_2', name='SEQ_2', description='SEQ_2', dbxrefs=[]),
SeqRecord(seq='--------eeQ--Dth--E---G----W-LMG-----Vkesdw---',
id='SEQ_3', name='SEQ_3', description='SEQ_3', dbxrefs=[]),
SeqRecord(seq='--------eeQaaDth--E---G----W-LMG-----Vkesdw---',
id='SEQ_4', name='SEQ_4', description='SEQ_4', dbxrefs=[])]
Sequences gapstrip
sequences = ["QW-RT-AS-F",
"-WEXTYAS-F",
"-WEYTYAS-F",
"-WEZTYAS-F"]
stripped = gapstrip_sequences(sequences)
# After execution:
stripped = ["QWRTASF", "-WXTASF", "-WYTASF" ,"-WZTASF"]
stripped = gapstrip_sequences(sequences, use_reference=False)
# After execution:
stripped = ["QW-RT-ASF", "-WEXTYASF", "-WEYTYASF", "-WEZTYASF"]
Pop reference of MSA sequence
msa_data = [
('s1', 'ATCTGACA'),
('s2', 'ATCTGACC'),
('s3', 'ATCTGACG'),
('s4', 'ATCTGACT')
]
msa_data = pop_reference(msa_data, 's3')
# After execution
msa_data = [
('s3', 'ATCTGACG'),
('s1', 'ATCTGACA'),
('s2', 'ATCTGACC'),
('s4', 'ATCTGACT')
]
msa_data = {
's1': 'ACTACG',
's2': 'CATCTG'
}
msa_data = pop_reference(msa_data, 's2')
# After execution
msa_data = [
('s1': 'ACTACG'),
('s2': 'CATCTG')
]
Pick a reference sequence from a MSA.
reference_sequence = "amnsrlsklqRDrrEatrGWLMGVkesdw"
msa_file = "some_msa.fasta
ref = pick_reference(reference_sequence, msa_file)
assert len(ref) == 1
ref_id, ref_seq, match_type = ref[0]
assert ref_id == "SEQ_1"
assert ref_seq == "AMNSRLSKLQRDRREATRGWLMGVKESDW"
assert match_type == "IDENTICAL_MATCH"
Compare two MSA sequences
msa1 = [
("seq1", "QWERTY"),
("seq2", "QWERTY"),
]
msa2 = [
("seq1", "QWERTY"),
("seq2", "QWERTY"),
]
result = compare_two_msa(msa1, msa2)
assert result == {
'msa1_n_sequences': 2,
'msa2_n_sequences': 2,
'has_same_number_of_sequences': True,
'identical_descriptions': True,
'identical_has_same_order': True,
'ungapped': {
'identical_seqs': True,
'has_same_order': True,
'corresponds_with_desc': True
},
'gapped': {
'identical_seqs': True,
'has_same_order': True,
'corresponds_with_desc': True
},
'identical_msa': True
}
Download MSA from pfam
Calculate gap content
msa_data = [
('s1', "QWERTY"),
('s2', "QWERTY"),
('s3', "------")
]
assert gap_content(msa_data) == approx(1.0/3)
Calculate gap content by column
msa_data = [
('s1', "-AAAA---"),
('s2', "--BBBA--"),
('s3', "---CCCC-"),
('s4', "----DCCD"),
]
gaps = gap_content_by_column(msa_data)
assert gaps[0] == 1
assert gaps[1] == 0.75
assert gaps[2] == 0.5
assert gaps[3] == 0.25
assert gaps[4] == 0
assert gaps[5] == 0.25
assert gaps[6] == 0.5
assert gaps[7] == 0.75
ROC curve and auc calculation.
# Merge covariation scores and contact distance
dist_elems = [
('A', 1, 'A', 2, 6.01),
('A', 1, 'A', 3, 6.02),
('A', 1, 'A', 4, 6.13),
('A', 2, 'A', 3, 6.24),
('A', 2, 'A', 4, 6.35),
]
ditances = Distances(dist_elems)
scores = {
(('A', 1), ('A', 2)) : 0.11,
(('A', 1), ('A', 3)) : 0.12,
(('A', 1), ('A', 4)) : 0.13,
(('A', 2), ('A', 3)) : 0.14,
(('A', 2), ('A', 4)) : 0.15,
(('A', 3), ('A', 4)) : 0.16,
}
merged = merge_scores_and_distances(scores, ditances)
# After execution, the merged is not ordered
merged = [
(0.11, True),
(0.12, True),
(0.13, False),
(0.14, False),
(0.15, False),
]
# Get binary classification of contacts
merged = [
(0.11, True),
(0.12, True),
(0.13, False),
(0.14, False),
(0.15, False),
]
binary = binary_from_merged(merged)
# After execution:
binary = [False, False, False, True, True]
# calculate curve roc from binary
binary = [False, False, False, True, True]
curve1 = curve(binary, method='roc')
curve2 = curve(binary, method='precision_recall')
Load results from ccmpred output
cov_data = from_ccmpred(cov_file)
Calculate smoothed covariation
cov_data = from_ccmpred(cov_file)
smoothed = smooth_cov(cov_data)
Run mkdssp
from xi_covutils.mkdssp import mkdssp
results = mkdssp(a_pdb_file)
# After execution
# results =
# {('A', 1): {'aa': 'K', 'chain': 'A', 'index': 1, 'pdb_num': 1, 'structure': ''},
# ('A', 2): {'aa': 'V', 'chain': 'A', 'index': 2, 'pdb_num': 2, 'structure': ''},
# ('A', 3): {'aa': 'S', 'chain': 'A', 'index': 3, 'pdb_num': 3, 'structure': ''},
# ('A', 4): {'aa': 'G', 'chain': 'A', 'index': 4, 'pdb_num': 4, 'structure': ''},
# ('A', 5): {'aa': 'T', 'chain': 'A', 'index': 5, 'pdb_num': 5, 'structure': ''},
# ('A', 6): {'aa': 'V', 'chain': 'A', 'index': 6, 'pdb_num': 6, 'structure': ''},
# ('A', 7): {'aa': 'C', 'chain': 'A', 'index': 7, 'pdb_num': 7, 'structure': ''},
# etc
# }
Split results of paired MSA in inter and intra chain covariation
Sequence clustering using Hobohm-1 algorithm
from xi_covutils.clustering import hobohm1
sequences = [
'ABCDEFGHIJ',
'ABCDEFGHIZ',
'ABCDEFZXCW',
'ABCDEFZXCK'
]
results = hobohm1(sequences)
# After execution:
# results = [
# Cluster:[2][ABCDEFGHIJ] ABCDEFGHIJ, ABCDEFGHIZ,
# Cluster:[2][ABCDEFZXCW] ABCDEFZXCW, ABCDEFZXCK
# ]
Sequence clustering using kmers
from xi_covutils.clustering import kmer_clustering
sequences = [
'ABCDEFGHIJ',
'ABCDEFGHIZ',
'ABCDEFZXCW',
'ABCDEFZXCK'
]
results = kmer_clustering(sequences)
# After execution:
# results = [
# Cluster:[2][ABCDEFGHIJ] ABCDEFGHIJ, ABCDEFGHIZ,
# Cluster:[2][ABCDEFZXCW] ABCDEFZXCW, ABCDEFZXCK
# ]
Read contact map files from SCPE
content = StringIO("""Tertiary contact map
Warning!. Position 2 has 0 contacts
Quaternary contact map
Warning!. Position 2 has 0 contacts
Terciary
0 1 1
1 0 0
1 0 0
Quaternary
0 0 1
0 0 1
1 1 0
Tertiary Total contacts Quaternary Total contacts
2 2 0
0 0 0
3 3 0
""")
c_map = contact_map_from_scpe(content, quaternary=False)
# After execution:
# c_map = {
# (1, 1) : 0,
# (1, 2) : 1,
# (1, 3) : 1,
# (2, 1) : 1,
# (2, 2) : 0,
# (2, 3) : 0,
# (3, 1) : 1,
# (3, 2) : 0,
# (3, 3) : 0,
# }
Read contact map files from text file
content = StringIO("""
0 1 1
1 0 0
1 0 0
""")
c_map = contact_map_from_text(content)
# After execution:
# c_map = {
# (1, 1) : 0,
# (1, 2) : 1,
# (1, 3) : 1,
# (2, 1) : 1,
# (2, 2) : 0,
# (2, 3) : 0,
# (3, 1) : 1,
# (3, 2) : 0,
# (3, 3) : 0,
# }
Create Distances objects from contact map
content = StringIO("""
0 1 1
1 0 0
1 0 0
""")
c_map = contact_map_from_text(content)
dist = Distances.from_contact_map(c_map)
Create Distances objects from contact map
seqs = [
"ACTACTATCTAGCTAGC",
"ACTACTGATGCACTGTG",
"ACTACTGATCTACTGAG"
]
results = entropy(seqs, False, 62)
# expected_results = [
# -0.0, -0.0, -0.0, -0.0, -0.0, -0.0,
# 0.6931471805599453, 0.6931471805599453,
# 0.6931471805599453, 1.0397207708399179,
# 1.0397207708399179, 0.6931471805599453,
# -0.0, -0.0,
# 0.6931471805599453, 1.0397207708399179,
# 0.6931471805599453]
How to install
> pip install dist/xi_covutils-x.y.z
Dependencies
- biopython >= 1.72
- requests
Development dependencies
In development_requirements.txt
Running tests
> pytest tests
Documentation
Automatic documentation from code is in the 'docs' folder.
Expand source code
"""
# XI Cov Utils
.. include::../README.md
"""
Sub-modules
xi_covutils.blast
xi_covutils.blast_api
-
Makes Blast API calls …
xi_covutils.clustering
-
Clustering functions
xi_covutils.compute
-
Compute covariation using external programs.
xi_covutils.conservation
-
Computes conservation for a collection of protein sequences.
xi_covutils.distances
-
Functions and classes to work with residue distances in proteins structures
xi_covutils.fastafilter
-
Filter fasta sequences.
xi_covutils.fastq
-
A simple module to work with fastq files
xi_covutils.identity
xi_covutils.matrices
-
Background and joint amino acid substitution frequencies used to calculate BLOSUMs matrix …
xi_covutils.mkdssp
-
Run mkdssp to get secondary structure information from a pdb …
xi_covutils.msa
-
MSA functions
xi_covutils.pdb_align
-
This module has function to align PDB structures.
xi_covutils.pdbbank
-
Compute some stuff on PDB file
xi_covutils.pdbmapper
-
Functions to map postions from PDB files to sequence files.
xi_covutils.primers
xi_covutils.read_results
-
Read results from covariation files
xi_covutils.roc
-
Functions to compute ROC curves and calculate AUC scores.
xi_covutils.seqmapper
-
Sequence Mapper
xi_covutils.seqs
xi_covutils.smooth
-
Functions to compute smooth covariation scores
xi_covutils.taxonomy
-
Taxonomy module …
xi_covutils.xi_covutils_app
-
XI Cov Utils - Command line interface