Documentation¶
sonia.sonia¶
Created on Wed Jan 30 12:06:58 2019
@author: zacharysethna and Giulio Isacchini
Classes¶
Sonia(features=[], data_seqs=[], gen_seqs=[], chain_type='humanTRB', load_dir=None, feature_file=None, model_file=None, data_seq_file=None, gen_seq_file=None, log_file=None, load_seqs=True, l2_reg=0.0, min_energy_clip=-5, max_energy_clip=10, seed=None, vj=False)
Class used to infer a Q selection model.
Attributes
----------
features : ndarray
Array of feature lists. Each list contains individual subfeatures which
all must be satisfied.
features_dict : dict
Dictionary keyed by tuples of the feature lists. Values are the index
of the feature, i.e. self.features[self.features_dict[tuple(f)]] = f.
constant_features : list
List of feature strings to not update parameters during learning. These
features are still used to compute energies (not currently used)
data_seqs : list
Data sequences used to infer selection model. Note, each 'sequence'
is a list where the first element is the CDR3 sequence which is
followed by any V or J genes.
gen_seqs : list
Sequences from generative distribution used to infer selection model.
Note, each 'sequence' is a list where the first element is the CDR3
sequence which is followed by any V or J genes.
data_seq_features : list
Lists of features that data_seqs project onto
gen_seq_features : list
Lists of features that gen_seqs project onto
data_marginals : ndarray
Array of the marginals of each feature over data_seqs
gen_marginals : ndarray
Array of the marginals of each feature over gen_seqs
model_marginals : ndarray
Array of the marginals of each feature over the model weighted gen_seqs
L1_converge_history : list
L1 distance between data_marginals and model_marginals at each
iteration.
chain_type : str
Type of receptor. This specification is used to determine gene names
and allow integrated OLGA sequence generation. Options: 'humanTRA',
'humanTRB' (default), 'humanIGH', 'humanIGL', 'humanIGK' and 'mouseTRB'.
l2_reg : float or None
L2 regularization. If None (default) then no regularization.
Methods
----------
seq_feature_proj(feature, seq)
Determines if a feature matches/is found in a sequence.
find_seq_features(seq, features = None)
Determines all model features of a sequence.
compute_seq_energy(seq_features = None, seq = None)
Computes the energy, as determined by the model, of a sequence.
compute_energy(seqs_features)
Computes the energies of a list of seq_features according to the model.
compute_marginals(self, features = None, seq_model_features = None, seqs = None, use_flat_distribution = False)
Computes the marginals of features over a set of sequences.
infer_selection(self, epochs = 20, batch_size=5000, initialize = True, seed = None)
Infers model parameters (energies for each feature).
update_model_structure(self,output_layer=[],input_layer=[],initialize=False)
Sets keras model structure and compiles.
update_model(self, add_data_seqs = [], add_gen_seqs = [], add_features = [], remove_features = [], add_constant_features = [], auto_update_marginals = False, auto_update_seq_features = False)
Updates model by adding/removing model features or data/generated seqs.
Marginals and seq_features can also be updated.
add_generated_seqs(self, num_gen_seqs = 0, reset_gen_seqs = True)
Generates synthetic sequences using OLGA and adds them to gen_seqs.
plot_model_learning(self, save_name = None)
Plots current marginal scatter plot as well as L1 convergence history.
save_model(self, save_dir, attributes_to_save = None)
Saves the model.
load_model(self, load_dir, load_seqs = True)
Loads a model.
Methods¶
add_generated_seqs(self, num_gen_seqs=0, reset_gen_seqs=True, custom_model_folder=None, add_error=False, custom_error=None)
Generates MonteCarlo sequences for gen_seqs using OLGA.
Only generates seqs from a V(D)J model. Requires the OLGA package
(pip install olga).
Parameters
----------
num_gen_seqs : int or float
Number of MonteCarlo sequences to generate and add to the specified
sequence pool.
custom_model_folder : str
Path to a folder specifying a custom IGoR formatted model to be
used as a generative model. Folder must contain 'model_params.txt'
and 'model_marginals.txt'
add_error: bool
simualate sequencing error: default is false
custom_error: int
set custom error rate for sequencing error.
Default is the one inferred by igor.
Attributes set
--------------
gen_seqs : list
MonteCarlo sequences drawn from a VDJ recomb model
gen_seq_features : list
Features gen_seqs have been projected onto.
compute_energy(self, seqs_features)
Computes the energy of a list of sequences according to the model.
Parameters
----------
seqs_features : list
list of encoded sequences into sonia features.
Returns
-------
E : float
Energies of seqs according to the model.
compute_marginals(self, features=None, seq_model_features=None, seqs=None, use_flat_distribution=False, output_dict=False)
Computes the marginals of each feature over sequences.
Computes marginals either with a flat distribution over the sequences
or weighted by the model energies. Note, finding the features of each
sequence takes time and should be avoided if it has already been done.
If computing marginals of model features use the default setting to
prevent searching for the model features a second time. Similarly, if
seq_model_features has already been determined use this to avoid
recalculating it.
Parameters
----------
features : list or None
List of features. This does not need to match the model
features. If None (default) the model features will be used.
seq_features_all : list
Indices of model features seqs project onto.
seqs : list
List of sequences to compute the feature marginals over. Note, each
'sequence' is a list where the first element is the CDR3 sequence
which is followed by any V or J genes.
use_flat_distribution : bool
Marginals will be computed using a flat distribution (each seq is
weighted as 1) if True. If False, the marginals are computed using
model weights (each sequence is weighted as exp(-E) = Q). Default
is False.
Returns
-------
marginals : ndarray or dict
Marginals of model features over seqs.
compute_seq_energy(self, seq=None, seq_features=None)
Computes the energy of a sequence according to the model.
Parameters
----------
seq : list
CDR3 sequence and any associated genes
seq_features : list
Features indices seq projects onto.
Returns
-------
E : float
Energy of seq according to the model.
find_seq_features(self, seq, features=None)
Finds which features match seq
Parameters
----------
seq : list
CDR3 sequence and any associated genes
features : ndarray
Array of feature lists. Each list contains individual subfeatures which
all must be satisfied.
Returns
-------
seq_features : list
Indices of features seq projects onto.
infer_selection(self, epochs=10, batch_size=5000, initialize=True, seed=None, validation_split=0.2, monitor=False, verbose=0)
Infer model parameters, i.e. energies for each model feature.
Parameters
----------
epochs : int
Maximum number of learning epochs
intialize : bool
Resets data shuffle
batch_size : int
Size of the batches in the inference
seed : int
Sets random seed
Attributes set
--------------
model : keras model
Parameters of the model
model_marginals : array
Marginals over the generated sequences, reweighted by the model.
L1_converge_history : list
L1 distance between data_marginals and model_marginals at each
iteration.
load_model(self, load_dir=None, load_seqs=True, feature_file=None, model_file=None, data_seq_file=None, gen_seq_file=None, log_file=None, verbose=True)
Loads model from directory.
Parameters
----------
load_dir : str
Directory name to load model attributes from.
save_model(self, save_dir, attributes_to_save=None, force=True)
Saves model parameters and sequences
Parameters
----------
save_dir : str
Directory name to save model attributes to.
attributes_to_save: list
name of attributes to save
seq_feature_proj(self, feature, seq)
Checks if a sequence matches all subfeatures of the feature list
Parameters
----------
feature : list
List of individual subfeatures the sequence must match
seq : list
CDR3 sequence and any associated genes
Returns
-------
bool
True if seq matches feature else False.
update_model(self, add_data_seqs=[], add_gen_seqs=[], add_features=[], remove_features=[], add_constant_features=[], auto_update_marginals=False, auto_update_seq_features=False)
Updates the model attributes
This method is used to add/remove model features or data/generated
sequences. These changes will be propagated through the class to update
any other attributes that need to match (e.g. the marginals or
seq_features).
Parameters
----------
add_data_seqs : list
List of CDR3 sequences to add to data_seq pool.
add_gen_seqs : list
List of CDR3 sequences to add to data_seq pool.
add_gen_seqs : list
List of CDR3 sequences to add to data_seq pool.
add_features : list
List of feature lists to add to self.features
remove_featurese : list
List of feature lists and/or indices to remove from self.features
add_constant_features : list
List of feature lists to add to constant features. (Not currently used)
auto_update_marginals : bool
Specifies to update marginals.
auto_update_seq_features : bool
Specifies to update seq features.
Attributes set
--------------
features : list
List of model features
data_seq_features : list
Features data_seqs have been projected onto.
gen_seq_features : list
Features gen_seqs have been projected onto.
data_marginals : ndarray
Marginals over the data sequences for each model feature.
gen_marginals : ndarray
Marginals over the generated sequences for each model feature.
model_marginals : ndarray
Marginals over the generated sequences, reweighted by the model,
for each model feature.
update_model_structure(self, output_layer=[], input_layer=[], initialize=False)
Defines the model structure and compiles it.
Parameters
----------
structure : Sequential Model Keras
structure of the model
initialize: bool
if True, it initializes to linear model, otherwise it updates to new structure
sonia.sonia_leftpos_rightpos¶
@author: zacharysethna
Classes¶
SoniaLeftposRightpos(data_seqs=[], gen_seqs=[], chain_type='humanTRB', load_dir=None, feature_file=None, data_seq_file=None, gen_seq_file=None, log_file=None, load_seqs=True, max_depth=25, max_L=30, include_indep_genes=False, include_joint_genes=True, min_energy_clip=-5, max_energy_clip=10, seed=None, custom_pgen_model=None, l2_reg=0.0, vj=False)
Class used to infer a Q selection model.
Attributes
----------
features : ndarray
Array of feature lists. Each list contains individual subfeatures which
all must be satisfied.
features_dict : dict
Dictionary keyed by tuples of the feature lists. Values are the index
of the feature, i.e. self.features[self.features_dict[tuple(f)]] = f.
constant_features : list
List of feature strings to not update parameters during learning. These
features are still used to compute energies (not currently used)
data_seqs : list
Data sequences used to infer selection model. Note, each 'sequence'
is a list where the first element is the CDR3 sequence which is
followed by any V or J genes.
gen_seqs : list
Sequences from generative distribution used to infer selection model.
Note, each 'sequence' is a list where the first element is the CDR3
sequence which is followed by any V or J genes.
data_seq_features : list
Lists of features that data_seqs project onto
gen_seq_features : list
Lists of features that gen_seqs project onto
data_marginals : ndarray
Array of the marginals of each feature over data_seqs
gen_marginals : ndarray
Array of the marginals of each feature over gen_seqs
model_marginals : ndarray
Array of the marginals of each feature over the model weighted gen_seqs
L1_converge_history : list
L1 distance between data_marginals and model_marginals at each
iteration.
chain_type : str
Type of receptor. This specification is used to determine gene names
and allow integrated OLGA sequence generation. Options: 'humanTRA',
'humanTRB' (default), 'humanIGH', 'humanIGL', 'humanIGK' and 'mouseTRB'.
l2_reg : float or None
L2 regularization. If None (default) then no regularization.
Methods
----------
seq_feature_proj(feature, seq)
Determines if a feature matches/is found in a sequence.
find_seq_features(seq, features = None)
Determines all model features of a sequence.
compute_seq_energy(seq_features = None, seq = None)
Computes the energy, as determined by the model, of a sequence.
compute_energy(seqs_features)
Computes the energies of a list of seq_features according to the model.
compute_marginals(self, features = None, seq_model_features = None, seqs = None, use_flat_distribution = False)
Computes the marginals of features over a set of sequences.
infer_selection(self, epochs = 20, batch_size=5000, initialize = True, seed = None)
Infers model parameters (energies for each feature).
update_model_structure(self,output_layer=[],input_layer=[],initialize=False)
Sets keras model structure and compiles.
update_model(self, add_data_seqs = [], add_gen_seqs = [], add_features = [], remove_features = [], add_constant_features = [], auto_update_marginals = False, auto_update_seq_features = False)
Updates model by adding/removing model features or data/generated seqs.
Marginals and seq_features can also be updated.
add_generated_seqs(self, num_gen_seqs = 0, reset_gen_seqs = True)
Generates synthetic sequences using OLGA and adds them to gen_seqs.
plot_model_learning(self, save_name = None)
Plots current marginal scatter plot as well as L1 convergence history.
save_model(self, save_dir, attributes_to_save = None)
Saves the model.
load_model(self, load_dir, load_seqs = True)
Loads a model.
Methods¶
add_features(self, include_indep_genes=False, include_joint_genes=True, custom_pgen_model=None)
Generates a list of feature_lsts for L/R pos model.
Parameters
----------
include_genes : bool
If true, features for gene selection are also generated. Currently
joint V/J pairs used.
custom_pgen_model: string
path to folder of custom olga model.
compute_seq_energy_from_parameters(self, seqs=None, seqs_features=None)
Computes the energy of a list of sequences according to the model.
This computes according to model parameters instead of the keras model.
As a result, no clipping occurs.
Parameters
----------
seqs : list or None
Sequence list for a single sequence or many.
seqs_features : list or None
list of sequence features for a single sequence or many.
Returns
-------
E : float
Energies of seqs according to the model.
find_seq_features(self, seq, features=None)
Finds which features match seq
If no features are provided, the left/right indexing amino acid model
features will be assumed.
Parameters
----------
seq : list
CDR3 sequence and any associated genes
features : ndarray
Array of feature lists. Each list contains individual subfeatures which
all must be satisfied.
Returns
-------
seq_features : list
Indices of features seq projects onto.
get_energy_parameters(self, return_as_dict=False)
Extract energy terms from keras model.
save_model(self, save_dir, attributes_to_save=None, force=True)
Saves model parameters and sequences
Parameters
----------
save_dir : str
Directory name to save model attributes to.
attributes_to_save: list
Names of attributes to save
sonia.sonia_length_pos¶
Created on Wed Mar 6 15:12:15 2019
@author: Zachary Sethna
Classes¶
SoniaLengthPos(data_seqs=[], gen_seqs=[], chain_type='humanTRB', load_dir=None, feature_file=None, data_seq_file=None, gen_seq_file=None, log_file=None, min_L=4, max_L=30, include_indep_genes=False, include_joint_genes=True, min_energy_clip=-5, max_energy_clip=10, seed=None, custom_pgen_model=None, l2_reg=0.0, vj=False)
Class used to infer a Q selection model.
Attributes
----------
features : ndarray
Array of feature lists. Each list contains individual subfeatures which
all must be satisfied.
features_dict : dict
Dictionary keyed by tuples of the feature lists. Values are the index
of the feature, i.e. self.features[self.features_dict[tuple(f)]] = f.
constant_features : list
List of feature strings to not update parameters during learning. These
features are still used to compute energies (not currently used)
data_seqs : list
Data sequences used to infer selection model. Note, each 'sequence'
is a list where the first element is the CDR3 sequence which is
followed by any V or J genes.
gen_seqs : list
Sequences from generative distribution used to infer selection model.
Note, each 'sequence' is a list where the first element is the CDR3
sequence which is followed by any V or J genes.
data_seq_features : list
Lists of features that data_seqs project onto
gen_seq_features : list
Lists of features that gen_seqs project onto
data_marginals : ndarray
Array of the marginals of each feature over data_seqs
gen_marginals : ndarray
Array of the marginals of each feature over gen_seqs
model_marginals : ndarray
Array of the marginals of each feature over the model weighted gen_seqs
L1_converge_history : list
L1 distance between data_marginals and model_marginals at each
iteration.
chain_type : str
Type of receptor. This specification is used to determine gene names
and allow integrated OLGA sequence generation. Options: 'humanTRA',
'humanTRB' (default), 'humanIGH', 'humanIGL', 'humanIGK' and 'mouseTRB'.
l2_reg : float or None
L2 regularization. If None (default) then no regularization.
Methods
----------
seq_feature_proj(feature, seq)
Determines if a feature matches/is found in a sequence.
find_seq_features(seq, features = None)
Determines all model features of a sequence.
compute_seq_energy(seq_features = None, seq = None)
Computes the energy, as determined by the model, of a sequence.
compute_energy(seqs_features)
Computes the energies of a list of seq_features according to the model.
compute_marginals(self, features = None, seq_model_features = None, seqs = None, use_flat_distribution = False)
Computes the marginals of features over a set of sequences.
infer_selection(self, epochs = 20, batch_size=5000, initialize = True, seed = None)
Infers model parameters (energies for each feature).
update_model_structure(self,output_layer=[],input_layer=[],initialize=False)
Sets keras model structure and compiles.
update_model(self, add_data_seqs = [], add_gen_seqs = [], add_features = [], remove_features = [], add_constant_features = [], auto_update_marginals = False, auto_update_seq_features = False)
Updates model by adding/removing model features or data/generated seqs.
Marginals and seq_features can also be updated.
add_generated_seqs(self, num_gen_seqs = 0, reset_gen_seqs = True)
Generates synthetic sequences using OLGA and adds them to gen_seqs.
plot_model_learning(self, save_name = None)
Plots current marginal scatter plot as well as L1 convergence history.
save_model(self, save_dir, attributes_to_save = None)
Saves the model.
load_model(self, load_dir, load_seqs = True)
Loads a model.
Methods¶
add_features(self, include_indep_genes=False, include_joint_genes=True, custom_pgen_model=None)
Generates a list of feature_lsts for a length dependent L pos model.
Parameters
----------
include_genes : bool
If true, features for gene selection are also generated. Currently
joint V/J pairs used.
custom_pgen_model: string
path to folder of custom olga model.
compute_seq_energy_from_parameters(self, seqs=None, seqs_features=None)
Computes the energy of a list of sequences according to the model.
This computes according to model parameters instead of the keras model.
As a result, no clipping occurs.
Parameters
----------
seqs : list or None
Sequence list for a single sequence or many.
seqs_features : list or None
list of sequence features for a single sequence or many.
Returns
-------
E : float
Energies of seqs according to the model.
find_seq_features(self, seq, features=None)
Finds which features match seq
If no features are provided, the length dependent amino acid model
features will be assumed.
Parameters
----------
seq : list
CDR3 sequence and any associated genes
features : ndarray
Array of feature lists. Each list contains individual subfeatures which
all must be satisfied.
Returns
-------
seq_features : list
Indices of features seq projects onto.
get_energy_parameters(self, return_as_dict=False)
Extract energy terms from keras model and gauge.
For the length dependent position model, the gauge is set so that at a
given position, for a given length, we have:
<q_i,aa;L>_gen|L = 1
Parameters
----------
min_L : int
Minimum length CDR3 sequence, if not given taken from class attribute
max_L : int
Maximum length CDR3 sequence, if not given taken from class attribute
save_model(self, save_dir, attributes_to_save=None, force=True)
Saves model parameters and sequences
Parameters
----------
save_dir : str
Directory name to save model attributes to.
attributes_to_save: list
Names of attributes to save
sonia.sonia_vjl¶
@author: Giulio Isacchini
Classes¶
SoniaVJL(data_seqs=[], gen_seqs=[], chain_type='humanTRB', load_dir=None, feature_file=None, data_seq_file=None, gen_seq_file=None, log_file=None, load_seqs=True, max_depth=25, max_L=30, include_indep_genes=False, include_joint_genes=True, min_energy_clip=-5, max_energy_clip=10, seed=None, custom_pgen_model=None, l2_reg=0.0, vj=False, joint_vjl=False)
Class used to infer a Q selection model.
Attributes
----------
features : ndarray
Array of feature lists. Each list contains individual subfeatures which
all must be satisfied.
features_dict : dict
Dictionary keyed by tuples of the feature lists. Values are the index
of the feature, i.e. self.features[self.features_dict[tuple(f)]] = f.
constant_features : list
List of feature strings to not update parameters during learning. These
features are still used to compute energies (not currently used)
data_seqs : list
Data sequences used to infer selection model. Note, each 'sequence'
is a list where the first element is the CDR3 sequence which is
followed by any V or J genes.
gen_seqs : list
Sequences from generative distribution used to infer selection model.
Note, each 'sequence' is a list where the first element is the CDR3
sequence which is followed by any V or J genes.
data_seq_features : list
Lists of features that data_seqs project onto
gen_seq_features : list
Lists of features that gen_seqs project onto
data_marginals : ndarray
Array of the marginals of each feature over data_seqs
gen_marginals : ndarray
Array of the marginals of each feature over gen_seqs
model_marginals : ndarray
Array of the marginals of each feature over the model weighted gen_seqs
L1_converge_history : list
L1 distance between data_marginals and model_marginals at each
iteration.
chain_type : str
Type of receptor. This specification is used to determine gene names
and allow integrated OLGA sequence generation. Options: 'humanTRA',
'humanTRB' (default), 'humanIGH', 'humanIGL', 'humanIGK' and 'mouseTRB'.
l2_reg : float or None
L2 regularization. If None (default) then no regularization.
Methods
----------
seq_feature_proj(feature, seq)
Determines if a feature matches/is found in a sequence.
find_seq_features(seq, features = None)
Determines all model features of a sequence.
compute_seq_energy(seq_features = None, seq = None)
Computes the energy, as determined by the model, of a sequence.
compute_energy(seqs_features)
Computes the energies of a list of seq_features according to the model.
compute_marginals(self, features = None, seq_model_features = None, seqs = None, use_flat_distribution = False)
Computes the marginals of features over a set of sequences.
infer_selection(self, epochs = 20, batch_size=5000, initialize = True, seed = None)
Infers model parameters (energies for each feature).
update_model_structure(self,output_layer=[],input_layer=[],initialize=False)
Sets keras model structure and compiles.
update_model(self, add_data_seqs = [], add_gen_seqs = [], add_features = [], remove_features = [], add_constant_features = [], auto_update_marginals = False, auto_update_seq_features = False)
Updates model by adding/removing model features or data/generated seqs.
Marginals and seq_features can also be updated.
add_generated_seqs(self, num_gen_seqs = 0, reset_gen_seqs = True)
Generates synthetic sequences using OLGA and adds them to gen_seqs.
plot_model_learning(self, save_name = None)
Plots current marginal scatter plot as well as L1 convergence history.
save_model(self, save_dir, attributes_to_save = None)
Saves the model.
load_model(self, load_dir, load_seqs = True)
Loads a model.
Methods¶
add_features(self, custom_pgen_model=None)
Generates a list of feature_lsts for L/R pos model.
Parameters
----------
include_genes : bool
If true, features for gene selection are also generated. Currently
joint V/J pairs used.
custom_pgen_model: string
path to folder of custom olga model.
compute_seq_energy_from_parameters(self, seqs=None, seqs_features=None)
Computes the energy of a list of sequences according to the model.
This computes according to model parameters instead of the keras model.
As a result, no clipping occurs.
Parameters
----------
seqs : list or None
Sequence list for a single sequence or many.
seqs_features : list or None
list of sequence features for a single sequence or many.
Returns
-------
E : float
Energies of seqs according to the model.
find_seq_features(self, seq, features=None)
Finds which features match seq
If no features are provided, the left/right indexing amino acid model
features will be assumed.
Parameters
----------
seq : list
CDR3 sequence and any associated genes
features : ndarray
Array of feature lists. Each list contains individual subfeatures which
all must be satisfied.
Returns
-------
seq_features : list
Indices of features seq projects onto.
get_energy_parameters(self, return_as_dict=False)
Extract energy terms from keras model.
save_model(self, save_dir, attributes_to_save=None)
Saves model parameters and sequences
Parameters
----------
save_dir : str
Directory name to save model attributes to.
attributes_to_save: list
Names of attributes to save
sonia.evaluate_model¶
@author: Giulio Isacchini
Classes¶
EvaluateModel(sonia_model=None, include_genes=True, processes=None, custom_olga_model=None)
Class used to evaluate sequences with the sonia model: Ppost=Q*Pgen
Attributes
----------
sonia_model: object
Sonia model. Loaded previously, do not put the path.
include_genes: bool
Conditioning on gene usage for pgen/ppost evaluation. Default: True
processes: int
Number of processes to use to infer pgen. Default: all.
custom_olga_model: object
Optional: already loaded custom generation_probability olga model.
Methods
----------
evaluate_seqs(seqs=[])
Returns Q, pgen and ppost of a list of sequences.
evaluate_selection_factors(seqs=[])
Returns normalised selection factor Q (Ppost=Q*Pgen) of a list of sequences (faster than evaluate_seqs because it does not compute pgen and ppost)
Methods¶
compute_all_pgens(self, seqs)
Compute Pgen of sequences using OLGA in parallel
Parameters
----------
seqs: list
list of sequences to evaluate.
Returns
-------
pgens: array
generation probabilities of the sequences.
compute_joint_marginals(self)
Computes joint marginals for all.
Attributes Set
-------
gen_marginals_two: array
matrix (i,j) of joint marginals for pre-selection distribution
data_marginals_two: array
matrix (i,j) of joint marginals for data
model_marginals_two: array
matrix (i,j) of joint marginals for post-selection distribution
gen_marginals_two_independent: array
matrix (i,j) of independent joint marginals for pre-selection distribution
data_marginals_two_independent: array
matrix (i,j) of joint marginals for pre-selection distribution
model_marginals_two_independent: array
matrix (i,j) of joint marginals for pre-selection distribution
evaluate_selection_factors(self, seqs=[])
Returns normalised selection factor Q (of Ppost=Q*Pgen) of list of sequences (faster than evaluate_seqs because it does not compute pgen and ppost)
Parameters
----------
seqs: list
list of sequences to evaluate
Returns
-------
Q: array
selection factor Q (of Ppost=Q*Pgen) of the sequences
evaluate_seqs(self, seqs=[])
Returns selection factors, pgen and pposts of sequences.
Parameters
----------
seqs: list
list of sequences to evaluate
Returns
-------
Q: array
selection factor Q (of Ppost=Q*Pgen) of the sequences
pgens: array
pgen of the sequences
pposts: array
ppost of the sequences
joint_marginals(self, features=None, seq_model_features=None, seqs=None, use_flat_distribution=False)
Returns joint marginals P(i,j) with i and j features of sonia (l3, aA6, etc..), index of features attribute is preserved.
Matrix is upper-triangular.
Parameters
----------
features: list
custom feature list
seq_model_features: list
encoded sequences
seqs: list
seqs to encode.
use_flat_distribution: bool
for data and generated seqs is True, for model is False (weights with Q)
Returns
-------
joint_marginals: array
matrix (i,j) of joint marginals
joint_marginals_independent(self, marginals)
Returns independent joint marginals P(i,j)=P(i)*P(j) with i and j features of sonia (l3, aA6, etc..), index of features attribute is preserved.
Matrix is upper-triangular.
Parameters
----------
marginals: list
marginals.
Returns
-------
joint_marginals: array
matrix (i,j) of joint marginals
sonia.sequence_generation¶
@author: Giulio Isacchini
Classes¶
SequenceGeneration(sonia_model=None, custom_olga_model=None, custom_genomic_data=None)
Class used to evaluate sequences with the sonia model
Attributes
----------
sonia_model: object
Required. Sonia model: only object accepted.
custom_olga_model: object
Optional: already loaded custom olga sequence_generation object
custom_genomic_data: object
Optional: already loaded custom olga genomic_data object
Methods
----------
generate_sequences_pre(num_seqs = 1)
Generate sequences using olga
generate_sequences_post(num_seqs,upper_bound=10)
Generate sequences using olga and perform rejection selection.
rejection_sampling(upper_bound=10,energies=None)
Returns acceptance from rejection sampling of a list of energies.
By default uses the generated sequences within the sonia model.
Methods¶
generate_sequences_post(self, num_seqs=1, upper_bound=10, nucleotide=True)
Generates MonteCarlo sequences from Sonia through rejection sampling.
Parameters
----------
num_seqs : int or float
Number of MonteCarlo sequences to generate and add to the specified
sequence pool.
upper_bound: int
accept all above the threshold. Relates to the percentage of
sequences that pass selection.
Returns
--------------
seqs : list
MonteCarlo sequences drawn from a VDJ recomb model that pass selection.
generate_sequences_pre(self, num_seqs=1, nucleotide=True)
Generates MonteCarlo sequences for gen_seqs using OLGA.
Only generates seqs from a V(D)J model. Requires the OLGA package
(pip install olga).
Parameters
----------
num_seqs : int or float
Number of MonteCarlo sequences to generate and add to the specified
sequence pool.
Returns
--------------
seqs : list
MonteCarlo sequences drawn from a VDJ recomb model
rejection_sampling(self, upper_bound=10, energies=None)
Returns acceptance from rejection sampling of a list of seqs.
By default uses the generated sequences within the sonia model.
Parameters
----------
upper_bound : int or float
accept all above the threshold. Relates to the percentage of
sequences that pass selection
Returns
-------
rejection selection: array of bool
acceptance of each sequence.
sonia.plotting¶
@author: Giulio Isacchini
Classes¶
Plotter(sonia_model=None)
Class used to do plotting
Attributes
----------
sonia_model: object
Sonia model. No path.
Methods
----------
plot_model_learning(save_name = None)
Plots L1 convergence curve and marginal scatter.
plot_pgen(pgen_data=[],pgen_gen=[],pgen_model=[],n_bins=100)
Histogram plot of pgen. You need to evalute them first.
plot_ppost(ppost_data=[],ppost_gen=[],pppst_model=[],n_bins=100)
Histogram plot of ppost. You need to evalute them first.
plot_model_parameters(low_freq_mask = 0.0)
For LengthPos model only. Plot the model parameters using plot_onepoint_values
plot_marginals_length_corrected(min_L = 8, max_L = 16, log_scale = True)
For LengthPos model only. Plot length normalized marginals.
plot_vjl(save_name = None)
Plots marginals of V gene, J gene and cdr3 length
plot_logQ(save_name=None)
Plots logQ of data and generated sequences
plot_ratioQ(self,save_name=None)
Plots the ratio of P(Q) in data and pre-selected pool. Useful for model validation.
Methods¶
norm_marginals(self, marg, min_L=None, max_L=None)
renormalizing the marginals accourding to length, so the sum of the marginals over all amino acid
for one position/length combination will be 1 (and not the fraction of CDR3s of this length)
Parameters
----------
marg : ndarray
the marginal to renormalize
min_L : int
Minimum length CDR3 sequence, if not given taken from class attribute
max_L : int
Maximum length CDR3 sequence, if not given taken from class attribute
plot_logQ(self, save_name=None)
Plots logQ of data and generated sequences
Parameters
----------
save_name : str or None
File name to save output figure. If None (default) does not save.
plot_marginals_length_corrected(self, min_L=8, max_L=16, log_scale=True)
plot length normalized marginals using plot_onepoint_values
Parameters
----------
min_L : int
Minimum length CDR3 sequence, if not given taken from class attribute
max_L : int
Maximum length CDR3 sequence, if not given taken from class attribute
log_scale : bool
if True (default) plots marginals on a log scale
plot_model_learning(self, save_name=None)
Plots L1 convergence curve and marginal scatter.
Parameters
----------
save_name : str or None
File name to save output figure. If None (default) does not save.
plot_model_parameters(self, low_freq_mask=0.0)
plot the model parameters using plot_onepoint_values
Parameters
----------
low_freq_mask : float
threshold on the marginals, anything lower would be grayed out
plot_onepoint_values(self, onepoint=None, onepoint_dict=None, min_L=None, max_L=None, min_val=None, max_value=None, title='', cmap='seismic', bad_color='black', aa_color='white', marginals=False)
plot a function of aa, length and position from left, one heatplot per aa
Parameters
----------
onepoint : ndarray
array containting one-point values to plot, in the same shape as self.features,
expected unless onepoint_dict is given
onepoint_dict : dict
dict of the one-point values to plot, keyed by the feature tuples such as (l12,aA8)
min_L : int
Minimum length CDR3 sequence
max_L : int
Maximum length CDR3 sequence
min_val : float
minimum value to plot
max_val : float
maximum value to plot
title : string
title of plot to display
cmap : colormap
colormap to use for the heatplots
bad_color : string
color to use for nan values - used primarly for cells where position is larger than length
aa_color : string
color to use for amino acid names for each heatplot displayed on the bad_color background
marginals : bool
if true, indicates marginals are to be plotted and this sets cmap, bad_color and aa_color
plot_prob(self, data=[], gen=[], model=[], n_bins=30, save_name=None, bin_min=-20, bin_max=-5, ptype='P_{pre}', figsize=(6, 4))
Histogram plot of Pgen/ppost/Q
Parameters
----------
n_bins: int
number of bins of the histogram
plot_ratioQ(self, save_name=None)
Plots the ratio of P(Q) in data and pre-selected pool. Useful for model validation.
Parameters
----------
save_name : str or None
File name to save output figure. If None (default) does not save.
plot_vjl(self, save_name=None)
Plots marginals of V gene, J gene and cdr3 length
Parameters
----------
save_name : str or None
File name to save output figure. If None (default) does not save.