API¶
Dataset objects¶
Detailed API¶
- class itpseq.DataSet(data_path: Path = '.', result_path: Path | None = None, samples: dict | None = None, keys=None, ref_labels: str | tuple | None = 'noa', cache_path=None, file_pattern=None, aafile_pattern=None)¶
Loads an iTP-Seq dataset and provides methods for analyzing and visualizing the data.
A DataSet object is constructed to handle iTP-Seq Samples with their respective Replicates. By default, it infers the files to uses in the provided directory by looking for “*.processed.json” files produced during the initial step of pre-processing and filtering the fastq files. It uses the pattern of the file names to group the Replicates into a Sample, and to define which condition is the reference in the DataSet (the Sample with name “noa” by default).
- data_path¶
Path to the data directory containing the output files from the fastq pre-processing.
- Type:
str or Path
- result_path¶
Path to the directory where the results of the analysis will be saved.
- Type:
str or Path
- samples¶
Dictionary of Samples in the DataSet. By default, it is None and will be populated automatically.
- Type:
dict or None
- keys¶
Properties in the file name to use for identifying the reference.
- Type:
tuple
- ref_labels¶
Specifies the reference: e.g. ‘noa’ or ((‘sample’, ‘noa’),)
- Type:
str or tuple
- cache_path¶
Path used to cache intermediate results. By default, this creates a subdirectory called “cache” in the result_path directory.
- Type:
str or Path
- file_pattern¶
Regex pattern used to identify the sample files in the data_path directory. If None, defaults to r’(?P<lib_type>[^_]+)_(?P<sample>[^_d]+)(?P<replicate>d+).processed.json’ which matches files like nnn15_noa1.processed.json, nnn15_tcx2.processed.json, etc.
- Type:
str
- aafile_pattern¶
Pattern used to identify the amino acid files in the data_path directory. It will use the values captured in the file_pattern regex to construct the file names. If None, defaults to ‘{lib_type}_{sample}{replicate}_aa.processed.txt’
- Type:
str
Examples
Creating a DataSet from a simple antibiotic treatment (tcx) vs no treatement (noa) with 3 replicates each (1, 2, 3).
- Load a dataset from the current directory, inferring the samples automatically.
>>> from itpseq import DataSet >>> data = DataSet(data_path='.') >>> data DataSet(data_path=PosixPath('.'), file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)\\.processed\\.json', samples=[Sample(nnn15.noa:[1, 2, 3]), Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)], )
- Same as above, but only use “sample” as key.
>>> data = DataSet(data_path='.', keys=['sample']) >>> data DataSet(data_path=PosixPath('.'), file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)\\.processed\\.json', samples=[Sample(noa:[1, 2, 3]), Sample(tcx:[1, 2, 3], ref: noa)], )
- Compute a standard report and export it as PDF
>>> data.report('my_experiment.pdf')
- Display a graph of the inverse-toeprints lengths for each sample
>>> data.itp_len_plot(row='sample')
- Attributes:
- samples_with_ref
Methods
DE
([pos])Computes the log2-FoldChange for each motif described by pos for each sample in the DataSet relative to their reference
infos
([html])Displays summary information about the dataset NGS reads per replicate.
itoeprint
itp_len_plot
reorder_samples
report
- DE(pos='E:A', **kwargs)¶
Computes the log2-FoldChange for each motif described by pos for each sample in the DataSet relative to their reference
- infos(html=False)¶
Displays summary information about the dataset NGS reads per replicate.
- This information is computed during the parsing step and includes:
the total number of reads,
the number of reads without adaptors,
the number of reads that are contaminants,
the number of reads with a low quality,
the number of reads that are too short or too long,
the number of extra nucleotides at the 3’-end of the inverse-toeprints,
- Parameters:
html (bool) – if True, returns the table as HTML, otherwise as DataFrame (default).
Example
>>> dataset.infos() total_sequences noadaptor contaminant lowqual tooshort toolong extra0 extra1 extra2 MAX_LEN noa.1 9036255 799955 1219 700374 299502 3376581 2434092 2709762 3092446 44 noa.2 8154560 407750 1318 680813 154587 4158921 2329190 2582052 2835568 44 noa.3 7725561 623037 1065 353401 279104 3505909 2216460 2402957 2483107 44 sample.1 8384889 714414 1192 685017 385537 3341987 2308291 2528638 2833546 44 sample.2 9120203 498202 1659 513308 104071 5664107 2673062 2972850 2976089 44 sample.3 8490958 1043590 1328 409697 187746 4004073 2243720 2555783 2647865 44
- class itpseq.Replicate(*, replicate: str | None = None, filename: Path | None = None, sample: Sample | None = None, labels: dict | None = None, **kwargs)¶
Replicate instances represent a specific biological or experimental replicate of a Sample.
The purpose of the class is to handle, process, and analyze data corresponding to a replicate. Replicate objects provide methods to load associated data, compute statistical measures, and generate graphical representations such as sequence logos.
- filename¶
Path to the file associated with the replicate. This file is expected to contain raw data relevant to the replicate.
- Type:
Optional[Path]
- replicate¶
Identifier or label for the replicate (e.g., “1”).
- Type:
Optional[str]
- labels¶
Dictionary of labels or metadata associated with the replicate.
- Type:
Optional[dict]
- name¶
Name of the sample, derived from sample.name if provided.
- Type:
str
- dataset¶
The DataSet the Sample belongs to, derived from sample.dataset if provided.
- Type:
Any
- kwargs¶
Additional keyword arguments and metadata stored as “meta” during initialization.
- Type:
dict
- Attributes:
- dataset
- sample_name
Methods
get_counts
([pos])Counts the number of reads for each motif or combination of amino-acid/position.
load_data
([min_peptide, max_peptide, how, ...])Reads the aminoacid inverse-toeprint file as a pandas Series, filters entries based on peptide length and stop codons.
logo
([logo_kwargs, ax, fMet, type])Generates a sequence logo based on the aligned inverse-toeprints, using the logomaker library.
rename
([name])Sets the name of the replicate from a parameter or automatically.
- get_counts(pos=None, **kwargs)¶
Counts the number of reads for each motif or combination of amino-acid/position.
- Parameters:
pos (str, optional) – Position to consider when counting the reads. If None is passed, then this returns a DataFrame with the counts of each amino-acid per position.
kwargs (optional) – Optional parameters to pass to load_data (min_peptide, max_peptide, how, limit, sample)
- Returns:
Returns a DataFrame is pos is None, otherwise a Series.
- Return type:
Series or DataFrame
Examples
- Count the number of reads for each amino-acid/position combination
>>> replicate.get_counts() -8 -7 -6 ... -1 0 1 2879961.0 2658485.0 2449526.0 ... 793143.0 52640.0 NaN * NaN NaN NaN ... NaN NaN 910137.0 A NaN 12240.0 25225.0 ... 111369.0 134995.0 107591.0 .. ... ... ... ... ... ... ... W NaN 2686.0 5059.0 ... 17643.0 28095.0 21577.0 Y NaN 9522.0 19296.0 ... 69671.0 81462.0 93099.0 m 197624.0 221476.0 208959.0 ... 409289.0 740503.0 52640.0 [23 rows x 10 columns]
- Count the number of reads for each motif in the E-P-A sites
>>> replicate.get_counts(pos='E:A')
- load_data(min_peptide=None, max_peptide=None, how='aax', limit=None, sample=None)¶
Reads the aminoacid inverse-toeprint file as a pandas Series, filters entries based on peptide length and stop codons.
- Parameters:
filename (Path or str) – Path to the amino-acid file
min_peptide (int, optional) – Minimum peptide length to keep, by default None
max_peptide (int, optional) – Maximum peptide length to keep, by default None
how (str, optional) – Mode to filter the stops: “aax” will remove peptides with a stop before the A-site, by default ‘aax’
- Returns:
Series of inverse-toeprints for the replicate.
- Return type:
Series
Examples
- Load the inverse-toeprints with a minimum peptide length of 3 and keep the internal stops.
>>> replicate.load_data(min_peptide=3, how='aa') 0 mFIVRGWQV 1 mWQ 2 m*T 3 mEVHATTSGQ 4 mHPNYTS*PV ... 2828877 mTGA 2828878 mRSATINLQ 2828879 mSLMPHHRGN 2828880 mHWH 2828881 mSSTRSSRS Length: 2828882, dtype: object
- logo(logo_kwargs=None, ax=None, fMet=False, type='information', **kwargs)¶
Generates a sequence logo based on the aligned inverse-toeprints, using the logomaker library.
- Parameters:
logo_kwargs (dict, optional) – Additional keyword arguments passed to logomaker.Logo for customizing the sequence logo. Defaults to {‘color_scheme’: ‘NajafabadiEtAl2017’}.
ax (matplotlib.axes.Axes, optional) – Pre-existing matplotlib Axes to draw the logo on. A new Axes is created if not provided.
fMet (bool, optional) – If False, removes m (formyl-methionine / start codon) from the alignment when building the logo. Defaults to False.
type (str, optional) – The transformation type applied to the counts matrix. Possible values include: - ‘information’ for information content. - ‘probability’ for probabilities. Defaults to ‘information’.
**kwargs (dict) – Additional keyword arguments passed to filter the input data (e.g., pos, min_peptide, max_peptide…).
- Returns:
A logomaker.Logo object representing the sequence logo.
- Return type:
logomaker.Logo
Notes
Sequence alignment data is first converted to a counts matrix via the logomaker.alignment_to_matrix method.
The ribosomal site corresponding to each position is annotated on the x-axis.
Transformation of the counts matrix (e.g., counts to information) is performed using logomaker.transform_matrix.
Examples
- Simple logo plot with default settings
>>> logo = obj.logo()
- Logo plot with min_peptide filtering
>>> logo = obj.logo(min_peptide=3)
- Logo plot with custom transformation type and filtering
>>> logo = obj.logo(type='probability', min_peptides=2, fMet=True)
- rename(name=None)¶
Sets the name of the replicate from a parameter or automatically.
- Parameters:
name (str, optional) – name to use as the new name for the replicate.
Examples
- Rename the replicate with a parameter
>>> rep.rename(name='new_name')
- Rename the replicate automatically from its parent sample data
>>> rep.rename()
- class itpseq.Sample(*, labels: dict, reference=None, dataset=None, data=None, keys=('sample',), **kwargs)¶
Represents a sample in a dataset, its replicates, reference, and associated metadata.
The Sample class is used to encapsulate information and behavior related to samples in a dataset. It manages details like labels, references, replicates, and metadata, and provides methods for analyzing replicates, performing differential enrichment analysis, and creating visualizations.
- Attributes:
- name_ref
- name_vs_ref
Methods
copy
([name, reference])Creates a copy of the sample.
get_counts
([pos])Counts the number of reads for each motif or combination of amino-acid/position for each replicate in the sample.
get_counts_ratio
([pos, factor, exclude_empty])get_counts_ratio_pos
([pos])Computes a DataFrame with the enrichment ratios for each ribosome position.
hmap
([r, c, pos, col, transform, cmap, ...])Generates a heatmap of enrichment for combinations of 2 positions.
hmap_grid
([pos, col, transform, cmap, vmax, ...])Creates a grid of heatmaps for all combinations of ribosome positions passed in pos.
hmap_pos
([pos, cmap, vmax, center, ax])Generates a heatmap of enrichment ratios for amino acid positions across ribosome sites.
infos
([html])Returns a table with information on the NGS reads per replicate.
itp_len_plot
([ax, min_codon, max_codon, ...])Generates a line plot of inverse-toeprint (ITP) counts per length.
rename
(name[, rename_replicates])Changes the name of the sample.
DE
all_logos
itoeprint
load_replicates
logo
volcano
- copy(name=None, reference=<no_default>)¶
Creates a copy of the sample.
- Parameters:
name (str, optional) – New name for the sample.
reference (Sample or None, optional) – If a parameter is used, this will set it as the reference sample.
- Returns:
A new Sample object with the same data as the original sample and optionally an updated name and reference.
- Return type:
Examples
- Create a copy of “sample” and change its name to “new_name”.
>>> new_sample = sample.copy(name='new_name') Sample(new_name:[1, 2, 3], ref: ref_name)
- Create a copy of “sample” with “sample2” as reference.
>>> new_sample = sample.copy(reference=sample2) Sample(new_name:[1, 2, 3], ref: sample2)
- get_counts(pos=None, **kwargs)¶
Counts the number of reads for each motif or combination of amino-acid/position for each replicate in the sample.
- Parameters:
pos (str, optional) – Position to consider when counting the reads. If None is passed, then this returns a DataFrame with the counts of each amino-acid per position.
kwargs (optional) – Optional parameters to pass to load_data (min_peptide, max_peptide, how, limit, sample)
- Returns:
Returns a DataFrame. If pos is None the columns will be a MultiIndex.
- Return type:
DataFrame
Examples
- Count the number of reads for each amino-acid/position combination
>>> sample.get_counts() sample.1 ... sample.3 -8 -7 -6 ... -1 0 1 2879961.0 2658485.0 2449526.0 ... 724998.0 34748.0 NaN * NaN NaN NaN ... NaN NaN 880568.0 A NaN 12240.0 25225.0 ... 92225.0 115164.0 85132.0 .. ... ... ... ... ... ... ... W NaN 2686.0 5059.0 ... 14313.0 23730.0 17656.0 Y NaN 9522.0 19296.0 ... 57431.0 69162.0 81430.0 m 197624.0 221476.0 208959.0 ... 375644.0 690250.0 34748.0 [23 rows x 30 columns]
- Count the number of reads for each motif in the E-P-A sites
>>> sample.get_counts(pos='E:A') sample.1 sample.2 sample.3 m* 254850.0 107060.0 258338.0 mS 54993.0 20419.0 50959.0 m 52640.0 17860.0 34748.0 .. ... ... ... WFW NaN 2.0 NaN WWW NaN 1.0 NaN MMW NaN NaN 1.0 [8842 rows x 3 columns]
- get_counts_ratio_pos(pos=None, **kwargs)¶
Computes a DataFrame with the enrichment ratios for each ribosome position.
This method calculates the enrichment for amino acids at the specified positions on the ribosome and organizes the results into a DataFrame. Each row of the DataFrame corresponds to a ribosome position.
- Parameters:
pos (iterable, optional) – An iterable of ribosome positions for which to compute enrichment ratios (e.g., (‘-2’, ‘E’, ‘P’, ‘A’)). If not provided, defaults to (‘-2’, ‘E’, ‘P’, ‘A’).
how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.
**kwargs (dict, optional) – Additional parameters to filter the data or customize the ratio computations.
- Returns:
A DataFrame where rows correspond to ribosome positions and columns correspond to amino acids (ordered by a predefined amino acid sequence). The values in the DataFrame represent the enrichment ratios for each position and amino acid.
- Return type:
pandas.DataFrame
Examples
- Calculate the enrichement relative to the reference for the default -2/E/P/A positions
>>> sample.get_counts_ratio_pos() amino-acid H R K ... W * m site ... -2 1.062831 1.066174 1.012982 ... 1.046303 NaN 0.907140 E 1.037079 1.018643 0.941939 ... 1.041217 NaN 0.933880 P 1.093492 1.100380 1.045145 ... 1.107238 NaN 0.793043 A 0.831129 1.005783 0.967491 ... 0.995833 1.143702 0.757118 [4 rows x 22 columns]
- hmap(r=None, c=None, *, pos=None, col='auto', transform=<ufunc 'log2'>, cmap='vlag', vmax=None, center=None, ax=None, heatmap_kwargs=None, **kwargs)¶
Generates a heatmap of enrichment for combinations of 2 positions.
- Parameters:
r (str) – The row position on the ribosome for the heatmap.
c (str) – The column position on the ribosome for the heatmap.
pos (str or list) – Either a specific position in the form “r:c” or a list of positions to analyze.
how (str) – Defines the method to compute the counts (e.g., ‘mean’, ‘sum’, ‘count’). If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.
col (str) – The dataset column used for computations.
transform (callable, optional) – A function or callable to apply to the dataset before generating the heatmap.
cmap (str or matplotlib.colors.Colormap) – The colormap to use for the heatmap visualization.
vmax (float, optional) – The maximum value for color scaling in the heatmap.
center (float, optional) – The midpoint value for centering the colormap.
ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. If not provided, a new figure and axes are created.
heatmap_kwargs (dict) – Parameters passed to the sns.heatmap method
kwargs (dict) – Additional parameters used to filter the dataset. This allows for fine-tuning of the data before generating the heatmap.
- Returns:
The heatmap axes object containing the visualization.
- Return type:
matplotlib.axes.Axes
Examples
- Create a heatmap for positions E-P-A
>>> sample.hmap('E:A')
- hmap_grid(pos=None, col='auto', transform=<ufunc 'log2'>, cmap='vlag', vmax=None, center=None, **kwargs)¶
Creates a grid of heatmaps for all combinations of ribosome positions passed in pos.
Each cell in the upper triangle of the grid represents a heatmap of enrichment between two positions, with the visualization parameters inherited from the hmap method.
- Parameters:
pos (iterable, optional) – An iterable of ribosome positions for generating combinations (e.g., [‘-2’, ‘E’, ‘P’, ‘A’]). If not provided, defaults to the set of positions [‘-2’, ‘E’, ‘P’, ‘A’].
how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.
col (str, optional) – The dataset column used for computations. Displays the enrichment by default.
transform (callable, optional) – A function or callable to apply to the dataset before generating the heatmaps. Defaults to numpy.log2.
cmap (str or matplotlib.colors.Colormap, optional) – The colormap to use for the heatmap visualizations. Defaults to ‘vlag’.
vmax (float, optional) – The maximum value for color scaling in the heatmaps.
center (float, optional) – The midpoint value for centering the colormap.
kwargs (key, value pairings) – Additional parameters used to filter the dataset or control heatmap generation via the hmap method.
- Returns:
The figure object containing the grid of heatmaps.
- Return type:
matplotlib.figure.Figure
Examples
- Create the default heatmap grid for all combinations of -2/E/P/A
>>> sample.hmap_grid()
- Create a heatmap grid for combinations of E/P/A
>>> sample.hmap_grid(['E', 'P', 'A'])
- hmap_pos(pos=None, *, cmap='vlag', vmax=None, center=0, ax=None, **kwargs)¶
Generates a heatmap of enrichment ratios for amino acid positions across ribosome sites.
This method visualizes the enrichment ratios as a heatmap, where the rows correspond to different ribosome positions and the columns represent amino acids.
- Parameters:
pos (tuple, optional) – Ribosome positions for which to compute and visualize enrichment ratios (e.g., (‘-2’, ‘E’, ‘P’, ‘A’)).
how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded. Default is ‘aax’.
col (str, optional) – The DataFrame column to utilize for enrichment visualization. Defaults to ‘auto’.
transform (callable, optional) – A function or callable to apply to the enrichment matrix before plotting. Defaults to numpy.log2.
cmap (str or matplotlib.colors.Colormap, optional) – The colormap to use for the heatmap visualization. Defaults to ‘vlag’.
vmax (float, optional) – The maximum value for color scaling in the heatmap. If not provided, it defaults to the maximum absolute value in the enrichment matrix.
center (float, optional) – The midpoint of the colormap. Defaults to 0.
ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. A new figure and axes are created if not provided.
**kwargs (dict, optional) – Additional parameters to customize the enrichment computation or filtering.
- Returns:
The axes object containing the heatmap visualization.
- Return type:
matplotlib.axes.Axes
Notes
The rows of the heatmap correspond to ribosome positions, while the columns represent amino acids.
Tick labels are styled using the aa_colors dictionary to match the biochemical categories of amino acids.
Enrichment ratios are automatically log2-transformed by default.
- infos(html=False)¶
Returns a table with information on the NGS reads per replicate.
- Parameters:
html (bool) – if True, returns the table as HTML, otherwise as DataFrame (default).
Example
>>> sample.infos() total_sequences noadaptor contaminant lowqual tooshort toolong extra0 extra1 extra2 MAX_LEN sample.1 8384889 714414 1192 685017 385537 3341987 2308291 2528638 2833546 44 sample.2 9120203 498202 1659 513308 104071 5664107 2673062 2972850 2976089 44 sample.3 8490958 1043590 1328 409697 187746 4004073 2243720 2555783 2647865 44
- itp_len_plot(ax=None, min_codon=0, max_codon=10, limit=100, norm=False)¶
Generates a line plot of inverse-toeprint (ITP) counts per length.
This method uses the output of itp_len to create a line plot showing the counts of inverse-toeprints across lengths for each replicate. Optionally, counts can be normalized (per million reads), and the plotted lengths can be limited.
- Parameters:
ax (matplotlib.axes.Axes, optional) – Pre-existing axes to draw the plot on. A new figure and axes are created if not provided.
min_codon (int, optional) – The minimum codon position to annotate on the plot. Defaults to 0.
max_codon (int, optional) – The maximum codon position to annotate on the plot. Defaults to 10.
limit (int, optional) – The maximum length to include in the plot. Defaults to 100.
norm (bool, optional) – Whether to normalize counts to reads per million. Defaults to False.
- Returns:
The axes object containing the plotted lineplot.
- Return type:
matplotlib.axes.Axes
Notes
The x-axis represents the distance from the 3’ end of the inverse-toeprint in nucleotides.
The y-axis shows the counts of inverse-toeprints, either absolute or normalized per million reads.
Each replicate is plotted independently and distinguished by the hue attribute in the plot.
- rename(name, rename_replicates=True)¶
Changes the name of the sample.
- Parameters:
name (str) – name to use as the new name for the sample.
rename_replicates (bool) – If True (default), also rename the replicates based on the new sample name.
Examples
- Rename the sample to “new_name”.
>>> sample.rename(name='new_name')
- property itp_len¶
Combines the counts of inverse-toeprints (ITPs) for each length across all replicates.
This method extracts the counts of inverse-toeprints for each length from the metadata of each replicate and combines them into a single DataFrame, keeping the data for each replicate independent.
- Returns:
A DataFrame with the following columns: - length : int
The length of the inverse-toeprints.
- replicatestr
The replicate identifier.
- countint
The count of inverse-toeprints of the given length for the replicate.
- samplestr
The name of the sample this data belongs to.
- Return type:
pandas.DataFrame
Examples
>>> sample.itp_len length replicate count sample 0 51 1 115732.0 spl 1 20 1 444506.0 spl 2 41 1 130495.0 spl 3 23 1 198257.0 spl 4 17 1 55786.0 spl .. ... ... ... ... 328 106 3 NaN spl 329 143 3 NaN spl 330 102 3 NaN spl 331 104 3 NaN spl 332 221 3 NaN spl [333 rows x 4 columns]