API¶
Dataset objects¶
Detailed API¶
- class itpseq.DataSet(data_path: Path = '.', result_path: Path | None = None, samples: dict | None = None, keys=None, ref_labels: str | tuple | None = 'noa', cache_path=None, file_pattern=None, aafile_pattern=None)¶
Loads an iTP-Seq dataset and provides methods for analyzing and visualizing the data.
A DataSet object is constructed to handle iTP-Seq Samples with their respective Replicates. By default, it infers the files to uses in the provided directory by looking for “*.processed.json” files produced during the initial step of pre-processing and filtering the fastq files. It uses the pattern of the file names to group the Replicates into a Sample, and to define which condition is the reference in the DataSet (the Sample with name “noa” by default).
- data_path¶
Path to the data directory containing the output files from the fastq pre-processing.
- Type:
str or Path
- result_path¶
Path to the directory where the results of the analysis will be saved.
- Type:
str or Path
- samples¶
Dictionary of Samples in the DataSet. By default, it is None and will be populated automatically.
- Type:
dict or None
- keys¶
Properties in the file name to use for identifying the reference.
- Type:
tuple
- ref_labels¶
Specifies the reference: e.g. ‘noa’ or ((‘sample’, ‘noa’),)
- Type:
str or tuple
- cache_path¶
Path used to cache intermediate results. By default, this creates a subdirectory called “cache” in the result_path directory.
- Type:
str or Path
- file_pattern¶
Regex pattern used to identify the sample files in the data_path directory. If None, defaults to r’(?P<lib_type>[^_]+)_(?P<sample>[^_d]+)(?P<replicate>d+).processed.json’ which matches files like nnn15_noa1.processed.json, nnn15_tcx2.processed.json, etc.
- Type:
str
- aafile_pattern¶
Pattern used to identify the amino acid files in the data_path directory. It will use the values captured in the file_pattern regex to construct the file names. If None, defaults to ‘{lib_type}_{sample}{replicate}_aa.processed.txt’
- Type:
str
Examples
Creating a DataSet from a simple antibiotic treatment (tcx) vs no treatement (noa) with 3 replicates each (1, 2, 3).
- Load a dataset from the current directory, inferring the samples automatically
>>> from itpseq import DataSet >>> data = DataSet(data_path='.') >>> data DataSet(data_path=PosixPath('.'), reference=Sample(noa:[1, 2, 3]), samples=[Sample(noa:[1, 2, 3]), Sample(tcx:[1, 2, 3], ref: noa)], )
- Compute a standard report and export it as PDF
>>> data.report('my_experiment.pdf')
- Display a graph of the inverse-toeprints lengths for each sample
>>> data.itp_len_plot(row='sample')
- Attributes:
- samples_with_ref
Methods
DE
([pos])Computes the log2-FoldChange for each motif described by pos for each sample in the DataSet relative to their reference
infos
([html])Displays summary information about the dataset sequences.
itoeprint
itp_len_plot
reorder_samples
report
- DE(pos='E:A', **kwargs)¶
Computes the log2-FoldChange for each motif described by pos for each sample in the DataSet relative to their reference
- infos(html=False)¶
Displays summary information about the dataset sequences.
- class itpseq.Replicate(*, replicate: str | None = None, filename: Path | None = None, sample: Sample | None = None, labels: dict | None = None, **kwargs)¶
Replicate instances represent a specific biological or experimental replicate of a Sample.
The purpose of the class is to handle, process, and analyze data corresponding to a replicate. Replicate objects provide methods to load associated data, compute statistical measures, and generate graphical representations such as sequence logos.
- filename¶
Path to the file associated with the replicate. This file is expected to contain raw data relevant to the replicate.
- Type:
Optional[Path]
- replicate¶
Identifier or label for the replicate (e.g., “1”).
- Type:
Optional[str]
- labels¶
Dictionary of labels or metadata associated with the replicate.
- Type:
Optional[dict]
- name¶
Name of the sample, derived from sample.name if provided.
- Type:
str
- dataset¶
The DataSet the Sample belongs to, derived from sample.dataset if provided.
- Type:
Any
- kwargs¶
Additional keyword arguments and metadata stored as “meta” during initialization.
- Type:
dict
Methods
logo
([logo_kwargs, ax, fMet, type])Generates a sequence logo based on the aligned inverse-toeprints, using the logomaker library.
get_counts
load_data
- logo(logo_kwargs=None, ax=None, fMet=False, type='information', **kwargs)¶
Generates a sequence logo based on the aligned inverse-toeprints, using the logomaker library.
- Parameters:
logo_kwargs (dict, optional) – Additional keyword arguments passed to logomaker.Logo for customizing the sequence logo. Defaults to {‘color_scheme’: ‘NajafabadiEtAl2017’}.
ax (matplotlib.axes.Axes, optional) – Pre-existing matplotlib Axes to draw the logo on. A new Axes is created if not provided.
fMet (bool, optional) – If False, removes m (formyl-methionine / start codon) from the alignment when building the logo. Defaults to False.
type (str, optional) – The transformation type applied to the counts matrix. Possible values include: - ‘information’ for information content. - ‘probability’ for probabilities. Defaults to ‘information’.
**kwargs (dict) – Additional keyword arguments passed to filter the input data (e.g., pos, min_peptide, max_peptide…).
- Returns:
A logomaker.Logo object representing the sequence logo.
- Return type:
logomaker.Logo
Notes
Sequence alignment data is first converted to a counts matrix via the logomaker.alignment_to_matrix method.
The ribosomal site corresponding to each position is annotated on the x-axis.
Transformation of the counts matrix (e.g., counts to information) is performed using logomaker.transform_matrix.
Examples
# Simple logo plot with default settings logo = obj.logo()
# Logo plot with min_peptide filtering logo = obj.logo(min_peptide=3)
# Logo plot with custom transformation type and filtering logo = obj.logo(type=’probability’, min_peptides=2, fMet=True)
- class itpseq.Sample(*, labels: dict, reference=None, dataset=None, data=None, keys=('sample',), **kwargs)¶
Represents a sample in a dataset, its replicates, reference, and associated metadata.
The Sample class is used to encapsulate information and behavior related to samples in a dataset. It manages details like labels, references, replicates, and metadata, and provides methods for analyzing replicates, performing differential enrichment analysis, and creating visualizations.
- Attributes:
- name_ref
- name_vs_ref
Methods
get_counts_ratio
([pos, factor, exclude_empty])get_counts_ratio_pos
([pos])Computes a DataFrame with the enrichment ratios for each ribosome position.
hmap
([r, c, pos, col, transform, cmap, ...])Generates a heatmap of enrichment for combinations of 2 positions.
hmap_grid
([pos, col, transform, cmap, vmax, ...])Creates a grid of heatmaps for all combinations of ribosome positions passed in pos.
hmap_pos
([pos, cmap, vmax, center, ax])Generates a heatmap of enrichment ratios for amino acid positions across ribosome sites.
itp_len_plot
([ax, min_codon, max_codon, ...])Generates a line plot of inverse-toeprint (ITP) counts per length.
DE
all_logos
get_counts
infos
itoeprint
load_replicates
logo
volcano
- get_counts_ratio_pos(pos=None, **kwargs)¶
Computes a DataFrame with the enrichment ratios for each ribosome position.
This method calculates the enrichment for amino acids at the specified positions on the ribosome and organizes the results into a DataFrame. Each row of the DataFrame corresponds to a ribosome position.
- Parameters:
pos (iterable, optional) – An iterable of ribosome positions for which to compute enrichment ratios (e.g., (‘-2’, ‘E’, ‘P’, ‘A’)). If not provided, defaults to (‘-2’, ‘E’, ‘P’, ‘A’).
how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.
**kwargs (dict, optional) – Additional parameters to filter the data or customize the ratio computations.
- Returns:
A DataFrame where rows correspond to ribosome positions and columns correspond to amino acids (ordered by a predefined amino acid sequence). The values in the DataFrame represent the enrichment ratios for each position and amino acid.
- Return type:
pandas.DataFrame
- hmap(r=None, c=None, *, pos=None, col='auto', transform=<ufunc 'log2'>, cmap='vlag', vmax=None, center=None, ax=None, heatmap_kwargs=None, **kwargs)¶
Generates a heatmap of enrichment for combinations of 2 positions.
- Parameters:
r (str) – The row position on the ribosome for the heatmap.
c (str) – The column position on the ribosome for the heatmap.
pos (str or list) – Either a specific position in the form “r:c” or a list of positions to analyze.
how (str) – Defines the method to compute the counts (e.g., ‘mean’, ‘sum’, ‘count’). If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.
col (str) – The dataset column used for computations.
transform (callable, optional) – A function or callable to apply to the dataset before generating the heatmap.
cmap (str or matplotlib.colors.Colormap) – The colormap to use for the heatmap visualization.
vmax (float, optional) – The maximum value for color scaling in the heatmap.
center (float, optional) – The midpoint value for centering the colormap.
ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. If not provided, a new figure and axes are created.
heatmap_kwargs (dict) – Parameters passed to the sns.heatmap method
kwargs (dict) – Additional parameters used to filter the dataset. This allows for fine-tuning of the data before generating the heatmap.
- Returns:
The heatmap axes object containing the visualization.
- Return type:
matplotlib.axes.Axes
- hmap_grid(pos=None, col='auto', transform=<ufunc 'log2'>, cmap='vlag', vmax=None, center=None, **kwargs)¶
Creates a grid of heatmaps for all combinations of ribosome positions passed in pos.
Each cell in the upper triangle of the grid represents a heatmap of enrichment between two positions, with the visualization parameters inherited from the hmap method.
- Parameters:
pos (iterable, optional) – An iterable of ribosome positions for generating combinations (e.g., [‘-2’, ‘E’, ‘P’, ‘A’]). If not provided, defaults to the set of positions [‘-2’, ‘E’, ‘P’, ‘A’].
how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.
col (str, optional) – The dataset column used for computations. Displays the enrichment by default.
transform (callable, optional) – A function or callable to apply to the dataset before generating the heatmaps. Defaults to numpy.log2.
cmap (str or matplotlib.colors.Colormap, optional) – The colormap to use for the heatmap visualizations. Defaults to ‘vlag’.
vmax (float, optional) – The maximum value for color scaling in the heatmaps.
center (float, optional) – The midpoint value for centering the colormap.
kwargs (key, value pairings) – Additional parameters used to filter the dataset or control heatmap generation via the hmap method.
- Returns:
The figure object containing the grid of heatmaps.
- Return type:
matplotlib.figure.Figure
- hmap_pos(pos=None, *, cmap='vlag', vmax=None, center=0, ax=None, **kwargs)¶
Generates a heatmap of enrichment ratios for amino acid positions across ribosome sites.
This method visualizes the enrichment ratios as a heatmap, where the rows correspond to different ribosome positions and the columns represent amino acids.
- Parameters:
pos (tuple, optional) – Ribosome positions for which to compute and visualize enrichment ratios (e.g., (‘-2’, ‘E’, ‘P’, ‘A’)).
how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded. Default is ‘aax’.
col (str, optional) – The DataFrame column to utilize for enrichment visualization. Defaults to ‘auto’.
transform (callable, optional) – A function or callable to apply to the enrichment matrix before plotting. Defaults to numpy.log2.
cmap (str or matplotlib.colors.Colormap, optional) – The colormap to use for the heatmap visualization. Defaults to ‘vlag’.
vmax (float, optional) – The maximum value for color scaling in the heatmap. If not provided, it defaults to the maximum absolute value in the enrichment matrix.
center (float, optional) – The midpoint of the colormap. Defaults to 0.
ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. A new figure and axes are created if not provided.
**kwargs (dict, optional) – Additional parameters to customize the enrichment computation or filtering.
- Returns:
The axes object containing the heatmap visualization.
- Return type:
matplotlib.axes.Axes
Notes
The rows of the heatmap correspond to ribosome positions, while the columns represent amino acids.
Tick labels are styled using the aa_colors dictionary to match the biochemical categories of amino acids.
Enrichment ratios are automatically log2-transformed by default.
- itp_len_plot(ax=None, min_codon=0, max_codon=10, limit=100, norm=False)¶
Generates a line plot of inverse-toeprint (ITP) counts per length.
This method uses the output of itp_len to create a line plot showing the counts of inverse-toeprints across lengths for each replicate. Optionally, counts can be normalized (per million reads), and the plotted lengths can be limited.
- Parameters:
ax (matplotlib.axes.Axes, optional) – Pre-existing axes to draw the plot on. A new figure and axes are created if not provided.
min_codon (int, optional) – The minimum codon position to annotate on the plot. Defaults to 0.
max_codon (int, optional) – The maximum codon position to annotate on the plot. Defaults to 10.
limit (int, optional) – The maximum length to include in the plot. Defaults to 100.
norm (bool, optional) – Whether to normalize counts to reads per million. Defaults to False.
- Returns:
The axes object containing the plotted lineplot.
- Return type:
matplotlib.axes.Axes
Notes
The x-axis represents the distance from the 3’ end of the inverse-toeprint in nucleotides.
The y-axis shows the counts of inverse-toeprints, either absolute or normalized per million reads.
Each replicate is plotted independently and distinguished by the hue attribute in the plot.
- property itp_len¶
Combines the counts of inverse-toeprints (ITPs) for each length across all replicates.
This method extracts the counts of inverse-toeprints for each length from the metadata of each replicate and combines them into a single DataFrame, keeping the data for each replicate independent.
- Returns:
A DataFrame with the following columns: - length : int
The length of the inverse-toeprints.
- replicatestr
The replicate identifier.
- countint
The count of inverse-toeprints of the given length for the replicate.
- samplestr
The name of the sample this data belongs to.
- Return type:
pandas.DataFrame