API

Dataset objects

Detailed API

class itpseq.DataSet(data_path: Path = '.', result_path: Path | None = None, samples: dict | None = None, keys=None, ref_labels: str | tuple | None = 'noa', cache_path=None, file_pattern=None, aafile_pattern=None)

Loads an iTP-Seq dataset and provides methods for analyzing and visualizing the data.

A DataSet object is constructed to handle iTP-Seq Samples with their respective Replicates. By default, it infers the files to uses in the provided directory by looking for “*.processed.json” files produced during the initial step of pre-processing and filtering the fastq files. It uses the pattern of the file names to group the Replicates into a Sample, and to define which condition is the reference in the DataSet (the Sample with name “noa” by default).

data_path

Path to the data directory containing the output files from the fastq pre-processing.

Type:

str or Path

result_path

Path to the directory where the results of the analysis will be saved.

Type:

str or Path

samples

Dictionary of Samples in the DataSet. By default, it is None and will be populated automatically.

Type:

dict or None

keys

Properties in the file name to use for identifying the reference.

Type:

tuple

ref_labels

Specifies the reference: e.g. ‘noa’ or ((‘sample’, ‘noa’),)

Type:

str or tuple

cache_path

Path used to cache intermediate results. By default, this creates a subdirectory called “cache” in the result_path directory.

Type:

str or Path

file_pattern

Regex pattern used to identify the sample files in the data_path directory. If None, defaults to r’(?P<lib_type>[^_]+)_(?P<sample>[^_d]+)(?P<replicate>d+).processed.json’ which matches files like nnn15_noa1.processed.json, nnn15_tcx2.processed.json, etc.

Type:

str

aafile_pattern

Pattern used to identify the amino acid files in the data_path directory. It will use the values captured in the file_pattern regex to construct the file names. If None, defaults to ‘{lib_type}_{sample}{replicate}_aa.processed.txt’

Type:

str

Examples

Creating a DataSet from a simple antibiotic treatment (tcx) vs no treatement (noa) with 3 replicates each (1, 2, 3).

Load a dataset from the current directory, inferring the samples automatically.
>>> from itpseq import DataSet
>>> data = DataSet(data_path='.')
>>> data
DataSet(data_path=PosixPath('.'),
        file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)\\.processed\\.json',
        samples=[Sample(nnn15.noa:[1, 2, 3]),
                 Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)],
        )
Same as above, but only use “sample” as key.
>>> data = DataSet(data_path='.', keys=['sample'])
>>> data
DataSet(data_path=PosixPath('.'),
        file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)\\.processed\\.json',
        samples=[Sample(noa:[1, 2, 3]),
                 Sample(tcx:[1, 2, 3], ref: noa)],
        )
Compute a standard report and export it as PDF
>>> data.report('my_experiment.pdf')
Display a graph of the inverse-toeprints lengths for each sample
>>> data.itp_len_plot(row='sample')
Attributes:
samples_with_ref

Methods

DE([pos])

Computes the log2-FoldChange for each motif described by pos for each sample in the DataSet relative to their reference

infos([html])

Displays summary information about the dataset NGS reads per replicate.

itoeprint

itp_len_plot

reorder_samples

report

DE(pos='E:A', **kwargs)

Computes the log2-FoldChange for each motif described by pos for each sample in the DataSet relative to their reference

infos(html=False)

Displays summary information about the dataset NGS reads per replicate.

This information is computed during the parsing step and includes:
  • the total number of reads,

  • the number of reads without adaptors,

  • the number of reads that are contaminants,

  • the number of reads with a low quality,

  • the number of reads that are too short or too long,

  • the number of extra nucleotides at the 3’-end of the inverse-toeprints,

Parameters:

html (bool) – if True, returns the table as HTML, otherwise as DataFrame (default).

Example

>>> dataset.infos()
          total_sequences  noadaptor  contaminant  lowqual  tooshort  toolong   extra0   extra1   extra2  MAX_LEN
noa.1             9036255     799955         1219   700374    299502  3376581  2434092  2709762  3092446       44
noa.2             8154560     407750         1318   680813    154587  4158921  2329190  2582052  2835568       44
noa.3             7725561     623037         1065   353401    279104  3505909  2216460  2402957  2483107       44
sample.1          8384889     714414         1192   685017    385537  3341987  2308291  2528638  2833546       44
sample.2          9120203     498202         1659   513308    104071  5664107  2673062  2972850  2976089       44
sample.3          8490958    1043590         1328   409697    187746  4004073  2243720  2555783  2647865       44
class itpseq.Replicate(*, replicate: str | None = None, filename: Path | None = None, sample: Sample | None = None, labels: dict | None = None, **kwargs)

Replicate instances represent a specific biological or experimental replicate of a Sample.

The purpose of the class is to handle, process, and analyze data corresponding to a replicate. Replicate objects provide methods to load associated data, compute statistical measures, and generate graphical representations such as sequence logos.

filename

Path to the file associated with the replicate. This file is expected to contain raw data relevant to the replicate.

Type:

Optional[Path]

sample

The sample object this replicate belongs to.

Type:

Optional[Sample]

replicate

Identifier or label for the replicate (e.g., “1”).

Type:

Optional[str]

labels

Dictionary of labels or metadata associated with the replicate.

Type:

Optional[dict]

name

Name of the sample, derived from sample.name if provided.

Type:

str

dataset

The DataSet the Sample belongs to, derived from sample.dataset if provided.

Type:

Any

kwargs

Additional keyword arguments and metadata stored as “meta” during initialization.

Type:

dict

Attributes:
dataset
sample_name

Methods

get_counts([pos])

Counts the number of reads for each motif or combination of amino-acid/position.

load_data([min_peptide, max_peptide, how, ...])

Reads the aminoacid inverse-toeprint file as a pandas Series, filters entries based on peptide length and stop codons.

logo([logo_kwargs, ax, fMet, type])

Generates a sequence logo based on the aligned inverse-toeprints, using the logomaker library.

rename([name])

Sets the name of the replicate from a parameter or automatically.

get_counts(pos=None, **kwargs)

Counts the number of reads for each motif or combination of amino-acid/position.

Parameters:
  • pos (str, optional) – Position to consider when counting the reads. If None is passed, then this returns a DataFrame with the counts of each amino-acid per position.

  • kwargs (optional) – Optional parameters to pass to load_data (min_peptide, max_peptide, how, limit, sample)

Returns:

Returns a DataFrame is pos is None, otherwise a Series.

Return type:

Series or DataFrame

Examples

Count the number of reads for each amino-acid/position combination
>>> replicate.get_counts()
           -8         -7         -6  ...        -1         0         1
    2879961.0  2658485.0  2449526.0  ...  793143.0   52640.0       NaN
*         NaN        NaN        NaN  ...       NaN       NaN  910137.0
A         NaN    12240.0    25225.0  ...  111369.0  134995.0  107591.0
..        ...        ...        ...  ...       ...       ...       ...
W         NaN     2686.0     5059.0  ...   17643.0   28095.0   21577.0
Y         NaN     9522.0    19296.0  ...   69671.0   81462.0   93099.0
m    197624.0   221476.0   208959.0  ...  409289.0  740503.0   52640.0
[23 rows x 10 columns]
Count the number of reads for each motif in the E-P-A sites
>>> replicate.get_counts(pos='E:A')
load_data(min_peptide=None, max_peptide=None, how='aax', limit=None, sample=None)

Reads the aminoacid inverse-toeprint file as a pandas Series, filters entries based on peptide length and stop codons.

Parameters:
  • filename (Path or str) – Path to the amino-acid file

  • min_peptide (int, optional) – Minimum peptide length to keep, by default None

  • max_peptide (int, optional) – Maximum peptide length to keep, by default None

  • how (str, optional) – Mode to filter the stops: “aax” will remove peptides with a stop before the A-site, by default ‘aax’

Returns:

Series of inverse-toeprints for the replicate.

Return type:

Series

Examples

Load the inverse-toeprints with a minimum peptide length of 3 and keep the internal stops.
>>> replicate.load_data(min_peptide=3, how='aa')
0           mFIVRGWQV
1                 mWQ
2                 m*T
3          mEVHATTSGQ
4          mHPNYTS*PV
              ...
2828877          mTGA
2828878     mRSATINLQ
2828879    mSLMPHHRGN
2828880          mHWH
2828881     mSSTRSSRS
Length: 2828882, dtype: object

Generates a sequence logo based on the aligned inverse-toeprints, using the logomaker library.

Parameters:
  • logo_kwargs (dict, optional) – Additional keyword arguments passed to logomaker.Logo for customizing the sequence logo. Defaults to {‘color_scheme’: ‘NajafabadiEtAl2017’}.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing matplotlib Axes to draw the logo on. A new Axes is created if not provided.

  • fMet (bool, optional) – If False, removes m (formyl-methionine / start codon) from the alignment when building the logo. Defaults to False.

  • type (str, optional) – The transformation type applied to the counts matrix. Possible values include: - ‘information’ for information content. - ‘probability’ for probabilities. Defaults to ‘information’.

  • **kwargs (dict) – Additional keyword arguments passed to filter the input data (e.g., pos, min_peptide, max_peptide…).

Returns:

A logomaker.Logo object representing the sequence logo.

Return type:

logomaker.Logo

Notes

  • Sequence alignment data is first converted to a counts matrix via the logomaker.alignment_to_matrix method.

  • The ribosomal site corresponding to each position is annotated on the x-axis.

  • Transformation of the counts matrix (e.g., counts to information) is performed using logomaker.transform_matrix.

Examples

Simple logo plot with default settings
>>> logo = obj.logo()
Logo plot with min_peptide filtering
>>> logo = obj.logo(min_peptide=3)
Logo plot with custom transformation type and filtering
>>> logo = obj.logo(type='probability', min_peptides=2, fMet=True)
rename(name=None)

Sets the name of the replicate from a parameter or automatically.

Parameters:

name (str, optional) – name to use as the new name for the replicate.

Examples

Rename the replicate with a parameter
>>> rep.rename(name='new_name')
Rename the replicate automatically from its parent sample data
>>> rep.rename()
class itpseq.Sample(*, labels: dict, reference=None, dataset=None, data=None, keys=('sample',), **kwargs)

Represents a sample in a dataset, its replicates, reference, and associated metadata.

The Sample class is used to encapsulate information and behavior related to samples in a dataset. It manages details like labels, references, replicates, and metadata, and provides methods for analyzing replicates, performing differential enrichment analysis, and creating visualizations.

Attributes:
name_ref
name_vs_ref

Methods

copy([name, reference])

Creates a copy of the sample.

get_counts([pos])

Counts the number of reads for each motif or combination of amino-acid/position for each replicate in the sample.

get_counts_ratio([pos, factor, exclude_empty])

get_counts_ratio_pos([pos])

Computes a DataFrame with the enrichment ratios for each ribosome position.

hmap([r, c, pos, col, transform, cmap, ...])

Generates a heatmap of enrichment for combinations of 2 positions.

hmap_grid([pos, col, transform, cmap, vmax, ...])

Creates a grid of heatmaps for all combinations of ribosome positions passed in pos.

hmap_pos([pos, cmap, vmax, center, ax])

Generates a heatmap of enrichment ratios for amino acid positions across ribosome sites.

infos([html])

Returns a table with information on the NGS reads per replicate.

itp_len_plot([ax, min_codon, max_codon, ...])

Generates a line plot of inverse-toeprint (ITP) counts per length.

rename(name[, rename_replicates])

Changes the name of the sample.

DE

all_logos

itoeprint

load_replicates

logo

volcano

copy(name=None, reference=<no_default>)

Creates a copy of the sample.

Parameters:
  • name (str, optional) – New name for the sample.

  • reference (Sample or None, optional) – If a parameter is used, this will set it as the reference sample.

Returns:

A new Sample object with the same data as the original sample and optionally an updated name and reference.

Return type:

Sample

Examples

Create a copy of “sample” and change its name to “new_name”.
>>> new_sample = sample.copy(name='new_name')
Sample(new_name:[1, 2, 3], ref: ref_name)
Create a copy of “sample” with “sample2” as reference.
>>> new_sample = sample.copy(reference=sample2)
Sample(new_name:[1, 2, 3], ref: sample2)
get_counts(pos=None, **kwargs)

Counts the number of reads for each motif or combination of amino-acid/position for each replicate in the sample.

Parameters:
  • pos (str, optional) – Position to consider when counting the reads. If None is passed, then this returns a DataFrame with the counts of each amino-acid per position.

  • kwargs (optional) – Optional parameters to pass to load_data (min_peptide, max_peptide, how, limit, sample)

Returns:

Returns a DataFrame. If pos is None the columns will be a MultiIndex.

Return type:

DataFrame

Examples

Count the number of reads for each amino-acid/position combination
>>> sample.get_counts()
     sample.1                        ...  sample.3
           -8         -7         -6  ...        -1         0         1
    2879961.0  2658485.0  2449526.0  ...  724998.0   34748.0       NaN
*         NaN        NaN        NaN  ...       NaN       NaN  880568.0
A         NaN    12240.0    25225.0  ...   92225.0  115164.0   85132.0
..        ...        ...        ...  ...       ...       ...       ...
W         NaN     2686.0     5059.0  ...   14313.0   23730.0   17656.0
Y         NaN     9522.0    19296.0  ...   57431.0   69162.0   81430.0
m    197624.0   221476.0   208959.0  ...  375644.0  690250.0   34748.0
[23 rows x 30 columns]
Count the number of reads for each motif in the E-P-A sites
>>> sample.get_counts(pos='E:A')
     sample.1  sample.2  sample.3
 m*  254850.0  107060.0  258338.0
 mS   54993.0   20419.0   50959.0
  m   52640.0   17860.0   34748.0
 ..        ...       ...       ...
WFW       NaN       2.0       NaN
WWW       NaN       1.0       NaN
MMW       NaN       NaN       1.0
[8842 rows x 3 columns]
get_counts_ratio_pos(pos=None, **kwargs)

Computes a DataFrame with the enrichment ratios for each ribosome position.

This method calculates the enrichment for amino acids at the specified positions on the ribosome and organizes the results into a DataFrame. Each row of the DataFrame corresponds to a ribosome position.

Parameters:
  • pos (iterable, optional) – An iterable of ribosome positions for which to compute enrichment ratios (e.g., (‘-2’, ‘E’, ‘P’, ‘A’)). If not provided, defaults to (‘-2’, ‘E’, ‘P’, ‘A’).

  • how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.

  • **kwargs (dict, optional) – Additional parameters to filter the data or customize the ratio computations.

Returns:

A DataFrame where rows correspond to ribosome positions and columns correspond to amino acids (ordered by a predefined amino acid sequence). The values in the DataFrame represent the enrichment ratios for each position and amino acid.

Return type:

pandas.DataFrame

Examples

Calculate the enrichement relative to the reference for the default -2/E/P/A positions
>>> sample.get_counts_ratio_pos()
amino-acid         H         R         K  ...         W         *         m
site                                      ...
-2          1.062831  1.066174  1.012982  ...  1.046303       NaN  0.907140
E           1.037079  1.018643  0.941939  ...  1.041217       NaN  0.933880
P           1.093492  1.100380  1.045145  ...  1.107238       NaN  0.793043
A           0.831129  1.005783  0.967491  ...  0.995833  1.143702  0.757118
[4 rows x 22 columns]
hmap(r=None, c=None, *, pos=None, col='auto', transform=<ufunc 'log2'>, cmap='vlag', vmax=None, center=None, ax=None, heatmap_kwargs=None, **kwargs)

Generates a heatmap of enrichment for combinations of 2 positions.

Parameters:
  • r (str) – The row position on the ribosome for the heatmap.

  • c (str) – The column position on the ribosome for the heatmap.

  • pos (str or list) – Either a specific position in the form “r:c” or a list of positions to analyze.

  • how (str) – Defines the method to compute the counts (e.g., ‘mean’, ‘sum’, ‘count’). If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.

  • col (str) – The dataset column used for computations.

  • transform (callable, optional) – A function or callable to apply to the dataset before generating the heatmap.

  • cmap (str or matplotlib.colors.Colormap) – The colormap to use for the heatmap visualization.

  • vmax (float, optional) – The maximum value for color scaling in the heatmap.

  • center (float, optional) – The midpoint value for centering the colormap.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. If not provided, a new figure and axes are created.

  • heatmap_kwargs (dict) – Parameters passed to the sns.heatmap method

  • kwargs (dict) – Additional parameters used to filter the dataset. This allows for fine-tuning of the data before generating the heatmap.

Returns:

The heatmap axes object containing the visualization.

Return type:

matplotlib.axes.Axes

Examples

Create a heatmap for positions E-P-A
>>> sample.hmap('E:A')
hmap_grid(pos=None, col='auto', transform=<ufunc 'log2'>, cmap='vlag', vmax=None, center=None, **kwargs)

Creates a grid of heatmaps for all combinations of ribosome positions passed in pos.

Each cell in the upper triangle of the grid represents a heatmap of enrichment between two positions, with the visualization parameters inherited from the hmap method.

Parameters:
  • pos (iterable, optional) – An iterable of ribosome positions for generating combinations (e.g., [‘-2’, ‘E’, ‘P’, ‘A’]). If not provided, defaults to the set of positions [‘-2’, ‘E’, ‘P’, ‘A’].

  • how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.

  • col (str, optional) – The dataset column used for computations. Displays the enrichment by default.

  • transform (callable, optional) – A function or callable to apply to the dataset before generating the heatmaps. Defaults to numpy.log2.

  • cmap (str or matplotlib.colors.Colormap, optional) – The colormap to use for the heatmap visualizations. Defaults to ‘vlag’.

  • vmax (float, optional) – The maximum value for color scaling in the heatmaps.

  • center (float, optional) – The midpoint value for centering the colormap.

  • kwargs (key, value pairings) – Additional parameters used to filter the dataset or control heatmap generation via the hmap method.

Returns:

The figure object containing the grid of heatmaps.

Return type:

matplotlib.figure.Figure

Examples

Create the default heatmap grid for all combinations of -2/E/P/A
>>> sample.hmap_grid()
Create a heatmap grid for combinations of E/P/A
>>> sample.hmap_grid(['E', 'P', 'A'])
hmap_pos(pos=None, *, cmap='vlag', vmax=None, center=0, ax=None, **kwargs)

Generates a heatmap of enrichment ratios for amino acid positions across ribosome sites.

This method visualizes the enrichment ratios as a heatmap, where the rows correspond to different ribosome positions and the columns represent amino acids.

Parameters:
  • pos (tuple, optional) – Ribosome positions for which to compute and visualize enrichment ratios (e.g., (‘-2’, ‘E’, ‘P’, ‘A’)).

  • how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded. Default is ‘aax’.

  • col (str, optional) – The DataFrame column to utilize for enrichment visualization. Defaults to ‘auto’.

  • transform (callable, optional) – A function or callable to apply to the enrichment matrix before plotting. Defaults to numpy.log2.

  • cmap (str or matplotlib.colors.Colormap, optional) – The colormap to use for the heatmap visualization. Defaults to ‘vlag’.

  • vmax (float, optional) – The maximum value for color scaling in the heatmap. If not provided, it defaults to the maximum absolute value in the enrichment matrix.

  • center (float, optional) – The midpoint of the colormap. Defaults to 0.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. A new figure and axes are created if not provided.

  • **kwargs (dict, optional) – Additional parameters to customize the enrichment computation or filtering.

Returns:

The axes object containing the heatmap visualization.

Return type:

matplotlib.axes.Axes

Notes

  • The rows of the heatmap correspond to ribosome positions, while the columns represent amino acids.

  • Tick labels are styled using the aa_colors dictionary to match the biochemical categories of amino acids.

  • Enrichment ratios are automatically log2-transformed by default.

infos(html=False)

Returns a table with information on the NGS reads per replicate.

Parameters:

html (bool) – if True, returns the table as HTML, otherwise as DataFrame (default).

Example

>>> sample.infos()
          total_sequences  noadaptor  contaminant  lowqual  tooshort  toolong   extra0   extra1   extra2  MAX_LEN
sample.1          8384889     714414         1192   685017    385537  3341987  2308291  2528638  2833546       44
sample.2          9120203     498202         1659   513308    104071  5664107  2673062  2972850  2976089       44
sample.3          8490958    1043590         1328   409697    187746  4004073  2243720  2555783  2647865       44
itp_len_plot(ax=None, min_codon=0, max_codon=10, limit=100, norm=False)

Generates a line plot of inverse-toeprint (ITP) counts per length.

This method uses the output of itp_len to create a line plot showing the counts of inverse-toeprints across lengths for each replicate. Optionally, counts can be normalized (per million reads), and the plotted lengths can be limited.

Parameters:
  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes to draw the plot on. A new figure and axes are created if not provided.

  • min_codon (int, optional) – The minimum codon position to annotate on the plot. Defaults to 0.

  • max_codon (int, optional) – The maximum codon position to annotate on the plot. Defaults to 10.

  • limit (int, optional) – The maximum length to include in the plot. Defaults to 100.

  • norm (bool, optional) – Whether to normalize counts to reads per million. Defaults to False.

Returns:

The axes object containing the plotted lineplot.

Return type:

matplotlib.axes.Axes

Notes

  • The x-axis represents the distance from the 3’ end of the inverse-toeprint in nucleotides.

  • The y-axis shows the counts of inverse-toeprints, either absolute or normalized per million reads.

  • Each replicate is plotted independently and distinguished by the hue attribute in the plot.

rename(name, rename_replicates=True)

Changes the name of the sample.

Parameters:
  • name (str) – name to use as the new name for the sample.

  • rename_replicates (bool) – If True (default), also rename the replicates based on the new sample name.

Examples

Rename the sample to “new_name”.
>>> sample.rename(name='new_name')
property itp_len

Combines the counts of inverse-toeprints (ITPs) for each length across all replicates.

This method extracts the counts of inverse-toeprints for each length from the metadata of each replicate and combines them into a single DataFrame, keeping the data for each replicate independent.

Returns:

A DataFrame with the following columns: - length : int

The length of the inverse-toeprints.

  • replicatestr

    The replicate identifier.

  • countint

    The count of inverse-toeprints of the given length for the replicate.

  • samplestr

    The name of the sample this data belongs to.

Return type:

pandas.DataFrame

Examples

>>> sample.itp_len
     length replicate     count sample
0        51         1  115732.0    spl
1        20         1  444506.0    spl
2        41         1  130495.0    spl
3        23         1  198257.0    spl
4        17         1   55786.0    spl
..      ...       ...       ...    ...
328     106         3       NaN    spl
329     143         3       NaN    spl
330     102         3       NaN    spl
331     104         3       NaN    spl
332     221         3       NaN    spl
[333 rows x 4 columns]