pacbio_data_processing package

Subpackages

Submodules

pacbio_data_processing.bam module

class pacbio_data_processing.bam.BamFile(bam_file_name, mode='r')[source]

Bases: object

Proxy class for _BamFileSamtools and _BamFilePysam. This is a high level class whose only roles are to choose among _ReadableBamFile and _WritableBamFile and to select the underlying implementation to interact with the BAM file:

- _BamFileSamtools: implementation that simply wraps the 'samtools'
 command line, and
- _BamFilePysam: implementation that uses 'pysam'
__init__(bam_file_name, mode='r')[source]
pacbio_data_processing.bam.pack_lines(lines)[source]
pacbio_data_processing.bam.set_pysam_verbosity()[source]

Ad-hoc function to remove unpleasant errors messages by pysam.

pacbio_data_processing.bam_file_filter module

This module contains the high level functions necessary to apply some filters to a given input BAM file.

class pacbio_data_processing.bam_file_filter.BamFilter(parameters)[source]

Bases: object

__call__()[source]

Call self as a function.

__init__(parameters)[source]
pacbio_data_processing.bam_file_filter.main()[source]

pacbio_data_processing.bam_utils module

Some helper functions to manipulate BAM files

class pacbio_data_processing.bam_utils.CircularDNAPosition(pos: int, ref_len: int = 0)[source]

Bases: object

A type that allows to do arithmetics with postitions in a circular topology.

>>> p = CircularDNAPosition(5, ref_len=9)

The class has a decent repr:

>>> p
CircularDNAPosition(5, ref_len=9)

And we can use it in arithmetic contexts:

>>> p + 1
CircularDNAPosition(6, ref_len=9)
>>> int(p+1)
6
>>> int(p+5)
1
>>> int(20+p)
7
>>> p - 1
CircularDNAPosition(4, ref_len=9)
>>> int(p-6)
8
>>> int(p-16)
7
>>> int(2-p)
6
>>> int(8-p)
3

Also boolean equality is supported:

>>> p == CircularDNAPosition(5, ref_len=9)
True
>>> p == CircularDNAPosition(6, ref_len=9)
False
>>> p == CircularDNAPosition(14, ref_len=9)
True
>>> p == CircularDNAPosition(5, ref_len=8)
False
>>> p == 5
False

But also < is supported:

>>> p < p+1
True
>>> p < p
False
>>> p < p-1
False

Of course two instances cannot be compared if their underlying references are not equally long:

>>> s = CircularDNAPosition(5, ref_len=10)
>>> p < s
Traceback (most recent call last):
...
ValueError: cannot compare positions if topologies differ

or if they are not both CircularDNAPosition’s:

>>> s < 6
Traceback (most recent call last):
...
TypeError: '<' not supported between instances of 'CircularDNAPosition' and 'int'

The class has a convenience method:

>>> p.as_1base()
6

If the ref_len input parameter is less than or equal to 0, the topology is assumed to be linear:

>>> q = CircularDNAPosition(5, ref_len=-1)
>>> q
CircularDNAPosition(5, ref_len=0)
>>> q + 1001
CircularDNAPosition(1006, ref_len=0)
>>> q - 100
CircularDNAPosition(-95, ref_len=0)
>>> int(10-q)
5

Linear topology is the default behaviour:

>>> r = CircularDNAPosition(5)
>>> r
CircularDNAPosition(5, ref_len=0)

It is possitble to use them as indices in slices:

>>> seq = "ABCDEFGHIJ"
>>> seq[r:r+2]
'FG'

And CircularDNAPosition instances can be hashed (so that they can be elements of a set or keys in a dictionary):

>>> positions = {p, q, r}

And, very conveniently, a CircularDNAPosition converts tp str as ints do:

>>> str(r) == '5'
True
__init__(pos: int, ref_len: int = 0)[source]

The parameter ‘ref_len’ represents the length of the sequence, which has full meaning only if the reference is truly circular. If the length is 0 or less, it is set to 0 and it is understood that the reference has a linear topology.

as_1base() int[source]

It returns the raw 1-based position.

class pacbio_data_processing.bam_utils.Molecule(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None)[source]

Bases: object

Abstraction around a single molecule from a Bam file

__init__(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None) None
property ascii_quals: str

Ascii qualities of sequencing the molecule. Each symbol refers to one base.

property cigar: pacbio_data_processing.cigar.Cigar
property dna: str
property end: pacbio_data_processing.bam_utils.CircularDNAPosition

Computes the end of a molecule as CircularDNAPosition(start+lenght of reference) which, obviously takes into account the possible circular topology of the reference.

find_gatc_positions() list[pacbio_data_processing.bam_utils.CircularDNAPosition][source]

The function returns the position of all the GATCs found in the Molecule’s sequence, taking into account the topology of the reference.

The return value is is the 0-based index of the GATC motif, ie, the index of the G in the Python convention.

id: int
is_crossing_origin(*, ori_pi_shifted=False) bool[source]

This method answers the question of whether the molecule crosses the origin, assuming a circular topology of the chromosome. The answer is True if the last base of the molecue is located before the first base. Otherwise the answer is False. It will return False if the molecule starts at the origin; but it will be True if it ends at the origin. There is an optional keyword-only boolean parameter, namely ori_pi_shifted to indicate that the reference has been shifted by pi radians, or not.

pi_shift_back() None[source]

Method that shifts back the (start, end) positions of the molecule assuming that they were shifted before by pi radians.

src_bam_path: Optional[Union[str, pathlib.Path]] = None
property start: pacbio_data_processing.bam_utils.CircularDNAPosition

Readable/Writable attribute. It was originally only readable but the SingleMoleculeAnalysis class relies on it being writable to make easier the shift back of pi-shifted positions, that are computed from this attribute. The logic is: by default, the value is taken from the _best_ccs_line attribute, until it is modified, in which case the value is simply stored and returned upon request.

pacbio_data_processing.bam_utils.count_subreads_per_molecule(bam: pacbio_data_processing.bam.BamFile) collections.defaultdict[int, collections.Counter][source]

Given a read-open BamFile instance, it returns a defaultdict with keys being molecule ids (str) and values, a counter with subreads classified by strand. The possible keys of the returned counter are: +, -, ? meaning direct strand, reverse strand and unknown, respectively.

pacbio_data_processing.bam_utils.flag2strand(flag: int) Literal['+', '-', '?'][source]

Given a FLAG (see the BAM format specification), it transforms it to the corresponding strand.

Returns

+, - or ? depending on the strand the input FLAG can be assigned to (? means: it could not be assigned to any strand).

pacbio_data_processing.bam_utils.gen_index_single_molecule_bams(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], program: pathlib.Path) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

It generates indices in the form of .pbi files using program, which must be the path to a working pbindex executable. For each molecule read from the input pipe, program is called like follows (the argument is the BAM associated with the current molecule):

pbindex blasr.pMA683.subreads.bam

The success of the operation is determined inspecting the return code. If the call succeeds (ie, the return code is 0), the corresponding MoleculeWorkUnit is yielded.

If the call fails (the return code is NOT 0), an error is reported.

pacbio_data_processing.bam_utils.join_gffs(work_units: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], out_file_path: Union[str, pathlib.Path]) collections.abc.Generator[pathlib.Path, None, None][source]

The gff files related to the molecules provided in the input are read and joined in a single file. The individual gff files are yielded back.

Probably this function is useless and should be removed in the future: it only provides a joint gff file that is not a valid gff file and that is never used in the rest of the processing.

pacbio_data_processing.bam_utils.split_bam_file_in_molecules(in_bam_file: Union[str, pathlib.Path], tempdir: Union[str, pathlib.Path], todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

All the individual molecules in the bam file path given, in_bam_file, that are found in todo, will be isolated and stored individually in the directory tempdir. The yielded Molecule instances will have their src_bam_path updated accordingly.

pacbio_data_processing.bam_utils.subreads_per_molecule(lines: collections.abc.Iterable, header: bytes, file_name_prefix: pathlib.Path, todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

This generator yields 2-tuples of (mol-id, Molecule) after having isolated the subreads corresponding to that molecule id from the lines (coming from the iteration over a BamFile instance). Before yielding, a one-molecule BAM file is created.

pacbio_data_processing.bam_utils.write_one_molecule_bam(buf: collections.abc.Iterable, header: bytes, in_file_name: pathlib.Path, suffix: Any) pathlib.Path[source]

Given a sequence of BAM lines, a header, the source name and a suffix, a new bamFile is created containg the data provided an a suitable name.

pacbio_data_processing.cigar module

This module provides basic ‘re-invented’ functionality to handle Cigars. A Cigar describes the differences between two sequences by providing a series of operations that one has to apply to one sequence to obtain the other one. For instance, given these two sequences:

sequence 1 (e.g. from the refenrece):

AAGTTCCGCAAATT

and

sequence 2 (e.g. from the aligner):

AAGCTCCCGCAATT

The Cigar that brings us from sequence 1 to sequence 2 is:

3=1X3=1I4=1D2=

where the numbers refer to the amount of letters and the symbols’ meaning can be found in the table below. Therefore the Cigar in the example is a shorthand for:

3 equal bases followed by 1 replacement followed by 3 equal bases followed by 1 insertion followed by 4 equal bases followed by 1 deletion followed by 2 equal bases

symbol

meaning

=

equal

I

insertion

D

deletion

X

replacement

S

soft clip

H

hard clip

class pacbio_data_processing.cigar.Cigar(incigar)[source]

Bases: object

__init__(incigar)[source]
property diff_ratio

difference ratio: 1 means that each base is different; 0 means that all the bases are equal.

property number_diff_items
property number_diff_types
property number_pb_diffs
property number_pbs
property sim_ratio

similarity ratio: 1 means that all the bases are equal; 0 means that each base is different.

This is computed from diff_ratio().

pacbio_data_processing.constants module

pacbio_data_processing.errors module

exception pacbio_data_processing.errors.SMAPipelineError[source]

Bases: Exception

pacbio_data_processing.errors.high_level_handler(func)[source]

pacbio_data_processing.external module

class pacbio_data_processing.external.Blasr(path: Union[pathlib.Path, str])[source]

Bases: pacbio_data_processing.external.ExternalProgram

An object to interact with the blasr aligner.

__call__(in_bamfile: Union[pathlib.Path, str], fasta: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str], nprocs: int = 1) Optional[int][source]

It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else None is returned.

One case where the executable cannot run is when the sentinel file is there before the executable process is run.

class pacbio_data_processing.external.CCS(path: Union[pathlib.Path, str])[source]

Bases: pacbio_data_processing.external.ExternalProgram

An object to interact with the ccs program.

__call__(in_bamfile: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str]) Optional[int][source]

It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else None is returned.

One case where the executable cannot run is when the sentinel file is there before the executable process is run.

class pacbio_data_processing.external.ExternalProgram(path: Union[pathlib.Path, str])[source]

Bases: object

A base class with common functionality to all external programs’ classes that:

  1. produce an output file, and

  2. its production is to be protected by a Sentinel.

This base class provides the interface and the Sentinel protection.

__call__(infile: Union[pathlib.Path, str], outfile: Union[pathlib.Path, str], *args, **kwargs) Optional[int][source]

It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else None is returned.

One case where the executable cannot run is when the sentinel file is there before the executable process is run.

__init__(path: Union[pathlib.Path, str]) None[source]

pacbio_data_processing.filters module

pacbio_data_processing.filters.cleanup_molecules(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

Generator of MoleculeWorkUnit’s that pass all the standard filters, ie the sequence of filters needed by sm-analysis to select what molecules (and what subreads in those molecules) will be IPD-analyzed.

It is assumed that each file contains subreads corresponding to only ONE molecule (ie, ‘molecules’ is a generator of tuples (mol id, Molecule), with Molecule being related to a single molecule id). [Note for developers: Should we allow multiple molecules per file?]

If there are subreads surviving the filtering process, the bam file is overwritten with the filtered data and the tuple (mol id, Molecule) is yielded. If no subread survives the process, nothing is done (no bam is written, no tuple is yielded).

pacbio_data_processing.filters.empty_buffer(buf: collections.deque, threshold: int, flags_seen: set) Generator[tuple[bytes], None, None][source]

This generator cleans the passed-in buffer either yielding its items, if the conditions are met, or throwing away them if not.

The conditions are:

  1. the number of items are at least threshold, and

  2. the flags_seen is a (non-necessarily proper) superset of

{'+', '-'}.

pacbio_data_processing.filters.filter_enough_data_per_molecule(lines: collections.abc.Iterable[tuple], threshold: int) Generator[tuple[bytes], None, None][source]

This generator yields the input data if (WIP)

pacbio_data_processing.filters.filter_mappings_binary(lines, mappings, *rest)[source]

Simply take or reject mappings depending on passed sequence

pacbio_data_processing.filters.filter_mappings_ratio(lines, mappings, ratio)[source]

Take or reject mappings depending on ratio of wished mappings vs total

pacbio_data_processing.filters.filter_quality(lines, quality_th)[source]
pacbio_data_processing.filters.filter_seq_len(lines, len_th)[source]

pacbio_data_processing.ipd module

exception pacbio_data_processing.ipd.UnknownErrorIpdSummary[source]

Bases: Exception

pacbio_data_processing.ipd.ipd_summary(molecule: tuple[int, pacbio_data_processing.bam_utils.Molecule], fasta: Union[str, pathlib.Path], program: pathlib.Path, nprocs: int, mod_types_comma_sep: str, ipd_model: Union[str, pathlib.Path], skip_if_present: bool) Optional[tuple[int, pacbio_data_processing.bam_utils.Molecule]][source]

Lowest level interface to ipdSummary: all calls to that program are expected to be done through this function. It runs ipdSummary with an input bam file like this:

ipdSummary  blasr.pMA683.subreads.bam --reference pMA683.fa      --identify m6A --gff blasr.pMA683.subreads.476.bam.gff

As a result of this, a gff file is created. This function sets an attribute in the target Molecule with the path to that file.

If the process went well (ipdSummary returns 0), the input MoleculeWorkUnit is returned, otherwise the molecule is tagged as being problematic (had_processing_problems is set to True) and None is returned.

Missing features:

  • skip_if_present

pacbio_data_processing.ipd.multi_ipd_summary(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None]

Generator that yields MoleculeWorkUnit resulting from ipd_summary (None results are skipped). Parallel implementation driven by a pool of threads.

pacbio_data_processing.ipd.multi_ipd_summary_direct(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None][source]

Generator that yields MoleculeWorkUnit resulting from ipd_summary (None results are skipped). Serial implementation (one file produced after the other).

pacbio_data_processing.ipd.multi_ipd_summary_threads(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None][source]

Generator that yields MoleculeWorkUnit resulting from ipd_summary (None results are skipped). Parallel implementation driven by a pool of threads.

pacbio_data_processing.logs module

pacbio_data_processing.logs.config_logging(verbosity: int) None[source]

pacbio_data_processing.parameters module

class pacbio_data_processing.parameters.BamFilteringParameters(cl_input)[source]

Bases: pacbio_data_processing.parameters.ParametersBase

property filter_mappings
property limit_mappings
property min_relative_mapping_ratio
property out_bam_file
class pacbio_data_processing.parameters.ParametersBase(cl_input)[source]

Bases: object

__init__(cl_input)[source]
class pacbio_data_processing.parameters.SingleMoleculeAnalysisParameters(cl_input)[source]

Bases: pacbio_data_processing.parameters.ParametersBase

property ipd_model
property joint_gff_filename
property one_line_per_mod_filename
property partition
property summary_report_html_filename

pacbio_data_processing.plots module

pacbio_data_processing.plots.make_barsplot(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str]) None[source]
pacbio_data_processing.plots.make_continuous_rolled_data(data: dict[typing.NewType.<locals>.new_type, typing.NewType.<locals>.new_type], window: int) pandas.core.frame.DataFrame[source]

Auxiliary function used by make_rolling_history to produce a dataframe with the rolling average of the input data. The resulting dataframe starts at the min input position and ends at the max input position. The holes are set to 0 in the input data.

pacbio_data_processing.plots.make_histogram(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True) None[source]
pacbio_data_processing.plots.make_multi_histogram(data: dict[str, pandas.core.series.Series], plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True) None[source]
pacbio_data_processing.plots.make_rolling_history(data: dict[typing.NewType.<locals>.new_type, typing.NewType.<locals>.new_type], plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True, window: int = 1000) None[source]

pacbio_data_processing.sam module

pacbio_data_processing.sentinel module

class pacbio_data_processing.sentinel.Sentinel(checkpoint: pathlib.Path)[source]

Bases: object

This class creates objects that are expected to be used as context managers. At __enter__ a sentinel file is created. At __exit__ the sentinel file is removed. If the file is there before entering the context, or is not there when the context is exited, an exception is raised.

__init__(checkpoint: pathlib.Path)[source]
_anti_aging()[source]

Method that updates the modification time of the sentinel file every SLEEP_SECONDS seconds. This is part of the mechanism to ensure that the sentinel does not get fooled by an abandoned leftover sentinel file.

property is_file_too_old

Property that answers the question: is the sentinel file too old to be taken as an active sentinel file, or not?

exception pacbio_data_processing.sentinel.SentinelFileFound[source]

Bases: Exception

Exception expected when the sentinel file is there before its creation.

exception pacbio_data_processing.sentinel.SentinelFileNotFound[source]

Bases: Exception

Exception expected if the sentinel file is missing before the Sentinel removes it.

pacbio_data_processing.sm_analysis module

This module contains the high level functions necessary to run the ‘Single Molecule Analysis’ on an input BAM file.

class pacbio_data_processing.sm_analysis.MethylationReport(detections_csv, molecules, modification_types, filtered_bam_statistics=None)[source]

Bases: object

PRELOG = '[methylation report]'
__init__(detections_csv, molecules, modification_types, filtered_bam_statistics=None)[source]
property modification_types
save()[source]
class pacbio_data_processing.sm_analysis.SingleMoleculeAnalysis(parameters)[source]

Bases: object

property CCS_bam_file

It produces a Circular Consensus Sequence (CCS) version of the input BAM file and returns its name. It uses generate_CCS_file() to generate the file.

__call__()[source]

Main entry point to perform a single molecule analysis: this method triggers the analysis.

__init__(parameters)[source]
_align_bam_if_no_candidate_found(inbam: pacbio_data_processing.bam.BamFile, bam_type: str, variant: str = 'straight') Optional[str][source]

[Internal method] Auxiliary method used by _ensure_input_bam_aligned. Given a bam_type (among input and ccs) and a variant, an initial BAM file is selected and a target aligned BAM filename is constructed. The method checks first whether the aligned file is there. If a plausible candidate is not found, the initial BAM is aligned (straight or π-shifted, depending on the variant and using the proper reference). IF, on the other hand, a candidate is found, its computation is skipped.

If the aligner cannot be run (i.e. calling the aligner returns None), None is returned, meaning that the aligner was not called. This can happen when the aligner finds a sentinel file indicating that the computation is work in progress. (See pacbio_data_processing.blasr.Blasr.__call__() for more details on the implementation.) This mechanism allows reentrancy.

Returns

the aligned input bam file, if it is there, or None if it could not be computed (yet).

_create_references()[source]

[Internal method] DNA reference sequences are created here. The ‘true’ reference must exist as fasta beforehand, with its index. A π-shifted reference is created from the original one. Its index is also made.

This method sets two attributes which are, both, mappings with two keys (‘straight’ and ‘pi-shifted’) and values as follows: - reference: the values are DNASeq objects - fasta: the values are Path objects

_disable_pi_shifted_analysis() None[source]

[Internal method] If the pi-shifted analysis cannot be carried out, it is disabled with this method.

_ensure_ccs_bam_aligned() None[source]

[Internal method] As its name suggests, it is ensured that the aligned variants of the CCS file exist. The summary report is informed about the aligned CCS files.

_ensure_input_bam_aligned() None[source]

[Internal method] Main check point for aligned input bam files: this method calls whatever is necessary to ensure that the input bam is aligned, which means: normal (straight) alignment and π-shifted alignment.

Warning! The method tries to find a pi-shifted aligned BAM if the input is aligned based on whether 1. a file with suitable filename is found, and 2. it is aligned.

_exists_pi_shifted_variant_from_aligned_input() bool[source]

[Internal method] It checks that the expected pi-shifted aligned file exists and is an aligned BAM file.

property partition: pacbio_data_processing.utils.Partition

The target Partition of the input BAM file that must be processed by the current analysis, according to the input provided by the user.

produce_methylation_report()[source]
property workdir: tempfile.TemporaryDirectory

This attribute returns the necessary temporary working directory on demand and it ensures that only one temporary dir is created by caching.

pacbio_data_processing.sm_analysis.add_to_own_output(gffs, own_output_file_name, modification_types)[source]

From a set of .gff files, a csv file (delimiter=”,”) is saved with the following columns:

  • mol id: taken each gff file (e.g. ‘a.b.c.gff’ -> mol id: ‘b’)

  • modtype: column number 3 (idx: 2) of the gffs (feature type)

  • GATC position: column number column number 5 (idx: 4) of the gffs which corresponds to the ‘end coordinate of the feature’ in the GFF3 standard

  • score of the feature: column number 6 (idx: 5); floating point (Phred-transformed pvalue that a kinetic deviation exists at this position)

  • strand: strand of the feature. It can be +, - with obvious meanings. It can also be ? (meaning unknown) or . (for non stranded features)

There are more columns, but they are nor fixed in number. They correspond to the values given in the ‘attributes’ column of the gffs (col 9, idx 8). For example, given the following attributes column:

coverage=134;context=TCA...;IPDRatio=3.91;identificationQv=228

we would get the following ‘extra’ columns:

134,TCA...,3.91,228

and this is exactly what happens with the m6A modification type.

All the lines starting by ‘#’ in the gff files are ignored. The format of the gff file is GFF3: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

The value of identificationQV is a a phred transformed probability of having a detection. See eq. (8) in [1]

[1]: “Detection and Identification of Base Modifications with Single Molecule Real-Time Sequencing Data”

pacbio_data_processing.sm_analysis.generate_CCS_file(ccs: pacbio_data_processing.external.CCS, in_bam: pathlib.Path, ccs_bam_file: pathlib.Path) Optional[pathlib.Path][source]

Idempotent computation of the Circular Consensus Sequence (CCS) version of the passed in in_bam file done with passed-in ccs object.

Returns

the CCS bam file, if it is there, or None if if could not be computed (yet).

pacbio_data_processing.sm_analysis.main_cl()[source]

Entry point for sm-analysis executable.

pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio(bam_file_name: Optional[Union[pathlib.Path, str]]) dict[int, pacbio_data_processing.bam_utils.Molecule][source]

Given the path to a bam file, it returns a dictionary, whose keys are mol ids (ints) and the values are the corresponding Molecules. If multiple lines in the given BAM file share the mol id, only the first line found with the highest similarity ratio (computed from the cigar) is chosen: if multiple lines share the molecule ID and the highest similarity ratio (say, 1), ONLY the first one is taken, irrespective of other factors.

pacbio_data_processing.sm_analysis.match_methylation_states_m6A(pos_plus, ipd_meth_states)[source]
pacbio_data_processing.sm_analysis.restore_old_run(old_path, new_path)[source]

pacbio_data_processing.sm_analysis_gui module

pacbio_data_processing.sm_analysis_gui.main_gui()[source]

Entry point for sm-analysis-gui executable.

pacbio_data_processing.summary module

class pacbio_data_processing.summary.AlignedCCSBamsAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.BarsPlotAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.ROAttribute

class pacbio_data_processing.summary.GATCCoverageBarsPlot(name=None)[source]

Bases: pacbio_data_processing.summary.BarsPlotAttribute

data_definition = {'GATCs NOT in BAM file (%)': ('perc_all_gatcs_not_identified_in_bam',), 'GATCs NOT in methylation report (%)': ('perc_all_gatcs_not_in_meth',), 'GATCs in BAM file (%)': ('perc_all_gatcs_identified_in_bam',), 'GATCs in methylation report (%)': ('perc_all_gatcs_in_meth',)}
dependency_names = ('aligned_ccs_bam_files', 'methylation_report')
index_labels = ('Percentage',)
title = 'GATCs in BAM file and Methylation report'
class pacbio_data_processing.summary.HistoryPlotAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.ROAttribute

class pacbio_data_processing.summary.InputBamAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.InputReferenceAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.MethTypeBarsPlot(name=None)[source]

Bases: pacbio_data_processing.summary.BarsPlotAttribute

data_definition = {'Fully methylated (%)': ('fully_methylated_gatcs_wrt_meth',), 'Fully unmethylated (%)': ('fully_unmethylated_gatcs_wrt_meth',), 'Hemi-methylated in + strand (%)': ('hemi_plus_methylated_gatcs_wrt_meth',), 'Hemi-methylated in - strand (%)': ('hemi_minus_methylated_gatcs_wrt_meth',)}
dependency_names = ('methylation_report',)
index_labels = ('Percentage',)
title = 'Methylation types in methylation report'
class pacbio_data_processing.summary.MethylationReport(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.MoleculeLenHistogram(name=None)[source]

Bases: pacbio_data_processing.summary.HistoryPlotAttribute

column_name = 'len(molecule)'
data_name = 'length'
dependency_name = 'methylation_report'
labels = ('Initial subreads', 'Analyzed molecules')
legend = True
make_data_for_plot(instance)[source]
title = 'Initial subreads and analyzed molecule length histogram'
class pacbio_data_processing.summary.MoleculeTypeBarsPlot(name=None)[source]

Bases: pacbio_data_processing.summary.BarsPlotAttribute

data_definition = {'Filtered out': ('perc_filtered_out_mols', 'perc_filtered_out_subreads'), 'In Methylation report with GATC': ('perc_mols_in_meth_report_with_gatcs', 'perc_subreads_in_meth_report_with_gatcs'), 'In Methylation report without GATC': ('perc_mols_in_meth_report_without_gatcs', 'perc_subreads_in_meth_report_without_gatcs'), 'Mismatch discards': ('perc_mols_dna_mismatches', 'perc_subreads_dna_mismatches'), 'Used in aligned CCS': ('perc_mols_used_in_aligned_ccs', 'perc_subreads_used_in_aligned_ccs')}
dependency_names = ('mols_used_in_aligned_ccs', 'mols_dna_mismatches', 'filtered_out_mols', 'methylation_report')
index_labels = ('Number of molecules (%)', 'Number of subreads (%)')
title = 'Processed molecules and subreads'
class pacbio_data_processing.summary.MolsSetAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.PercAttribute(total_attr, pref='perc_', suf='_wrt_meth', name=None)[source]

Bases: pacbio_data_processing.summary.ROAttribute

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

__init__(total_attr, pref='perc_', suf='_wrt_meth', name=None)[source]
class pacbio_data_processing.summary.PositionCoverageBarsPlot(name=None)[source]

Bases: pacbio_data_processing.summary.BarsPlotAttribute

data_definition = {'Positions NOT covered by molecules in BAM file (%)': ('perc_all_positions_not_in_bam',), 'Positions NOT covered by molecules in methylation report (%)': ('perc_all_positions_not_in_meth',), 'Positions covered by molecules in BAM file (%)': ('perc_all_positions_in_bam',), 'Positions covered by molecules in methylation report (%)': ('perc_all_positions_in_meth',)}
dependency_names = ('aligned_ccs_bam_files', 'methylation_report')
index_labels = ('Percentage',)
title = 'Position coverage in BAM file and Methylation report'
class pacbio_data_processing.summary.PositionCoverageHistory(name=None)[source]

Bases: pacbio_data_processing.summary.HistoryPlotAttribute

dependency_name = 'methylation_report'
labels = ('Positions',)
legend = False
len_column_name = 'len(molecule)'
make_data_for_plot(instance)[source]
start_column_name = 'start of molecule'
title = 'Sequencing positions covered by analyzed molecules'
class pacbio_data_processing.summary.ROAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.SimpleAttribute(name=None)[source]

Bases: object

The base class of all other descriptor managed attributes of SummaryReport. It is a wrapper around the _data dictionary of the instance owning this attribute.

__init__(name=None)[source]
class pacbio_data_processing.summary.SummaryReport(bam_path, dnaseq)[source]

Bases: collections.abc.Mapping

Final summary report generated by sm-analysis initially intended for humans.

This class has been crafted to carefully control its attributes. Data can be fed into the class by setting some attributes. That process triggers the generation of other attributes, that are typically read-only.

After instantiating the class with the path to the input BAM and the dna sequence of the reference (instance of DNASeq), one must set some attributes to be able to save the summary report:

s = SummaryReport(bam_path, dnaseq)
s.methylation_report = path_to_meth_report
s.raw_detections = path_to_raw_detections_file
s.gff_result = path_to_gff_result_file
s.mols_dna_mismatches = {20, 49, ...} # set of ints
s.filtered_out_mols = {22, 493, ...} # set of ints
s.mols_used_in_aligned_ccs = {3, 67, ...} # set of ints
s.aligned_ccs_bam_files = {
    'straight': aligned_ccs_path,
    'pi-shifted': pi_shifted_aligned_ccs_path
}

at this point all the necessary data is there and the report can be created:

s.save('summary_whatever.html')
__init__(bam_path, dnaseq)[source]
aligned_ccs_bam_files
all_gatcs_identified_in_bam
all_gatcs_in_meth
all_gatcs_not_identified_in_bam
all_gatcs_not_in_meth
all_positions_in_bam
all_positions_in_meth
all_positions_not_in_bam
all_positions_not_in_meth
property as_html
body_md5sum
filtered_out_mols
filtered_out_subreads
full_md5sum
fully_methylated_gatcs
fully_methylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

fully_unmethylated_gatcs
fully_unmethylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

gatc_coverage_bars
gff_result

The base class of all other descriptor managed attributes of SummaryReport. It is a wrapper around the _data dictionary of the instance owning this attribute.

hemi_methylated_gatcs
hemi_methylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

hemi_minus_methylated_gatcs
hemi_minus_methylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

hemi_plus_methylated_gatcs
hemi_plus_methylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

input_bam
input_bam_size
input_reference
keys() a set-like object providing a view on D's keys[source]
max_possible_methylations
meth_type_bars
methylation_report
molecule_len_histogram
molecule_type_bars
mols_dna_mismatches
mols_in_meth_report
mols_in_meth_report_with_gatcs
mols_in_meth_report_without_gatcs
mols_ini
mols_used_in_aligned_ccs
perc_all_gatcs_identified_in_bam

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_all_gatcs_in_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_all_gatcs_not_identified_in_bam

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_all_gatcs_not_in_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_all_positions_in_bam

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_all_positions_in_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_all_positions_not_in_bam

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_all_positions_not_in_meth

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_filtered_out_mols

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_filtered_out_subreads

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_mols_dna_mismatches

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_mols_in_meth_report

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_mols_in_meth_report_with_gatcs

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_mols_in_meth_report_without_gatcs

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_mols_used_in_aligned_ccs

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_subreads_dna_mismatches

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_subreads_in_meth_report

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_subreads_in_meth_report_with_gatcs

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_subreads_in_meth_report_without_gatcs

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

perc_subreads_used_in_aligned_ccs

From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.

position_coverage_bars
position_coverage_history
raw_detections

The base class of all other descriptor managed attributes of SummaryReport. It is a wrapper around the _data dictionary of the instance owning this attribute.

ready_to_go(*attrs)[source]

Method used to check if some attributes are already usable or not (in other words if they have been already set or not).

reference_base_pairs
reference_md5sum
reference_name
save(filename)[source]
subreads_dna_mismatches
subreads_in_meth_report
subreads_in_meth_report_with_gatcs
subreads_in_meth_report_without_gatcs
subreads_ini
subreads_used_in_aligned_ccs
switch_on(attribute)[source]

Method used by descriptors to inform the instance of ``SummaryReport``that some computed attributes needed by the plots are already computed and usable.

total_gatcs_in_ref

pacbio_data_processing.templates module

pacbio_data_processing.types module

pacbio_data_processing.utils module

class pacbio_data_processing.utils.AlmostUUID[source]

Bases: object

A class that provides a 5 letters summary of a UUID. It is intended to be used as prefix in all log message. It is not necessary that two instances are different. But it is necessary that:

  1. the string representation is short, and

  2. given two instances their string representations most probably differ.

The underlying UUID is obtained from the stdlib using uuid.uuid1. The class is implemented using the Borg pattern: all instances running in the same interpreter share a common _uuid attribute.

__init__() None[source]
class pacbio_data_processing.utils.DNASeq(raw_seq: pacbio_data_processing.utils.DNASeqLike, name: str = '', description: str = '')[source]

Bases: Generic[pacbio_data_processing.utils.DNASeqLike]

Wrapper around ‘Bio.Seq.Seq’.

__init__(raw_seq: pacbio_data_processing.utils.DNASeqLike, name: str = '', description: str = '')[source]
classmethod from_fasta(fasta_name: str) pacbio_data_processing.utils.DNASeqType[source]

Returns a DNASeq from the first DNA sequence stored in the fasta named ‘fasta_name’.

property md5sum: str

It returns the MD5 checksum’s hexdigest of the upper version of the sequence as a string.

pi_shifted() pacbio_data_processing.utils.DNASeqType[source]

Method to return a pi-shifted DNASeq from the original one. pi-shifted means that a circular topology is assumed in the DNA sequence and a shift in the origin is done by π radians, ie the sequence is splitted in two parts and both parts are permuted.

upper() Bio.Seq.Seq[source]
write_fasta(output_file_name: Union[pathlib.Path, str]) None[source]
class pacbio_data_processing.utils.Partition(partition_specification: Optional[tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile)[source]

Bases: object

A Partition is a class that helps answering the following question: assuming that we are interested in processing a fraction of a BamFile, does the molecule ID mol_id belongs to that fraction, or not? A prior implementation consisted in storing all the molecule IDs in the BamFile for a given partition in a set, and the answer is just obtained by querying if a molecule ID belongs to the set or not. That former implementation is not enough for the case of multiple alignment processes for the same raw BamFile (eg, when a combined analysis of the so-called ‘straight’ and ‘pi-shifted’ variants is performed). In that case the partition is decided with one file. And all molecule IDs belonging to the non-empty intersection with the other file must be unambiguously accomodated in a certain partition. This class has been designed to solve that problem.

__init__(partition_specification: Optional[tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile) None[source]
_delimit_partitions() None[source]

[Internal method] This method decides what are the limits of all partitions given the number of partitions. The method sets an internal mapping, self._lower_limits, of the type {partition number [int]: lower limit [int]} with that information. This mapping is populated with all the partition numbers and corresponding values.

_set_current_limits() None[source]

[Internal method] Auxiliary method for __contains__ Here it is determined what is the range of molecule IDs, as ints, that belong to the partition. The method sets two integer attributes, namely: - _lower_limit_current: the minimum molecule ID of the

current partition, and

  • _higher_limit_current: the maximum molecule ID of the current partition; it can be None, meaning that there is no maximum (last partition).

pacbio_data_processing.utils.combine_scores(scores: collections.abc.Sequence[float]) float[source]

It computes the combined phred transformed score of the scores provided. Some examples:

>>> combine_scores([10])
10.0
>>> q = combine_scores([10, 12, 14])
>>> print(round(q, 6))
7.204355
>>> q = combine_scores([30, 20, 100, 92])
>>> print(round(q, 6))
19.590023
>>> q_500 = combine_scores([30, 20, 500])
>>> q_no_500 = combine_scores([30, 20])
>>> q_500 == q_no_500
True
>>> combine_scores([200, 300, 500])
200.0
pacbio_data_processing.utils.find_gatc_positions(seq: str, offset: int = 0) set[int][source]

Convenience function that computes the positions of all GATCs found in the given sequence. The values are relative to the offset.

>>> find_gatc_positions('AAAGAGAGATCGCGCGATC') == {7, 15}
True
>>> find_gatc_positions('AAAGAGAGTCGCGCCATC')
set()
>>> find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC') == {7, 12, 19}
True
>>> s = find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC', offset=23)
>>> s == {30, 35, 42}
True
pacbio_data_processing.utils.pishift_back_positions_in_gff(gff_path: Union[str, pathlib.Path]) None[source]

A function that parses the input GFF file (assumed to be a valid `GFF3`_ file) and shifts back the positions found in it (columns 4th and 5th of lines not starting by #). It is assumed that the positions in the input file (gff_path) are referring to a pi-shifted origin. To undo the shift, the length of the sequence(s) is (are) read from the GFF3 directives (lines starting by ##), in particular from the ##sequence-region pragmas. This function can handle the case of multiple sequences.

Warning! The function overwrites the input gff_path.

pacbio_data_processing.utils.shift_me_back(pos: int, nbp: int) int[source]

Unshifts a given position taking into account that it has been previously shifted by half of the number of base pairs. It takes into account the possibility of having a sequence with an odd length.

@params:

  • pos - 1-based position of a base pair to unshift

  • nbp - number of base pairs in the reference

@returns:

  • unshifted position

Some examples:

>>> shift_me_back(3, 10)
8
>>> shift_me_back(1, 20)
11
>>> shift_me_back(3, 7)
6
>>> shift_me_back(4, 7)
7
>>> shift_me_back(5, 7)
1
>>> shift_me_back(7, 7)
3
>>> shift_me_back(1, 7)
4

To understand the operation of this function consider the following example. Given a sequence of 7 base pairs with the following indices found in the reference in the natural order, ie

1 2 3 4 5 6 7

then, after being pi-shifted the base pairs in the sequence are reordered, and the indices become (in parenthesis the former indices):

1’(=4) 2’(=5) 3’(=6) 4’(=7) 5’(=1) 6’(=2) 7’(=3)

The current function accepts primed indices and transforms them to the unprimed indices, ie, the positions returned refer to the original reference.

pacbio_data_processing.utils.try_computations_with_variants_until_done(func: Callable, variants: collections.abc.Sequence[str], *args: Any) None[source]

This function runs the passed in function func with the arguments``*args`` and for each variant in variants,eg. something like this:

for v in variants:

result = func(*args, variant=v)

but it keeps doing so until each result returned by func is not None. When a None is returned by func, a call to sleep is warranted before continuing. The time slept depends on how many times it was sleeping before; the sleep time grows exponentially with every iteration:

t -> 2*t

until all the computations (results of func for each variant) are completed, ie all are not None. The main application of this function is to ensure that some common operations of the SingleMoleculeAnalysis are done once and only once irrespective of how many parallel instances of the analysis (with different partitions each) are carried out. For example, this function can be used to avoid collisions in the generation of aligned BAM files since pacbio_data_processing.blasr.Blasr has a mechanism that allows concurrent computations. This function delegates the decision on whether the computation is done or not to func.

Note

A special case is when a variant is None, in that case the function func is called without the variant argument:

result = func(*args)

Therefore, if variants is, e.g. (None,), then func is only called once in each iteration WITHOUT variant keyword argument. That is useful if the function func must be called until is done, but it takes no variant argument.

Module contents

Top-level package for PacBio data processing.