pacbio_data_processing package¶
Subpackages¶
Submodules¶
pacbio_data_processing.bam module¶
- class pacbio_data_processing.bam.BamFile(bam_file_name, mode='r')[source]¶
Bases:
object
Proxy class for
_BamFileSamtools
and_BamFilePysam
. This is a high level class whose only role is to choose among different possible states:_ReadableBamFile
and_WritableBamFile
and to select the underlying implementation (strategy) to interact with the BAM file:_BamFileSamtools
: implementation that simply wraps the
‘samtools’ command line, and
_BamFilePysam
: implementation that uses ‘pysam’
The code is ready to permit the choice of strategy. With the current implementation it is, intentionally, a bit convoluted. For instance, instead of the default implementation (
pysam
), another one can be chosen as follows:from pacbio_data_processing.bam import BamFile BamFile.bamfile_strategy_name = "samtools" bam = BamFile("my.bam")
and
samtools
will be used under the hood to get access to the data in a BAM file.- bamfile_strategy_name = '_BamFilePysam'¶
- class pacbio_data_processing.bam.BamFileStrategy(*args, **kwargs)[source]¶
Bases:
Protocol
- __init__(*args, **kwargs)¶
- pacbio_data_processing.bam._strategy_factory(name: str = '_BamFilePysam') pacbio_data_processing.bam.BamFileStrategy [source]¶
Internal function that returns the strategy class in a concrete
BamFile
instance.
pacbio_data_processing.bam_file_filter module¶
This module contains the high level functions necessary to apply some filters to a given input BAM file.
pacbio_data_processing.bam_utils module¶
Some helper functions to manipulate BAM files
- class pacbio_data_processing.bam_utils.CircularDNAPosition(pos: int, ref_len: int = 0)[source]¶
Bases:
object
A type that allows to do arithmetics with postitions in a circular topology.
>>> p = CircularDNAPosition(5, ref_len=9)
The class has a decent repr:
>>> p CircularDNAPosition(5, ref_len=9)
And we can use it in arithmetic contexts:
>>> p + 1 CircularDNAPosition(6, ref_len=9) >>> int(p+1) 6 >>> int(p+5) 1 >>> int(20+p) 7 >>> p - 1 CircularDNAPosition(4, ref_len=9) >>> int(p-6) 8 >>> int(p-16) 7 >>> int(2-p) 6 >>> int(8-p) 3
Also boolean equality is supported:
>>> p == CircularDNAPosition(5, ref_len=9) True >>> p == CircularDNAPosition(6, ref_len=9) False >>> p == CircularDNAPosition(14, ref_len=9) True >>> p == CircularDNAPosition(5, ref_len=8) False >>> p == 5 False
But also < is supported:
>>> p < p+1 True >>> p < p False >>> p < p-1 False
Of course two instances cannot be compared if their underlying references are not equally long:
>>> s = CircularDNAPosition(5, ref_len=10) >>> p < s Traceback (most recent call last): ... ValueError: cannot compare positions if topologies differ
or if they are not both CircularDNAPosition’s:
>>> s < 6 Traceback (most recent call last): ... TypeError: '<' not supported between instances of 'CircularDNAPosition' and 'int'
The class has a convenience method:
>>> p.as_1base() 6
If the ref_len input parameter is less than or equal to 0, the topology is assumed to be linear:
>>> q = CircularDNAPosition(5, ref_len=-1) >>> q CircularDNAPosition(5, ref_len=0) >>> q + 1001 CircularDNAPosition(1006, ref_len=0) >>> q - 100 CircularDNAPosition(-95, ref_len=0) >>> int(10-q) 5
Linear topology is the default behaviour:
>>> r = CircularDNAPosition(5) >>> r CircularDNAPosition(5, ref_len=0)
It is possitble to use them as indices in slices:
>>> seq = "ABCDEFGHIJ" >>> seq[r:r+2] 'FG'
And CircularDNAPosition instances can be hashed (so that they can be elements of a set or keys in a dictionary):
>>> positions = {p, q, r}
And, very conveniently, a CircularDNAPosition converts tp str as ints do:
>>> str(r) == '5' True
- class pacbio_data_processing.bam_utils.Molecule(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None)[source]¶
Bases:
object
Abstraction around a single molecule from a Bam file
- __init__(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None) None ¶
- property ascii_quals: str¶
Ascii qualities of sequencing the molecule. Each symbol refers to one base.
- property cigar: pacbio_data_processing.cigar.Cigar¶
- property dna: str¶
- property end: pacbio_data_processing.bam_utils.CircularDNAPosition¶
Computes the end of a molecule as CircularDNAPosition(start+lenght of reference) which, obviously takes into account the possible circular topology of the reference.
- find_gatc_positions() list[pacbio_data_processing.bam_utils.CircularDNAPosition] [source]¶
The function returns the position of all the GATCs found in the Molecule’s sequence, taking into account the topology of the reference.
The return value is is the 0-based index of the GATC motif, ie, the index of the G in the Python convention.
- id: int¶
- is_crossing_origin(*, ori_pi_shifted=False) bool [source]¶
This method answers the question of whether the molecule crosses the origin, assuming a circular topology of the chromosome. The answer is
True
if the last base of the molecue is located before the first base. Otherwise the answer isFalse
. It will returnFalse
if the molecule starts at the origin; but it will beTrue
if it ends at the origin. There is an optional keyword-only boolean parameter, namelyori_pi_shifted
to indicate that the reference has been shifted by pi radians, or not.
- pi_shift_back() None [source]¶
Method that shifts back the (start, end) positions of the molecule assuming that they were shifted before by pi radians.
- src_bam_path: Optional[Union[str, pathlib.Path]] = None¶
- property start: pacbio_data_processing.bam_utils.CircularDNAPosition¶
Readable/Writable attribute. It was originally only readable but the
SingleMoleculeAnalysis
class relies on it being writable to make easier the shift back of pi-shifted positions, that are computed from this attribute. The logic is: by default, the value is taken from the_best_ccs_line
attribute, until it is modified, in which case the value is simply stored and returned upon request.
- pacbio_data_processing.bam_utils.count_subreads_per_molecule(bam: pacbio_data_processing.bam.BamFile) collections.defaultdict[int, collections.Counter] [source]¶
Given a read-open BamFile instance, it returns a defaultdict with keys being molecule ids (str) and values, a counter with subreads classified by strand. The possible keys of the returned counter are: +, -, ? meaning direct strand, reverse strand and unknown, respectively.
- pacbio_data_processing.bam_utils.flag2strand(flag: int) Literal['+', '-', '?'] [source]¶
Given a
FLAG
(see the BAM format specification), it transforms it to the corresponding strand.- Returns
+
,-
or?
depending on the strand the inputFLAG
can be assigned to (?
means: it could not be assigned to any strand).
- pacbio_data_processing.bam_utils.gen_index_single_molecule_bams(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], program: pathlib.Path) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
It generates indices in the form of
.pbi
files usingprogram
, which must be the path to a workingpbindex
executable. For each molecule read from the input pipe,program
is called like follows (the argument is the BAM associated with the current molecule):pbindex aligned.pMA683.subreads.bam
The success of the operation is determined inspecting the return code. If the call succeeds (ie, the return code is
0
), the correspondingMoleculeWorkUnit
is yielded.If the call fails (the return code is NOT
0
), an error is reported.
- pacbio_data_processing.bam_utils.join_gffs(work_units: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], out_file_path: Union[str, pathlib.Path]) collections.abc.Generator[pathlib.Path, None, None] [source]¶
The gff files related to the molecules provided in the input are read and joined in a single file. The individual gff files are yielded back.
Probably this function is useless and should be removed in the future: it only provides a joint gff file that is not a valid gff file and that is never used in the rest of the processing.
- pacbio_data_processing.bam_utils.old_single_molecule_work_units_gen(lines: collections.abc.Iterable, header: bytes, file_name_prefix: pathlib.Path, todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
This generator yields 2-tuples of (mol-id, Molecule) after having isolated the subreads corresponding to that molecule id from the
lines
(coming from the iteration over aBamFile
instance). Before yielding, a one-molecule BAM file is created. .. warning:This generator assumes that the subreads are sorted by ``molecule_id``, aka ZMW number. In that case, this implementation is probably much faster in most situations than the equivalently functional ``single_molecule_work_units_gen``.
- pacbio_data_processing.bam_utils.single_molecule_work_units_gen(inbam: pacbio_data_processing.bam.BamFile, out_name_without_molid: pathlib.Path, todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
This generator yields 2-tuples of (mol-id, Molecule) after having isolated the subreads corresponding to that molecule id from
inbam
. The generator relies oninbam
having a mapping,inbam.last_subreads_map
that, for each molecule id gives the last subread index corresponding to that molecule id. This generator handles properly the case of BAM files where the subreads are not groupped by molecule id, i.e. BAM files that are not sorted by molecule id (or ZWM).Before yielding, a one-molecule BAM file is created with all the subreads of that molecule.
Warning
The current implementation keeps in memory a dictionary with all subreads of molecules that are not yet completely read. For large BAM files that can be a large memory footprint.
- pacbio_data_processing.bam_utils.split_bam_file_in_molecules(in_bam_file: Union[str, pathlib.Path], tempdir: Union[str, pathlib.Path], todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
All the individual molecules in the bam file path given,
in_bam_file
, that are found intodo
, will be isolated and stored individually in the directorytempdir
. The yielded Molecule instances will have theirsrc_bam_path
updated accordingly.
- pacbio_data_processing.bam_utils.write_one_molecule_bam(subreads: collections.abc.Iterable, header: bytes, in_file_name: pathlib.Path, pre_suffix: Any) pathlib.Path [source]¶
Given a sequence of BAM lines, a header, the source name and a suffix, a new
bamFile
is created containg the data provided an a suitable name.
pacbio_data_processing.cigar module¶
This module provides basic ‘re-invented’ functionality to handle Cigars. A Cigar describes the differences between two sequences by providing a series of operations that one has to apply to one sequence to obtain the other one. For instance, given these two sequences:
sequence 1 (e.g. from the refenrece):
AAGTTCCGCAAATT
and
sequence 2 (e.g. from the aligner):
AAGCTCCCGCAATT
The Cigar that brings us from sequence 1 to sequence 2 is:
3=1X3=1I4=1D2=
where the numbers refer to the amount of letters and the symbols’ meaning can be found in the table below. Therefore the Cigar in the example is a shorthand for:
3 equal bases followed by 1 replacement followed by 3 equal bases followed by 1 insertion followed by 4 equal bases followed by 1 deletion followed by 2 equal bases
symbol |
meaning |
---|---|
= |
equal |
I |
insertion |
D |
deletion |
X |
replacement |
S |
soft clip |
H |
hard clip |
- class pacbio_data_processing.cigar.Cigar(incigar)[source]¶
Bases:
object
- property diff_ratio¶
difference ratio:
1
means that each base is different;0
means that all the bases are equal.
- property number_diff_items¶
- property number_diff_types¶
- property number_pb_diffs¶
- property number_pbs¶
- property sim_ratio¶
similarity ratio:
1
means that all the bases are equal;0
means that each base is different.This is computed from
diff_ratio()
.
pacbio_data_processing.constants module¶
pacbio_data_processing.errors module¶
pacbio_data_processing.external module¶
- class pacbio_data_processing.external.AlignerMixIn[source]¶
Bases:
object
A MixIn providing common functionality for aligner wrappers.
- class pacbio_data_processing.external.Blasr(path: Union[pathlib.Path, str])[source]¶
Bases:
pacbio_data_processing.external.AlignerMixIn
,pacbio_data_processing.external.ExternalProgram
A simple wrapper around the
blasr
aligner (https://github.com/BioinformaticsArchive/blasr).
- class pacbio_data_processing.external.CCS(path: Union[pathlib.Path, str])[source]¶
Bases:
pacbio_data_processing.external.ExternalProgram
A simple wrapper around the
ccs
program, from the pbccs package (https://ccs.how/)- __call__(in_bamfile: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str]) Optional[int] [source]¶
It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else
None
is returned.One case where the executable cannot run is when the sentinel file is there before the executable process is run.
- class pacbio_data_processing.external.ExternalProgram(path: Union[pathlib.Path, str])[source]¶
Bases:
object
A base class with common functionality to all external programs’ classes that:
produce an output file, and
its production is to be protected by a
Sentinel
.
This base class provides the interface and the
Sentinel
protection.- __call__(infile: Union[pathlib.Path, str], outfile: Union[pathlib.Path, str], *args, **kwargs) Optional[int] [source]¶
It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else
None
is returned.One case where the executable cannot run is when the sentinel file is there before the executable process is run.
- exception pacbio_data_processing.external.MissingExternalToolError[source]¶
Bases:
FileNotFoundError
- class pacbio_data_processing.external.Pbmm2(path: Union[pathlib.Path, str])[source]¶
Bases:
pacbio_data_processing.external.AlignerMixIn
,pacbio_data_processing.external.ExternalProgram
A simple wrapper around the
pbmm2
aligner (https://github.com/PacificBiosciences/pbmm2).
pacbio_data_processing.filters module¶
- pacbio_data_processing.filters.cleanup_molecules(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
Generator of
MoleculeWorkUnit``s that pass all the *standard* *filters*, ie the sequence of filters needed by ``sm-analysis
to select what molecules (and what subreads in those molecules) will be IPD-analyzed.It is assumed that each file contains subreads corresponding to only ONE molecule (ie, ‘molecules’ is a generator of tuples (mol id, Molecule), with
Molecule
being related to a single molecule id). [Note for developers: Should we allow multiple molecules per file?]If there are subreads surviving the filtering process, the bam file is overwritten with the filtered data and the tuple (mol id, Molecule) is yielded. If no subread survives the process, nothing is done (no bam written, no tuple yielded).
- pacbio_data_processing.filters.empty_buffer(buf: collections.deque, threshold: int, flags_seen: set) Generator[tuple[bytes], None, None] [source]¶
This generator cleans the passed-in buffer either yielding its items, if the conditions are met, or throwing away them if not.
The conditions are:
the number of items are at least
threshold
, andthe
flags_seen
is a (non-necessarily proper) superset of{'+', '-'}
.
- pacbio_data_processing.filters.filter_enough_data_per_molecule(lines: collections.abc.Iterable[tuple], threshold: int) Generator[tuple[bytes], None, None] [source]¶
This generator yields the input data if there is enough data to yield. Enough means at least threshold number of data items.
- pacbio_data_processing.filters.filter_mappings_binary(lines, mappings, *rest)[source]¶
Simply take or reject mappings depending on passed sequence
pacbio_data_processing.ipd module¶
- pacbio_data_processing.ipd.ipd_summary(molecule: tuple[int, pacbio_data_processing.bam_utils.Molecule], fasta: Union[str, pathlib.Path], program: pathlib.Path, nprocs: int, mod_types_comma_sep: str, ipd_model: Union[str, pathlib.Path], skip_if_present: bool) Optional[tuple[int, pacbio_data_processing.bam_utils.Molecule]] [source]¶
Lowest level interface to
ipdSummary
: all calls to that program are expected to be done through this function. It runsipdSummary
with an input bam file like this:ipdSummary aligned.pMA683.subreads.bam --reference pMA683.fa --identify m6A --gff aligned.pMA683.subreads.476.bam.gff
As a result of this, a gff file is created. This function sets an attribute in the target Molecule with the path to that file.
If the process went well (
ipdSummary
returns0
), the inputMoleculeWorkUnit
is returned, otherwise the molecule is tagged as being problematic (had_processing_problems
is set toTrue
) andNone
is returned.Missing features:
skip_if_present
- pacbio_data_processing.ipd.multi_ipd_summary(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None] ¶
Generator that yields
MoleculeWorkUnit
resulting fromipd_summary
(None
results are skipped). Parallel implementation driven by a pool of threads.
- pacbio_data_processing.ipd.multi_ipd_summary_direct(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None] [source]¶
Generator that yields
MoleculeWorkUnit
resulting fromipd_summary
(None
results are skipped). Serial implementation (one file produced after the other).
- pacbio_data_processing.ipd.multi_ipd_summary_threads(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None] [source]¶
Generator that yields
MoleculeWorkUnit
resulting fromipd_summary
(None
results are skipped). Parallel implementation driven by a pool of threads.
pacbio_data_processing.logs module¶
pacbio_data_processing.methylation module¶
A module containing methylation related code.
pacbio_data_processing.parameters module¶
This module defines mediator classes to interact with user given parameters.
- class pacbio_data_processing.parameters.BamFilteringParameters(cl_input)[source]¶
Bases:
pacbio_data_processing.parameters.ParametersBase
Mediator class: intermediary between the user input and the
BamFilter
instance.- property filter_mappings¶
- property limit_mappings¶
- property min_relative_mapping_ratio¶
- property out_bam_file¶
- class pacbio_data_processing.parameters.SingleMoleculeAnalysisParameters(cl_input)[source]¶
Bases:
pacbio_data_processing.parameters.ParametersBase
Mediator class: intermediary between the user input and the
SingleMoleculeAnalysis
instance.- property ipd_model: Optional[pathlib.Path]¶
- property joint_gff_filename¶
- property partition: Optional[tuple[int, int]]¶
It validates the input partition and interfaces with API clients.
- property partition_done_filename¶
- property raw_detections_filename¶
- property summary_report_html_filename¶
pacbio_data_processing.plots module¶
- pacbio_data_processing.plots.make_barsplot(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str]) None [source]¶
- pacbio_data_processing.plots.make_continuous_rolled_data(data: dict[typing.NewType.<locals>.new_type, typing.NewType.<locals>.new_type], window: int) pandas.core.frame.DataFrame [source]¶
Auxiliary function used by
make_rolling_history
to produce a dataframe with the rolling average of the input data. The resulting dataframe starts at the min input position and ends at the max input position. The holes are set to 0 in the input data.
- pacbio_data_processing.plots.make_histogram(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True) None [source]¶
pacbio_data_processing.sam module¶
pacbio_data_processing.sentinel module¶
- class pacbio_data_processing.sentinel.Sentinel(checkpoint: pathlib.Path)[source]¶
Bases:
object
This class creates objects that are expected to be used as context managers. At
__enter__
a sentinel file is created. At__exit__
the sentinel file is removed. If the file is there before entering the context, or is not there when the context is exited, an exception is raised.- _anti_aging()[source]¶
Method that updates the modification time of the sentinel file every
SLEEP_SECONDS
seconds. This is part of the mechanism to ensure that the sentinel does not get fooled by an abandoned leftover sentinel file.
- property is_file_too_old¶
Property that answers the question: is the sentinel file too old to be taken as an active sentinel file, or not?
pacbio_data_processing.sm_analysis module¶
This module contains the high level functions necessary to run the ‘Single Molecule Analysis’ on an input BAM file.
- class pacbio_data_processing.sm_analysis.SingleMoleculeAnalysis(parameters)[source]¶
Bases:
object
- property CCS_bam_file¶
It produces a Circular Consensus Sequence (CCS) version of the input BAM file and returns its name. It uses
generate_CCS_file()
to generate the file.
- __call__() None [source]¶
Main entry point to perform a single molecule analysis: this method triggers the analysis.
- _align_bam_if_no_candidate_found(inbam: pacbio_data_processing.bam.BamFile, bam_type: str, variant: str = 'straight') Optional[str] [source]¶
[Internal method] Auxiliary method used by
_ensure_input_bam_aligned
. Given abam_type
(amonginput
andccs
) and avariant
, an initial BAM file is selected and a target aligned BAM filename is constructed. The method checks first whether the aligned file is there. If a plausible candidate is not found, the initial BAM is aligned (straight
orπ-shifted
, depending on thevariant
and using the proper reference). IF, on the other hand, a candidate is found, its computation is skipped.If the aligner cannot be run (i.e. calling the aligner returns
None
),None
is returned, meaning that the aligner was not called. This can happen when the aligner finds a sentinel file indicating that the computation is work in progress. (Seepacbio_data_processing.external.Blasr.__call__()
for more details on the implementation.) This mechanism allows reentrancy.- Returns
the aligned input bam file, if it is there, or None if it could not be computed (yet).
- _collect_statistics() None [source]¶
[Internal method] It sets an attribute: ‘filtered_bam_statistics’ that contains some data to be consumed by the MethylationReport. For now the only data is the number of subreads per molecule and per strand.
- _collect_suitable_molecules_from_ccs() dict[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
[Internal method] Auxiliary routine of _select_molecules in charge of choosing suitable molecules from the aligned CCS bam files. The resulting mapping contains all suitable molecules in the ‘straight’ variant and the suitable molecules in the ‘π-shifted’ variant that are not in the ‘straight’ variant. The molecules corresponding to both variants will be joined. Among all the possible subreads of each molecule in the aligned CCS, one is chosen by
map_molecules_with_highest_sim_ratio
. The choice of suitable molecules is done by the method_discard_molecules_with_seq_mismatch
. Moreover the molecules are labeled with the variant they belong to. It is necessary to do this labeling, so that we can later trace what reference each molecule is attached to.
- _create_references()[source]¶
[Internal method] DNA reference sequences are created here. The ‘true’ reference must exist as fasta beforehand, with its index. A π-shifted reference is created from the original one. Its index is also made.
This method sets two attributes which are, both, mappings with two keys (‘straight’ and ‘pi-shifted’) and values as follows:
reference: the values are DNASeq objects
fasta: the values are Path objects
- _crosscheck_molecules_in_partition_with_ccs(molecules_from_ccs: dict[int, pacbio_data_processing.bam_utils.Molecule]) None [source]¶
[Internal method] This method ensures that only the molecules in the current partition are processed. It does it by crosschecking the sets corresponding to the partition (for all variants) with the set of valid molecules in the ccs file. The attribute
_molecules_todo
is set, and its type is:dict[int, Molecule]
- _disable_pi_shifted_analysis() None [source]¶
[Internal method] If the pi-shifted analysis cannot be carried out, it is disabled with this method.
- _discard_molecules_with_seq_mismatch(molecules_from_ccs: dict[int, pacbio_data_processing.bam_utils.Molecule]) dict[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
[Internal method] The aligned CCS molecules are filtered in this method to keep only molecules that match perfectly the corresponding reference (ie, taking into account variants).
- _dump_results() None [source]¶
[Internal method] All the output generated is driven by this method:
a joint gff file
a per detection csv file
a methylation report
a summary report
the molecules sets (see :py:class:pacbio_data_processing.summary.SummaryReport)
- _ensure_ccs_bam_aligned() None [source]¶
[Internal method] As its name suggests, it is ensured that the aligned variants of the CCS file exist. The summary report is informed about the aligned CCS files.
Note
The CCS BAM file is created before checking if its aligned variants are present. It might seem a logic error to proceed this way instead of checking first for the existence of the aligned variants of the CCS BAM before deciding if the computation of the CCS BAM file is needed, but it is not an error: in order to decide if a given file can be an aligned version of the CCS BAM, we need the CCS BAM itself.
- _ensure_input_bam_aligned() None [source]¶
[Internal method] Main check point for aligned input bam files: this method calls whatever is necessary to ensure that the input bam is aligned, which means: normal (straight) alignment and π-shifted alignment.
Warning! The method tries to find a pi-shifted aligned BAM if the input is aligned based on whether
a file with suitable filename is found, and
it is aligned.
- _exists_pi_shifted_variant_from_aligned_input() bool [source]¶
[Internal method] It checks that the expected pi-shifted aligned file exists and is an aligned BAM file.
- _filter_molecules() None [source]¶
[Internal method] The
_molecules_todo
mapping is here reduced by removing molecules that do not fulfil a minimum requirement of quality. The summary report is updated accordingly. See thecleanup_molecules
auxiliary function for details on the filtering process. An attribute called_filtered_molecules_generator
is set which producesMoleculeWorkUnit
s.
- _fix_positions() None [source]¶
[Internal method] The purpose is to shift back the shifted positions in the π-shifted molecules. Two operations are required to complete that task:
fixing positions in the gff files, and
fixing positions in the molecules themselves.
- _fix_positions_in_gffs() None [source]¶
[Internal method] In the case that some molecules have been processed, the positions in the gff files corresponding to molecules that have been π-shifted are shifted back.
- _fix_positions_in_molecules() None [source]¶
[Internal method] All positions of π-shifted molecules are shifted back in the
_molecules_todo
dictionary (which will be used to generate the methylation report).
- _generate_indices() None [source]¶
[Internal method] Indices are generated for all files that need to be analyzed by ipdSummary.
- _init_summary() None [source]¶
[Internal method] This method creates an instance of
SummaryReport
and sets an attribute with it.
- _ipd_analysis() None [source]¶
[Internal method] Performs the IPD analysis of the single molecule files. Sets a generator with Paths to produced GFF files.
- _keep_only_pishifted_molecules_crossing_origin(molecules_from_ccs: dict[int, pacbio_data_processing.bam_utils.Molecule]) dict[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
[Internal method] This method filters out molecules from the CCS aligned list that 1. Belong to the π-shifted variant, and 2. Do not cross the origin These molecules are unwanted because the point of including π-shifting in the analysis is to catch molecules crossing the origin.
- _merge_partitions_if_needed() None [source]¶
[Internal method] This method merges properly the output files produced during the processing of all the partitions, if they are ready.
Warning
It is assumed that this is called within the :py:meth:_post_process_partition phase.
- _post_process_partition() None [source]¶
[Internal method] After the analysis is done, if only a fraction (aka
partition
) was processed, this method declares that the analysis of the current``partition`` is complete and tries to merge the partitions (which will only occur if the proper conditions are met).
- _remove_partition_done_files()[source]¶
[Internal method] Remove the partition done marker files after the partitions have been successfully merged.
Warning
It is assumed that this is called within the :py:meth:_post_process_partition phase.
- _report_discarded_molecules_with_seq_mismatch(mols_in_raw_ccs_files: dict[str, dict[int, pacbio_data_processing.bam_utils.Molecule]], molecules_from_ccs: dict[int, pacbio_data_processing.bam_utils.Molecule]) None [source]¶
[Internal method] This method simply logs the ids of discarded molecules and passes the infos to the
SummaryReport
instance.
- _report_faulty_molecules() None [source]¶
[Internal method] The molecules that had any problem in their processing are passed to the
SummaryReport
as a set.
- _select_molecules() None [source]¶
[Internal method] This method is part of the main sequence irrespective of whether the user selects to only produce the methylation report, or the full analysis. After this method the mapping
_molecules_todo
is created of type dict[int, Molecule], with molecules that:Belong to the partition,
Are correctly mapped in the aligned CCS file, and
If they belong to the
pi-shifted
variant (molecules obtained after aligning with a pi-shifted reference) then they cross the origin.
- _set_aligner() None [source]¶
[Internal method] This method decides what aligner to use, sets an attribute with it and sets the prefixes accordingly.
- _split_bam() None [source]¶
[Internal method] Produces a generator with 2-tuples of the type (mol_id[int], Molecule) where the Molecule is related to a single molecule BAM file that has been generated by
split_bam_file_in_molecules
. It sets an attribute called_per_molecule_bam_generator
that refers to that generator.
- property all_partition_done_filenames: list[pathlib.Path]¶
Attribute that return a list of ``Path``s corresponding to the files expected to be found when all the partitions are processed (in case of partitioning the input BAM).
- property all_partitions_ready: bool¶
Attribute that answers the question: are all the partitions ready?
- property partition: pacbio_data_processing.utils.Partition¶
The target
Partition
of the input BAM file that must be processed by the current analysis, according to the input provided by the user.
- property workdir: tempfile.TemporaryDirectory¶
This attribute returns the necessary temporary working directory on demand and it ensures that only one temporary dir is created by caching.
- pacbio_data_processing.sm_analysis._main(config) None [source]¶
This function drives the Single Molecule Analysis once the input has been parsed.
- pacbio_data_processing.sm_analysis.create_raw_detections_file(gffs: collections.abc.Iterable[Union[pathlib.Path, str]], detections_filename: Union[pathlib.Path, str], modification_types: list[str])[source]¶
Function in charge of creating the raw detections file. Starting from a set of .gff files, a csv file (delimiter=”,”), the raw detections file, is saved with the following columns:
mol id: taken from each gff filename (e.g. ‘a.b.c.gff’ -> mol id: ‘b’);
modtype: column number 3 (idx: 2) of the gffs (feature type) (e.g. ‘m6A’);
GATC position: column number 5 (idx: 4) of each gff which corresponds to the ‘end coordinate of the feature’ in the GFF3 standard;
score of the feature: column number 6 (idx: 5); floating point (Phred-transformed pvalue that a kinetic deviation exists at this position)
strand: strand of the feature. It can be +, - with obvious meanings. It can also be ? (meaning unknown) or . (for non stranded features)
There are more columns. Although their number is not fixed by this function, in practice they are 4 in the case of a detected modification. In that case these 4 last columns correspond to the values given in the ‘attributes’ column of the gffs (col 9; idx 8). For example, given the following attributes column:
coverage=134;context=TCA...;IPDRatio=3.91;identificationQv=228
we would get the following 4 ‘extra’ columns in our raw detections file:
134,TCA...,3.91,228
and this is exactly what happens with the m6A modification type. Notice that the value of identificationQV is, again, a phred transformed probability of having a detection. See eq. (8) in [1]
Parsing: All the lines starting by ‘#’ in the gff files are ignored. The format of the gff file is GFF3: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
[1]: “Detection and Identification of Base Modifications with Single Molecule Real-Time Sequencing Data”
- pacbio_data_processing.sm_analysis.generate_CCS_file(ccs: pacbio_data_processing.external.CCS, in_bam: pathlib.Path, ccs_bam_file: pathlib.Path) Optional[pathlib.Path] [source]¶
Idempotent computation of the Circular Consensus Sequence (CCS) version of the passed in
in_bam
file done with passed-inccs
object.- Returns
the CCS bam file, if it is there, or
None
if if could not be computed (yet).
- pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio(bam_file_name: Optional[Union[pathlib.Path, str]]) dict[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
Given the path to a bam file, it returns a dictionary, whose keys are mol ids (ints) and the values are the corresponding Molecules. If multiple lines in the given BAM file share the mol id, only the first line found with the highest similarity ratio (computed from the cigar) is chosen: if multiple lines share the molecule ID and the highest similarity ratio (say, 1), ONLY the first one is taken, irrespective of other factors.
pacbio_data_processing.sm_analysis_gui module¶
pacbio_data_processing.summary module¶
- class pacbio_data_processing.summary.GATCCoverageBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'GATCs NOT in BAM file (%)': ('perc_all_gatcs_not_identified_in_bam',), 'GATCs NOT in methylation report (%)': ('perc_all_gatcs_not_in_meth',), 'GATCs in BAM file (%)': ('perc_all_gatcs_identified_in_bam',), 'GATCs in methylation report (%)': ('perc_all_gatcs_in_meth',)}¶
- dependency_names = ('aligned_ccs_bam_files', 'methylation_report')¶
- index_labels = ('Percentage',)¶
- title = 'GATCs in BAM file and Methylation report'¶
- class pacbio_data_processing.summary.MethTypeBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Fully methylated (%)': ('fully_methylated_gatcs_wrt_meth',), 'Fully unmethylated (%)': ('fully_unmethylated_gatcs_wrt_meth',), 'Hemi-methylated in + strand (%)': ('hemi_plus_methylated_gatcs_wrt_meth',), 'Hemi-methylated in - strand (%)': ('hemi_minus_methylated_gatcs_wrt_meth',)}¶
- dependency_names = ('methylation_report',)¶
- index_labels = ('Percentage',)¶
- title = 'Methylation types in methylation report'¶
- class pacbio_data_processing.summary.MoleculeLenHistogram(name=None)[source]¶
Bases:
pacbio_data_processing.summary.HistoryPlotAttribute
- column_name = 'len(molecule)'¶
- data_name = 'length'¶
- dependency_name = 'methylation_report'¶
- labels = ('Initial subreads', 'Analyzed molecules')¶
- legend = True¶
- title = 'Initial subreads and analyzed molecule length histogram'¶
- class pacbio_data_processing.summary.MoleculeTypeBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Faulty (with processing error)': ('perc_faulty_mols', 'perc_faulty_subreads'), 'Filtered out': ('perc_filtered_out_mols', 'perc_filtered_out_subreads'), 'In Methylation report with GATC': ('perc_mols_in_meth_report_with_gatcs', 'perc_subreads_in_meth_report_with_gatcs'), 'In Methylation report without GATC': ('perc_mols_in_meth_report_without_gatcs', 'perc_subreads_in_meth_report_without_gatcs'), 'Mismatch discards': ('perc_mols_dna_mismatches', 'perc_subreads_dna_mismatches'), 'Used in aligned CCS': ('perc_mols_used_in_aligned_ccs', 'perc_subreads_used_in_aligned_ccs')}¶
- dependency_names = ('mols_used_in_aligned_ccs', 'methylation_report')¶
- index_labels = ('Number of molecules (%)', 'Number of subreads (%)')¶
- title = 'Processed molecules and subreads'¶
- class pacbio_data_processing.summary.PercAttribute(total_attr, pref='perc_', suf='_wrt_meth', name=None)[source]¶
Bases:
pacbio_data_processing.summary.ROAttribute
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- class pacbio_data_processing.summary.PositionCoverageBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Positions NOT covered by molecules in BAM file (%)': ('perc_all_positions_not_in_bam',), 'Positions NOT covered by molecules in methylation report (%)': ('perc_all_positions_not_in_meth',), 'Positions covered by molecules in BAM file (%)': ('perc_all_positions_in_bam',), 'Positions covered by molecules in methylation report (%)': ('perc_all_positions_in_meth',)}¶
- dependency_names = ('aligned_ccs_bam_files', 'methylation_report')¶
- index_labels = ('Percentage',)¶
- title = 'Position coverage in BAM file and Methylation report'¶
- class pacbio_data_processing.summary.PositionCoverageHistory(name=None)[source]¶
Bases:
pacbio_data_processing.summary.HistoryPlotAttribute
- dependency_name = 'methylation_report'¶
- labels = ('Positions',)¶
- legend = False¶
- len_column_name = 'len(molecule)'¶
- start_column_name = 'start of molecule'¶
- title = 'Sequencing positions covered by analyzed molecules'¶
- class pacbio_data_processing.summary.SimpleAttribute(name=None)[source]¶
Bases:
object
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- class pacbio_data_processing.summary.SummaryReport(bam_path, dnaseq, figures_prefix='')[source]¶
Bases:
collections.abc.Mapping
Final summary report generated by
sm-analysis
initially intended for humans.This class has been crafted to carefully control its attributes. Data can be fed into the class by setting some attributes. That process triggers the generation of other attributes, that are typically read-only.
After instantiating the class with the path to the input BAM and the dna sequence of the reference (instance of
DNASeq
), one must set some attributes to be able to save the summary report:s = SummaryReport(bam_path, dnaseq) s.methylation_report = path_to_meth_report s.raw_detections = path_to_raw_detections_file s.gff_result = path_to_gff_result_file s.aligned_ccs_bam_files = { 'straight': aligned_ccs_path, 'pi-shifted': pi_shifted_aligned_ccs_path } # Some information about what happened with some molecules must # be given as well. There are two options for that. First, in the # *normal flow* the following would be done: s.mols_used_in_aligned_ccs = {3, 67, ...} # set of ints # Optionally you can provide: s.mols_dna_mismatches = {20, 49, ...} # set of ints # or/and: s.filtered_out_mols = {22, 493, ...} # set of ints # or/and: s.faulty_mols = {332, 389, ...} # set of ints # The second possibility is to load the data about the molecules # from file(s). That is an option if a partitioned # ``SingleMoleculeAnalysis`` has been carried out and the results # must be merged. In that case, you would do: s.load_molecule_sets("file1.pickle") s.load_molecule_sets("file2.pickle") ... # and so many files as necessary can be loaded. Their information # will be added together. # The names of the files can be also ``Path`` instances (which is # the usual case).
At this point all the necessary data is there and the report can be created:
s.save('summary_whatever.html')
- aligned_ccs_bam_files¶
- all_gatcs_identified_in_bam¶
- all_gatcs_in_meth¶
- all_gatcs_not_identified_in_bam¶
- all_gatcs_not_in_meth¶
- all_positions_in_bam¶
- all_positions_in_meth¶
- all_positions_not_in_bam¶
- all_positions_not_in_meth¶
- property as_html: str¶
- body_md5sum¶
- dump_molecule_sets(filename: pathlib.Path) None [source]¶
This method stores in a file the
_molecule_sets
attribute. It is done usingpickle
. The motivation for that is to be able to easily combine severalSummaryReport
instances coming from different partitioned analysis. To be able to do that without repeating the filtering process, etc, it is necessary to have the information about what molecules have been discarded for different reasons and what molecules are used from the aligned files.
- faulty_mols¶
- faulty_subreads¶
- filtered_out_mols¶
- filtered_out_subreads¶
- full_md5sum¶
- fully_methylated_gatcs¶
- fully_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- fully_unmethylated_gatcs¶
- fully_unmethylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- gatc_coverage_bars¶
- gff_result¶
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- hemi_methylated_gatcs¶
- hemi_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- hemi_minus_methylated_gatcs¶
- hemi_minus_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- hemi_plus_methylated_gatcs¶
- hemi_plus_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- input_bam¶
- input_bam_size¶
- input_reference¶
- load_molecule_sets(filename: pathlib.Path) None [source]¶
This method reads data from the file
filename` (using ``pickle
), it assumes that a dictionary is obtained with the sets of molecule ids (int
) that are important to re-create the state of theSummaryReport
without going through theSingleMoleculeAnalysis
process all over again.If can be used multiple times and the sets obtained each time will update the current ones (a mathematical union of sets).
- max_possible_methylations¶
- meth_type_bars¶
- methylation_report¶
- molecule_len_histogram¶
- molecule_type_bars¶
- mols_dna_mismatches¶
- mols_in_meth_report¶
- mols_in_meth_report_with_gatcs¶
- mols_in_meth_report_without_gatcs¶
- mols_ini¶
- mols_used_in_aligned_ccs¶
- perc_all_gatcs_identified_in_bam¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_all_gatcs_in_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_all_gatcs_not_identified_in_bam¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_all_gatcs_not_in_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_all_positions_in_bam¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_all_positions_in_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_all_positions_not_in_bam¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_all_positions_not_in_meth¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_faulty_mols¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_faulty_subreads¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_filtered_out_mols¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_filtered_out_subreads¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_mols_dna_mismatches¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_mols_in_meth_report¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_mols_in_meth_report_with_gatcs¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_mols_in_meth_report_without_gatcs¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_mols_used_in_aligned_ccs¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_subreads_dna_mismatches¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_subreads_in_meth_report¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_subreads_in_meth_report_with_gatcs¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_subreads_in_meth_report_without_gatcs¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- perc_subreads_used_in_aligned_ccs¶
From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in
s.total_attr
) and returned as str.
- position_coverage_bars¶
- position_coverage_history¶
- raw_detections¶
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- ready_to_go(*attrs) bool [source]¶
Method used to check if some attributes are already usable or not (in other words if they have been already set or not).
- reference_base_pairs¶
- reference_md5sum¶
- reference_name¶
- subreads_dna_mismatches¶
- subreads_in_meth_report¶
- subreads_in_meth_report_with_gatcs¶
- subreads_in_meth_report_without_gatcs¶
- subreads_ini¶
- subreads_used_in_aligned_ccs¶
- switch_on(attribute: str) None [source]¶
Method used by descriptors to inform the instance of
SummaryReport
that some computed attributes needed by the plots are already computed and usable.
- total_gatcs_in_ref¶
pacbio_data_processing.templates module¶
pacbio_data_processing.types module¶
pacbio_data_processing.utils module¶
- class pacbio_data_processing.utils.AlmostUUID[source]¶
Bases:
object
A class that provides a 5 letters summary of a UUID. It is intended to be used as prefix in all log messages. It is not necessary that two instances are different. But it is necessary that:
the string representation is short, and
given two instances their string representations most probably differ.
The underlying UUID is obtained from the stdlib using
uuid.uuid1
. The class is implemented using the Borg pattern: all instances running in the same interpreter share a common_uuid
attribute.
- class pacbio_data_processing.utils.DNASeq(raw_seq: pacbio_data_processing.utils.DNASeqLike, name: str = '', description: str = '')[source]¶
Bases:
Generic
[pacbio_data_processing.utils.DNASeqLike
]Wrapper around ‘Bio.Seq.Seq’.
- __init__(raw_seq: pacbio_data_processing.utils.DNASeqLike, name: str = '', description: str = '')[source]¶
- classmethod from_fasta(fasta_name: str) pacbio_data_processing.utils.DNASeqType [source]¶
Returns a DNASeq from the first DNA sequence stored in the fasta named ‘fasta_name’ after ensuring that the fasta index is there.
- property md5sum: str¶
It returns the MD5 checksum’s hexdigest of the upper version of the sequence as a string.
- pi_shifted() pacbio_data_processing.utils.DNASeqType [source]¶
Method to return a pi-shifted DNASeq from the original one. pi-shifted means that a circular topology is assumed in the DNA sequence and a shift in the origin is done by π radians, ie the sequence is splitted in two parts and both parts are permuted.
- class pacbio_data_processing.utils.Partition(partition_specification: Optional[tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile)[source]¶
Bases:
object
A
Partition
is a class that helps answering the following question: assuming that we are interested in processing a fraction of aBamFile
, does the molecule IDmol_id
belong to that fraction, or not? A prior implementation consisted in storing all the molecule IDs in theBamFile
corresponding to a given partition in a set, and the answer is just obtained by querying if a molecule ID belongs to the set or not. That former implementation is not enough for the case of multiple alignment processes for the same rawBamFile
(eg, when a combined analysis of the so-called ‘straight’ and ‘pi-shifted’ variants is performed). In that case the partition is decided with one file. And all molecule IDs belonging to the non-empty intersection with the other file must be unambiguously accomodated in a certain partition. This class has been designed to solve that problem.- __init__(partition_specification: Optional[tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile) None [source]¶
Creates a
Partition
object without validating thepartition_specification
, which is done at the time of reading the input given by the user. See :py:class:pacbio_data_processing.parameters.SingleMoleculeAnalysisParameters
- _delimit_partitions() None [source]¶
[Internal method] This method decides what are the limits of all partitions given the number of partitions. The method sets an internal mapping,
self._lower_limits
, of the type{partition number [int]: lower limit [int]}
with that information. This mapping is populated with all the partition numbers and corresponding values.
- _set_current_limits() None [source]¶
[Internal method] Auxiliary method for __contains__ Here it is determined what is the range of molecule IDs, as ints, that belong to the partition. The method sets two integer attributes, namely:
_lower_limit_current
: the minimum molecule ID of the current partition, and_higher_limit_current
: the maximum molecule ID of the current partition; it can beNone
, meaning that there is no maximum (last partition).
- property is_proper: bool¶
A proper partition is one that refers to a proper subset of the given
BamFile
. Since an empty set is not permitted by the :py:class:SingleMoleculeAnalysisParameters class, an improper partition can only be a partition that refers to the wholeBamFile
.
- pacbio_data_processing.utils.combine_scores(scores: collections.abc.Sequence[float]) float [source]¶
It computes the combined phred transformed score of the
scores
provided. Some examples:>>> combine_scores([10]) 10.0 >>> q = combine_scores([10, 12, 14]) >>> print(round(q, 6)) 7.204355 >>> q = combine_scores([30, 20, 100, 92]) >>> print(round(q, 6)) 19.590023 >>> q_500 = combine_scores([30, 20, 500]) >>> q_no_500 = combine_scores([30, 20]) >>> q_500 == q_no_500 True >>> combine_scores([200, 300, 500]) 200.0
- pacbio_data_processing.utils.find_gatc_positions(seq: str, offset: int = 0) set[int] [source]¶
Convenience function that computes the positions of all GATCs found in the given sequence. The values are relative to the offset.
>>> find_gatc_positions('AAAGAGAGATCGCGCGATC') == {7, 15} True >>> find_gatc_positions('AAAGAGAGTCGCGCCATC') set() >>> find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC') == {7, 12, 19} True >>> s = find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC', offset=23) >>> s == {30, 35, 42} True
- pacbio_data_processing.utils.make_partition_prefix(partition: int, partitions: int) str [source]¶
Simple function to act as a Single Source of Truth for the partition prefix used elsewhere in the project. No validation is done. It just blindly returns a string constructed with the arguments.
- pacbio_data_processing.utils.merge_files(infiles: list[pathlib.Path], outfile: pathlib.Path, keep_only_first_header=False) None [source]¶
Utility function that concatenates files optionally handling one-line headers correctly: if the files have (one-line) header, it must be declared at call time and then the function will only keep the header found in the first file. All other headers (first line of the remaining files) will be discarded.
- pacbio_data_processing.utils.pishift_back_positions_in_gff(gff_path: Union[str, pathlib.Path]) None [source]¶
A function that parses the input GFF file (assumed to be a valid GFF file) and shifts back the positions found in it (columns 4th and 5th of lines not starting by
#
). It is assumed that the positions in the input file (gff_path
) are referring to a pi-shifted origin. To undo the shift, the length of the sequence(s) is (are) read from the GFF3 directives (lines starting by##
), in particular from the##sequence-region
pragmas. This function can handle the case of multiple sequences.Warning
The function overwrites the input
gff_path
.
- pacbio_data_processing.utils.shift_me_back(pos: int, nbp: int) int [source]¶
Unshifts a given position taking into account that it has been previously shifted by half of the number of base pairs. It takes into account the possibility of having a sequence with an odd length.
@params:
pos - 1-based position of a base pair to unshift
nbp - number of base pairs in the reference
@returns:
unshifted position
Some examples:
>>> shift_me_back(3, 10) 8 >>> shift_me_back(1, 20) 11 >>> shift_me_back(3, 7) 6 >>> shift_me_back(4, 7) 7 >>> shift_me_back(5, 7) 1 >>> shift_me_back(7, 7) 3 >>> shift_me_back(1, 7) 4
To understand the operation of this function consider the following example. Given a sequence of 7 base pairs with the following indices found in the reference in the natural order, ie
1 2 3 4 5 6 7
then, after being pi-shifted the base pairs in the sequence are reordered, and the indices become (in parenthesis the former indices):
1’(=4) 2’(=5) 3’(=6) 4’(=7) 5’(=1) 6’(=2) 7’(=3)
The current function accepts primed indices and transforms them to the unprimed indices, ie, the positions returned refer to the original reference.
- pacbio_data_processing.utils.try_computations_with_variants_until_done(func: Callable, variants: collections.abc.Sequence[str], *args: Any) None [source]¶
This function runs the passed in function
func
with the arguments``*args`` and for eachvariant
invariants
,eg. something like this: .. code-block:for v in variants: result = func(*args, variant=v)
but it keeps doing so until each result returned by
func
is notNone
. When aNone
is returned byfunc
, a call tosleep
is warranted before continuing. The time slept depends on how many times it was sleeping before; the sleep time grows exponentially with every iteration:t -> 2*t
until all the computations (results of
func
for each variant) are completed, ie all are notNone
. The main application of this function is to ensure that some common operations of theSingleMoleculeAnalysis
are done once and only once irrespective of how many parallel instances of the analysis (with different partitions each) are carried out. For example, this function can be used to avoid collisions in the generation of aligned BAM files sincepacbio_data_processing.external.Blasr
has a mechanism that allows concurrent computations. This function delegates the decision on whether the computation is done or not tofunc
.Note
A special case is when a
variant
isNone
, in that case the functionfunc
is called without thevariant
argument:result = func(*args)
Therefore, if
variants
is, e.g.(None,)
, thenfunc
is only called once in each iteration WITHOUTvariant
keyword argument. That is useful if the functionfunc
must be called until is done, but it takes no variant argument.
Module contents¶
Top-level package for PacBio data processing.