Modules¶
sm_analysis¶
This module contains the high level functions necessary to run the ‘Single Molecule Analysis’ on an input BAM file.
- pacbio_data_processing.sm_analysis.add_to_own_output(gffs, own_output_file_name, modification_types)[source]¶
From a set of .gff files, a csv file (delimiter=”,”) is saved with the following columns:
mol id: taken each gff file (e.g. ‘a.b.c.gff’ -> mol id: ‘b’)
modtype: column number 3 (idx: 2) of the gffs (feature type)
GATC position: column number column number 5 (idx: 4) of the gffs which corresponds to the ‘end coordinate of the feature’ in the GFF3 standard
score of the feature: column number 6 (idx: 5); floating point (Phred-transformed pvalue that a kinetic deviation exists at this position)
strand: strand of the feature. It can be +, - with obvious meanings. It can also be ? (meaning unknown) or . (for non stranded features)
There are more columns, but they are nor fixed in number. They correspond to the values given in the ‘attributes’ column of the gffs (col 9, idx 8). For example, given the following attributes column:
coverage=134;context=TCA...;IPDRatio=3.91;identificationQv=228
we would get the following ‘extra’ columns:
134,TCA...,3.91,228
and this is exactly what happens with the m6A modification type.
All the lines starting by ‘#’ in the gff files are ignored. The format of the gff file is GFF3: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
The value of identificationQV is a a phred transformed probability of having a detection. See eq. (8) in [1]
[1]: “Detection and Identification of Base Modifications with Single Molecule Real-Time Sequencing Data”
- pacbio_data_processing.sm_analysis.generate_CCS_file(ccs: pacbio_data_processing.external.CCS, in_bam: pathlib.Path, ccs_bam_file: pathlib.Path) Optional[pathlib.Path] [source]¶
Idempotent computation of the Circular Consensus Sequence (CCS) version of the passed in
in_bam
file done with passed-inccs
object.- Returns
the CCS bam file, if it is there, or
None
if if could not be computed (yet).
- pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio(bam_file_name: Optional[Union[pathlib.Path, str]]) dict[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
Given the path to a bam file, it returns a dictionary, whose keys are mol ids (ints) and the values are the corresponding Molecules. If multiple lines in the given BAM file share the mol id, only the first line found with the highest similarity ratio (computed from the cigar) is chosen: if multiple lines share the molecule ID and the highest similarity ratio (say, 1), ONLY the first one is taken, irrespective of other factors.