pacbio_data_processing package¶
Subpackages¶
Submodules¶
pacbio_data_processing.bam module¶
- class pacbio_data_processing.bam.BamFile(bam_file_name, mode='r')[source]¶
Bases:
object
Proxy class for _BamFileSamtools and _BamFilePysam. This is a high level class whose only roles are to choose among _ReadableBamFile and _WritableBamFile and to select the underlying implementation to interact with the BAM file:
- _BamFileSamtools: implementation that simply wraps the 'samtools' command line, and - _BamFilePysam: implementation that uses 'pysam'
pacbio_data_processing.bam_file_filter module¶
This module contains the high level functions necessary to apply some filters to a given input BAM file.
pacbio_data_processing.bam_utils module¶
Some helper functions to manipulate BAM files
- class pacbio_data_processing.bam_utils.CircularDNAPosition(pos: int, ref_len: int = 0)[source]¶
Bases:
object
A type that allows to do arithmetics with postitions in a circular topology.
>>> p = CircularDNAPosition(5, ref_len=9)
The class has a decent repr:
>>> p CircularDNAPosition(5, ref_len=9)
And we can use it in arithmetic contexts:
>>> p + 1 CircularDNAPosition(6, ref_len=9) >>> int(p+1) 6 >>> int(p+5) 1 >>> int(20+p) 7 >>> p - 1 CircularDNAPosition(4, ref_len=9) >>> int(p-6) 8 >>> int(p-16) 7 >>> int(2-p) 6 >>> int(8-p) 3
Also boolean equality is supported:
>>> p == CircularDNAPosition(5, ref_len=9) True >>> p == CircularDNAPosition(6, ref_len=9) False >>> p == CircularDNAPosition(14, ref_len=9) True >>> p == CircularDNAPosition(5, ref_len=8) False >>> p == 5 False
But also < is supported:
>>> p < p+1 True >>> p < p False >>> p < p-1 False
Of course two instances cannot be compared if their underlying references are not equally long:
>>> s = CircularDNAPosition(5, ref_len=10) >>> p < s Traceback (most recent call last): ... ValueError: cannot compare positions if topologies differ
or if they are not both CircularDNAPosition’s:
>>> s < 6 Traceback (most recent call last): ... TypeError: '<' not supported between instances of 'CircularDNAPosition' and 'int'
The class has a convenience method:
>>> p.as_1base() 6
If the ref_len input parameter is less than or equal to 0, the topology is assumed to be linear:
>>> q = CircularDNAPosition(5, ref_len=-1) >>> q CircularDNAPosition(5, ref_len=0) >>> q + 1001 CircularDNAPosition(1006, ref_len=0) >>> q - 100 CircularDNAPosition(-95, ref_len=0) >>> int(10-q) 5
Linear topology is the default behaviour:
>>> r = CircularDNAPosition(5) >>> r CircularDNAPosition(5, ref_len=0)
It is possitble to use them as indices in slices:
>>> seq = "ABCDEFGHIJ" >>> seq[r:r+2] 'FG'
And CircularDNAPosition instances can be hashed (so that they can be elements of a set or keys in a dictionary):
>>> positions = {p, q, r}
And, very conveniently, a CircularDNAPosition converts tp str as ints do:
>>> str(r) == '5' True
- class pacbio_data_processing.bam_utils.Molecule(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None)[source]¶
Bases:
object
Abstraction around a single molecule from a Bam file
- __init__(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None) None ¶
- property ascii_quals: str¶
Ascii qualities of sequencing the molecule. Each symbol refers to one base.
- property cigar: pacbio_data_processing.cigar.Cigar¶
- property dna: str¶
- property end: pacbio_data_processing.bam_utils.CircularDNAPosition¶
Computes the end of a molecule as CircularDNAPosition(start+lenght of reference) which, obviously takes into account the possible circular topology of the reference.
- find_gatc_positions() list[pacbio_data_processing.bam_utils.CircularDNAPosition] [source]¶
The function returns the position of all the GATCs found in the Molecule’s sequence, taking into account the topology of the reference.
The return value is is the 0-based index of the GATC motif, ie, the index of the G in the Python convention.
- id: int¶
- is_crossing_origin(*, ori_pi_shifted=False) bool [source]¶
This method answers the question of whether the molecule crosses the origin, assuming a circular topology of the chromosome. The answer is
True
if the last base of the molecue is located before the first base. Otherwise the answer isFalse
. It will returnFalse
if the molecule starts at the origin; but it will beTrue
if it ends at the origin. There is an optional keyword-only boolean parameter, namelyori_pi_shifted
to indicate that the reference has been shifted by pi radians, or not.
- pi_shift_back() None [source]¶
Method that shifts back the (start, end) positions of the molecule assuming that they were shifted before by pi radians.
- src_bam_path: Optional[Union[str, pathlib.Path]] = None¶
- property start: pacbio_data_processing.bam_utils.CircularDNAPosition¶
Readable/Writable attribute. It was originally only readable but the
SingleMoleculeAnalysis
class relies on it being writable to make easier the shift back of pi-shifted positions, that are computed from this attribute. The logic is: by default, the value is taken from the_best_ccs_line
attribute, until it is modified, in which case the value is simply stored and returned upon request.
- pacbio_data_processing.bam_utils.count_subreads_per_molecule(bam: pacbio_data_processing.bam.BamFile) collections.defaultdict[int, collections.Counter] [source]¶
Given a read-open BamFile instance, it returns a defaultdict with keys being molecule ids (str) and values, a counter with subreads classified by strand. The possible keys of the returned counter are: +, -, ? meaning direct strand, reverse strand and unknown, respectively.
- pacbio_data_processing.bam_utils.flag2strand(flag: int) Literal['+', '-', '?'] [source]¶
Given a
FLAG
(see the BAM format specification), it transforms it to the corresponding strand.- Returns
+
,-
or?
depending on the strand the inputFLAG
can be assigned to (?
means: it could not be assigned to any strand).
- pacbio_data_processing.bam_utils.gen_index_single_molecule_bams(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], program: Union[str, pathlib.Path], skip_if_present: bool = False) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
It generates indices using program (the path to pbindex) in this way:
pbindex blasr.pMA683.subreads.bam
the generator yields the original MoleculeWorkUnit.
Note for developers: Maybe it should check for errors and report them (since we are using an external tool) and do not yield the molecule if an error happens).
- pacbio_data_processing.bam_utils.join_gffs(work_units: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], out_file_path: Union[str, pathlib.Path]) collections.abc.Generator[pathlib.Path, None, None] [source]¶
The gff files related to the molecules provided in the input are read and joined in a single file. The individual gff files are yielded back.
Probably this function is useless and should be removed in the future: it only provides a joint gff file that is not a valid gff file and that is never used in the rest of the processing.
- pacbio_data_processing.bam_utils.split_bam_file_in_molecules(in_bam_file: Union[str, pathlib.Path], tempdir: Union[str, pathlib.Path], todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
All the individual molecules in the bam file path given,
in_bam_file
, that are found intodo
, will be isolated and stored individually in the directorytempdir
. The yielded Molecule instances will have theirsrc_bam_path
updated accordingly.
- pacbio_data_processing.bam_utils.subreads_per_molecule(lines: collections.abc.Iterable, header: bytes, file_name_prefix: pathlib.Path, todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
This generator yields 2-tuples of (mol-id, Molecule) after having isolated the subreads corresponding to that molecule id from the
lines
(coming from the iteration over aBamFile
instance). Before yielding, a one-molecule BAM file is created.
- pacbio_data_processing.bam_utils.write_one_molecule_bam(buf: collections.abc.Iterable, header: bytes, in_file_name: pathlib.Path, suffix: Any) pathlib.Path [source]¶
Given a sequence of BAM lines, a header, the source name and a suffix, a new
bamFile
is created containg the data provided an a suitable name.
pacbio_data_processing.cigar module¶
This module provides basic ‘re-invented’ functionality to handle Cigars. A Cigar describes the differences between two sequences by providing a series of operations that one has to apply to one sequence to obtain the other one. For instance, given these two sequences:
sequence 1 (e.g. from the refenrece):
AAGTTCCGCAAATT
and
sequence 2 (e.g. from the aligner):
AAGCTCCCGCAATT
The Cigar that brings us from sequence 1 to sequence 2 is:
3=1X3=1I4=1D2=
where the numbers refer to the amount of letters and the symbols’ meaning can be found in the table below. Therefore the Cigar in the example is a shorthand for:
3 equal bases followed by 1 replacement followed by 3 equal bases followed by 1 insertion followed by 4 equal bases followed by 1 deletion followed by 2 equal bases
symbol |
meaning |
---|---|
= |
equal |
I |
insertion |
D |
deletion |
X |
replacement |
S |
soft clip |
H |
hard clip |
- class pacbio_data_processing.cigar.Cigar(incigar)[source]¶
Bases:
object
- property diff_ratio¶
difference ratio:
1
means that each base is different;0
means that all the bases are equal.
- property number_diff_items¶
- property number_diff_types¶
- property number_pb_diffs¶
- property number_pbs¶
- property sim_ratio¶
similarity ratio:
1
means that all the bases are equal;0
means that each base is different.This is computed from
diff_ratio()
.
pacbio_data_processing.constants module¶
pacbio_data_processing.errors module¶
pacbio_data_processing.external module¶
- class pacbio_data_processing.external.Blasr(path: Union[pathlib.Path, str])[source]¶
Bases:
pacbio_data_processing.external.ExternalProgram
An object to interact with the
blasr
aligner.- __call__(in_bamfile: Union[pathlib.Path, str], fasta: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str], nprocs: int = 1) Optional[int] [source]¶
It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else
None
is returned.One case where the executable cannot run is when the sentinel file is there before the executable process is run.
- class pacbio_data_processing.external.CCS(path: Union[pathlib.Path, str])[source]¶
Bases:
pacbio_data_processing.external.ExternalProgram
An object to interact with the
ccs
program.- __call__(in_bamfile: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str]) Optional[int] [source]¶
It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else
None
is returned.One case where the executable cannot run is when the sentinel file is there before the executable process is run.
- class pacbio_data_processing.external.ExternalProgram(path: Union[pathlib.Path, str])[source]¶
Bases:
object
A base class with common functionality to all external programs’ classes that:
produce an output file, and
its production is to be protected by a
Sentinel
.
This base class provides the interface and the
Sentinel
protection.- __call__(infile: Union[pathlib.Path, str], outfile: Union[pathlib.Path, str], *args, **kwargs) Optional[int] [source]¶
It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else
None
is returned.One case where the executable cannot run is when the sentinel file is there before the executable process is run.
pacbio_data_processing.filters module¶
- pacbio_data_processing.filters.cleanup_molecules(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
Generator of MoleculeWorkUnit’s that pass all the standard filters, ie the sequence of filters needed by
sm-analysis
to select what molecules (and what subreads in those molecules) will be IPD-analyzed.It is assumed that each file contains subreads corresponding to only ONE molecule (ie, ‘molecules’ is a generator of tuples (mol id, Molecule), with
Molecule
being related to a single molecule id). [Note for developers: Should we allow multiple molecules per file?]If there are subreads surviving the filtering process, the bam file is overwritten with the filtered data and the tuple (mol id, Molecule) is yielded. If no subread survives the process, nothing is done (no bam is written, no tuple is yielded).
- pacbio_data_processing.filters.empty_buffer(buf: collections.deque, threshold: int, flags_seen: set) Generator[tuple[bytes], None, None] [source]¶
This generator cleans the passed-in buffer either yielding its items, if the conditions are met, or throwing away them if not.
The conditions are:
the number of items are at least
threshold
, andthe
flags_seen
is a (non-necessarily proper) superset of
{'+', '-'}
.
- pacbio_data_processing.filters.filter_enough_data_per_molecule(lines: collections.abc.Iterable[tuple], threshold: int) Generator[tuple[bytes], None, None] [source]¶
This generator yields the input data if (WIP)
- pacbio_data_processing.filters.filter_mappings_binary(lines, mappings, *rest)[source]¶
Simply take or reject mappings depending on passed sequence
pacbio_data_processing.ipd module¶
- pacbio_data_processing.ipd.ipd_summary(molecule: tuple[int, pacbio_data_processing.bam_utils.Molecule], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], nprocs: int, mod_types_comma_sep: str, ipd_model: Union[str, pathlib.Path], skip_if_present: bool) tuple[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
Lowest level interface to
ipdSummary
: all calls to that program are expected to be done through this function. It runsipdSummary
with an input bam file like this:ipdSummary blasr.pMA683.subreads.bam --reference pMA683.fa --identify m6A --gff blasr.pMA683.subreads.476.bam.gff
As a result of this, a gff file is created. This function sets an attribute in the target Molecule with the path to that file.
Missing features:
skip_if_present
logging
error handling
check output and raise error if != 0
- pacbio_data_processing.ipd.multi_ipd_summary(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[pathlib.Path, None, None] ¶
Generator that yields gff files as they are produced in parallel. Implementation drived by a pool of threads.
- pacbio_data_processing.ipd.multi_ipd_summary_direct(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[pathlib.Path, None, None] [source]¶
Generator that yields gff files as they are produced. Serial implementation (one file produced after the other).
- pacbio_data_processing.ipd.multi_ipd_summary_threads(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[pathlib.Path, None, None] [source]¶
Generator that yields gff files as they are produced in parallel. Implementation drived by a pool of threads.
pacbio_data_processing.logs module¶
pacbio_data_processing.parameters module¶
- class pacbio_data_processing.parameters.BamFilteringParameters(cl_input)[source]¶
Bases:
pacbio_data_processing.parameters.ParametersBase
- property filter_mappings¶
- property limit_mappings¶
- property min_relative_mapping_ratio¶
- property out_bam_file¶
pacbio_data_processing.plots module¶
- pacbio_data_processing.plots.make_barsplot(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str]) None [source]¶
- pacbio_data_processing.plots.make_continuous_rolled_data(data: dict[typing.NewType.<locals>.new_type, typing.NewType.<locals>.new_type], window: int) pandas.core.frame.DataFrame [source]¶
Auxiliary function used by
make_rolling_history
to produce a dataframe with the rolling average of the input data. The resulting dataframe starts at the min input position and ends at the max input position. The holes are set to 0 in the input data.
- pacbio_data_processing.plots.make_histogram(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True) None [source]¶
pacbio_data_processing.sam module¶
pacbio_data_processing.sentinel module¶
- class pacbio_data_processing.sentinel.Sentinel(checkpoint: pathlib.Path)[source]¶
Bases:
object
This class creates objects that are expected to be used as context managers. At
__enter__
a sentinel file is created. At__exit__
the sentinel file is removed. If the file is there before entering the context, or is not there when the context is exited, an exception is raised.- _anti_aging()[source]¶
Method that updates the modification time of the sentinel file every
SLEEP_SECONDS
seconds. This is part of the mechanism to ensure that the sentinel does not get fooled by an abandoned leftover sentinel file.
- property is_file_too_old¶
Property that answers the question: is the sentinel file too old to be taken as an active sentinel file, or not?
pacbio_data_processing.sm_analysis module¶
This module contains the high level functions necessary to run the ‘Single Molecule Analysis’ on an input BAM file.
- class pacbio_data_processing.sm_analysis.MethylationReport(detections_csv, molecules, modification_types, filtered_bam_statistics=None)[source]¶
Bases:
object
- PRELOG = '[methylation report]'¶
- property modification_types¶
- class pacbio_data_processing.sm_analysis.SingleMoleculeAnalysis(parameters)[source]¶
Bases:
object
- property CCS_bam_file¶
It produces a Circular Consensus Sequence (CCS) version of the input BAM file and returns its name. It uses
generate_CCS_file()
to generate the file.
- __call__()[source]¶
Main entry point to perform a single molecule analysis: this method triggers the analysis.
- _align_bam_if_no_candidate_found(inbam: pacbio_data_processing.bam.BamFile, bam_type: str, variant: str = 'straight') Optional[str] [source]¶
[Internal method] Auxiliary method used by
_ensure_input_bam_aligned
. Given abam_type
(amonginput
andccs
) and avariant
, an initial BAM file is selected and a target aligned BAM filename is constructed. The method checks first whether the aligned file is there. If a plausible candidate is not found, the initial BAM is aligned (straight
orπ-shifted
, depending on thevariant
and using the proper reference). IF, on the other hand, a candidate is found, its computation is skipped.If the aligner cannot be run (i.e. calling the aligner returns
None
),None
is returned, meaning that the aligner was not called. This can happen when the aligner finds a sentinel file indicating that the computation is work in progress. (Seepacbio_data_processing.blasr.Blasr.__call__()
for more details on the implementation.) This mechanism allows reentrancy.- Returns
the aligned input bam file, if it is there, or None if it could not be computed (yet).
- _create_references()[source]¶
[Internal method] DNA reference sequences are created here. The ‘true’ reference must exist as fasta beforehand, with its index. A π-shifted reference is created from the original one. Its index is also made.
This method sets two attributes which are, both, mappings with two keys (‘straight’ and ‘pi-shifted’) and values as follows: - reference: the values are DNASeq objects - fasta: the values are Path objects
- _disable_pi_shifted_analysis() None [source]¶
[Internal method] If the pi-shifted analysis cannot be carried out, it is disabled with this method.
- _ensure_ccs_bam_aligned() None [source]¶
[Internal method] As its name suggests, it is ensured that the aligned variants of the CCS file exist. The summary report is informed about the aligned CCS files.
- _ensure_input_bam_aligned() None [source]¶
[Internal method] Main check point for aligned input bam files: this method calls whatever is necessary to ensure that the input bam is aligned, which means: normal (straight) alignment and π-shifted alignment.
Warning! The method tries to find a pi-shifted aligned BAM if the input is aligned based on whether 1. a file with suitable filename is found, and 2. it is aligned.
- _exists_pi_shifted_variant_from_aligned_input() bool [source]¶
[Internal method] It checks that the expected pi-shifted aligned file exists and is an aligned BAM file.
- property partition: pacbio_data_processing.utils.Partition¶
The target
Partition
of the input BAM file that must be processed by the current analysis, according to the input provided by the user.
- property workdir: tempfile.TemporaryDirectory¶
This attribute returns the necessary temporary working directory on demand and it ensures that only one temporary dir is created by caching.
- pacbio_data_processing.sm_analysis.add_to_own_output(gffs, own_output_file_name, modification_types)[source]¶
From a set of .gff files, a csv file (delimiter=”,”) is saved with the following columns:
mol id: taken each gff file (e.g. ‘a.b.c.gff’ -> mol id: ‘b’)
modtype: column number 3 (idx: 2) of the gffs (feature type)
GATC position: column number column number 5 (idx: 4) of the gffs which corresponds to the ‘end coordinate of the feature’ in the GFF3 standard
score of the feature: column number 6 (idx: 5); floating point (Phred-transformed pvalue that a kinetic deviation exists at this position)
strand: strand of the feature. It can be +, - with obvious meanings. It can also be ? (meaning unknown) or . (for non stranded features)
There are more columns, but they are nor fixed in number. They correspond to the values given in the ‘attributes’ column of the gffs (col 9, idx 8). For example, given the following attributes column:
coverage=134;context=TCA...;IPDRatio=3.91;identificationQv=228
we would get the following ‘extra’ columns:
134,TCA...,3.91,228
and this is exactly what happens with the m6A modification type.
All the lines starting by ‘#’ in the gff files are ignored. The format of the gff file is GFF3: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
The value of identificationQV is a a phred transformed probability of having a detection. See eq. (8) in [1]
[1]: “Detection and Identification of Base Modifications with Single Molecule Real-Time Sequencing Data”
- pacbio_data_processing.sm_analysis.generate_CCS_file(ccs: pacbio_data_processing.external.CCS, in_bam: pathlib.Path, ccs_bam_file: pathlib.Path) Optional[pathlib.Path] [source]¶
Idempotent computation of the Circular Consensus Sequence (CCS) version of the passed in
in_bam
file done with passed-inccs
object.- Returns
the CCS bam file, if it is there, or
None
if if could not be computed (yet).
- pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio(bam_file_name: Optional[Union[pathlib.Path, str]]) dict[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
Given the path to a bam file, it returns a dictionary, whose keys are mol ids (ints) and the values are the corresponding Molecules. If multiple lines in the given BAM file share the mol id, only the first line found with the highest similarity ratio (computed from the cigar) is chosen: if multiple lines share the molecule ID and the highest similarity ratio (say, 1), ONLY the first one is taken, irrespective of other factors.
pacbio_data_processing.sm_analysis_gui module¶
pacbio_data_processing.summary module¶
- class pacbio_data_processing.summary.GATCCoverageBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'GATCs NOT in BAM file (%)': ('perc_all_gatcs_not_identified_in_bam',), 'GATCs NOT in methylation report (%)': ('perc_all_gatcs_not_in_meth',), 'GATCs in BAM file (%)': ('perc_all_gatcs_identified_in_bam',), 'GATCs in methylation report (%)': ('perc_all_gatcs_in_meth',)}¶
- dependency_names = ('aligned_ccs_bam_files', 'methylation_report')¶
- index_labels = ('Percentage',)¶
- title = 'GATCs in BAM file and Methylation report'¶
- class pacbio_data_processing.summary.MethTypeBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Fully methylated (%)': ('fully_methylated_gatcs_wrt_meth',), 'Fully unmethylated (%)': ('fully_unmethylated_gatcs_wrt_meth',), 'Hemi-methylated in + strand (%)': ('hemi_plus_methylated_gatcs_wrt_meth',), 'Hemi-methylated in - strand (%)': ('hemi_minus_methylated_gatcs_wrt_meth',)}¶
- dependency_names = ('methylation_report',)¶
- index_labels = ('Percentage',)¶
- title = 'Methylation types in methylation report'¶
- class pacbio_data_processing.summary.MoleculeLenHistogram(name=None)[source]¶
Bases:
pacbio_data_processing.summary.HistoryPlotAttribute
- column_name = 'len(molecule)'¶
- data_name = 'length'¶
- dependency_name = 'methylation_report'¶
- labels = ('Initial subreads', 'Analyzed molecules')¶
- legend = True¶
- title = 'Initial subreads and analyzed molecule length histogram'¶
- class pacbio_data_processing.summary.MoleculeTypeBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Filtered out': ('perc_filtered_out_mols', 'perc_filtered_out_subreads'), 'In Methylation report with GATC': ('perc_mols_in_meth_report_with_gatcs', 'perc_subreads_in_meth_report_with_gatcs'), 'In Methylation report without GATC': ('perc_mols_in_meth_report_without_gatcs', 'perc_subreads_in_meth_report_without_gatcs'), 'Mismatch discards': ('perc_mols_dna_mismatches', 'perc_subreads_dna_mismatches'), 'Used in aligned CCS': ('perc_mols_used_in_aligned_ccs', 'perc_subreads_used_in_aligned_ccs')}¶
- dependency_names = ('mols_used_in_aligned_ccs', 'mols_dna_mismatches', 'filtered_out_mols', 'methylation_report')¶
- index_labels = ('Number of molecules (%)', 'Number of subreads (%)')¶
- title = 'Processed molecules and subreads'¶
- class pacbio_data_processing.summary.PercAttribute(total_attr, pref='perc_', suf='_wrt_meth', name=None)[source]¶
Bases:
pacbio_data_processing.summary.ROAttribute
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- class pacbio_data_processing.summary.PositionCoverageBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Positions NOT covered by molecules in BAM file (%)': ('perc_all_positions_not_in_bam',), 'Positions NOT covered by molecules in methylation report (%)': ('perc_all_positions_not_in_meth',), 'Positions covered by molecules in BAM file (%)': ('perc_all_positions_in_bam',), 'Positions covered by molecules in methylation report (%)': ('perc_all_positions_in_meth',)}¶
- dependency_names = ('aligned_ccs_bam_files', 'methylation_report')¶
- index_labels = ('Percentage',)¶
- title = 'Position coverage in BAM file and Methylation report'¶
- class pacbio_data_processing.summary.PositionCoverageHistory(name=None)[source]¶
Bases:
pacbio_data_processing.summary.HistoryPlotAttribute
- dependency_name = 'methylation_report'¶
- labels = ('Positions',)¶
- legend = False¶
- len_column_name = 'len(molecule)'¶
- start_column_name = 'start of molecule'¶
- title = 'Sequencing positions covered by analyzed molecules'¶
- class pacbio_data_processing.summary.SimpleAttribute(name=None)[source]¶
Bases:
object
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- class pacbio_data_processing.summary.SummaryReport(bam_path, dnaseq)[source]¶
Bases:
collections.abc.Mapping
Final summary report generated by
sm-analysis
initially intended for humans.This class has been crafted to carefully control its attributes. Data can be fed into the class by setting some attributes. That process triggers the generation of other attributes, that are typically read-only.
After instantiating the class with the path to the input BAM and the dna sequence of the reference (instance of
DNASeq
), one must set some attributes to be able to save the summary report:s = SummaryReport(bam_path, dnaseq) s.methylation_report = path_to_meth_report s.raw_detections = path_to_raw_detections_file s.gff_result = path_to_gff_result_file s.mols_dna_mismatches = {20, 49, ...} # set of ints s.filtered_out_mols = {22, 493, ...} # set of ints s.mols_used_in_aligned_ccs = {3, 67, ...} # set of ints s.aligned_ccs_bam_files = { 'straight': aligned_ccs_path, 'pi-shifted': pi_shifted_aligned_ccs_path }
at this point all the necessary data is there and the report can be created:
s.save('summary_whatever.html')
- aligned_ccs_bam_files¶
- all_gatcs_identified_in_bam¶
- all_gatcs_in_meth¶
- all_gatcs_not_identified_in_bam¶
- all_gatcs_not_in_meth¶
- all_positions_in_bam¶
- all_positions_in_meth¶
- all_positions_not_in_bam¶
- all_positions_not_in_meth¶
- property as_html¶
- body_md5sum¶
- filtered_out_mols¶
- filtered_out_subreads¶
- full_md5sum¶
- fully_methylated_gatcs¶
- fully_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- fully_unmethylated_gatcs¶
- fully_unmethylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- gatc_coverage_bars¶
- gff_result¶
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- hemi_methylated_gatcs¶
- hemi_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- hemi_minus_methylated_gatcs¶
- hemi_minus_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- hemi_plus_methylated_gatcs¶
- hemi_plus_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- input_bam¶
- input_bam_size¶
- input_reference¶
- max_possible_methylations¶
- meth_type_bars¶
- methylation_report¶
- molecule_len_histogram¶
- molecule_type_bars¶
- mols_dna_mismatches¶
- mols_in_meth_report¶
- mols_in_meth_report_with_gatcs¶
- mols_in_meth_report_without_gatcs¶
- mols_ini¶
- mols_used_in_aligned_ccs¶
- perc_all_gatcs_identified_in_bam¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_gatcs_in_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_gatcs_not_identified_in_bam¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_gatcs_not_in_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_positions_in_bam¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_positions_in_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_positions_not_in_bam¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_positions_not_in_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_filtered_out_mols¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_filtered_out_subreads¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_dna_mismatches¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_in_meth_report¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_in_meth_report_with_gatcs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_in_meth_report_without_gatcs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_used_in_aligned_ccs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_dna_mismatches¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_in_meth_report¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_in_meth_report_with_gatcs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_in_meth_report_without_gatcs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_used_in_aligned_ccs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- position_coverage_bars¶
- position_coverage_history¶
- raw_detections¶
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- ready_to_go(*attrs)[source]¶
Method used to check if some attributes are already usable or not (in other words if they have been already set or not).
- reference_base_pairs¶
- reference_md5sum¶
- reference_name¶
- subreads_dna_mismatches¶
- subreads_in_meth_report¶
- subreads_in_meth_report_with_gatcs¶
- subreads_in_meth_report_without_gatcs¶
- subreads_ini¶
- subreads_used_in_aligned_ccs¶
- switch_on(attribute)[source]¶
Method used by descriptors to inform the instance of ``SummaryReport``that some computed attributes needed by the plots are already computed and usable.
- total_gatcs_in_ref¶
pacbio_data_processing.templates module¶
pacbio_data_processing.types module¶
pacbio_data_processing.utils module¶
- class pacbio_data_processing.utils.DNASeq(raw_seq: pacbio_data_processing.utils.DNASeqLike, name: str = '', description: str = '')[source]¶
Bases:
Generic
[pacbio_data_processing.utils.DNASeqLike
]Wrapper around ‘Bio.Seq.Seq’.
- __init__(raw_seq: pacbio_data_processing.utils.DNASeqLike, name: str = '', description: str = '')[source]¶
- classmethod from_fasta(fasta_name: str) pacbio_data_processing.utils.DNASeqType [source]¶
Returns a DNASeq from the first DNA sequence stored in the fasta named ‘fasta_name’.
- property md5sum: str¶
It returns the MD5 checksum’s hexdigest of the upper version of the sequence as a string.
- pi_shifted() pacbio_data_processing.utils.DNASeqType [source]¶
Method to return a pi-shifted DNASeq from the original one. pi-shifted means that a circular topology is assumed in the DNA sequence and a shift in the origin is done by π radians, ie the sequence is splitted in two parts and both parts are permuted.
- class pacbio_data_processing.utils.Partition(partition_specification: Optional[tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile)[source]¶
Bases:
object
A Partition is a class that helps answering the following question: assuming that we are interested in processing a fraction of a BamFile, does the molecule ID
mol_id
belongs to that fraction, or not? A prior implementation consisted in storing all the molecule IDs in theBamFile
for a given partition in a set, and the answer is just obtained by querying if a molecule ID belongs to the set or not. That former implementation is not enough for the case of multiple alignment processes for the same rawBamFile
(eg, when a combined analysis of the so-called ‘straight’ and ‘pi-shifted’ variants is performed). In that case the partition is decided with one file. And all molecule IDs belonging to the non-empty intersection with the other file must be unambiguously accomodated in a certain partition. This class has been designed to solve that problem.- __init__(partition_specification: Optional[tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile) None [source]¶
- _delimit_partitions() None [source]¶
[Internal method] This method decides what are the limits of all partitions given the number of partitions. The method sets an internal mapping,
self._lower_limits
, of the type{partition number [int]: lower limit [int]}
with that information. This mapping is populated with all the partition numbers and corresponding values.
- _set_current_limits() None [source]¶
[Internal method] Auxiliary method for __contains__ Here it is determined what is the range of molecule IDs, as ints, that belong to the partition. The method sets two integer attributes, namely: -
_lower_limit_current
: the minimum molecule ID of thecurrent partition, and
_higher_limit_current
: the maximum molecule ID of the current partition; it can beNone
, meaning that there is no maximum (last partition).
- pacbio_data_processing.utils.combine_scores(scores: collections.abc.Sequence[float]) float [source]¶
It computes the combined phred transformed score of the
scores
provided. Some examples:>>> combine_scores([10]) 10.0 >>> q = combine_scores([10, 12, 14]) >>> print(round(q, 6)) 7.204355 >>> q = combine_scores([30, 20, 100, 92]) >>> print(round(q, 6)) 19.590023 >>> q_500 = combine_scores([30, 20, 500]) >>> q_no_500 = combine_scores([30, 20]) >>> q_500 == q_no_500 True >>> combine_scores([200, 300, 500]) 200.0
- pacbio_data_processing.utils.find_gatc_positions(seq: str, offset: int = 0) set[int] [source]¶
Convenience function that computes the positions of all GATCs found in the given sequence. The values are relative to the offset.
>>> find_gatc_positions('AAAGAGAGATCGCGCGATC') == {7, 15} True >>> find_gatc_positions('AAAGAGAGTCGCGCCATC') set() >>> find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC') == {7, 12, 19} True >>> s = find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC', offset=23) >>> s == {30, 35, 42} True
- pacbio_data_processing.utils.pishift_back_positions_in_gff(gff_path: Union[str, pathlib.Path]) None [source]¶
A function that parses the input GFF file (assumed to be a valid `GFF3`_ file) and shifts back the positions found in it (columns 4th and 5th of lines not starting by
#
). It is assumed that the positions in the input file (gff_path
) are referring to a pi-shifted origin. To undo the shift, the length of the sequence(s) is (are) read from the GFF3 directives (lines starting by##
), in particular from the##sequence-region
pragmas. This function can handle the case of multiple sequences.Warning! The function overwrites the input
gff_path
.
- pacbio_data_processing.utils.shift_me_back(pos: int, nbp: int) int [source]¶
Unshifts a given position taking into account that it has been previously shifted by half of the number of base pairs. It takes into account the possibility of having a sequence with an odd length.
@params:
pos - 1-based position of a base pair to unshift
nbp - number of base pairs in the reference
@returns:
unshifted position
Some examples:
>>> shift_me_back(3, 10) 8 >>> shift_me_back(1, 20) 11 >>> shift_me_back(3, 7) 6 >>> shift_me_back(4, 7) 7 >>> shift_me_back(5, 7) 1 >>> shift_me_back(7, 7) 3 >>> shift_me_back(1, 7) 4
To understand the operation of this function consider the following example. Given a sequence of 7 base pairs with the following indices found in the reference in the natural order, ie
1 2 3 4 5 6 7
then, after being pi-shifted the base pairs in the sequence are reordered, and the indices become (in parenthesis the former indices):
1’(=4) 2’(=5) 3’(=6) 4’(=7) 5’(=1) 6’(=2) 7’(=3)
The current function accepts primed indices and transforms them to the unprimed indices, ie, the positions returned refer to the original reference.
- pacbio_data_processing.utils.try_computations_with_variants_until_done(func: Callable, variants: collections.abc.Sequence[str], *args: Any) None [source]¶
This function runs the passed in function
func
with the arguments``*args`` and for eachvariant
invariants
,eg. something like this:- for v in variants:
result = func(*args, variant=v)
but it keeps doing so until each result returned by
func
is notNone
. When aNone
is returned byfunc
, a call tosleep
is warranted before continuing. The time slept depends on how many times it was sleeping before; the sleep time grows exponentially with every iteration:t -> 2*t
until all the computations (results of
func
for each variant) are completed, ie all are notNone
. The main application of this function is to ensure that some common operations of theSingleMoleculeAnalysis
are done once and only once irrespective of how many parallel instances of the analysis (with different partitions each) are carried out. For example, this function can be used to avoid collisions in the generation of aligned BAM files sincepacbio_data_processing.blasr.Blasr
has a mechanism that allows concurrent computations. This function delegates the decision on whether the computation is done or not tofunc
.Note
A special case is when a
variant
isNone
, in that case the functionfunc
is called without thevariant
argument:result = func(*args)
Therefore, if
variants
is, e.g.(None,)
, thenfunc
is only called once in each iteration WITHOUTvariant
keyword argument. That is useful if the functionfunc
must be called until is done, but it takes no variant argument.
Module contents¶
Top-level package for PacBio data processing.