Description of data in the Methylation Reports¶
- Author
David Palao
- Date
6 May 2021
- Last updated
13 September 2021
- Version
3
- Tags
sm-analysis PacbioDataProcessing output methylation
- abstract
The
sm-analysispipeline produces a so-called methylation report. Its contents is described in this document.
Format¶
Note
This section describes the most recent version of the methylation report’s format.
The methylation report produced by sm-analysis is a csv file with ;
(semicolon) as separator and 21 columns with the following header:
molecule id;sequence;start of molecule;end of molecule;len(molecule);count(subreads+);count(subreads-);combined QUAL;mean(QUAL);sim ratio;count(GATC);positions of GATCs;count(methylation states);methylation states;combined score;mean(score);min(IPDRatio);mean(IPDRatio);combined idQV;mean(idQV);mean(coverage)
Each column can be itself separated (see e.g. columns 12 and 14). In that case
an internal separator, namely ,, is used.
The following table summarizes the meaning of each column.
col num |
field name |
possible values |
description |
example |
|---|---|---|---|---|
1 |
|
positive int |
value provided by the sequencer |
23480 |
2 |
|
[ACGT]* |
DNA sequence corresponding to the molecule, as reported by CCS |
AGACTTTC… |
3 |
|
positive int |
1-based start position of the molecule within the reference; the values are taken from the aligner; this value is the number of bases before the first base of the sequence plus 1 (the minimum position is 1) |
312 |
4 |
|
positive int |
1-based end position of the molecule within the reference; the values are taken from the aligner. This value is the number of bases before the last base of the sequence plus 1 |
1509 |
5 |
|
positive int |
length of the DNA sequence corresponding to the molecule, according to the aligned CCS file. |
1198 |
6 |
|
int >= 0 |
number of subreads in the + strand found in the input BAM |
51 |
7 |
|
int >= 0 |
number of subreads in the - strand found in the input BAM |
48 |
8 |
|
positive float |
combined QUAL (asccii of base quality plus 33). Each QUAL is the Phred-transformed proba- bility value that the base is wrong. |
95.2 |
9 |
|
positive float |
mean QUAL (asccii of base quality plus 33). Each QUAL is the Phred-transformed proba- bility value that the base is wrong. |
101.4 |
10 |
|
float between 0 and 1 |
ratio of similarity between the molecule’s sequence and the corresponding piece in the reference |
1.0 |
11 |
|
positive int |
number of GATCs found in the DNA sequence. |
3 |
12 |
|
comma separated sequence of int>0 |
1-based absolute positions of the Gs for all the GATCs present in col 3 (ie, in the current molecule) |
315,699,902 |
13 |
|
positive int |
in how many positions the mole-
cule was detected to have a
methylation ( |
2 |
14 |
|
comma separated sequence of [0-+f] |
each element corresponds to the
methylation state of one GATC in
the sequence as returned by the
|
f,-,0 |
15 |
|
positive float |
combined score of the feature for each detection (each score is the Phred-transformed proba- bility value that a kinetic de- viation exists at a position) |
118 |
16 |
|
positive float |
mean score of the feature over all detections in the mole- cule (each score is the Phred- transformed probability value that a kinetic deviation exists at a position) |
150 |
17 |
|
positive float |
min of tMean/modelPredictions (tMean is the capped mean of normalized IPDs observed at this position) |
3.4 |
18 |
|
positive float |
mean of tMean/modelPredictions (tMean is the capped mean of normalized IPDs observed at this position) |
5.2 |
19 |
|
positive float |
combined |
19.6 |
20 |
|
positive float |
mean |
30 |
21 |
|
positive float |
mean value of the coverage levels used to assign the modif. type label |
42 |
Some notes:
the number of elements in columns 12 and 14 must be equal to the value in column 11
idQVis the Phred-transformed QV of having a modification at a given positionThe meaning of the methylation state symbols:
0: not methylated-: hemi-methylated. Negative strand+: hemi-methylated. Positive strandf: full methylated
Format (version 2)¶
The methylation report produced by sm-analysis is a csv file with ;
(semicolon) as separator and 7 columns with the following header:
molecule id;count(GATC);sequence;start of molecule;end of molecule;positions of GATCs;methylation states
The following table summarizes the meaning of each column.
col num |
field name |
possible values |
description |
example |
|---|---|---|---|---|
1 |
|
positive int |
value provided by the sequencer |
23480 |
2 |
|
positive int |
number of GATCs found in the DNA sequence. |
3 |
3 |
|
[ACGT]* |
DNA sequence corresponding to the molecule, as reported by CCS |
AGACTTTC… |
4 |
|
positive int |
1-based start position of the molecule within the reference; the values are taken from the aligner; this value is the number of bases before the first base of the sequence plus 1 (the minimum position is 1) |
312 |
5 |
|
positive int |
1-based end position of the molecule within the reference; the values are taken from the aligner. This value is the number of bases before the last base of the sequence plus 1 |
1509 |
6 |
|
comma separated sequence of int>0 |
1-based absolute positions of the Gs for all the GATCs present in col 3 (ie, in the current molecule) |
315,699,1002 |
7 |
|
comma separated sequence of [0-+f] |
each element corresponds to the
methylation state of one GATC in
the sequence as returned by the
|
f,-,0 |
Some notes:
the number of elements in columns 6 and 7 must be equal to the value in column 2
The meaning of the methylation state symbols:
0: not methylated-: hemi-methylated. Negative strand+: hemi-methylated. Positive strandf: full methylated
Format (version 1)¶
Note
This version, v1, is an old format no longer used. It was decided to be replaced by the version 2 (described above) in a work meeting (with DP, DV and TW) on 18 June 20201.
The methylation report produced by sm-analysis is a csv file with ;
(semicolon) as separator and 6 columns with the following header:
molecule id;count(GATC);sequence;start-end of molecule;
positions of GATCs;methylation states
The following table summarizes the meaning of each column.
col num |
field name |
possible values |
description |
|---|---|---|---|
1 |
|
positive int |
value provided by the sequencer |
2 |
|
positive int |
number of GATCs found in the DNA sequence. |
3 |
|
[ACGT]* |
DNA sequence corresponding to the molecule, as reported by CCS |
4 |
|
[int>=,int>0] |
inclusive interval corresponding to the start and end of the molecule within the reference; the values are taken from the aligner but shifted such that the minimum position is 0 (ie 0-index is used) |
5 |
|
space separated sequence of int>0 |
0-index positions of the A in all the GATCs present in col 3 and realtive to that sequence |
6 |
|
space separated sequence of [0-+f] |
each element corresponds to the
methylation state of one GATC in
the sequence as returned by the
|
Some notes:
the number of elements in columns 5 and 6 must be equal to the value in column 2
The meaning of the methylation state symbols:
0: not methylated-: hemi-methylated. Negative strand+: hemi-methylated. Positive strandf: full methylated