You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: book/src/dmr_scoring_details.md
+2-1Lines changed: 2 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -118,7 +118,8 @@ The model is a simple 2-state hidden Markov model, shown below, where the two hi
118
118

119
119
120
120
</div>
121
-
The model is run over the intersection of the modified positions in a [pileup](https://nanoporetech.github.io/modkit/intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.
121
+
122
+
The model is run over the intersection of the modified positions in a [pileup](./intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.
122
123
123
124
## Transition parameters
124
125
There are two transition probability parameters, \\(p\\) and \\(d\\).
## How are base modification probabilities calculated?
4
+
5
+
Base modifications are assigned a probability reflecting the confidence the base modification detection algorithm has in making a decision about the modification state of the molecule at a particular position.
6
+
The probabilities are parsed from the `ML` tag in the BAM record. These values reflect the probability of the base having a specific modification, `modkit` uses these values and calculates the probability for each modification as well as the probability the base is canonical:
where \\(\textbf{M}\\) is the set of all of the potential modifications for the base.
13
+
14
+
For example, consider using a m6A model that predicts m6A or canonical bases at adenine residues, if the \\( P_{\text{m6A}} = 0.9 \\) then the probability of canonical \\( \text{A} \\) is \\( P_{\text{canonical}} = 1 - P_{\text{m6A}} = 0.1 \\).
15
+
Or considering a typical case for cytosine modifications where the model predicts 5hmC, 5mC, and canonical cytosine:
A potential confusion is that `modkit` does not assume a base is canonical if the probability of modification is close to \\( \frac{1}{N_{\text{classes}}} \\), the lowest probability the algorithm may assign.
24
+
25
+
## What value for `--filter-threshold` should I use?
26
+
27
+
The same way that you may remove low quality data as a first step to any processing, `modkit` will filter out the lowest confidence base modification probabilities.
28
+
The filter threshold (or pass threshold) defines the minimum probability required for a read's base modification information at a particular position to be used in a downstream step.
29
+
This does not remove the whole read from consideration, just the base modification information attributed to a particular position in the read will be removed.
30
+
The most common place to encounter filtering is in `pileup`, where base modification probabilities falling below the pass threshold will be tabulated in the \\( \text{N}\_{\text{Fail}} \\) column instead of the \\( \text{N}\_{\text{valid}} \\) column.
31
+
For highest accuracy, the general recommendation is to let `modkit` estimate this value for you based on the input data.
32
+
The value is calculated by first taking a sample of the base modification probabilities from the input dataset and determining the \\(10^{\text{th}}\\) percentile probability value.
33
+
This percentile can be changed with the `--filter-percentile` option.
34
+
Passing a value to `--filter-threshold` and/or `--mod-threshold` that is higher or lower than the estimated value will have the effect of excluding or including more probabilities, respectively.
35
+
It may be a good idea to inspect the distribution of probability values in your data, the `modkit sample-probs`[command](./intro_sample_probs.md) is designed for this task.
36
+
Use the `--hist` and `--out-dir` options to collect a histogram of the prediction probabilities for each canonical base and modification.
37
+
38
+
39
+
40
+
<!-- ## How can I perform differential methylation analysis? -->
> For details on how base modification probabilities are calculated, see the [FAQ page](./faq.html#how-are-base-modification-probabilities-calculated)
4
+
5
+
For most use cases the automatic filtering enabled in `modkit` will produce nearly ideal results.
6
+
However, in some cases such as exotic organisms or specialized assays, you may want to interrogate the base modification probabilities directly and tune the pass thresholds.
7
+
The `modkit sample-probs` command is designed for this task.
8
+
There are two ways to use this command, first by simply running `modkit sample-probs $mod_bam` to get a tab-separated file of threshold values for each modified base.
9
+
This can save time in downstream steps where you wish to re-use the threshold value by passing `--filter-threshold` and skip re-estimating the value.
10
+
To generate more advanced output, add `--hist --out-dir $output_dir` to the command and generate per-modification histograms of the output probabilities.
11
+
Using the command this way produces 3 files in the `$output_dir`:
12
+
1. An HTML document containing a histogram of the total counts of each probability emitted for each modification code (including canonical) in the sampled reads.
13
+
1. Another HTML document containing the proportions of each probability emitted.
14
+
1. A tab-separated table with the same information as the histograms and the percentile rank of each probability value.
| 1 | code | modification code or '-' for canonical | string |
21
+
| 2 | primary base | the primary DNA base for which the code applies | string |
22
+
| 3 | range_start | the inclusive start probability of the bin | float |
23
+
| 4 | range_end | the exclusive end probability of the bin | float |
24
+
| 5 | count | the total count of probabilities falling in this bin | int |
25
+
| 6 | frac | the fraction of the total calls for this code/primary base in this bin | float |
26
+
| 7 | percentile_rank | the [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) of this probability bin | float |
27
+
28
+
From these plots and tables you can decide on a pass threshold per-modification code and use `--mod-threshold`/`--filter-threshold`[accordingly](./filtering.md).
0 commit comments