Skip to content

Commit 828697b

Browse files
committed
Merge branch 'ar/docs-0.3.2' into 'master'
Docs 0.3.2 See merge request machine-learning/modkit!206
2 parents 9d65d2a + d86f367 commit 828697b

39 files changed

+863
-85
lines changed

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,15 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [v0.3.2]
8+
### Fixes
9+
- [thresholds] OOB panic fix, #244
10+
- [dmr, pair] Allow strand information in regions to be provided. #240
11+
- [dmr, multi] Fix problem when many pairs are provided (#229)
12+
### Adds
13+
- [sample-probs] Change output format of probabilities table to make it easier to parse, also change schema. Output HTML documents with nicer tables.
14+
- [ci] Build in Ubuntu-16 due to Centos7 EOL.
15+
716
## [v0.3.1]
817
### Fixes
918
- [call-mods] Always change model to "explicit", dropped base modification probabilities should not be interpreted as canonical.

book/src/SUMMARY.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
- [Quick Start guides](./quick_start.md)
44
- [Constructing bedMethyl tables](./intro_bedmethyl.md)
55
- [Updating and adjusting MM tags](./intro_adjust.md)
6+
- [Inspecting base modification probabilities](./intro_sample_probs.md)
67
- [Summarizing a modBAM](./intro_summary.md)
78
- [Making a motif BED file](./intro_motif_bed.md)
89
- [Extracting read information to a table](./intro_extract.md)
@@ -17,6 +18,7 @@
1718
- [Narrow output to specific positions](./intro_include_bed.md)
1819
- [Extended subcommand help](./advanced_usage.md)
1920
- [Troubleshooting](./troubleshooting.md)
21+
- [Frequently asked questions](./faq.md)
2022
- [Current limitations](./limitations.md)
2123
- [Performance considerations](./perf_considerations.md)
2224
- [Algorithm details](./algo_details.md)

book/src/advanced_usage.md

Lines changed: 56 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -281,8 +281,8 @@ Usage: modkit adjust-mods [OPTIONS] <IN_BAM> <OUT_BAM>
281281
282282
Arguments:
283283
<IN_BAM>
284-
BAM file to collapse mod call from. Can be a path to a file or one of `-` or `stdin` to
285-
specify a stream from standard input.
284+
Input BAM file, can be a path to a file or one of `-` or `stdin` to specify a stream from
285+
standard input.
286286
287287
<OUT_BAM>
288288
File path to new BAM file to be created. Can be a path to a file or one of `-` or `stdin`
@@ -325,6 +325,58 @@ Options:
325325
--output-sam
326326
Output SAM format instead of BAM.
327327
328+
--filter-probs
329+
Filter out the lowest confidence base modification probabilities.
330+
331+
-n, --num-reads <NUM_READS>
332+
Sample approximately this many reads when estimating the filtering threshold. If
333+
alignments are present reads will be sampled evenly across aligned genome. If a region is
334+
specified, either with the --region option or the --sample-region option, then reads will
335+
be sampled evenly across the region given. This option is useful for large BAM files. In
336+
practice, 10-50 thousand reads is sufficient to estimate the model output distribution and
337+
determine the filtering threshold.
338+
339+
[default: 10042]
340+
341+
--sample-region <SAMPLE_REGION>
342+
Specify a region for sampling reads from when estimating the threshold probability. If
343+
this option is not provided, but --region is provided, the genomic interval passed to
344+
--region will be used. Format should be <chrom_name>:<start>-<end> or <chrom_name>.
345+
346+
--sampling-interval-size <SAMPLING_INTERVAL_SIZE>
347+
Interval chunk size to process concurrently when estimating the threshold probability, can
348+
be larger than the pileup processing interval.
349+
350+
[default: 1000000]
351+
352+
-p, --filter-percentile <FILTER_PERCENTILE>
353+
Filter out modified base calls where the probability of the predicted variant is below
354+
this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence
355+
modification calls.
356+
357+
[default: 0.1]
358+
359+
--filter-threshold <FILTER_THRESHOLD>
360+
Specify the filter threshold globally or per primary base. A global filter threshold can
361+
be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by
362+
colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for
363+
cytosine modification calls. Additional per-base thresholds can be specified by repeating
364+
the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a
365+
single base option and a default for all other bases with: --filter-threshold A:0.70
366+
--filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all
367+
other base modification calls.
368+
369+
--mod-threshold <MOD_THRESHOLDS>
370+
Specify a passing threshold to use for a base modification, independent of the threshold
371+
for the primary sequence base or the default. For example, to set the pass threshold for
372+
5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as
373+
usual and used for canonical cytosine and other modifications unless the
374+
`--filter-threshold` option is also passed. See the online documentation for more details.
375+
376+
--only-mapped
377+
Only use base modification probabilities from bases that are aligned when estimating the
378+
filter threshold (i.e. ignore soft-clipped, and inserted bases).
379+
328380
--suppress-progress
329381
Hide the progress bar
330382
@@ -398,13 +450,9 @@ Options:
398450
399451
-o, --out-dir <OUT_DIR>
400452
Directory to deposit result tables into. Required for model probability histogram output.
401-
Creates two files probabilities.tsv and probabilities.txt The .txt contains
402-
ASCII-histograms and the .tsv contains tab-separated variable data represented by the
403-
histograms.
404453
405454
--prefix <PREFIX>
406-
Label to prefix output files with. E.g. 'foo' will output foo_thresholds.tsv,
407-
foo_probabilities.tsv, and foo_probabilities.txt.
455+
Label to prefix output files with.
408456
409457
--force
410458
Overwrite results if present.
@@ -431,11 +479,6 @@ Options:
431479
--hist
432480
Output histogram of base modification prediction probabilities.
433481
434-
--buckets <BUCKETS>
435-
Number of buckets for the histogram, if used.
436-
437-
[default: 128]
438-
439482
-n, --num-reads <NUM_READS>
440483
Approximate maximum number of reads to use, especially recommended when using a large BAM
441484
without an index. If an indexed BAM is provided, the reads will be sampled evenly over the
@@ -1397,7 +1440,7 @@ sample is input as a bgzip pileup bedMethyl (produced by pileup, for example) th
13971440
tabix index. Output is a BED file with the score column indicating the magnitude of the difference
13981441
in methylation between the two samples. See the online documentation for additional details.
13991442
1400-
Usage: modkit dmr pair [OPTIONS] -a <CONTROL_BED_METHYL> -b <EXP_BED_METHYL> --ref <REFERENCE_FASTA>
1443+
Usage: modkit dmr pair [OPTIONS] --ref <REFERENCE_FASTA>
14011444
14021445
Options:
14031446
-a <CONTROL_BED_METHYL>

book/src/dmr_scoring_details.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,8 @@ The model is a simple 2-state hidden Markov model, shown below, where the two hi
118118
![hmm](./images/hmm2.png "2-state segmenting HMM")
119119

120120
</div>
121-
The model is run over the intersection of the modified positions in a [pileup](https://nanoporetech.github.io/modkit/intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.
121+
122+
The model is run over the intersection of the modified positions in a [pileup](./intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.
122123

123124
## Transition parameters
124125
There are two transition probability parameters, \\(p\\) and \\(d\\).

book/src/faq.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Frequently asked questions
2+
3+
## How are base modification probabilities calculated?
4+
5+
Base modifications are assigned a probability reflecting the confidence the base modification detection algorithm has in making a decision about the modification state of the molecule at a particular position.
6+
The probabilities are parsed from the `ML` tag in the BAM record. These values reflect the probability of the base having a specific modification, `modkit` uses these values and calculates the probability for each modification as well as the probability the base is canonical:
7+
8+
\\[ \
9+
P_{\text{canonical}} = 1 - \sum_{m \in \textbf{M}} P_{m} \
10+
\\]
11+
12+
where \\(\textbf{M}\\) is the set of all of the potential modifications for the base.
13+
14+
For example, consider using a m6A model that predicts m6A or canonical bases at adenine residues, if the \\( P_{\text{m6A}} = 0.9 \\) then the probability of canonical \\( \text{A} \\) is \\( P_{\text{canonical}} = 1 - P_{\text{m6A}} = 0.1 \\).
15+
Or considering a typical case for cytosine modifications where the model predicts 5hmC, 5mC, and canonical cytosine:
16+
17+
\\[
18+
P_{\text{5mC}} = 0.7, \\\\
19+
P_{\text{5hmC}} = 0.2, \\\\
20+
P_{\text{canonical}} = 1 - P_{\text{5mC}} + P_{\text{5hmC}} = 0.1, \\\\
21+
\\]
22+
23+
A potential confusion is that `modkit` does not assume a base is canonical if the probability of modification is close to \\( \frac{1}{N_{\text{classes}}} \\), the lowest probability the algorithm may assign.
24+
25+
## What value for `--filter-threshold` should I use?
26+
27+
The same way that you may remove low quality data as a first step to any processing, `modkit` will filter out the lowest confidence base modification probabilities.
28+
The filter threshold (or pass threshold) defines the minimum probability required for a read's base modification information at a particular position to be used in a downstream step.
29+
This does not remove the whole read from consideration, just the base modification information attributed to a particular position in the read will be removed.
30+
The most common place to encounter filtering is in `pileup`, where base modification probabilities falling below the pass threshold will be tabulated in the \\( \text{N}\_{\text{Fail}} \\) column instead of the \\( \text{N}\_{\text{valid}} \\) column.
31+
For highest accuracy, the general recommendation is to let `modkit` estimate this value for you based on the input data.
32+
The value is calculated by first taking a sample of the base modification probabilities from the input dataset and determining the \\(10^{\text{th}}\\) percentile probability value.
33+
This percentile can be changed with the `--filter-percentile` option.
34+
Passing a value to `--filter-threshold` and/or `--mod-threshold` that is higher or lower than the estimated value will have the effect of excluding or including more probabilities, respectively.
35+
It may be a good idea to inspect the distribution of probability values in your data, the `modkit sample-probs` [command](./intro_sample_probs.md) is designed for this task.
36+
Use the `--hist` and `--out-dir` options to collect a histogram of the prediction probabilities for each canonical base and modification.
37+
38+
39+
40+
<!-- ## How can I perform differential methylation analysis? -->

book/src/intro_sample_probs.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Inspecting base modification probabilities
2+
3+
> For details on how base modification probabilities are calculated, see the [FAQ page](./faq.html#how-are-base-modification-probabilities-calculated)
4+
5+
For most use cases the automatic filtering enabled in `modkit` will produce nearly ideal results.
6+
However, in some cases such as exotic organisms or specialized assays, you may want to interrogate the base modification probabilities directly and tune the pass thresholds.
7+
The `modkit sample-probs` command is designed for this task.
8+
There are two ways to use this command, first by simply running `modkit sample-probs $mod_bam` to get a tab-separated file of threshold values for each modified base.
9+
This can save time in downstream steps where you wish to re-use the threshold value by passing `--filter-threshold` and skip re-estimating the value.
10+
To generate more advanced output, add `--hist --out-dir $output_dir` to the command and generate per-modification histograms of the output probabilities.
11+
Using the command this way produces 3 files in the `$output_dir`:
12+
1. An HTML document containing a histogram of the total counts of each probability emitted for each modification code (including canonical) in the sampled reads.
13+
1. Another HTML document containing the proportions of each probability emitted.
14+
1. A tab-separated table with the same information as the histograms and the percentile rank of each probability value.
15+
16+
The schema of the table is as follows:
17+
18+
| column | name | description | type |
19+
|--------|-----------------|----------------------------------------------------------------------------------------------|--------|
20+
| 1 | code | modification code or '-' for canonical | string |
21+
| 2 | primary base | the primary DNA base for which the code applies | string |
22+
| 3 | range_start | the inclusive start probability of the bin | float |
23+
| 4 | range_end | the exclusive end probability of the bin | float |
24+
| 5 | count | the total count of probabilities falling in this bin | int |
25+
| 6 | frac | the fraction of the total calls for this code/primary base in this bin | float |
26+
| 7 | percentile_rank | the [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) of this probability bin | float |
27+
28+
From these plots and tables you can decide on a pass threshold per-modification code and use `--mod-threshold`/`--filter-threshold` [accordingly](./filtering.md).

0 commit comments

Comments
 (0)