Skip to content

Commit f882a43

Browse files
committed
Merge branch 'ar/prep-040' into 'master'
Prep docs and loose ends for 0.4.0 See merge request machine-learning/modkit!224
2 parents 3d4f2ec + eca57ca commit f882a43

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+2597
-2033
lines changed

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,22 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [v0.4.0]
8+
### Adds
9+
- [motif] Add `search` and `evaluate` subcommands under `motif` command hierarchy.
10+
- [stats, localize] Add `stats` and `localize` commands, see documentation for details.
11+
- [dmr, multi] Combine samples when they have the same name.
12+
- [extract] Add option to emit bgzf-compressed output.
13+
### Fixes
14+
- [validate] Only consider modification codes attributed to the primary sequence base being validated.
15+
- [pileup, extract] Improve iteration over regions when `--include-bed` is provided.
16+
### Changes
17+
- [dmr] Use `htslib` `tbx` module for reading tabix index instead of `noodles`.
18+
- [pileup] Require `.fai` FASTA index when a reference is provided, load only sections of reference that are necessary.
19+
- [extract] Separate `calls` and `full` commands to produce the read calls, and full tables, respectively.
20+
- [validate] Change "any-mod" code from {`A`, `C`, `G`, `T`} to `*`.
21+
22+
723
## [v0.3.3]
824
### Fixes
925
- [sample-probs, summary, pileup] Refactor sampling algorithm so that it will not over-sample reads leading to excessive memory usage.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "mod_kit"
3-
version = "0.3.3"
3+
version = "0.4.0"
44
edition = "2021"
55

66
[[bin]]

book/src/SUMMARY.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,22 @@
22

33
- [Quick Start guides](./quick_start.md)
44
- [Constructing bedMethyl tables](./intro_bedmethyl.md)
5+
- [Make hemi-methylation bedMethyl tables](./intro_pileup_hemi.md)
56
- [Updating and adjusting MM tags](./intro_adjust.md)
67
- [Inspecting base modification probabilities](./intro_sample_probs.md)
78
- [Summarizing a modBAM](./intro_summary.md)
8-
- [Making a motif BED file](./intro_motif_bed.md)
9-
- [Extracting read information to a table](./intro_extract.md)
9+
- [Calculating modification statistics in regions](./intro_stats.md)
1010
- [Calling mods in a modBAM](./intro_call_mods.md)
1111
- [Removing modification calls at the ends of reads](./intro_edge_filter.md)
1212
- [Repair MM/ML tags on trimmed reads](./intro_repair.md)
13-
- [Make hemi-methylation bedMethyl tables](./intro_pileup_hemi.md)
13+
- [Working with sequence motifs](./intro_motif.md)
14+
- [Making a motif BED file](./intro_motif_bed.md)
15+
- [Find highly modified motif sequences](./intro_find_motifs.md)
16+
- [Evaluate and refine a table of known motifs](./evaluate_motif.md)
17+
- [Extracting read information to a table](./intro_extract.md)
18+
- [Investigating patterns with `localise`](./intro_localize.md)
1419
- [Perform differential methylation scoring](./intro_dmr.md)
1520
- [Validate ground truth results](./intro_validate.md)
16-
- [Find highly modified motif sequences](./intro_find_motifs.md)
1721
- [Calculating methylation entropy](./intro_entropy.md)
1822
- [Narrow output to specific positions](./intro_include_bed.md)
1923
- [Extended subcommand help](./advanced_usage.md)

book/src/advanced_usage.md

Lines changed: 597 additions & 499 deletions
Large diffs are not rendered by default.

book/src/evaluate_motif.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Evaluate a table of known motifs
2+
3+
The `modkit search` command has an option to provide any number of known motifs with `--know-motif`.
4+
If you already have a list of candidate motifs (e.f. from a previous run of `modkit motif search`) you can check these motifs quickly against a bedMethyl table with `modkit motif evaluate`.
5+
6+
```bash
7+
modkit motif evaluate -i ${bedmethyl} --known-motifs-table motifs.tsv -r ${ref}
8+
```
9+
10+
Similarly, the search [algorithm](./intro_find_motifs.md#simple-description-of-the-search-algorithm) can be run using known motifs as seeds:
11+
12+
```bash
13+
modkit motif refine -i ${bedmethyl} --known-motifs-table motifs.tsv -r ${ref}
14+
```
15+
16+
The output tables to both of these commands have the same schema:
17+
18+
| column | name | description | type |
19+
|--------|------------|-------------------------------------------------------------------------------------------------|-------|
20+
| 1 | mod_code | code specifying the modification found in the motif | str |
21+
| 2 | motif | sequence of identified motif using [IUPAC](https://www.bioinformatics.org/sms/iupac.html) codes | str |
22+
| 3 | offset | 0-based offset into the motif sequence of the modified base | int |
23+
| 4 | frac_mod | fraction of time this sequence is found in the _high modified_ set col-5 / (col-5 + col-6) | float |
24+
| 5 | high_count | number of occurances of this sequence in the _high-modified_ set | int |
25+
| 6 | low_count | number of occurances of this sequence in the _low-modified_ set | int |
26+
| 7 | mid_count | number of occurances of this sequence in the _mid-modified_ set | int |
27+
| 8 | log_odds | log2 odds of the motif being in the high-modified set | int |
28+
29+
In the human-readable table columns (1) and (2) are merged to show the modification code in the motif sequence context, the rest of the columns are the same as the machine-readable table.
30+
265 KB
Loading

book/src/intro_bedmethyl.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ for details.
1212

1313
## Basic usage
1414

15-
In its simplest form `modkit` creates a bedMethyl file using the following:
15+
In its simplest form `modkit pileup` creates a bedMethyl file using the following:
1616

1717
```text
1818
modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
@@ -36,6 +36,8 @@ in the reference:
3636
modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta
3737
```
3838

39+
**Note** that when passing a reference with `--ref` a FASTA index `.fai` file is required to be at `path/to/reference.fasta.fai`.
40+
3941
To restrict output to only certain CpGs, pass the `--include-bed` option with the CpGs to be used,
4042
see [this page](./intro_include_bed.md) for more details.
4143

@@ -180,6 +182,5 @@ CG->CH substitution such that no modification call was produced by the basecalle
180182

181183
## Performance considerations
182184

183-
The `--interval-size`, `--threads`, `--chunk-size`, and `--max-depth` parameters can be used to tweak the parallelism and
184-
memory consumption of `modkit pileup`. The defaults should be suitable for most use cases, for more details see the
185-
[advanced usage](./advanced_usage.md) and [performance considerations](./perf_considerations.md) sections.
185+
The `--interval-size`, `--threads`, `--chunk-size`, and `--max-depth` parameters can be used to tweak the parallelism and memory consumption of `modkit pileup`.
186+
The defaults should be suitable for most use cases, for more details see [performance considerations](./perf_considerations.md) sections.

book/src/intro_dmr.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,9 @@ The full schema is described [below](#differential-methylation-output-format).
9696
## 2. Perform differential methylation detection on all pairs of samples over regions from the genome.
9797
The `modkit dmr multi` command runs all pairwise comparisons for more than two samples for all regions provided in the regions BED file.
9898
The preparation of the data is identical to that for the [previous section](#preparing-the-input-data) (for each sample, of course).
99+
100+
**Note** that if multiple samples are given the same name, they will be combined.
101+
99102
An example command could be:
100103

101104
```bash

book/src/intro_extract.md

Lines changed: 18 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
# Extracting base modification information
22

3-
The `modkit extract` sub-command will produce a table containing the base modification probabilities,
4-
the read sequence context, and optionally aligned reference information.
5-
For `extract`, if a correct `MN` tag is found, secondary and supplementary alignments may be output with the `--allow-non-primary` flag.
3+
The `modkit extract full` sub-commands will produce a table containing the base modification probabilities, the read sequence context, and optionally aligned reference information.
4+
For `extract full` and `extract calls`, if a correct `MN` tag is found, secondary and supplementary alignments may be output with the `--allow-non-primary` flag.
65
See [troubleshooting](./troubleshooting.md) for details.
76

87
The table will by default contain unmapped sections of the read (soft-clipped sections, for example).
@@ -13,7 +12,7 @@ the size of the BAM). You may want to either use the `--num-reads` option, the `
1312
pre-filter the modBAM ahead of time. You can also stream the output to stdout by setting the output to `-`
1413
or `stdout` and filter the columns before writing to disk.
1514

16-
## Description of output table
15+
## Description of output table for `extract full`
1716

1817
| column | name | description | type |
1918
|--------|-----------------------|---------------------------------------------------------------------------------|------|
@@ -38,11 +37,10 @@ or `stdout` and filter the columns before writing to disk.
3837
| 19 | flag | FLAG from alignment record | str |
3938

4039

41-
# Tabulating base modification _calls_ for each read position
42-
Passing `--read-calls <file-path>` option will generate a table of read-level base modification calls using the
43-
same [thresholding](./filtering.md) algorithm employed by `modkit pileup`. The resultant table has, for each read,
44-
one row for each base modification call in that read. If a base is called as modified then `call_code` will be the
45-
code in the `MM` tag. If the base is called as canonical the `call_code` will be `-` (`A`, `C`, `G`, and `T` are
40+
# Tabulating base modification _calls_ for each read position with `extract calls`
41+
The `modkit extract calls` command will generate a table of read-level base modification calls using the same [thresholding](./filtering.md) algorithm employed by `modkit pileup`.
42+
The resultant table has, for each read, one row for each base modification call in that read.
43+
If a base is called as modified then `call_code` will be the code in the `MM` tag. If the base is called as canonical the `call_code` will be `-` (`A`, `C`, `G`, and `T` are
4644
reserved for "any modification"). The full schema of the table is below:
4745

4846
| column | name | description | type |
@@ -88,44 +86,40 @@ For secondary and supplementary alignments, soft-clipped positions are not repea
8886

8987
## Example usages:
9088

91-
### Extract a table from an aligned and indexed BAM
89+
### Extract a table of base modification probabilities from an aligned and indexed BAM
9290
```
93-
modkit extract <input.bam> <output.tsv>
91+
modkit extract full <input.bam> <output.tsv>
9492
```
9593
If the index `input.bam.bai` can be found, intervals along the aligned genome can be performed
9694
in parallel.
9795

9896
### Extract a table from a region of a large modBAM
9997
The below example will extract reads from only chr20, and include reference sequence context
10098
```
101-
modkit extract <intput.bam> <output.tsv> --region chr20 --ref <ref.fasta>
99+
modkit extract full <intput.bam> <output.tsv> --region chr20 --ref <ref.fasta>
102100
```
103101

104102
### Extract only sites aligned to a CG motif
105103
```
106-
modkit motif-bed <reference.fasta> CG 0 > CG_motifs.bed
107-
modkit extract <in.bam> <out.tsv> --ref <ref.fasta> --include-bed CG_motifs.bed
104+
modkit motif bed <reference.fasta> CG 0 > CG_motifs.bed
105+
modkit extract full <in.bam> <out.tsv> --ref <ref.fasta> --include-bed CG_motifs.bed
108106
```
109107

110108
### Extract only sites that are at least 50 bases from the ends of the reads
111109
```
112-
modkit extract <in.bam> <out.tsv> --edge-filter 50
110+
modkit extract full <in.bam> <out.tsv> --edge-filter 50
113111
```
114112

115113
### Extract read-level base modification calls
116-
```
117-
modkit extract <input.bam> null --read-calls <calls.tsv>
118-
```
119-
Using "null" in the place of the normal output will direct the normal extract output
120-
to /dev/null, to keep this output specify a file or `-` for standard out.
121114

122115
```
123-
modkit extract <input.bam> <output.tsv> --read-calls <calls.tsv>
116+
modkit extract calls <input.bam> <calls.tsv>
124117
```
118+
125119
Use `--allow-non-primary` to get secondary and supplementary mappings in the output.
120+
126121
```
127-
modkit extract <input.bam> <output.tsv> --read-calls <calls.tsv> --allow-non-primary
122+
modkit extract calls <input.bam> <output.tsv> --allow-non-primary
128123
```
129124

130-
131-
See the help string and/or [advanced_usage](./advanced_usage.md) for more details.
125+
See the help string and/or [advanced_usage](./advanced_usage.md) for more details and [performace considerations](./perf_considerations.m) if you encounter issues with memory usage.

book/src/intro_find_motifs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ For example, to run the command with default settings (recommended):
88
bedmethyl=/path/to/pileup.bed
99
ref=/path/to/reference.fasta
1010

11-
modkit find-motifs \
11+
modkit motif search \
1212
-i ${bedmethyl} \
1313
-r ${ref} \
1414
-o ./motifs.tsv \

book/src/intro_localize.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Investigating patterns with localise
2+
3+
One a bedMethyl table has been created, `modkit localise` will use the pileup and calculate per-base modification aggregate information around genomic features of interest.
4+
For example, we can investigate base modification patterns around CTCF binding sites.
5+
6+
<p align="center">
7+
<img src="./images/modkit_localise_ctcf_5mC.png" alt="5mC patterns at CTCF sites" width="500" />
8+
</p>
9+
10+
The input requirements to `modkit localise` are simple:
11+
1. BedMethyl table that has been bgzf-compressed and tabix-indexed
12+
1. Regions file in BED format (plaintext).
13+
1. Genome sizes tab-separated file: `<chrom>\t<size_in_bp>`
14+
15+
an example command:
16+
17+
```bash
18+
modkit localise ${bedmethyl} --regions ${ctcf} --genome-sizes ${sizes}
19+
```
20+
21+
The output table has the following schema:
22+
23+
| column | Name | Description | type |
24+
|--------|------------------|---------------------------------------------------------------------------------------------------------------------|-------|
25+
| 1 | mod code | modification code as present in the bedmethyl | str |
26+
| 2 | offset | distance in base pairs from the center of the genome features, negative values reflect towards the 5' of the genome | int |
27+
| 3 | n_valid | number of valid calls at this offset for this modification code | int |
28+
| 4 | n_mod | number of calls for this modification code at this offset | int |
29+
| 5 | percent_modified | `n_mod` / `n_valid` * 100 | float |
30+
31+
Optionally the `--chart` argument can be used to create HTML charts of the modification patterns.

book/src/intro_motif.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Working with sequence motifs
2+
3+
The `modkit motif` suite contains tools for discovery and exploration of short degenerate sequences (motifs) that may be enriched in a sample.
4+
A common use case is to discover the motifs enriched for modification in a native bacterial sample which can give indication of methyltransferase enzymes present in the genomes present in the sample.
5+
6+
The following tools are available:
7+
8+
1. [Find enriched motifs _de novo_ from a bedMethyl with `search`.](,/intro_find_motifs.md)
9+
1. [`evaluate` or `refine` a table of known motifs](./evaluate_motif.md)
10+
4. [Making a motif BED file with `motif bed`](./intro_motif_bed.md)

book/src/intro_stats.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Calculating modification statistics in regions
2+
3+
There are many analysis operations available in `modkit` once you've generated a bedMethyl table.
4+
One such operation is to calculate aggregation statistics on specific regions, for example in CpG islands or gene promoters.
5+
The `modkit stats` command is designed for this purpose.
6+
7+
```bash
8+
# these files can be found in the modkit repository
9+
cpgs=tests/resources/cpg_chr20_with_orig_names_selection.bed
10+
sample=tests/resources/lung_00733-m_adjacent-normal_5mc-5hmc_chr20_cpg_pileup.bed.gz
11+
modkit stats ${sample} --regions ${cpgs} -o ./stats.tsv [--mod-codes "h,m"]
12+
```
13+
14+
> Note that the argument `--mod-codes` can alternatively be passed multiple times, e.g. this is equivalent: <br />
15+
> `--mod-codes c --mod-codes h`
16+
17+
The output TSV has the following schema:
18+
19+
| column | Name | Description | type |
20+
|--------|----------------|-------------------------------------------------------------------------------|-------|
21+
| 1 | chrom | name of reference sequence from BAM header | str |
22+
| 2 | start position | 0-based start position | int |
23+
| 3 | end position | 0-based exclusive end position | int |
24+
| 4 | name | name of the region from input BED (`.` if not provided) | str |
25+
| 5 | strand | Strand (`+`, `-`, `.`) from the input BED (`.` assumed for when not provided) | str |
26+
| 6+ | count_x | total number of `x` base modification codes in the region | int |
27+
| 7+ | count_valid_x | total valid calls for the primary base modified by code `x` | int |
28+
| 8+ | percent_x | `count_x` / `count_vali_x` * 100 | float |
29+
30+
Columns 6, 7, and 8 are repeated for each modification code found in the bedMethyl file or provided with `--mod-codes` argument.
31+
32+
An example output:
33+
34+
```text
35+
chrom start end name strand count_h count_valid_h percent_h count_m count_valid_m percent_m
36+
chr20 9838623 9839213 CpG: 47 . 12 1777 0.6752954 45 1777 2.532358
37+
chr20 10034962 10035266 CpG: 35 . 7 1513 0.46265697 0 1513 0
38+
chr20 10172120 10172545 CpG: 35 . 15 1229 1.2205045 28 1229 2.278275
39+
chr20 10217487 10218336 CpG: 59 . 29 2339 1.2398461 108 2339 4.617358
40+
chr20 10433628 10434345 CpG: 71 . 29 2750 1.0545455 2 2750 0.07272727
41+
chr20 10671925 10674963 CpG: 255 . 43 9461 0.45449743 24 9461 0.25367296
42+
```
43+

0 commit comments

Comments
 (0)