Skip to content

Commit 1e64664

Browse files
committed
Merge branch 'ar/prep-031' into 'master'
Prepare v0.3.1 See merge request machine-learning/modkit!195
2 parents 67a2eea + 2454273 commit 1e64664

File tree

8 files changed

+132
-17
lines changed

8 files changed

+132
-17
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,14 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [v0.3.1]
8+
### Fixes
9+
- [call-mods] Always change model to "explicit", dropped base modification probabilities should not be interpreted as canonical.
10+
- [dmr, segment] Add pseudo-count to avoid -inf in HMM.
11+
- [find-motifs] Fix crash in exhaustive search.
12+
### Adds
13+
- [dmr] Allow specification of mod code-to-primary base on the command line with `--assign-code`.
14+
715
## [v0.3.1rc1]
816
### Fixes
917
- [find-motifs] Bug where error would be reported when output tables are specified. Fixes #195

book/src/advanced_usage.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -797,7 +797,15 @@ Options:
797797
row for each base modification call in that read using the same thresholding algorithm as
798798
in pileup, or summary (see online documentation for details on thresholds). Passing this
799799
option will cause `modkit` to estimate the pass thresholds from the data unless a
800-
`--filter-threshold` value is passed to the command. (alias: --read-calls)
800+
`--filter-threshold` value is passed to the command. Use 'stdout' to stream this table to
801+
stdout, but note that you cannot stream this table and the raw extract table to stdout.
802+
803+
--pass-only
804+
Only output base modification calls that pass the minimum confidence threshold. (alias:
805+
pass)
806+
807+
--no-headers
808+
Don't print the header lines in the output tables.
801809
802810
--reference <REFERENCE>
803811
Path to reference FASTA to extract reference context information from. If no reference is
@@ -1415,7 +1423,7 @@ Options:
14151423
compared at each site.
14161424
14171425
--ref <REFERENCE_FASTA>
1418-
Path to reference fasta for used in the pileup/alignment.
1426+
Path to reference fasta for used in the pileup/alignment
14191427
14201428
--segment <SEGMENTATION_FP>
14211429
Run segmentation, output segmented differentially methylated regions to this file.
@@ -1454,10 +1462,20 @@ Options:
14541462
Preset HMM segmentation parameters for higher propensity to switch from "Same" to
14551463
"Different" state. Results will be shorter segments, but potentially higher sensitivity.
14561464
1457-
-m <MODIFIED_BASES>
1465+
-m, --base <MODIFIED_BASES>
14581466
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
14591467
methylated regions using only cytosine modifications use --base C.
1460-
1468+
1469+
--assign-code <MOD_CODE_ASSIGNMENTS>
1470+
Extra assignments of modification codes to their respective primary bases. In general,
1471+
modkit dmr will use the SAM specification to know which modification codes are appropriate
1472+
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
1473+
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
1474+
codes or codes that are not part of the specification, you can specify which primary base
1475+
they belong to here with --assign-code x:C meaning associate modification code "x" with
1476+
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
1477+
specification, the bedMethyl record will not be used, this will be logged.
1478+
14611479
--log-filepath <LOG_FILEPATH>
14621480
File to write logs to, it's recommended to use this option.
14631481
@@ -1564,9 +1582,18 @@ Options:
15641582
Prefix files in directory with this label.
15651583
--ref <REFERENCE_FASTA>
15661584
Path to reference fasta for the pileup.
1567-
-m <MODIFIED_BASES>
1585+
-m, --base <MODIFIED_BASES>
15681586
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
15691587
methylated regions using only cytosine modifications use --base C.
1588+
--assign-code <MOD_CODE_ASSIGNMENTS>
1589+
Extra assignments of modification codes to their respective primary bases. In general,
1590+
modkit dmr will use the SAM specification to know which modification codes are appropriate
1591+
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
1592+
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
1593+
codes or codes that are not part of the specification, you can specify which primary base
1594+
they belong to here with --assign-code x:C meaning associate modification code "x" with
1595+
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
1596+
specification, the bedMethyl record will not be used, this will be logged.
15701597
--log-filepath <LOG_FILEPATH>
15711598
File to write logs to, it's recommended to use this option.
15721599
-t, --threads <THREADS>

book/src/intro_dmr.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,16 @@ modkit dmr pair \
158158

159159
Keep in mind that the MAP-based p-value provided in single-site analysis is based on a "modified" vs "unmodified" model, see the [scoring section](./dmr_scoring_details.md) and [limitations](./limitations.md) for additional details.
160160

161+
### Note about modification codes
162+
The `modkit dmr` commands require the `--base` option to determine which genome positions to compare, i.e. `--base C` tells `modkit` to compare methylation at cytosine bases.
163+
You may use this option multiple times to compare methylation at multiple primary sequence bases.
164+
It is possible that, during `pileup` a read will have a mismatch and a modification call, such as a C->A mismatch and a 6mA call on that A, and you may not want to use that 6mA call when calculating the differential methylation metrics.
165+
To filter out bedMethyl records like this, `modkit` uses the [SAM specification](https://samtools.github.io/hts-specs/SAMtags.pdf) (page 9) of modification codes to determine which modification codes apply to which primary sequence bases.
166+
For example, `h` is 5hmC and applies to cytosine bases, `a` is 6mA and applies to adenine bases.
167+
However, `modkit pileup` does not require that you use modification codes only in the specification.
168+
If your bedMethyl has records with custom modification codes or codes that aren't in the specification yet, use `--assign-code <mod_code>:<primary_base>` to indicate the code applies to a given primary sequence base.
169+
170+
161171
## Differential methylation output format
162172
The output from `modkit dmr pair` (and for each pairwise comparison with `modkit dmr multi`) is (roughly)
163173
a BED file with the following schema:

docs/advanced_usage.html

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -956,7 +956,15 @@ <h2 id="extract"><a class="header" href="#extract">extract</a></h2>
956956
row for each base modification call in that read using the same thresholding algorithm as
957957
in pileup, or summary (see online documentation for details on thresholds). Passing this
958958
option will cause `modkit` to estimate the pass thresholds from the data unless a
959-
`--filter-threshold` value is passed to the command. (alias: --read-calls)
959+
`--filter-threshold` value is passed to the command. Use 'stdout' to stream this table to
960+
stdout, but note that you cannot stream this table and the raw extract table to stdout.
961+
962+
--pass-only
963+
Only output base modification calls that pass the minimum confidence threshold. (alias:
964+
pass)
965+
966+
--no-headers
967+
Don't print the header lines in the output tables.
960968

961969
--reference &lt;REFERENCE&gt;
962970
Path to reference FASTA to extract reference context information from. If no reference is
@@ -1562,7 +1570,7 @@ <h2 id="dmr-pair"><a class="header" href="#dmr-pair">dmr pair</a></h2>
15621570
compared at each site.
15631571

15641572
--ref &lt;REFERENCE_FASTA&gt;
1565-
Path to reference fasta for used in the pileup/alignment.
1573+
Path to reference fasta for used in the pileup/alignment
15661574

15671575
--segment &lt;SEGMENTATION_FP&gt;
15681576
Run segmentation, output segmented differentially methylated regions to this file.
@@ -1601,10 +1609,20 @@ <h2 id="dmr-pair"><a class="header" href="#dmr-pair">dmr pair</a></h2>
16011609
Preset HMM segmentation parameters for higher propensity to switch from "Same" to
16021610
"Different" state. Results will be shorter segments, but potentially higher sensitivity.
16031611

1604-
-m &lt;MODIFIED_BASES&gt;
1612+
-m, --base &lt;MODIFIED_BASES&gt;
16051613
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
16061614
methylated regions using only cytosine modifications use --base C.
1607-
1615+
1616+
--assign-code &lt;MOD_CODE_ASSIGNMENTS&gt;
1617+
Extra assignments of modification codes to their respective primary bases. In general,
1618+
modkit dmr will use the SAM specification to know which modification codes are appropriate
1619+
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
1620+
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
1621+
codes or codes that are not part of the specification, you can specify which primary base
1622+
they belong to here with --assign-code x:C meaning associate modification code "x" with
1623+
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
1624+
specification, the bedMethyl record will not be used, this will be logged.
1625+
16081626
--log-filepath &lt;LOG_FILEPATH&gt;
16091627
File to write logs to, it's recommended to use this option.
16101628

@@ -1709,9 +1727,18 @@ <h2 id="dmr-multi"><a class="header" href="#dmr-multi">dmr multi</a></h2>
17091727
Prefix files in directory with this label.
17101728
--ref &lt;REFERENCE_FASTA&gt;
17111729
Path to reference fasta for the pileup.
1712-
-m &lt;MODIFIED_BASES&gt;
1730+
-m, --base &lt;MODIFIED_BASES&gt;
17131731
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
17141732
methylated regions using only cytosine modifications use --base C.
1733+
--assign-code &lt;MOD_CODE_ASSIGNMENTS&gt;
1734+
Extra assignments of modification codes to their respective primary bases. In general,
1735+
modkit dmr will use the SAM specification to know which modification codes are appropriate
1736+
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
1737+
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
1738+
codes or codes that are not part of the specification, you can specify which primary base
1739+
they belong to here with --assign-code x:C meaning associate modification code "x" with
1740+
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
1741+
specification, the bedMethyl record will not be used, this will be logged.
17151742
--log-filepath &lt;LOG_FILEPATH&gt;
17161743
File to write logs to, it's recommended to use this option.
17171744
-t, --threads &lt;THREADS&gt;

docs/intro_dmr.html

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,14 @@ <h2 id="3-detecting-differential-modification-at-single-base-positions"><a class
309309
--log-filepath dmr.log
310310
</code></pre>
311311
<p>Keep in mind that the MAP-based p-value provided in single-site analysis is based on a "modified" vs "unmodified" model, see the <a href="./dmr_scoring_details.html">scoring section</a> and <a href="./limitations.html">limitations</a> for additional details.</p>
312+
<h3 id="note-about-modification-codes"><a class="header" href="#note-about-modification-codes">Note about modification codes</a></h3>
313+
<p>The <code>modkit dmr</code> commands require the <code>--base</code> option to determine which genome positions to compare, i.e. <code>--base C</code> tells <code>modkit</code> to compare methylation at cytosine bases.
314+
You may use this option multiple times to compare methylation at multiple primary sequence bases.
315+
It is possible that, during <code>pileup</code> a read will have a mismatch and a modification call, such as a C-&gt;A mismatch and a 6mA call on that A, and you may not want to use that 6mA call when calculating the differential methylation metrics.
316+
To filter out bedMethyl records like this, <code>modkit</code> uses the <a href="https://samtools.github.io/hts-specs/SAMtags.pdf">SAM specification</a> (page 9) of modification codes to determine which modification codes apply to which primary sequence bases.
317+
For example, <code>h</code> is 5hmC and applies to cytosine bases, <code>a</code> is 6mA and applies to adenine bases.
318+
However, <code>modkit pileup</code> does not require that you use modification codes only in the specification.
319+
If your bedMethyl has records with custom modification codes or codes that aren't in the specification yet, use <code>--assign-code &lt;mod_code&gt;:&lt;primary_base&gt;</code> to indicate the code applies to a given primary sequence base.</p>
312320
<h2 id="differential-methylation-output-format"><a class="header" href="#differential-methylation-output-format">Differential methylation output format</a></h2>
313321
<p>The output from <code>modkit dmr pair</code> (and for each pairwise comparison with <code>modkit dmr multi</code>) is (roughly)
314322
a BED file with the following schema:</p>

docs/print.html

Lines changed: 40 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -898,6 +898,14 @@ <h2 id="3-detecting-differential-modification-at-single-base-positions"><a class
898898
--log-filepath dmr.log
899899
</code></pre>
900900
<p>Keep in mind that the MAP-based p-value provided in single-site analysis is based on a "modified" vs "unmodified" model, see the <a href="./dmr_scoring_details.html">scoring section</a> and <a href="./limitations.html">limitations</a> for additional details.</p>
901+
<h3 id="note-about-modification-codes"><a class="header" href="#note-about-modification-codes">Note about modification codes</a></h3>
902+
<p>The <code>modkit dmr</code> commands require the <code>--base</code> option to determine which genome positions to compare, i.e. <code>--base C</code> tells <code>modkit</code> to compare methylation at cytosine bases.
903+
You may use this option multiple times to compare methylation at multiple primary sequence bases.
904+
It is possible that, during <code>pileup</code> a read will have a mismatch and a modification call, such as a C-&gt;A mismatch and a 6mA call on that A, and you may not want to use that 6mA call when calculating the differential methylation metrics.
905+
To filter out bedMethyl records like this, <code>modkit</code> uses the <a href="https://samtools.github.io/hts-specs/SAMtags.pdf">SAM specification</a> (page 9) of modification codes to determine which modification codes apply to which primary sequence bases.
906+
For example, <code>h</code> is 5hmC and applies to cytosine bases, <code>a</code> is 6mA and applies to adenine bases.
907+
However, <code>modkit pileup</code> does not require that you use modification codes only in the specification.
908+
If your bedMethyl has records with custom modification codes or codes that aren't in the specification yet, use <code>--assign-code &lt;mod_code&gt;:&lt;primary_base&gt;</code> to indicate the code applies to a given primary sequence base.</p>
901909
<h2 id="differential-methylation-output-format"><a class="header" href="#differential-methylation-output-format">Differential methylation output format</a></h2>
902910
<p>The output from <code>modkit dmr pair</code> (and for each pairwise comparison with <code>modkit dmr multi</code>) is (roughly)
903911
a BED file with the following schema:</p>
@@ -2026,7 +2034,15 @@ <h2 id="extract"><a class="header" href="#extract">extract</a></h2>
20262034
row for each base modification call in that read using the same thresholding algorithm as
20272035
in pileup, or summary (see online documentation for details on thresholds). Passing this
20282036
option will cause `modkit` to estimate the pass thresholds from the data unless a
2029-
`--filter-threshold` value is passed to the command. (alias: --read-calls)
2037+
`--filter-threshold` value is passed to the command. Use 'stdout' to stream this table to
2038+
stdout, but note that you cannot stream this table and the raw extract table to stdout.
2039+
2040+
--pass-only
2041+
Only output base modification calls that pass the minimum confidence threshold. (alias:
2042+
pass)
2043+
2044+
--no-headers
2045+
Don't print the header lines in the output tables.
20302046

20312047
--reference &lt;REFERENCE&gt;
20322048
Path to reference FASTA to extract reference context information from. If no reference is
@@ -2632,7 +2648,7 @@ <h2 id="dmr-pair"><a class="header" href="#dmr-pair">dmr pair</a></h2>
26322648
compared at each site.
26332649

26342650
--ref &lt;REFERENCE_FASTA&gt;
2635-
Path to reference fasta for used in the pileup/alignment.
2651+
Path to reference fasta for used in the pileup/alignment
26362652

26372653
--segment &lt;SEGMENTATION_FP&gt;
26382654
Run segmentation, output segmented differentially methylated regions to this file.
@@ -2671,10 +2687,20 @@ <h2 id="dmr-pair"><a class="header" href="#dmr-pair">dmr pair</a></h2>
26712687
Preset HMM segmentation parameters for higher propensity to switch from "Same" to
26722688
"Different" state. Results will be shorter segments, but potentially higher sensitivity.
26732689

2674-
-m &lt;MODIFIED_BASES&gt;
2690+
-m, --base &lt;MODIFIED_BASES&gt;
26752691
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
26762692
methylated regions using only cytosine modifications use --base C.
2677-
2693+
2694+
--assign-code &lt;MOD_CODE_ASSIGNMENTS&gt;
2695+
Extra assignments of modification codes to their respective primary bases. In general,
2696+
modkit dmr will use the SAM specification to know which modification codes are appropriate
2697+
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
2698+
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
2699+
codes or codes that are not part of the specification, you can specify which primary base
2700+
they belong to here with --assign-code x:C meaning associate modification code "x" with
2701+
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
2702+
specification, the bedMethyl record will not be used, this will be logged.
2703+
26782704
--log-filepath &lt;LOG_FILEPATH&gt;
26792705
File to write logs to, it's recommended to use this option.
26802706

@@ -2779,9 +2805,18 @@ <h2 id="dmr-multi"><a class="header" href="#dmr-multi">dmr multi</a></h2>
27792805
Prefix files in directory with this label.
27802806
--ref &lt;REFERENCE_FASTA&gt;
27812807
Path to reference fasta for the pileup.
2782-
-m &lt;MODIFIED_BASES&gt;
2808+
-m, --base &lt;MODIFIED_BASES&gt;
27832809
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
27842810
methylated regions using only cytosine modifications use --base C.
2811+
--assign-code &lt;MOD_CODE_ASSIGNMENTS&gt;
2812+
Extra assignments of modification codes to their respective primary bases. In general,
2813+
modkit dmr will use the SAM specification to know which modification codes are appropriate
2814+
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
2815+
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
2816+
codes or codes that are not part of the specification, you can specify which primary base
2817+
they belong to here with --assign-code x:C meaning associate modification code "x" with
2818+
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
2819+
specification, the bedMethyl record will not be used, this will be logged.
27852820
--log-filepath &lt;LOG_FILEPATH&gt;
27862821
File to write logs to, it's recommended to use this option.
27872822
-t, --threads &lt;THREADS&gt;

docs/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/searchindex.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)