Skip to content

Design of nucleotide summarizer scoring and depth estimates

Robert Edgar edited this page Jun 30, 2020 · 17 revisions

Score

There is a score for each virus family and each reference sequence (score=xxx field). Collectively, families and reference sequences are "targets" of interest. The design goals for the score are as follows.

  • Range 0 to 100 for intuitive interpretation.

  • Can be interpreted as classifier confidence score that the virus family (or a closely related reference sequence) is present in the reads.

  • Zero if no alignments, greater than zero if at least one target alignment is found.

Heuristics for scoring

  • Coverage across the pan-genome or reference sequence is the best signal for presence / absence.

  • Very low coverage should be down-weighted because presence has less value if assembly is not possible.

  • Low identity should be up-weighted because coverage and depth may be under-estimated due to reduced sensitivity of alignments.

Score calculation

Divide the reference sequence or pan-genome into 25 equal-sized bins. Count the number of alignments in each bin.

Cvg8 = number of bins with at least 8 alignments.

Cvg1 = number of bins with 1 to 7 alignments.

raw_score = Cvg1 + Cvg8*4. This is in range 0 to 100. It is always > 0 if there is an alignment. Maximum value is 100 when all bins have at least 8 alignments.

identity_weight = 100 / (pctid^3).

The identity weight is a number >= 1.0 which increases with lower pctid, increasing slowly at identities close to 100% and rapidly below 80%id. This is used to up-weight raw_score at lower identities. For example, at 90%id identity_weight = 1.6 and at 80%id identity_weight = 2.0. This is a simple model of the sensitivity of bowtie2 as a function of identity.

score = raw_score * identity_weight, pegged at 100 (i.e., if the value is > 100 set to 100).

Depth estimate

The goal is to estimate typical depth over a recoverable contig that is long enough to be useful. We don't want to require that the entire target can be assembled because a long contig often has value. We don't want mean or median depth over the complete target sequence because of outlier bins. Commonly observed outliers include (a) empty bins due to low coverage or missing gene, and (b) highly abundant genes due to false-positive alignments and expression artifacts (e.g., leaders).

typical_depth is calculated by discarding the two most abundant and two least abundant bins, and taking the mean number of reads from the remaining 21 bins. Similarly to coverage, depth will be under-estimated at low identities. The final estimate is depth = raw_depth * identity_weight.

Clone this wiki locally