Skip to content

Design of nucleotide summarizer scoring and depth estimates

Robert Edgar edited this page Jun 30, 2020 · 17 revisions

Score

There is a score for each virus family and each reference sequence (score=xxx field). Collectively, families and reference sequences are "targets" of interest. The design goals for the score are as follows.

  • Range 0 to 100 for intuitive interpretation.

  • Can be interpreted as classifier confidence score that the virus family (or a closely related reference sequence) is present in the reads.

  • Zero if no alignments, greater than zero if at least one target alignment is found.

Heuristics for scoring

  • Coverage across the pan-genome or reference sequence is the best signal for presence / absence.

  • Very low coverage should be down-weighted because presence has less value if assembly is not possible.

  • Low identity should be up-weighted because coverage and depth may be under-estimated due to reduced sensitivity of alignments.

Score calculation

Divide the reference sequence or pan-genome into 25 equal-sized bins. Count the number of alignments in each bin. Let Cvg8 = number of bins with at least 8 alignments. Let Cvg1 be number of bins with 1 to 7 alignments.

Define raw_score = Cvg1 + Cvg8*4. This is in range 0 to 100. It is always > 0 if there is an alignment. Maximum value is 100 when all bins have at least 8 alignments.

Define identity_weight = 100 / (pctid^3). This is a number >= 1.0 which increases with lower pctid, increasing slowly at identities close to 100% and rapidly below 80%id. This is used to up-weight raw_score at lower identities. For example, at 90%id identity_weight = 1.6 and at 80%id identity_weight = 2.0. This is a very rough model of the sensitivity of bowtie2 as a function of identity.

Calculation is score = raw_score * identity_weight, pegged at 100 (i.e., if the value is > 100 set to 100).

Clone this wiki locally