-
Notifications
You must be signed in to change notification settings - Fork 34
Design of nucleotide summarizer scoring and depth estimates
There is a score for each virus family and each reference sequence (score=xxx field). Collectively, families and reference sequences are "targets" of interest. The design goals for the score are as follows.
-
Range 0 to 100 for intuitive interpretation.
-
Can be interpreted as classifier confidence score that the virus family (or a closely related reference sequence) is present in the reads.
-
Zero if no alignments, greater than zero if at least one target alignment is found.
-
Coverage across the pan-genome or reference sequence is the best signal for presence / absence.
-
Very low coverage should be down-weighted because presence has less value if assembly is not possible.
-
Low identity should be up-weighted because coverage and depth may be under-estimated due to reduced sensitivity of alignments.
Divide the reference sequence or pan-genome into 25 equal-sized bins. Count the number of alignments in each bin. Let Cvg8 = number of bins with at least 8 alignments. Let Cvg1 be number of bins with 1 to 7 alignments.
Define raw_score = Cvg1 + Cvg8*4. This is in range 0 to 100. It is always > 0 if there is an alignment. Maximum value is 100 when all bins have at least 8 alignments.
Define identity_weight = 100 / (pctid^3). This is a number >= 1.0 which increases with lower pctid, increasing slowly at identities close to 100% and rapidly below 80%id. This is used to up-weight raw_score at lower identities. For example, at 90%id identity_weight = 1.6 and at 80%id identity_weight = 2.0. This is a very rough model of the sensitivity of bowtie2 as a function of identity.
Calculation is score = raw_score * identity_weight, pegged at 100 (i.e., if the value is > 100 set to 100).