Design of nucleotide summarizer scoring and depth estimates

Score

There is a score for each virus family and each reference sequence (score=xxx field). Collectively, families and reference sequences are "targets" of interest. The design goals for the score are as follows.

Range 0 to 100 for intuitive interpretation.
Can be interpreted as classifier confidence score that the virus family (or a closely related reference sequence) is present in the reads.
Zero if no alignments, greater than zero if at least one target alignment is found.

Heuristics for scoring

Coverage across the pan-genome or reference sequence is the best signal for presence / absence.
Very low coverage should be down-weighted because presence has less value if assembly is not possible.
Low identity should be up-weighted because coverage and depth may be under-estimated due to reduced sensitivity of alignments.

Score calculation

Divide the reference sequence or pan-genome into 25 equal-sized bins. Count the number of alignments in each bin. Let Cvg8 = number of bins with at least 8 alignments. Let Cvg1 be number of bins with 1 to 7 alignments.

Define raw_score = Cvg1 + Cvg8*4. This is in range 0 to 100. It is always > 0 if there is an alignment. Maximum value is 100 when all bins have at least 8 alignments.

Define identity_weight = 100 / (pctid^3). This is a number >= 1.0 which increases with lower pctid, increasing slowly at identities close to 100% and rapidly below 80%id. This is used to up-weight raw_score at lower identities. For example, at 90%id identity_weight = 1.6 and at 80%id identity_weight = 2.0. This is a very rough model of the sensitivity of bowtie2 as a function of identity.

Calculation is score = raw_score * identity_weight, pegged at 100 (i.e., if the value is > 100 set to 100).

Overview

Architecture and Pipeline

Raw Data

Serratus Explorer (serratus.io)

Usage

Running Serratus
- Serratus-Lite, local
Finding Novel Viruses (tutorials)
Papers using Serratus
Containers
Summarizer usage
Cloud Budgeting
Serratus SQL Database Management
Data Policy

Design of nucleotide summarizer scoring and depth estimates

Score

Heuristics for scoring

Score calculation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Overview

Raw Data