In collaboration with the team at Twist Bioscience, we've developed rapid workflows on the One Codex platform designed specifically to analyze data from the Twist Respiratory Virus Research Panel and the Twist Comprehensive Viral Research Panel. Both of these workflows are included with the purchase of the respective product from Twist Bioscience, and are designed to help you take your raw sequence data and generate a detailed yet easy-to-interpret report.

Both workflows share a similar methodology, with different reference assemblies for alignments. Principally, these analyses use sequence alignment against a set of curated viral reference genomes to determine the estimated depth and breadth coverage and estimated identity of the sequence reads to the viral genomes detected in the sample.

Methodology

When you upload data from the either of the Twist Viral Research Panels, we first analyze it against our One Codex Database. This is a large, curated reference database consisting of over 115k genomes, including over 48k whole viral reference genomes. The One Codex Database uses a k-mer based classification method, which lets us rapidly and accurately identify viral reads in your sample. You can read more about our One Codex Database analysis here.

Once we've identified which viruses are present in your sample, we gather the reference genomes for viruses in the corresponding panel which are assigned at least 100 reads. Note that this threshold is adjustable when launching your Twist Viral Panel. From there, we perform a secondary sequence alignment step in order to provide more detailed coverage, depth and identity metrics for each of the viruses present in the sample.

Reference Genomes

Twist Respiratory Virus Research Panel

The analysis for the Twist Respiratory Virus Research Panel uses a set of 29 viral reference genomes curated by the team at Twist Bioscience. A complete list of viruses included in the panel is available here.

Twist Comprehensive Viral Research Panel

The Twist Comprehensive Viral Research Panel has more than 3,153 viruses in the Panel. Learn more about the Twist Comprehensive Viral Research Panel here.

Sequence Alignment

Both workflows use minimap2 as the sequence alignment tool. We determine if your reads are short or long, and launch minimap2 with the appropriate settings for read length. The remaining settings are default alignment settings. Since there is significant homology between many of the viruses in the panel, we allow for multi-mapping by retaining all secondary alignments of equal quality to the primary alignment.

After the sequence alignment is complete, we calculate mean sequencing depth across the entire reference ("average depth"), fraction of the reference covered by at least one read ("coverage"), and cumulative sequence identity ("identity").

Reporting

The final PDF report identifies a given virus as being either "Present" or "Indeterminate" according to the following thresholds:

Present: In order to be considered "Present", we must observe at least 20% of the reference genome covered, with an average depth of 10x across the entire genome.

Indeterminate: If a given virus falls short of being considered "Present", it is still considered "Indeterminate" as long as we observe at least 5% of the genome.

If a virus does not pass the "Indeterminate" threshold, it is considered not detected and is excluded from the report.

Consensus Assembly Generation

Consensus genome sequences are generated for every target that is detected or indeterminate. Reads are aligned to high-quality reference genomes selected using the methods described above (see "Reference Genomes") using minimap2 [1]. Variants are called with bcftools [2] with variant positions with quality scores <150 excluded. Bedtools [3] is used to mark sites with low aligned sequence depth (fewer than 10 reads per site) with “N’s” and deletions with “-”. The consensus sequences are available in FASTA format hosted on the One Codex platform and are available via links provided in the report.

For ease of post processing sequence analysis, the .fasta consensus sequence files are named uniquely by assemblyID_speciesID_onecodexSampleID. Refs.fasta and snps.vcf.gz contain the reference fasta files and corresponding variant calling information to the consensus fasta files for all detected and indeterminate viruses.

Output file	Description
*.fasta(s)	Separate consensus FASTA file with reference to each species detected/indeterminate in the sample
refs.fasta	Concatenated references multi-FASTA file containing reference genomes used to generate consensus sequence
snps.vcf	VCF file containing variants between sample and reference(s) for all detected/indeterminate targets in *.fasta(s)
aln.bam.gz	BAM file containing alignments between sample and concatenated reference sequence(s)

References

Li, Heng. "Minimap2: pairwise alignment for nucleotide sequences." Bioinformatics 34.18 (2018): 3094-3100.
Danecek, Petr, et al. "Twelve years of SAMtools and BCFtools." Gigascience 10.2 (2021): giab008.
Quinlan, Aaron R., and Ira M. Hall. "BEDTools: a flexible suite of utilities for comparing genomic features." Bioinformatics 26.6 (2010): 841-842.

Bioinformatics Details of the Twist Viral Panels