Skip to main content
All CollectionsKit GuidesTwist Bioscience Panels
Bioinformatics Details of the Twist Viral Panels
Bioinformatics Details of the Twist Viral Panels

This guide describes the methods used to analyze data for the Twist Respiratory Virus and Comprehensive Viral Research Panels on One Codex

Christopher Smith avatar
Written by Christopher Smith
Updated over 6 months ago

In collaboration with the team at Twist Bioscience, we've developed rapid workflows on the One Codex platform designed specifically to analyze data from the Twist Respiratory Virus Research Panel and the Twist Comprehensive Viral Research Panel. Both of these workflows are included with the purchase of the respective product from Twist Bioscience, and are designed to help you take your raw sequence data and generate a detailed yet easy-to-interpret report.

Both workflows share a similar methodology, which a couple of small differences that are outlined below. Principally, these analyses use sequence alignment against a set of curated viral reference genomes to determine the estimated depth and breadth coverage and estimated identity of the sequence reads to the viral genomes detected in the sample.

Reference Genomes

Twist Respiratory Virus Research Panel

The analysis for the Twist Respiratory Virus Research Panel uses a set of 29 viral reference genomes curated by the team at Twist Bioscience. A complete list of viruses included in the panel is available here. The One Codex platform automatically aligns your sequence data against these viral reference genomes, before summarizing the results into a convenient PDF report.

Twist Comprehensive Viral Research Panel

When you upload data from the Twist Comprehensive Viral Research Panel, we first analyze it against our One Codex Database. This is a large, curated reference database consisting of over 115k genomes, including over 48k whole viral reference genomes. The One Codex Database uses a k-mer based classification method, which lets us rapidly and accurately identify any of the more than 3,153 viruses in the Panel. You can read more about our One Codex Database analysis here.

Once we've identified which viruses are present in your sample, we gather the reference genomes for viruses which are assigned at least 100 reads. From there, we perform a secondary sequence alignment step in order to provide more detailed coverage, depth and identity metrics for each of the viruses present in the sample.

Sequence Alignment

Both workflows use minimap2 as the sequence alignment tool. We run minimap2 with default alignment settings, using the built-in preset for aligning short input reads. Since there is significant homology between many of the viruses in the panel, we allow for multi-mapping by retaining all secondary alignments of equal quality to the primary alignment.

After the sequence alignment is complete, we calculate mean sequencing depth across the entire reference ("average depth"), fraction of the reference covered by at least one read ("coverage"), and cumulative sequence identity ("identity").

Reporting

The final PDF report identifies a given virus as being either "Present" or "Indeterminate" according to the following thresholds:

Present: In order to be considered "Present", we must observe at least 20% of the reference genome covered, with an average depth of 10x across the entire genome.

Indeterminate: If a given virus falls short of being considered "Present", it is still considered "Indeterminate" as long as we observe at least 5% of the genome.

If a virus does not pass the "Indeterminate" threshold, it is considered not detected and is excluded from the report.

Consensus Assembly Generation

Consensus genome sequences are generated for every target that is detected or indeterminate. Reads are aligned to high-quality reference genomes selected using the methods described above (see "Reference Genomes") using minimap2 [1]. Variants are called with bcftools [2] with variant positions with quality scores <150 excluded. Bedtools [3] is used to mark sites with low aligned sequence depth (fewer than 10 reads per site) with “N’s” and deletions with “-”. The consensus sequences are available in FASTA format hosted on the One Codex platform and are available via links provided in the report.

For ease of post processing sequence analysis, the .fasta consensus sequence files are named uniquely by assemblyID_speciesID_onecodexSampleID. Refs.fasta and snps.vcf.gz contain the reference fasta files and corresponding variant calling information to the consensus fasta files for all detected and indeterminate viruses.

Output file

Description

*.fasta(s)

Separate consensus FASTA file with reference to each species detected/indeterminate in the sample

refs.fasta

Concatenated references multi-FASTA file containing reference genomes used to generate consensus sequence

snps.vcf

VCF file containing variants between sample and reference(s) for all detected/indeterminate targets in *.fasta(s)

aln.bam.gz

BAM file containing alignments between sample and concatenated reference sequence(s)

References

  1. Li, Heng. "Minimap2: pairwise alignment for nucleotide sequences." Bioinformatics 34.18 (2018): 3094-3100.

  2. Danecek, Petr, et al. "Twelve years of SAMtools and BCFtools." Gigascience 10.2 (2021): giab008.

  3. Quinlan, Aaron R., and Ira M. Hall. "BEDTools: a flexible suite of utilities for comparing genomic features." Bioinformatics 26.6 (2010): 841-842.

Did this answer your question?