After classifying your samples using the One Codex Database, you may be curious which reads were classified to a particular taxon. While we do provide a file that lists the taxon for each read (a button to download "Read-Level Data" when viewing a sample's results page), you may want to pull the sequences for just those reads out of your original sequence file for further analyses. We provide users with a means to pull out taxon-specific reads. There are a few things you need in order to do that.
The raw reads file, which should be interleaved. If you have paired-end data in two separate files (R1 and R2), once they are uploaded to the platform, we interleave those files, so our scripts work with interleaved files. This means that both R1 and R2 files are combined into one file, with sequence 1 from the R1 file followed by sequence 1 from the R2 file; then sequence 2 from R1 file, and so on. If you are unsure of how to interleave your files, feel free to reach out to us.
You’ll need the classification ID for the One Codex Database classification results for that sample. That can be found in the URL when viewing the classification results for that sample. I.e. https://app.onecodex.com/classification/<classification_UUID>
You’ll need the TaxID for the species of interest. That can be found in the Complete Results Table at the bottom of the results page for that sample. Note that you can include/exclude multiple taxa by passing the
-tflag multiple times. You can also find the taxID for any species in our reference list: https://app.onecodex.com/references
Lastly, you’ll need the onecodex command line interface installed. There are instructions to install that here.
Once you have those, you can create a file of the taxon-specific reads using the following command:
onecodex scripts subset_reads -t <TaxID> --include-lowconf <classification_id> <fastq_filename>
Note that the
--include-lowconf flag is optional, and will return reads for which we have low confidence. You can chose to exclude low confidence reads by removing this flag from the command.
If you wish to exclude reads for a given taxon from your results, you can do this by passing the
--exclude-reads flag to the command. With this, the resulting fasta/fastq file will contain all reads except those matching to the tax ID indicated.
Another useful flag,
--with-children, allows you to subset to the reads of your taxon of interest, and any more-specific taxa that fall under this. For instance, if you wanted to get the reads for a genus, and all species/strains that belong to that genus, you can provide the genus-level TaxID, and the
--with-children flag, to get all of the reads that classify at genus-level, as well as reads that classify to species or strains in that genus.